UAB day 4: analytical approaches

Today I am blogging from day 4 of the UAB short course on sequencing.

Speaker 1: Suzanne Leal – Methods to detect rare variant association

The missing heritability

For the vast majority of complex traits, < 10% of the genetic variance is explained by common variants so far
This motivates a search for causal rare variants to explain the “missing heritability”
Other places that the “missing heritability” could lie: multi-gene interaction terms, gene-environment interaction terms, structural variation.
But there is strong evidence that rare variants play a role.
Rare variant associations are difficult to uncover in GWAS. tagSNPs included on chips are usually in low LD with rare variants. Analysis of haplotypes can help to increase power, but still underpowered.
Sequencing is a more powerful approach

Individual variant tests

Same methods used for common variants can be used for rare variants
Single marker tests: Chi-square, Cochran-Armitage trend test
Multiple marker tests: Hotelling’s T², logistic regression, minimal p-value
However, individual variant tests require huge sample sizes to have power for rare variants.
MAF 1%, OR 2.0, disease prevalence 1% requires 6,000 cases and 6,000 controls for 80% power.
MAF 1%, OR 1.5, disease prevalence 1% requires 20,000 cases and 20,000 controls for 80% power.
For analysis of rare with MAF < 0.5% it is usually not possible to ascertain large enough sample sizes to detect associations
This has motivated gene score methods for aggregating variants across a functional unit, usually a gene.

Aggregate tests

Variously called aggregate tests, burden tests, or collapsing methods.
Misclassification is a huge issue. Causal variants may be excluded, neutral variants may be included. Methods must be robust to misclassification.
Frequency cutoffs are usually used to determine which variants to include in the analysis.

A few tests for rare variant association

Combined multivariate and collapsing (CMC) [Li & Leal 2008] Collapsing scheme can be used in a regression framework. All these methods were originally developed _not_ in a regression framework but have been adapted so that you can use covariates in your analysis
Gene-or-region-based analysis of variants of intermediate and low frequency (GRANVIL) - Numerator is number of rare variants for an individual; denominator is number of genotyped rare variant sites. Incomplete genotyping becomes a source of bias. Suppose there are 5 rare variant sites; individual A was only genotyped at 2 sites and had a rare variant at 1. This person’s score is 1/2, which is quite high and misleading – because these are very rare variants, it’s likely they would have been homozygous reference at the other 3 ungenotyped sites and their true score should be 1/5.
Weighted Sum Statistic WSS [Madsen & Browning 2009]. Variants are inversely weighted by frequency in controls. For all of these tests it is advisable to use empirical p-values; for this test it is absolutely essential.
C-alpha
Sequence Kernel Association Test (SKAT)

Imputation of rare variant data

Can be imputed from sequence data e.g. 1000G into GWAS studies
Imputation of rare variants is not very good, but there have been some successful examples. Auer 2012 AJHG
Imputation programs: Mach, Minimac, Impute2

Comparison of methods

Each paper creates simulation data that makes their method look best
Differences in power between different approaches are usually modest, and which model is best depends on underlying nature of variation in data
Predictably, bidirectional tests (C-Alpha, SKAT, SKAT-O) are more powerful when variant effects are bidirectional; unidirectional tests are more powerful when variant effects are unidirectional.
All tests work in a regression framework except C-Alpha, but SKAT works in a regression framework and can be equivalent to C-Alpha.
Allelic structure in populations is a huge confounder and cannot be adequately controlled-for using principal components. Ability to control this confounding may be a factor in choice of test.

NHLBI exome sequencing project

Sequenced ~6700 exomes, about 2/3 European Americans and 1/3 African Americans
Included individuals from HeartGO, LungGO and WHISP.
Selected individuals from extremes of quantitative phenotypes (LDL, blood pressure), from disease endpoints (stroke, chronic obstructive pulmonary disease, early heart attack) and people with deeper phenotyping – selected from 220,000 individuals
One concern was that the extreme phenotype individuals were so far in the extremes you would start to get Mendelian-like traits. But that appears not to have happened.
90x depth on Illumina HiSeq and Genome Analyzers at Broad and University of Washington.
Multi-sample variant calling

Extensive data quality control needed before any analysis

Variant site removal with support vector machine. Using ~20 criteria, find sites to remove. They intentionally added duplicate samples. Found that concordance between duplicates was not great; removed those variants that
Variant call removal – depth < 10x or > 500x. High read depth can indicate the presence of a CNV. Removing the high depth sites improves the Ti/Tv ratio and the concordance between duplicates. Removing low read depth also helped improve concordance, even removing everything < 20x, but the 20x cutoff was deemed too great a sacrifice because it would have removed far more genotypes than the 10x cutoff.
Sex check. For 13 individuals, sex didn’t match – 6 were errors and the rest were Turner or Kleinfelter syndrome.
Duplicates and related samples check. Removed individuals related at third degree or greater.
Population stratification with principal components. Some individuals who self-identified as African American or European American clustered with the opposite group. Not clear how many were sample switches and how many were just self-identification not necessarily being identical to genetic identity. They removed these individuals.

Distribution of allele frequencies

Majority of called variants were singletons
On average each individual had ~100 variants that were singletons in this study and also unseen in other public data sources

Analysis

Analysis packages developed in R will soon be made public
All p values derived empirically
Primary analysis using Combined Multivariate and Collapsing (CMC)
Minor allele frequency filtering using MAFs computed from entire ESP dataset (6700 individuals) as opposed to only individuals with the given phenotype in a particular analysis. No functional filtering; she recommends against strict functional filtering, as you can get vastly different results with different tools, but you can do weighting based on PolyPhen score or similar.
African Americans and European Americans analyzed separately and results combined through a meta-analysis.
Used Bonferroni correction for 20,000 genes. Tests are pretty much independent from each other, so Bonferroni correction is appropriate
Analysis reproduced over 100 associations previously reported in GWAS
Also reported a few novel associations
One new association was rare variants in APOC3 associated with triglyceride levels. Total of about 25 individuals with rare variants out of ~3700 sequenced who were also phenotyped for triglycerides, p ~ 1e-6. However located in same region as APOA1 and several related genes. Coded each gene individually and re-ran CMC. Association appears to be driven by APOC3 and not by nearby genes
Validation using exome chip on ~3400 individuals through Women’s Health Initiative – got even stronger p values than from original exome sequencing study
Model uses a binary indicator for whether the individual has at least one rare variant in the gene
APOC3 and ABCA1 also found to be associated with HDLs. Variants in ABCA1 reduce HDLs (which is bad); there exist two Mendelian diseases in this gene: AR-Tangiers and AD-Familial HDL deficiency. 48 rare variant sites in African Americans in this gene, with p = 1e-4, and 54 rare variant sites in European Americans but no association, p = .15. Rare variants in APOC3 increase HDLs. These associations failed validation with exome chip individually (perhaps because not enough of the variants driving association were included on the chip) but did give p values on order of 1e-6 once imputation was used, so suggestive of a true association.

3883 exomes and phenotypes are available through dbGaP but you need to apply to get access to data, a painful process.

Calculating power to design sequencing studies – SimPOWER or RARE.

Speaker 2: Yun Li – Imputation methods in 1000 Genomes Phase 1

Yun Li is author of MaCH genotype imputation software.

Missing heritability – possibly due to gene-gene interactions [Zuk 2012] or phenotypic robustness [Queitsch 2012].

Imputation in next-generation sequencing data

Thunder – similar to MaCH
Trio-Caller – extension of Thunder to trio data
AbCD – for design of sequencing studies

Hidden markov model underlying MaCH

Rule of thumb: you need at least 10-20 copies of minor allele in your data in order to be able to impute haplotypes for it. Rarer variants cannot be imputed.

Broad uses GATK + Beagle pipeline for haplotyping

See Li 2012 for a review.

AbCD: Arbitrary Coverage Design – a tool for designing studies – evaluate tradeoff between sequencing depth and number of samples sequenced.

MaCH-Admix – an extension of MaCH to address admixed populations

Computational cost of MaCH is O(N²) where N = number of haplotypes in your sample

Speaker 3: Degui Zhi – de novo assembly

If you’re working on an organism that does not yet have a reference, you have to do de novo assembly, that’s how we get a reference genome. If your organism has a reference, then why do assembly? (1) Organism may have sequence not present in the reference, and (2) regions in reference may be divergent.

New frontiers in assembly:

Variant-aware mapping / assembly
Local re-assembly
Haplotype assemby
Metagenomics sequencing – you get a sample from soil, water or air with multiple different species of microorganisms in it, all of which have no reference sequence – all you have is fragmented reads with all the organisms mixed together.
Transcriptome assembly
NGS data compression like ReduceReads.

Overlap graph

Create a graph with each read as a node and each overlap as an edge
Repeat regions will have edges to many nodes
Finding minimal path through all nodes is the Hamiltonian path problem
A simplification is a ‘repeat graph’ where unitigs with same sequence are collapsed. Now you just need to find a path visiting every edge once. This is an Eulerian path problem (finding the Eulerian path in a de Bruijn graph), which is more tractable. See Pevzner 2001.
To identify all repeats you actually need to break up the reads into even smaller pieces of the genome. Advantages: (1) efficient representation of sequence by k-mers instead of reads, (2) explicit representation of repeats, and (3) search for overlaps is simplified because they are on the same edge.

Recent developments

Variant aware assemblers – Cortex [Iqbal 2012], Fermi [Li 2012].
Can even handle significant indels, structural variation
Cannot take advantage of paired-end information
Doesn’t handle sequencing errors that well
Computational complexity explodes with large numbers of individuals

Reference-guided assembly

A mixed alternative to pure mapping-to-reference or pure de novo assembly. Use reference as a guide but use assembly for local corrections and haplotype discovery.
Haplotype assembly is difficult because variants are often too far apart for reads to span, and sequencing errors create false variants. Hence a probabilistic approach is needed. Also to integrate haplotype information from reads and from population.
In 1000G data, about 5-6% of reads span more than one variant site (called ‘jumping reads’). 33-70% (YRI or CEU respectively) of polymorphic sites are covered by ‘jumping reads’. [Zhi 2012]