Biological and statistical considerations in exome analysis

Read with caution!

This post was written during early stages of trying to understand a complex scientific problem, and we didn't get everything right. The original author no longer endorses the content of this post. It is being left online for historical reasons, but read at your own risk.

This post will explore some of the best and latest techniques for exploring statistical associations in exome data, based on what I read in five papers: two exploring exome data and GWAS in general– Kiezun/Garimella/Do/Stitziel 2012 (Nature Genetics) [supplemental information] and Pasaniuc 2011 (Nature Genetics)– and three dealing specifically with autism spectrum disorder (ASD): Sanders 2012 (Nature), O’Roak 2012 (Nature), Neale 2012 (Nature).

A common theme among all five of these papers is that they explore the techniques required for dealing with complex traits, which simply means traits determined by more than one gene. Fatal Familial Insomnia, for example, is a disease caused by a singular genotype: PRNP D178N 129M. This one mutation causes an extreme phenotype and, as far as we know, complete penetrance. By contrast, scientists have found tens of genes in which mutations are associated with autism spectrum disorders, and it is now believed that each of these mutations contributes some multiplier of ASD risk, with many shades of variation in phenotype according to which gene(s) are affected and how. Remember how Ng & Buckingham 2009 found the gene involved in Miller syndrome as blogged about here recently? That was a single well-characterized phenotype caused by mutations in a single gene, so it was possible to pinpoint that gene with just 4 cases and 8 controls. By contrast, with complex traits you can have many genes which might be neither necessary nor sufficient for phenotype, so you’ll need larger sample sizes to find these gens. Much larger. One of the major take-homes from Kiezun/Garimella/Do/Stitziel 2012 is the necessity of about 10,000 samples in order to achieve statistical power in complex trait studies.

First, let’s get some vocabulary out of the way. There are common variants and then there are rare variants, with some arbitrary threshold dividing the two, depending on who is talking, but both of these fall in the category of variants-that-have-ever-been-observed-before. By contrast, novel variants are those that someone is identifying for the first time, and de novo variants are those observed in a sample but in neither of the sample’s parents– i.e. a new mutation.

The new understanding seems to be that most of us carry tens of rare and/or de novo deleterious mutations in our genomes– Daniel MacArthur has been instrumental in annotating these variants and exploring their implications– see his blog post on the subject. The loss of function caused by these rare variants might underlie many complex traits (such as ASD). I found very helpful the way Kiezun/Garimella/Do/Stitziel 2012 phrased the importance (and prevalence) of these rare variants:

Not only are there many more rare variants than common ones, but sequencing additional samples continues to uncover additional rare variants. In fact, as sample size increases, the number of observed variants grows much faster than is predicted by the neutral model with constant population size . This relative excess of rare variants can, in part, be attributed to recent population expansion but is also likely to be due to purifying selection. As a consequence, rare variation is enriched for evolutionarily deleterious, and thus functional, variants. Additionally, the proportion of nonsynonymous variants is higher among rare than among common variants. Finally, among rare variants, missense variants predicted to be damaging are more prevalent than variants predicted to be benign.

So rare variants are important and we should study them. But because they’re rare, and especially for de novo mutations, it’s rather unlikely that two samples in your study will have the exact same variant. So good luck finding any statistical significance in a variant-level analysis. Instead you need to at least aggregate up to the gene level and maybe the pathway level. As a gross simplificiation, suppose you’ve got 1000 cases and 1000 controls. If this were a simple trait like FFI, you’d find 1 variant that is found in 1000 cases and 0 controls. Boom. P = 10 to the negative infinity. For a complex trait, maybe there are actually 1000 variants each of which is found in 1 case and 0 controls– good luck finding the ones that matter amidst all the chaff. But maybe those 1000 variants are enriched in one gene, so you find 1 gene in which each case has a different variant but all controls conform to the reference sequence. Now you’re back in business. But maybe things aren’t that easy– maybe the 1000 variants aren’t even in the same gene. Maybe the phenotypic similarity between your cases owes to the disabling of many different genes in the same pathway, with the final result being the same: such-and-such process doesn’t happen or such-and-such compound doesn’t get synthesized. So then you’ll find 1 gene which has a rare variant in 20 of your cases and another with a rare variant in 15 of your cases and so on. If you could aggregate up to pathway, then things would look much clearer: maybe all 1000 of your cases have one or more rare variant in a gene in Pathway X, while no such variants are observed in Pathway X in controls. Now it really looks like the trait you’re studying is caused by the disabling of Pathway X. (Unfortunately, aggregation to pathways is not quite as objective as to genes; a lot of assignment of genes to pathways is done by text mining of literature abstracts and such).

Hopefully the simplified example above illustrates why aggregating by gene or pathway is important for rare variants. It would be awesome if cases and controls segregated so cleanly in real life, but since they don’t, there are special statistical methods for dealing with rare variants. Kiezun/Garimella/Do/Stitziel 2012 call out a few:

Combined Multivariate and Collapsing (CMC) test pioneered by Li & Leal 2008
Weighted Sum Statistic (WSS) due to Madson & Browning 2009
Kernel-Based Adaptive Cluster (KBAC) by Liu & Leal 2010
Variable Threshold (VT) by Price 2010

I browsed these to get a sense of the issues involved and what these methods are all about. Li & Leal 2008 explain the motivation for their CMC test by contrasting it with simple univariate statistical tests on single variants:

One approach is the single-marker test, whereby individual variant sites within a gene are tested for an association with the disease outcome, with standard univariate statistical tests used (e.g., χ² test, Fisher’s exact test, or Cochran Armitage test for trend) and with the family-wise error rate (FWER) controlled by a multiple-comparison correction (e.g., Bonferroni, permutation). Another approach is to perform a multiple-marker test, which tests multiple variant sites simultaneously with the use of multivariate methods, such as the Fisher product method Hotelling’s T² test or logistic regression. Both single-marker and multiple-marker tests involve multiplicity (i.e., multiple-testing correction or multiple degrees of freedom), which will reduce power. On the other hand, collapsing methods, which combine information across multiple variant sites, could enrich the association signals and at the same time reduce the number of the test’s degrees of freedom. However, collapsing nonfunctional variants together with functional variants could adversely affect power. In order to unify the advantages of both collapsing and multiple-marker tests, the Combined Multivariate and Collapsing (CMC) method is developed.

The WSS is a way t0 test association for a group of variants rather than individual variants, KBAC “combines variant classification and association testing in a coherent framework”, and VT groups variants by gene or pathway. A common theme is the use of permutation testing to determine statistical significance. My statistics vocab has always been a bit behind the curve, so I had to look that up. Turns out it’s just a fancy term for adding up the probabilities of the observed permutation and all other imaginable permutations that would represent equal or larger departures from the null hypothesis. The very essence of a p value. Here are some references on it: a book chapter, an example from image segmentation, an example of how to do it with R, and a presentation on specifically how to apply it to genetics. The Wikipedia articles on Exact tests and the anchor on permutation tests are also helpful.

Kiezun/Garimella/Do/Stitziel 2012 provide further discussion of statistical significance and its meaning in relation to genetics. A simple approach which a lot of people seem to use is the Bonferroni correction — if you are going to ask 20,000 questions, then you’ll expect to get 200 answers significant at the p < .01 threshold even if there is no real association, so an association is only really truly less than 1% likely to have emerged by random chance if its p value is less than .01 / 20,000, so you have to test at p < 5e-7. But these authors point out that (for non-exact tests), you’re relying on asymptotic properties of a test statistic when you use it to derive a p value, and those asymptotic properties might not be met in a small sample size. In that case, the Bonferroni correction is too harsh and even a somewhat larger p value might indicate significance:

An important consideration for exome sequencing studies is selecting the significance threshold that accounts for multiple testing. A simple way of doing this is to adopt a Bonferroni correction for 20,000 independent tests (1 test for each gene), which, for an experiment-wide significance of 0.05 gives a P-value threshold of 2.5 × 10-6 per gene. However, such a threshold may be overly conservative, because it assumes that each tested gene has sufficient variation to achieve the asymptotic properties for the test statistic. For example, if only two individuals carry nonsynonymous variants in a given gene, the difference between cases and controls never exceeds two total observations—thus, the most significant P value that can be achieved is approximately 0.25, assuming that these two variants are independent. Therefore, unless the study is large, association P values will generally be less significant than expected under the null hypothesis of no association.

They propose using the i-stat instead, but I couldn’t find a good web reference for this.

As with Gayan’s HD study mentioned in a previous post, Kiezun/Garimella/Do/Stitziel 2012 also touch on the importance of controlling for population stratification to avoid erroneously significant associations. But whereas Gayan’s main concern was close familial relations, here the authors point out that with rare variants, you need to think about population stratification even for very broad ethnic categories:

Although excess of rare variant tests are fundamentally different from single variant tests, the possibility of stratification still exists, because different ancestries within a structured population sample (for example, African and European ancestry in African-Americans or northern and southern European ancestry in European-Americans) may have different allele frequency spectra due to their different demographic histories. For example, in an exome sequencing study in African-Americans in which disease cases have more African ancestry than controls, one expects to see an excess of rare variants in cases, because African chromosomes carry more rare variants.

On a more technical note, a few of these papers touch on removing PCR duplicates at the alignment stage. What’s up with that? I found a couple of references: a thorough explanation of the issue on SeqAnswers and a much more concise answer on biostars. Basically, this refers to errors in PCR amplification not of your sample but of the libraries of probes used to capture the regions you’re interested in (for instance, the exome).

Pasaniuc 2011 addresses a very different issue. Pasaniuc’s focus is on common and kind-of-rare 1-5% of people) variants. For these variants, Pasaniuc’s surprising conclusion is that it makes more economic sense, in terms of dollars spent per unit of statistical power, to sequence a large number of individuals’ exomes at very low coverage (really low, like .1x) rather than a small number of exomes at high coverage (or using a SNP array). I find this pretty surprising, and Pasaniuc never gives a succinct one-liner to explain, intuitively, why this should turn out to be the case. If I had to paraphrase it myself, it seems the reason is basically just that since error rates in sequencing are small, the marginal dollar is better spent on checking whether another person has a given variant than confirming whether the first person you sequenced really truly has the variant. Anyway, this is really germane only to variants with > 1% frequency; for really rare and de novo variants, it’s pretty important to know whether each variant you think you observe is real, because you’re not going to observe it in a second person.

The other three papers I read all had the same basic pattern: sequence exomes from a group of cases and controls for ASD, and then try to figure out if some kinds of variants are enriched in cases. These were looking at really rare variants: no paper found the same variant twice, though sometimes they found different variants in the same gene in different probands. Clearly, when you’re dealing with rare variants, you better be sure they are real and not sequencing errors, and accordingly, quality control standards are quite high. Sanders 2012 had 8x+ coverage in 95% of bases and yet restricted the search for disease-associated variants to those (83%) with at least 20x+ coverage in all family members. (Sanders is looking at triads of two unaffected parents and one affected proband, to specifically look for de novo mutations).

Besides QCing the data itself, these authors apply a whole series of tests to check that their associations make sense. For instance, do the quantity and loci of observed de novo mutations match with a model of random mutations (as opposed to sequencing errors)? Neale 2012 and O’Roak 2012 both test this and get a yes. Sanders 2012 and O’Roak 2012 both check for association between the de novo mutation rate and parental age (answer: yes, there is an association); Sanders says this is to “determine whether factors other than diagnosis of ASD could explain our findings”. I suppose the argument someone could make is that parental age determines both mutation rate and ASD status, rather than de novo mutations being causally associated with ASD. Sanders re-runs the analysis controlling for parental age and still finds the results to be significant. Neale 2012 specifically finds that the mutation rates are the same in probands and unaffected siblings but that the probands are enriched for non-synonymous variants. Sanders 2012 also finds that non-synonymous de novo mutations are elevated in cases vs. controls.

A theme of all these papers is that not all variants are created equal. Obviously, non-synonymous variants should matter more (i.e. larger/any effect size, more likely a priori to be causally associated with disease) than synonymous ones; conserved splice sites matter more than not-very-conserved splice sites (see splicesomal introns on Wikipedia); nonsense is probably a big deal; and there are a whole host of bioinformatics prediction engines to guess, based on phylogenic conservation and amino acid similarity (and other factors?) how much a given substitution or indel ought to matter: PolyPhen2, SIFT, GERP, PhyloP and Grantham Score (can’t find a good link for this).

The authors all look at the connectedness of the genes disrupted by the de novo variants they find, because finding 20 variants in 20 genes in 20 cases is more meaningful if you can show that those genes are all part of the same pathway. As Neale puts it:

We therefore posed two questions of the group of genes harbouring de novo functional mutations: do the protein products of these genes interact with each other more than expected, and are they unusually enriched in, or connected to, previous curated lists of ASD-implicated genes?

Neale then checked for protein-protein interactions using DAPPLE (a Broad Institute tool based on literature text mining: “DAPPLE looks for significant physical connectivity among proteins encoded for by genes in loci associated to disease according to protein-protein interactions reported in the literature”) and ASD-implicated status using “manually-curated lists of genes associated with high risk for ASD”. Neale found that the genes disrupted by de novo variants in the ASD cases were more interconnected than would be expected by chance.

So basically Neale does this: first, objectively look at what variants seem associated with ASD, then back this up by showing how, in the aggregate, the findings are consistent with literature. By contrast, O’Roak’s approach seems to be: generate a list of associated variants, then cull them based (partly) on literature: “Among the de novo events, we identified 62 top ASD risk contributing mutations based on the deleteriousness of the mutations, functional evidence, or previous studies”. I like Neale’s approach better– the fact that the genes you find are, in the aggregate, enriched for genes already found in the literature helps to validate that you’re on the right track, while staying relatively objective as you don’t filter your findings for specific genes already found in literature (of course there’s still higher-level filtering at work: if you hadn’t replicated at least some findings from literature you’d be less likely to get published).

Sanders narrows down the list of candidate genes partly by looking specifically for those genes where variants were found in more than one case: “we hypothesized that estimating the probability of observing multiple independent de novo SNVs in the same gene in unrelated individuals would provide a more powerful statistical approach to identifying ASD-risk genes than the alternative of comparing mutation counts in affected versus unaffected individuals.” This is basically the same gene- and pathway-level aggregation discussed by Kiezun/Garimella/Do/Stitziel 2012 and seems to add a lot of power for rare variants.

All these papers are very recent, so I am assuming that this pretty much gives a cross-section of the state of the art today for GWAS dealing with rare variants. What is impressive is that with relatively small sample sizes (number of samples was in the hundreds in each of these papers) the authors do manage to find publishable associations between particular disrupted genes and phenotype. Kiezun/Garimella/Do/Stitziel 2012 state that you need about 10,000 samples to get sufficient statistical power for GWAS on complex traits, yet here these authors have drawn meaningful conclusions from a few hundred samples. How so? A lot of the magic here seems to lie in who the samples are. If you have affected cases each with two unaffected parents, then you can isolate the de novo variants, which are very few in number– like on the order of 1 per person rather than tens of thousands per person. Then you can further filter for things you think a priori should matter. O’Roak, for instance, finds 248 de novo events in 189 probands (average 1.3 per person), and then throws out the synonymous changes, narrowing them down to 181 variants, then to 120 predicted to be severe based on “sequence conservation and/or biochemical properties” (those prediction engines I mentioned earlier). That’s quite a small search space compared to the hundreds of thousands of variants, in total, that must have been present in these subjects. Looking for these sorts of “tricks” that reduce dimensionality while remaining faithful to the biology of the disease you’re looking at seems to be key to getting a lot of power out of small samples.