Exome sequencing and analysis

Read with caution!

This post was written during early stages of trying to understand a complex scientific problem, and we didn't get everything right. The original author no longer endorses the content of this post. It is being left online for historical reasons, but read at your own risk.

This post will aim to answer the following questions:

How is exome sequencing accomplished?
Why is exome sequencing still a commonly used alternative to increasingly cheap whole genome sequencing?
What are the particular challenges of working with exome data and how can these be mitigated?
What can be achieved with exome data?

How is exome sequencing accomplished?

The most popular method is in-solution capture, also known as liquid hybridization. This technique relies on the fact that, at this point in human history, we already have a pretty good idea of what parts of the genome are exomic. A few companies (Agilent, Roche NimbleGen — see Sulonen 2011) offer solution-based exome capture kits which contain probes complementary to subsequences of (almost) every known exomic region of the human genome. That’s right, we’re going in with an a priori idea of what regions are exomic and we’re only going to capture those for sequencing. So this technique would never, for instance, discover new exomic regions not previously known to be expressed.

Moreover, it’s not necessarily perfectly known or agreed which regions are exomic, and the answer also depends on if you only want protein-coding sequences or also miRNA sequences. Teer & Mullikin (2010) provide the follwoing discussion of this issue:

The term ‘whole human exome’ can be defined in many different ways. Two companies offer commercial kits for exome capture and have targeted the human consensus coding sequence regions (28), which cover ∼29 Mb of the genome. This is a more conservative set of genes and includes only protein-coding sequence. It covers ∼83% of the RefSeq coding exon bases. Both companies also target selected miRNAs, and extra regions can sometimes be added (Agilent). Although still a subset of the genome, exome capture allows the investigation of a more complete set of human genes with the cost and time advantages of genome capture.

The probes are added to a solution of fragmented genomic DNA and they hybridize to their complement sequences. The primers themselves might be RNA or DNA (again see Sulonen 2011). The primers are biotinylated, and biotin binds to streptavidin beads, so once you’ve got the primers hybridized to the genomic DNA fragments of interest, you add the beads, then wash the bead-bait complexes to remove the undesired (non-exomic) DNA, and elute the DNA. Teer & Mullikin (2010) offer the best diagram I’ve seen of this process.

At this point, you have a solution full of only the exomic DNA you’re interested in sequencing. Now you just sequence it the same way you would if you were doing whole-genome sequencing. Most likely using the Illumina method, which Illumina summarizes as follows:

TruSeq technology supports massively parallel sequencing using a proprietary reversible terminator-based method that enables detection of single bases as they are incorporated into growing DNA strands. A fluorescently-labeled terminator is imaged as each dNTP is added and then cleaved to allow incorporation of the next base. Since all four reversible terminator-bound dNTPs are present during each sequencing cycle, natural competition minimizes incorporation bias. The end result is true base-by-base sequencing that enables the industry’s most accurate data for a broad range of applications.

Illumina’s video explains this a lot more clearly, particularly the part from 5:13 – 6:06:

An important thing to keep track of is your coverage or call depth. This means how many reads per base pair sequenced. Typical numbers seem to range from 20x to 80x (for instance, 23andme’s personal exome product boasts 80x coverage) (update: it is no longer offered). But the figure won’t be uniform across the genome, so you end up with sentences like this one: “Exome capture produced 1x coverage of 97% of the exome; 30x coverage was 74%” .

By the way, there do exist other methods for exome capture. Teer & Mullikin (2010) discuss some of them, as does Wikipedia, but word on the street is that no one uses the other methods anymore; as of August 2012, it’s all in-solution capture.

Why is exome sequencing still a commonly used alternative to increasingly cheap whole genome sequencing?

tl;dr: It’s still cheaper, both in terms of direct cost to sequence, and in terms of your time and computing power to deal with the data.

Genome Biology asked experts Biesecker, Shianna and Mullikin to discuss this among other issues. As of September 2011, all three experts stated that whole exome sequencing is still a lot cheaper (~1/6) than whole genome sequencing, and that cost is still a limiting factor in terms of how much sequence data they can generate. They also said you wait longer for your sequence data to be ready if you are getting a whole genome than just an exome, and then you also have loads more data to store and process (the exome is about 30 Megabases, the (haploid) genome is about 3.2 Gigabases)

Meanwhile, while we are no longer so ignorant as to call non-exomic DNA “junk DNA“, the exome is still highly enriched for Stuff That Matters. To wit: Choi 2009 says:

Protein coding genes constitute only approximately 1% of the human genome but harbor 85% of the mutations with large effects on disease-related traits.

What are the particular challenges of working with exome data and how can these be mitigated?

Most obviously, the exome isn’t everything. Mutations in splice sites, promoter regions and other non-exomic regions matter too. The fact that exome sequencing misses non-coding regions is clearly a limitation, but that’s what exome sequencing is; this is sort of a “known unknown.” Perhaps more troubling is the fact that there are probably also some coding regions that we haven’t yet identified as coding regions and so we’re missing them. Because this is an “unknown unknown” it bothers me a bit more. Biesecker and Mullikin touch on this issue:

Also, our current understanding of the genome limits our exome interrogation – nucleotides in regions of the genome not currently recognized to be a gene will be missed by exome approaches.

The blog at Scientific American lists ten things exome sequencing doesn’t capture. This was a nice, concise introduction to the issue. I’d take a few things with a grain of salt– the list includes “gene-gene (epistatic) interactions”, but that’s just a limitation of our scientific understanding, not of exome sequence data, and moreover, it’s precisely the sort of science that exome sequencing will help to advance. But this list was useful and it did alert me to an important issue: exome sequencing (whole genome sequencing, too) cannot detect the length of CAG repeats in HTT. This is because standard read lengths from, say, an Illumina HiSeq machine, are about 150bp long. To determine the length of CAG repeats in HTT, you’d need a read that starts with about 20bp that you can uniquely align to HTT, then the entire CAG repeat region, then at least 3 or 4 bases at the end that are not CAG, to show you that the repeat region has ended. Huntington’s disease is caused by repeat lengths from about 40 on up, with cases in the 100s. At the low end of 40, you could just barely fit the entire CAG repeat region into one read: 20 + 40*3 +4 = 144 < 150. Longer repeats won’t fit in one read length.

You might think you could overcome this, if you had enough resolution, by measuring the quantity of all-CAG-repeat reads, but other genes besides HTT also have CAG repeat regions, so you would not know where those reads fit into the genome.

An added problem is that, just as CAG repeats have a tendency to lengthen due to slippage during replication in gamete production, particularly spermatogenesis (giving rise to anticipation), they also have a tendency to lengthen during PCR, which means you can’t accurately amplify HTT DNA.

I asked around and apparently the standard way that clinical genetic diagnostics of Huntington’s are done is to use restriction enzymes to cleave DNA immediately on either side of the expanded repeat region and then perform gel electrophoresis on the cleaved region and simply measure how long it is (based on how far it travels on the gel).

What can be achieved with exome data?

The web is littered with references to what seem to be the two most famous examples of successful applications of exome data: one in which genome-wide association study was used to find the mutation responsible for a familial syndrome (Miller syndrome) which had already been characterized but whose cause was unknown, and one in which the exome data was used as a diagnostic tool, scanning a patient’s exome for any known deleterious mutations in order to identify the cause of the patient’s symptoms (revealed to be congenital chloride diarrhea).

1. Identifying the mutation responsible for a familial syndrome. Ng & Buckingham 2009 in Nature Genetics used exome data from just 4 cases and 8 controls to pinpoint the gene (DHODH) in which mutations are responsible for Miller syndrome. The filtering and modeling they did was pretty remarkable in its elegance and power. Two of the four cases were siblings, so they reasoned that the two siblings must share the same mutation and, because Miller syndrome is so rare, it must be a variant not observed in any of the controls nor in dbSNP129 (a database of SNPs, as it sounds like). They also hypothesized the variant would be recessive, so that meant that each case had to be homozygous for the given mutation. These few filters alone reduced the pool of candidates to just 9 genes. Comparison to just 1 of the other unrelated cases reduced the pool to 1 gene, DHODH. Even with the Bonferroni correction, this result was highly statistically significant. The researchers then sequenced DHODH in several other Miller syndrome cases and found mutations there as well, thus validating their finding.

2. Diagnosing a patient’s by discovering that they have a mutation already known to cause a particular syndrome. Choi 2009 performed exome sequencing on a patient with an unidentified ailment. Due to consanguinity the researchers hypothesized a homozygous recessive mutation. They found 2,405 homozygous variants, including 668 non-synonymous substitutions (i.e. missense), of which 29 were novel. Reasoning that mutations in highly conserved loci are more likely to cause disease, they ranked the novel missense variants by phyloP score (a measure of how highly conserved that locus is across species) and found 10 variants with a score above their chosen threshold of 2. It appears the final step was simply to look at these 10 and reason about which one made the most sense as an explanation for the patient’s symptoms. A mutation in SLC26A3 stood out because the locus was very highly conserved, the read quality was very good, and the gene was known to be related to congenital chloride-losing diarrhea, which would match the patient’s symptoms.

A third application is the use of exome data for genome-wide association studies – in other words, searching the whole exome for variants correlated with a given phenotype. These types of studies will be the subject of my next post.