How to identify a disease-associated repeat expansion

My boss likes to say that repeat regions are the one remaining genetic “black hole” in our era of plentiful genetic information. Common SNPs and copy number variations can be genotyped with specialized chips for on the order of $50 per sample, and next-gen sequencing, in more of the $500 – $5000 range depending on your needs, can be used to call rarer SNPs, small indels, copy number variations, and chromosomal translocations and inversions. Yet there still isn’t any really good, affordable, genome-wide way of calling repeat expansions. Many repeats are too long to be definitively called by 100 or even 200bp sequence data, and even if they could be called, repeat regions of the genome tend to be hugely underrepresented in sequence data, owing both to PCR amplification difficulties and to the impossibility of uniquely aligning those reads to a single genomic position. update 2013-01-07: lobSTR is a great tool for calling repeat length polymorphisms, though it doesn’t do well with repeats that are longer than the read length – see this post.

To this day, repeat expansions are still genotyped using customized assays that target one particular region. For instance, HTT genotyping is usually done by amplifying with PCR primers for the regions adjacent to the CAG tract and then measuring the length of the resulting PCR product. This approach works just fine for genotyping known at-risk family members or symptomatic individuals, where the prior probability of a repeat expansion is high. But this approach just doesn’t fit on a chip and therefore it doesn’t scale when you want unbiased genotyping of all the potentially pathogenic repeat regions in the genome. No wonder Counsyl‘s ’universal genetic test‘ chip for pre-pregnancy carrier screening doesn’t include any trinucleotide repeat disorders (Counsyl does offer a test for Fragile X, but only on special request; Natera’s Fragile X test also appears to be a separate protocol from its Natera One product).

This also means that there has never been an equivalent of GWAS for repeat expansions. Instead, repeat expansions tend to be discovered through painstaking analysis of regions already identified as significant based on SNPs that have been found to belong to a disease haplotype in family-based association studies (TDTs). Consider the history of HD: starting as early as 1983, a series of family-based association and recombination studies was able to associate HD with chromosome 4 and eventually narrowing it down to 4p16.3 via the association of SNPs on the disease haplotype. It was 10 years before MacDonald 1993 was able to identify the repeat expansion itself as the cause of the disease.

For many diseases, this saga is ongoing (and surely for some it’s yet to start). I just read about the story of ALS-FTD, a combined disease phenotype with motor neuron degeneration (that’s ALS) and personality/behavioral change (that’s FTD). A series of family-based studies over 2006-2011 identified chromosome 9p21 as the culprit region, until two independent papers published simultaneously in October 2011 announced that a GGGGCC hexanucleotide expansion is the causative mutation [DeJesus-Hernandez 2011 and Renton 2011].

This is interesting for several reasons.

First, this hexanucleotide expansion is noncoding. It’s between exons 1a and 1b of C9ORF72, a not-yet-characterized gene, meaning that for this gene’s transcript 1 (which starts from exon 1b) this repeat expansion is in the promoter region, while in transcripts 2 and 3 (which start from exon 1a), it’s in an intron. How does such a mutation cause disease? DeJesus-Hernandez 2011 shows (1) a total loss of transcript 1 in patients with the mutation, and (2) the appearance of little GGGGCC-rich clumps of RNA in the nucleus (“nuclear RNA foci”). So haploinsufficiency of that transcript variant and RNA toxicity could both be disease mechanisms. There’s already plenty of other diseases known to be caused by noncoding repeat expansions, some specifically by loss of expression (ex. Fragile X) or toxic amounts of repeating RNA retained in the nucleus (possibly myotonic dystrophy, see Orr & Zoghbi 2007 for a review). But C9ORF72 is another reminder that we shouldn’t be too cavalier about filtering for only exonic mutations when we look for disease variants.

Another interesting aspect of the story is how the researchers were able to identify the repeat, given that repeats are so difficult to genotype. There are two questions here: first, how did they even notice that a repeat might exist, and second, how did they confirm it and measure its length?

As to the first question, DeJesus-Hernandez 2011 states that “In the process of sequencing the non-coding region of C9ORF72, we detected a polymorphic GGGGCC hexanucleotide repeat…” Whether this “sequencing” was next-gen or targeted sequencing is not stated. Renton 2011 gives a bit more detail: “We undertook massively parallel, next-generation, deep re-sequencing of the chromosome 9p21 region…” and then they ran a BWA/Picard/GATK pipeline to do alignment and call variants (probably not so different from my exome pipeline, minus the exome-specific steps) and then filtered for novel (never-before-seen) variants and then looked at them manually:

sequence data revealed eight novel variants within the 232kb block of linkage disequilibrium containing the previously identified association signal that were not described as polymorphisms in either the 1000 Genomes (April 2009 release) or the dbSNP (build 32) online databases. Six of these variants were located within a 30 base pair (bp) region. When the individual sequence reads within this region were examined and manually re-aligned, they indicated the presence of a hexanucleotide repeat expansion GGGGCC located 63bp centromeric to the first exon of the long transcript of C9ORF72

Just as I’d have expected, GATK wasn’t able to cleanly call the presence of a repeat: it could tell that something was going on in that region but it took a bit of human intelligence to put the pieces together. That seems to be where we’re still at today with detection of repeats in next-gen sequence data.

For both groups, even once they had noticed that a repeat was present at that location, validating and measuring it was a challenge and they used several different protocols. First, DeJesus-Hernandez 2011 tried fluorescent fragment-length analysis, which I infer requires PCR-amplifying the desired region using primers specific to sequences up and downstream of the targeted region. They found that “All affected individuals appeared homozygous in this assay, and affected children appeared not to inherit an allele from the affected parent” – suggesting that the disease allele was not getting detected at all because it was too long to amplify with PCR. So instead, both they and Renton instead relied on a repeat-primed PCR method which involves a forward primer unique to a sequence near the repeat and a reverse primer composed of several repeat units which can bind anywhere in the repeat region, thus creating amplicons of varying sizes. The reverse primer is used in lower concentrations and is exhausted in a few cycles, after which an anchor primer takes over as the reverse end starting point. (Of the two authors, Renton gives the more clear explanation of this procedure, see ‘Repeat-primed PCR’ under Experimental Procedures). As you can guess, this method doesn’t give you an exact answer on how long the repeat is, just a lower bound – the genomic repeat region is at least as long as the longest PCR amplicon. Renton was able to detect that the disease alleles had at least 30 repeats, compared to <20 in controls, but didn’t see any amplicons of longer than 71 repeats. DeJesus-Hernandez followed up on the PCR with a Southern blot which suggested that the disease alleles had 700 – 1600 repeats. Neither group was able to positively identify the exact repeat length of any of the disease alleles, and in at least one patient, two expanded repeat lengths were observed, suggesting somatic instability (i.e. when the repeat expands over time in some adult cells).

From what the wet lab people tell me, apparently these are pretty difficult and labor-intensive protocols to execute correctly. Sometimes, genotyping a repeat length is difficult even in one targeted site where you know you’re looking for a mutation.

It’s not clear how soon technological advances will make all this easier. One possibility is PacBio‘s single molecule sequencing, which according to their brochure gives average read lengths around 3000b; it’s already been used for genotyping Fragile X repeat length, which in the full-blown disease is more than 200 trinucleotide repeats (i.e. 600bp).