Calling repeat length polymorphisms with lobSTR

It is generally held to be quite tough to call repeat length polymorphisms from sequence data for several reasons: it’s hard to uniquely align repeat-rich elements to the reference genome, repeats may be longer than the read length, and pileup depth over GC-rich repeats will be depleted due to poor PCR amplification.

Enter lobSTR, created by Melissa Gymrek at the Whitehead Institute. The ‘STR’ stands for Short Tandem Repeat, which simply means 2-6 nucleotide repeat elements. This term is almost never used in the Huntington’s / trinucleotide repeat disease literature, but is the term of choice in law enforcement and forensics circles. The FBI has identified 13 STRs which are highly polymorphic between individuals but highly stable within individuals, meaning they’re ideal for ‘DNA fingerprinting’ – using these repeat lengths to create a DNA profile, “the probability that two unrelated individuals will have the same DNA profile purely by chance is less than 1 in 200 billion” [Dept of Justice 2004]. And lobSTR got some secondary press for its potential forensics applications.

But short tandem repeats also seem to be a good candidate for where some of the ‘missing heritability’ of human disease might lie. As Gymrek points out, repeat lengths mutate much more quickly than SNPs do, and reapet expansions are already known to be responsible for a large number of diseases [see Mirkin 2007 for a review of molecular mechanisms of repeat expansion and its consequences].

lobSTR’s ability to detect repeat length polymorphisms is largely limited by read length. In the original lobSTR paper [Gymrek 2012], Gymrek states that lobSTR only calls lengths for repeats that are completely spanned by a read. So if you’ve got 100b reads and you need 10b flanking each side of the repeat in order to uniquely align, you’d be limited to calling repeats that total less than 80bp. That’s ~26 trinucleotide repeats, less than the pathogenic threshold for HD and most other trinucleotide diseases (see Orr & Zoghbi 2007). However, in v2.0, she added support for using paired-end reads to uniquely align one repeat-rich read without much (or any?) flanking sequence as long as its mate aligns uniquely, and to call minimum repeat lengths based on reads that do not completely span the repeat tract (though this is still in beta). So I figured it was worth a try to see if lobSTR could detect the HTT CAG expansion in my HD exomes.

As background, I have 50 paired end 100bp exomes (49, really, since one turned out to be a duplicate) of HD patients who have been separately genotyped for repeat length in the HTT CAG tract (chr4:3076603-3076666). The hg19 reference genome has 21 CAG repeats (we think the population average is actually around 17), including the CAACAG that usually appears at the end of the repeat region. Each patient in my dataset has an expanded allele between 41 and 55 CAGs in length (i.e. 123 – 165bp of repeats), and a normal allele of between 14 and 32 repeats.

lobSTR was pretty easy to get started with using the binaries and indices provided on the download page. One caveat: after you wget, gunzip and tar -xvf the hg19 index, you’ll notice it has indices for a huge number of different 2-6nt repeats, mostly starting with A and a few starting with C. None starting with G or T, and even CAG, for example, is not in there. On further investigation it became clear that each repeat listed here represents all six different frame/strand combinations, so AGC is standing in for CAG=AGC=GCA=CTG=TGC=GCT.

For all the different sample ids in my dataset, denoted by {sid}, I ran the following code:

lobSTR --p1 /data/HD/dataset/human_hd_extreme_exome/fastq_unzipped/{sid}_1.fq --p2 /data/HD/dataset/human_hd_extreme_exome/fastq_unzipped/{sid}_2.fq -q --index-prefix /data/HD/dataset/lobSTR-index/index_trf_hg19_extend1000/lobSTR_ --out {sid} --max-diff-ref 500
python ~/bin/lobstr_v2.0.2_1.linux64/scripts/lobSTR_alignment_checks.py -f {sid}.aligned.tab --plot
allelotype --command classify --bam {sid}.aligned.bam --haploid chrX,chrY --noise_model ~/bin/lobstr_v2.0.2_1.linux64/models/illumina2 --out {sid} --strinfo ~/bin/lobstr_v2.0.2_1.linux64/data/strinfo.hg19.tab

Running time ranged from 0.5 to 2.5 hours (these are ~41x exomes with 15-27M read pairs total).

Next, to see what calls lobSTR had made for the HTT CAG tract, I grepped out the relevant line from each output file:

grep $'^chr4\t3076603' *.genotypes.tab

Results: for 41 out of the 50 exomes, lobSTR called an allelotype of NA/NA meaning it did not have enough information to make a call. For 2 exomes, lobSTR called the site homozygous reference (allelotype 0/0) and for 7 exomes it called a homozygous variant shorter than the reference (e.g. allelotype -15/-15). Of these 9 cases, the repeat length called was equal to the length of the normal allele that we have on file for that patient in only two cases, and within a few bases plus or minus for the other patients. The expanded allele was not called for any exomes, and all of the minimums given in column 16 (see page 15-16 of the v2.0 manual for table schema) were less than 0, meaning lobSTR had not detected the presence of an expanded allele of unknown length.

None of this is too surprising – lobSTR doesn’t claim to be able to call expansions longer than read length, and the use of partially overlapping reads is still in beta. Moreover the HTT repeat is particularly hard to call because reads are depleted there. In my exomes we have average 41x coverage of exons, but coverage at the edges of the CAG tract ranges from 2 – 10x across the different samples. We’re not totally sure why that is, though we have some ideas. First, the GC-rich nature of this repeat region makes it hard to amplify, and our exomes are pretty low quality – ~27% PCR duplicates on average, which suggests that they were amplified more than is ideal before sequencing. Second, it may just be that the exome capture probes for exon 1 of HTT are of poor quality and aren’t capturing enough molecules from this region. If we had unbiased whole genome data, which we don’t yet, we’d be able to examine whether that’s the case. Third, at the alignment level you have the difficulty of uniquely aligning reads from such a repetitive region, particularly since the CAG repeat tract is flanked on its immediate right by a polyproline CCG repeat.

In short, it’s no fault of lobSTR that it couldn’t make an accurate call of the allelotype at this locus. Still, it is tempting to think there might be more ways to glean repeat length information from sequence data.

First, most of my HD samples have one or two pure CAG reads in the HTT CAG tract, demonstrating that one allele has a repeat of at least 33 trinucleotide units. These reads have no flanking sequence at all but have a mate that aligned uniquely to some nearby stretch of sequence. As far as I can tell from the documentation and the C++ source on SourceForge, lobSTR is currently not taking advantage of these reads – it only uses the mate’s information to gain confidence in an alignment, not to entirely substitute for flanking sequence on the individual read itself. To test this, I tried aligning with --minflank 0 and lobSTR gave me an error: lobSTR: Error: invalid min flank length.

Second, I wonder whether combining lobSTR’s approach with insert size information might be fruitful. Insert sizes from paired end reads are used in some more generic indel callers (see UAB day 1 notes on SVs). While the HTT expanded allele is not long enough to cause insert sizes that are individually anomalous, the distribution of insert sizes could still be informative.

Third, some things may be telling by their absence. Given sufficient read depth around the repeat region, the absence of any read spanning the entire repeat region may suggest a length greater than 100bp, and the absence of a read pair spanning the repeat region may suggest a repeat length greater than the typical range of insert sizes. HTT is a bad example for this, but consider DMPK. In that gene, healthy alleles have 5-37 CTG repeats. That’s up to 111bp of repeat, so it might not be spanned by a single read but is very likely to be spanned by a pair. By contrast, pathogenic repeats in DMPK can be 1000 repeat units (3000bp) in length, where in conventional paired-end sequencing you’d be very unlikely to have a pair that spans the repeat tract (for instance, in my exomes the average insert size is 200bp and 99% of pairs are less than 500bp). Provided that you had enough read depth, the absence of such a spanning pair would be conspicuous.

Overall, my review of lobSTR is very positive: it’s easy to set up, runs fast, is well-documented and is a clever approach to solving a problem no one else has managed to do well (or at all).

Calling repeats longer than read length, though, is still a tough problem.