Testing for repeat length association with phenotype using lobSTR

In yesterday’s post I got acquainted with lobSTR, an innovative new tool for calling repeat length polymorphisms from next-generation sequence data. lobSTR is designed primarily to call repeats that are shorter than the read length and therefore can be completely spanned by a read, with enough flanking sequence on each side to uniquely align the read to the reference genome. However the v2.0 release of lobSTR makes two big improvements: first, it takes into account reads that only partially span a repeat tract, and second, it can run on paired-end reads and use a read’s mate’s alignment to gain confidence in its mapping. Features like these may ultimately help lobSTR to be able to call repeat expansions that are longer than the length of a sequencing read. But as of right now, it looks like lobSTR still struggles with those longer repeats. In yesterday’s post I tested lobSTR on my Huntington’s Disease exomes (which all have expanded CAG repeats of 123-165bp, whereas my reads are only 100bp) and it was unable to call the repeat expansion in any of them.

But I’m still very interested in what lobSTR can do for my HD research by calling repeat expansions elsewhere in the genome. As documented in this backgrounder on HD age of onset, residual age of onset in HD (after controlling for HTT CAG repeat length) is believed to be at least 40% heritable, yet after a decade of research, the HD research community has not confidently identified any specific genetic variations associated with residual age of onset.

You can file this problem under the broader heading of ‘missing heritability’. For many diseases we think (based on family associations) that a phenotype is highly heritable, yet all the common SNPs identified through GWAS put together explain very little of the heritability. Over the last few years, people have been looking for rare SNPs (through exome sequencing or whole genome sequencing), structural variations and copy number variations (through whole genome sequencing or array-CGH), epigenetic modifications (through bisulfite-seq, MeDIP-seq, ChIP-seq, etc.), and gene-environment interactions to explain the missing heritability. But disappointingly, most of the heritability continues to be unexplained.

Because repeat length polymorphisms are so difficult to genotype, they are probably the least-explored type of variation that might be contributing to missing heritability. Yet given the known disease associations with repeat expansions, it’s totally plausible that these could be driving heritability of some phenotypes. In particular, HD residual age of onset could plausibly be associated with repeat expansions in other genes (other than HTT) if some of the disease phenotype is mediated by polyQ polar zippers or if RNA toxicity makes any contribution (see HD reading list). I’m excited about lobSTR because it offers the possibility of testing for such associations on a genome-wide or exome-wide level using sequence data.

I have 50 phenotypically extreme HD exomes (people with very early or very late residual age of onset) and my goal this morning was to call repeat length polymorphisms in these individuals and test for association of these repeat lengths with residual age of onset.

Here’s what I did.

step 1: call repeat lengths with lobSTR

As documented yesterday, I ran lobSTR on all of my exomes:

lobSTR --p1 /data/HD/dataset/human_hd_extreme_exome/fastq_unzipped/{sid}_1.fq --p2 /data/HD/dataset/human_hd_extreme_exome/fastq_unzipped/{sid}_2.fq -q --index-prefix /data/HD/dataset/lobSTR-index/index_trf_hg19_extend1000/lobSTR_ --out {sid} --max-diff-ref 500
python ~/bin/lobstr_v2.0.2_1.linux64/scripts/lobSTR_alignment_checks.py -f {sid}.aligned.tab --plot
allelotype --command classify --bam {sid}.aligned.bam --haploid chrX,chrY --noise_model ~/bin/lobstr_v2.0.2_1.linux64/models/illumina2 --out {sid} --strinfo ~/bin/lobstr_v2.0.2_1.linux64/data/strinfo.hg19.tab

step 2: QC with lobSTR

lobSTR also comes with some handy Python scripts for QC and descriptive statistics. I used these to spot check the results from a few exomes, such as:

python ~/bin/lobstr_v2.0.2_1.linux64/scripts/lobSTR_alignment_checks.py -f 1820.aligned.tab

1820.aligned.tab Number of aligned reads 107786
1820.aligned.tab Number of stitched reads 30128
1820.aligned.tab Number of single end reads 38829
1820.aligned.tab Number of supporting end reads 38829
1820.aligned.tab Percent partial aligned reads 0.488072
1820.aligned.tab Percent reverse strand reads 0.490146
1820.aligned.tab Mean insert size 174.789
1820.aligned.tab Percentage of non-unit allele reads 0.0163735

Those all look pretty good. The non-unit alleles are supposed to be 1-5%, and the reverse strand reads should be about 50%. And the mean insert size of 175 is within ballpark of the figure of 199 that I had calculated for the whole exomes.

python ~/bin/lobstr_v2.0.2_1.linux64/scripts/lobSTR_allelotype_checks.py -f 1820.genotypes.tab

1820.genotypes.tab Number STR covered at least once (includes partially spanning) 28355
1820.genotypes.tab Number STR loci coverage >= 1 18790
1820.genotypes.tab Number STR loci coverage >= 5 827
1820.genotypes.tab Number STR loci coverage >= 10 320
1820.genotypes.tab Mean STR coverage (only non-partial reads counted) 1.58265
1820.genotypes.tab Mean percent agreeing reads 0.994757
1820.genotypes.tab STR period vs. fraction in each category
1820.genotypes.tab Period 0/0 0/1 1/1 1/2
1820.genotypes.tab Period:2 0.193853427896 0.36170212766 0.115839243499 0.328605200946
1820.genotypes.tab Period:3 0.713567839196 0.13567839196 0.0954773869347 0.0552763819095
1820.genotypes.tab Period:4 0.669902912621 0.145631067961 0.145631067961 0.0388349514563
1820.genotypes.tab Period:5 0.770491803279 0.114754098361 0.0655737704918 0.0491803278689
1820.genotypes.tab Period:6 0.926829268293 0.0243902439024 0.0243902439024 0.0243902439024

The thing that jumped out at me here is that the mean coverage is only 1.58. That’s because this is exome sequence data but lobSTR is calling repeats genome-wide. So that 1.58 is the average of high coverage in the exons and very sparse coverage of non-exonic sites. I’ll filter for coverage later, so I decided not to worry about this at this stage. (Alternately, I could have run lobSTR with a custom index for just the exonic repeats – see the lobSTR usage page under ‘Building a custom index’).

step 3: combine lobSTR output

I needed to do a regression using data from all 50 exomes, so I had to combine the lobSTR output from each exome at some point. I decided it was easiest at the bash command line:

for sid in {1820..1869} # sid is sample id
do
    grep -v '^#' ${sid}.genotypes.tab | awk -v sid=$sid '{print sid "\t" $0}' >> all.genotypes.tab
done

I’m using grep to remove the header lines, awk to add the sample id as the first column, and concatenating all the results together with >>. By the way, because this code concatenates, if you run it more than once just remember to rm all.genotypes.tab in between.

step 4: read combined lobSTR output into R

I wrote a few lines of R code to read in the combined lobSTR output, combine it with phenotype information (HD residual age of onset) and sort it by genomic site. In the process I learned the handy R commands merge and order which provide much of the functionality of SQL join and group by operations respectively:

# column names for lobSTR genotypes.tab output. remove the first 'sid' column if you are not adding a sample id to the data.
colnames=c('sid','chrom','left','right','unit','period','refcn','mlatype','cov','agree','conflict','alist','score','a1score','a2score','pcov','lbound','alistp','nstitched')
# read lobSTR genotypes into a data frame named lob
# need to use as.is=TRUE so that characters don't turn into factors. need as characters so you can split on delimiters later.
lob = data.frame(read.table('all.genotypes.tab',header=FALSE,skip=2,col.names=colnames,as.is=TRUE)) # create a dataframe of lobSTR output
lobq = lob[lob$unit%in%c('AGC','AAC'),] # I consider only CAG and CAA repeats (named "q" for glutamine, though these repeats won't all code for glutamine depending on frame and strand.
lobq$a1 = as.numeric(lapply(strsplit(lobq$mlatype,","),"[",1)) # parse first allele value
lobq$a2 = as.numeric(lapply(strsplit(lobq$mlatype,","),"[",2)) # parse second allele value
lobq$maxal = pmax(lobq$a1,lobq$a2,lobq$lbound,na.rm=TRUE) # get longest allele from first, second, and lower bound based on partially spanning reads
lobq$tcov = as.numeric(lobq$cov)+as.numeric(lobq$pcov) # calculate total coverage including spanning and partial reads
phen = data.frame(read.table('phen.txt',header=FALSE,col.names=c('sid','res'),as.is=TRUE,sep=";")) # read in a table of phenotype info
phen$res = as.numeric(phen$res) # convert residual age of onset to numeric
vs = merge(lobq,phen,by="sid") # equivalent of SQL "create table vs as select * from lobq inner join phen on lobq.sid = phen.sid;"
o = vs[with(vs, order(chrom,left,right)), ] # equivalent of SQL "create table o as select * from vs order by chrom, left, right;"

step 5: a few descriptive statistics in R

Once I’ve got a new dataset read into R, I always try to start with some descriptive statistics. First I checked the rate of heterozygous and homozygous calls and how many sites were not called (NA):

# number of allelotype calls containing NA
sum(is.na(lobq$a1) | is.na(lobq$a2),na.rm=TRUE)
# number of heterozygous allelotype calls
sum(lobq$a1 != lobq$a2,na.rm=TRUE)
# number of homozygous reference calls
sum(lobq$a1==0 & lobq$a2==0,na.rm=TRUE)
# number of homozygous non-reference calls
sum(lobq$a1!=0 & lobq$a1==lobq$a2,na.rm=TRUE)
# this should equal the sum of those:
dim(lobq)[1]

And the results looked right and matched what lobSTR_allelotype_checks.py gave above. I also checked how many sample-site combinations had coverage greater than 1 read:

sum(lobq$cov > 1)

The answer was only about 20%, which isn’t too surprising considering that I used lobSTR to call repeat lengths genome-wide even though my data is exome.

Finally, while browsing my data I noticed a preponderance of negative allelotypes, meaning that lobSTR had called an allele length shorter than reference. I wondered if this was objectively true or if I was just seeing things so I checked:

# number of allelotypes with the longest allele longer than reference
sum(o$maxal > 0)
# number of allelotypes with the longest allele shorter than reference
sum(o$maxal < 0)

And sure enough, there were more contracted alleles than expanded alleles (13826 vs. 9051). But once I applied a filter for allelotypes with coverage > 1 read, that trend went away and I had more expanded than contracted alleles, which makes sense considering that I’m testing against the longer allele (o$maxal):

# number of allelotypes with the longest allele longer than reference and coverage > 1
sum(o$maxal > 0 & o$cov > 1)
# number of allelotypes with the longest allele shorter than reference
sum(o$maxal < 0 & o$cov > 1)

And here I got 3071 expanded vs. 1261 contracted. (And by the way – if like me you’re not totally fluent in R, remember that & and && are very different.)

step 6: test for association in R

I wrote this script to crawl through the table and create two models for every site: a linear regression of max allele length vs. residual age of onset, and a Fisher’s exact test for the number of expanded alleles at each site in extreme early vs. extreme late patients:

# set up variables for loops and regression
mincov = 2 # minimum read coverage of a repeat region to include in model
minsamplecount = 40 # minimum number of samples to create a model for the site. 40/50 = require 80% genotyping rate.
maxrow = dim(o)[1] # highest row number in o
sitefirstrow = 1 # first row number referring to a particular genomic site. used in outer loop.
rowno = 1 # row number currently being processed, used in inner loop
results = matrix(nrow = dim(unique(o[2:4]))[1], ncol = 6) # one row for every unique chrom/left/right combination; 6 cols listed on next line
colnames(results) = c("chrom","left","right","count","linearp","fisherp") # name the result matrix columns
resultrow = 1 # counter for inserting results into the result matrix
# outer loop: iterate over repeat sites
while (sitefirstrow < maxrow) { # until we hit the end of the table...
    chrom = o$chrom[sitefirstrow] # grab the chrom/left/right unique identifier for a site
    left = o$left[sitefirstrow]
    right = o$right[sitefirstrow]
    rowno = sitefirstrow # start with the first sample for this site
    maxals = c() # vector to hold largest allele for each sample
    resaos = c() # vector to hold residual age of onset for each sample
    # inner loop: iterate over samples at a given repeat site
    while(o$chrom[rowno] == chrom && o$left[rowno] == left && o$right[rowno] == right && rowno <= maxrow) { # as long as we're on the same site, iterate over samples
        if(o$tcov[rowno] >= mincov) { # skip sample if that site had insufficient coverage in that sample
            maxals = c(maxals,o$maxal[rowno]) # append the sample's max allele length at that site
            resaos = c(resaos,o$res[rowno]) # append the sample's phenotype
        }
        rowno = rowno + 1 # advance to the next sample
    }
    samplecount = length(maxals) # how many samples are included for this site
    if (samplecount >= minsamplecount & length(unique(maxals)) > 1) { # don't bother doing regression if < 3 datapoints or monomorphic
        # linear model of residual age of onset vs. maximum allele repeat length
        model = lm(resaos ~ maxals) 
        linearp = summary(model)$coef["maxals","Pr(>|t|)"] # can also get coefficient: #coeff = summary(model)$coef["maxals","Estimate"]
        # fisher model of number of expanded alleles vs. dichotomous phenotype
        earlyexp = sum(resaos < 0 & maxals > 0) # early onset patients with expanded alleles
        earlynon = sum(resaos < 0 & maxals <= 0) # early onset patients with non-expanded alleles
        lateexp = sum(resaos > 0 & maxals > 0) # late onset patients with expanded alleles
        latenon = sum(resaos > 0 & maxals <= 0) # late onset patients with non-expanded alleles
        ctable = matrix(c(earlyexp,earlynon,lateexp,latenon),nrow=2) # contingency table for fisher's exact test
        fisherp = fisher.test(ctable,alternative="two.sided")$p.value # extract p value from fisher's exact test
        results[resultrow,] = c(chrom,left,right,samplecount,linearp,fisherp) # write out result row
        resultrow = resultrow + 1 # advance to next result row
    }
    sitefirstrow = rowno # advance to the next genomic site
}
results = data.frame(results,stringsAsFactors=FALSE) # convert to data frame
results = results[!is.na(results$chrom),] # get rid of all the extra rows full of NA

In the end, I got a table with results for 131 sites. To review: those 131 sites represent all the places where (1) an AGC or AAC repeat is present, and at least 40 of my exomes had (2) coverage of at least 2 spanning that site, and (3) an allelotype called by lobSTR. Since these sites each represent a separate statistical test, in order to be significant after multiple testing correction I will need to see a p value of .05/131 = .0004 or lower.

discussion

It is good to know that lobSTR can call an appreciable number of repeat length polymorphisms reliably across samples. I am sure it could do even better, especially with longer repeats, if we had longer read lengths. I know some Illumina HiSeq machines can do paired-end 150bp, and apparently Ion Torrent will be able to do single-end 200bp.