Association of repeat length polymorphisms with complex traits and common diseases

Huntington’s Disease is just one of many Mendelian diseases caused by a pathological repeat expansion. A recent review, La Spada and Taylor 2010, lists 22 Mendelian repeat expansion disorders, and more continue to be discovered – most recently a non-coding GGGGCC hexanucleotide expansion in C9ORF72 has been found to segregate with FTD-ALS [DeJesus-Hernandez 2011, Renton 2011].

It is likely that the phenotypic influence of repeat length polymorphisms is not limited to Mendelian diseases. There is also some evidence of repeat length polymorphisms contributing to common traits, complex diseases and acting as secondary modifiers of other diseases.

In reading the linked papers, note the divergent terminology: repeats with a period of 2-6 nucleotides are called short tandem repeats (STRs) by some authors and microsatellites by others. (Wikipedia calls them microsatellites). Some authors also include repeats with a period of 1, such as CCCCCCCCCCCC, in the definition of these terms, though these simple runs of a single base are also called homopolymers.

associations from the literature

As mentioned in a recent post, Pulst 2005 (ft) discovered that sub-pathological repeat length polymorphisms in CACNA1A contribute to an age of onset phenotype in spinocerebellar ataxia 2, a Mendelian disorder caused by a repeat expansion in ATXN2. So repeat length polymorphisms in one gene can modify the phenotype of a disease-causing repeat expansion in another gene.

Certain repeat length polymorphisms have also been suggested to cause, or modify, disorders that (to our knowledge) aren’t otherwise repeat-related. Wikipedia pointed me to a few examples: Jemaa 2009 found a intronic 27bp repeat in NOS3 associated with hypertension among Tunisians, and Kersting 2008 found an intronic dinucleotide repeat in EGFR to be associated with osteosarcoma. Akagi 2009 examined an intronic 14bp repeat in the asparagine synthetase (ASNS) gene in relation to acute lymphoblastic leukemia and found biochemical evidence of differing levels of regulatory activity based on repeat length, though there was not a large enough sample size to propose a genotype-phenotype association. And as noted in the sleep post, a protein-coding 54bp repeat in PER3 may be associated with certain sleep phenotypes [Viola 2007, Groeger 2007].

Perhaps the most studied example concerns an (AC)_n dinucleotide repeat length polymorphism near (2.1kb upstream of) ALR2. Ko 1995 found this polymorphism to be associated with diabetic retinopathy among Chinese type II diabetes (T2D) patients. Following that finding, Heesom 1997 found evidence that the same polymorphism was also associated with kidney problems among type I diabetes (T1D) patients. To be clear, T1D and T2D are both complex disorders with some genetic component, and the ALR2 repeat length polymorphism was not proposed as a cause of either disease, but rather as a phenotypic modifier.

It’s not clear if either of those associations is real. Both findings have been replicated by some studies and not by others; one meta-analysis [Xu 2008 (ft)] suggested that the T1D association was real and the T2D association was not. Assuming a correlation really exists, it is also not yet totally clear that the repeat is the directly causal variation and not just tagging some other variation. For instance Kao 1999 (ft) found a separate SNP in the ALR1 promoter region in nearly perfect linkage disequilibrium with the repeat. Interestingly, Shah 1998 found that the repeat length polymorphism strongly correlated with ALR1 gene expression level only in diabetics and not in controls.

Repeat length polymorphisms elsewhere in the genome have also been associated with other diabetic phenotypes, but it also seems as yet unclear whether these associations are real [Uthra 2010].

Most of the above studies used relatively small sample sizes – for instance, the T1D and T2D studies had tens or a few hundreds of patients each (see Xu 2008‘s Table 1 and Table 2). No wonder the results were (are?) still debated even more than a decade after the original discovery.

Probably the most compelling association of a repeat length polymorphism with a complex trait comes from Grant 2006. A previous study had reported a linkage peak at 10q for type II diabetes risk. The linkage region was 10.5Mb long and contained 228 microsatellites. Grant genotyped all of these microsatellites in almost 2000 cases and 2000 controls (split across three cohorts), and found that one microsatellite, named DG10S478, was strongly associated with T2D.

This microsatellite is a bit mysterious – the supplementary methods give the genomic location as hg16/b34 chr10:114,460,845-114,461,228 but it appears that the reference genome actually represents the 0-repeat allele, i.e. you can’t see the repeat anywhere in the reference sequence. Grant says the repeat is tetranucleotide, i.e. a period of 4, but never actually says what the repeated sequence is.

In any event, Grant reports that the association of this repeat with T2D is slightly stronger (relative risk 1.56, p = 4.7E-18) than the most strongly associated nearby SNP (rs7903146, relative risk 1.54, p = 2.1E-17), which to my eye would suggest that the repeat is truly causal and the SNP is just a marker. Grant even sequences the nearest exon to look for possible coding variation tagged by the repeat, and finds none. Yet still, Grant never asserts that the repeat is causal, instead offering that DG10S478 might be causal, or might “more likely” be a marker for something else intronic.

what’s the mechanism?

In contrast to the Mendelian repeat disorders, where the repeats are very well-established to be causal, in the end it is not clear that there is really solid evidence for any of the above-mentioned repeats being directly causal for these complex traits/diseases.

But there is no shortage of mechanisms by which a repeat length polymorphism could directly influence a phenotype. Consider how repeat expansions cause Mendelian disorders: sometimes they result in a loss of a gene’s expression (ex. Fragile X) or of just one transcript thereof (FTD-ALS, see DeJesus-Hernandez 2011). Alternately, protein-coding repeats can cause a gain of function at the protein level – we don’t know exactly how that works but one theory is that longer repeat tracts make for a pathological exaggeration of the protein’s native function [Orr 2012]. Any of these mechanisms could apply to less extreme repeat length polymorphisms as well. Coding repeats might control the activity level of a protein; intergenic, UTR or intronic repeats might control dosage of a gene or of a specific transcript. One specific possibility is that repeat lengths in promoter regions – especially (AC)_n repeats – control the level of DNA binding by transcription factors [Rockman & Wray 2002 (ft)].

conclusion

Repeat lengths mutate much more quickly than SNPs do [Sun 2012] and are often highly polymorphic. It seems logical that they might well contribute to human phenotypic diversity in complex traits. So far, several repeat length polymorphisms have been proposed to associate with phenotypes, but I couldn’t find an example where the repeat has very confidently been identified as the causal variant.