motivation

When I go to scientific talks, I very often hear an introduction that includes some statement about how “this gene is highly conserved and therefore important.”  My boss once pointed out to me, you never hear someone say “the gene I study is not very highly conserved.”  Yet I also have never heard someone offer a single objective or quantitative metric as evidence that the gene they study is any more highly conserved than any other gene.

This matters because the notion that a gene is “highly conserved” seems to influence people’s thinking about whether a gene is important and whether it could be knocked down as part of a therapeutic strategy.  This comes up often in the prion literature.  Consider this quote from a classic paper [Collinge 1994]:

Although PrP is highly conserved among mammals and widely expressed in early embryogenesis, mice homozygous for disrupted PrP genes appear developmentally and behaviourally normal.

The authors cite [Bueler 1992] regarding the PrP knockout mice but as for the claim of “highly conserved,” one might want to add this note:

citation-needed

a comparison to platypus

Yes, of course, orthologs of our PRNP gene are found in every mammal species, even platypus - a member of the monotremes, the mammalian clade most distant from humans - and they all have strong sequence similarity, as seen in multiple alignments.

But is that so special?  Isn’t that true of most human genes?  86% of human genes have at least one ortholog in platypus [Warren 2008 (ft), see Supplement Table 5].

Duck_billed_platypus_schnabeltier-600px

Above: The platypus has PrP, but it also has its own version of most of our genes. 

Those authors also noted that 41% of human genes have simple 1:1 orthology across all mammals included in the analysis, matching exactly one gene in each of dog, mouse, gray short-tailed opossum (a model marsupial) and platypus  [Supplement-to-the-supplement, Table A11].  But when I looked up PRNP in their database by its Ensembl gene symbol, ENSG00000171867, I discovered it’s not even one of the elite 41%.  By the definitions those authors used, the orthology group for PRNP includes PRND (Doppel, ENSG00000171864), both in humans and some of the other mammals included in the analysis.  Also, when only human and platypus are considered, there are 11,287 1:1 orthology relationships, i.e. accounting for about half of human genes.

Taken together, all of this says that the sole fact that PrP is found across all mammals, including platypus, does not necessarily place it in even the top 50% of most conserved human genes.  The fact that PrP also has orthologs in chicken and frog is slightly stronger, especially (in my mind) since, although there is less sequence identity between our PrP and these more distant species, shedding and alpha cleavage of PrP are still conserved all the way out to chicken [Harris 1993 (ft)], and alpha cleavage may be essential for myelin maintenance [Bremer 2010].  And of course, there are even orthologs in zebrafish, though the conservation is much weaker.

How special do these facts make PrP?  Still not very.  I turned to the JAX homology database, downloaded the “Complete List of Vertebrate Homology Classes“, and ran some SQL queries in R to determine how many human gene symbols have nominal orthologs in various species.

library(sqldf)
setwd('c:/sci/003prigen/blog/coding_conservation')
tbl = read.table('HOM_AllOrganism.rpt',header=TRUE,quote='',comment.char='',sep='\t') # http://www.informatics.jax.org/homology.shtml
colnames(tbl)[1] = "homolog_id" # use SQL-compatible table names
colnames(tbl)[2] = "species"

# count the human gene symbols in this database
sql_query = "
select   count(distinct t1.Symbol)
from     tbl t1
where    t1.species = 'human'
;"
sqldf(sql_query) # 18843.  

# how many human gene symbols have mouse orthologs?
sql_query = "
select   count(distinct t1.Symbol)
from     tbl t1, tbl t2
where    t1.homolog_id = t2.homolog_id
and      t1.species = 'human'
and      t2.species = 'mouse, laboratory'
;"
sqldf(sql_query) #16887

And so on.  JAX only lists 18,843 human gene symbols, not the full 23,705 in hg19 RefSeq, but I’m assuming that’s just because some human genes didn’t have orthologs in any of the species they included.  Mouse orthologs exist, then, for 16887/23705 = 71% of human genes, and if you modify the last SQL query to say ‘chicken’ or ‘zebrafish’ you’ll get figures of 12,379 and 12,251, so in either case ~52% of human genes have orthologs in those species.

So, again, the presence of orthologs by itself – even in zebrafish – does not even necessarily place PrP in the top half of most conserved human genes.  Given that orthology does exist, we can also ask how similar PrP is between humans and other mammals.

sequence evolution rate

When a gene does have orthologs in other species, you can align them and make judgments about the conservation of each nucleotide or each amino acid.  The bioinformatics community is awash in tools for quantifying either how highly conserved a given base or amino acid is, or how flagrantly a missense mutation violates that conservation: PolyPhen2, GERP, PhastCons & PhyloP, SIFT, and so on.  These use phylogenomics and in some cases also physical traits of the original and substituted amino acid.

But those tools are all for conservation of a particular amino acid or nucleotide.  How do you make judgments about the conservation of a whole gene?

The most common metric used is the ratio of non-synonymous to synonymous amino acid substitutions, dN/dS.  dN is the count of non-synonymous substitutions between species being compared, and dS is the count of synonymous substitutions.  This is based on an assumption that evolution doesn’t act on synonymous substitutions, though obviously that’s an oversimplification.

dN/dS is too simple in another way as well: if you glance at the genetic code, some sites in a gene are constrained – ATG is the only methionine (M) codon, so any substitution is necessarily non-synonymous; and ACN is always threonine (T), no matter what N is, so substitutions at that third base are necessarily synonymous.  To normalize for the number of opportunities for synonymous or nonsynonymous substitutions, an alternate formulation is Ka/Ks, where Ka is the ratio of actual to potential non-synonymous substitutions and Ks is the ratio of actual to potential synonymous substitutions.

If you accept for a moment the simplifying assumption that synonymous SNPs are neutral, then Ka/Ks > 1 implies that evolution favors change – this is called positive, Darwinian or directional selection.  Ka/Ks = 1 implies no selection – evolution doesn’t care what happens to this gene.  Ka/Ks < 1 implies evolution resists change – this is called negative or purifying selection.  (There are also such things as balancing selection, which favors heterozygosity, disruptive selection, which favors phenotypic extremes, and stabilizing selection, which favors phenotypic moderation, none of which are directly captured by Ka/Ks.)  See also McDonald-Kreitman Test and Tajima’s D statistic.

The Ka/Ks value for PRNP within humans is 0.20 and between humans and chimpanzees is 0.50 (2 nonsynonymous / 4 synonymous) [Mead 2003], compared to an average of 0.23 for all human-chimpanzee orthologs [de Maghalaes 2007].  In other words, PRNP appears to be under purifying selection, though not necessarily moreso than most of our genes.

On the other hand, the nonsynonymous polymorphisms that do exist in PRNP, especially M129V, appear to be more common than they should be compared to how rare new polymorphisms are.  Using Tajima’s D statistic, it has therefore been argued that evolution has exerted balancing selection on PRNP, favoring the maintenance of polymorphisms that are dominant negative against prion disease [Mead 2003].

does conservation imply deleterious knockout?

The notion that PrP is highly conserved, yet its knockout phenotypes are relatively mild, was part of the original basis for positing the existence of pi, a hypothesized protein that has function similar or identical to that of PrP and carries out PrP’s job when PrP is absent, thus preventing PrP knockout mice from having severe phenotypes.

But all that is premised on an expectation that highly conserved genes should produce dramatic knockout phenotypes.  It seems intuitively obvious that this should be the case, and it was proposed long ago [Wilson 1977], yet people have struggled to demonstrate it, and there appears to be only a slight correlation between gene conservation and knockout severity.

Some of the earliest work was done on a couple hundred mouse and rat genes before we had full genome sequences for any vertebrate [Hurst & Smith 1999].  68 genes classified as “essential” (knockout causes death or infertility) were compared to 107 non-essential genes.  Nominally, the essential genes had lower sequence evolution rates, but this was in part because some of the non-essential genes were immune genes deemed to be under directional selection, i.e. selection favoring change rather than preventing change.  Once this was corrected for, Hurst & Smith found no difference.

But a subsequent, more quantitative analysis of yeast knockouts using growth rate as the quantitative phenotype did find a negative correlation [Hirsh & Fraser 2001]: genes that were more similar between S. cerevisiae and C. elegans slowed down S. cerevisiae’s growth more dramatically when they were knocked out.  Interestingly, though this study found a significant slope in a linear model, it did not find a significant difference in a binary comparison of “essential vs. non-essential”.  That’s because the linear correlation was driven by the most non-essential genes, those which had very little impact on growth rate and very strong divergence from C. elegans.  Combining these with the merely somewhat dispensible genes in the middle of the distribution sapped all the power out of a Welch’s t test.  In contrast, a study of bacterial genomes found both a significant correlation and a significant dichotomous difference between essential and non-essential genes [Jordan 2002].

A re-analysis of Hirsh and Fraser’s yeast data using a much larger dataset concluded that the correlation was far smaller than originally reported, with a gene’s “dispensability” explaining only ~2% of variance in its evolution rate, compared to ~20% in Hirsh’s study [Pal 2003].  As time went on, the datasets got larger and the math got fancier.  It began to appear that dispensability – as well as gene expression level – are indeed correlated with sequence evolution rate, though it is hard to tell how strongly [Wall 2005], and the effect is far from black-and-white – for a visual see Fig 2 in [Zhang & He 2005 (ft)].

The most comprehensive analysis I’ve found is in [Wolf 2006].  Wolf considers several different (but correlated) variables: sequence evolution rate, expression level, number of physical interactors, number of genetic interactors, number of paralogs per species and knockout effect.  Wolf’s narrative interpretation of the principal components of these variables, though speculative, is a highlight: PC1 is said to reflect a gene’s “status” or prominence and ranks highly those genes that are both essential (strong knockout phenotype) but also well backed-up (many paralogs), whereas PC2 reflects “adaptability” – genes that are well backed-up but can also be done without.

A problem (or confusion?) I have about all of these analyses is that they seem to only account for genes that actually have orthologs among all species studied.  You can’t calculate a sequence evolution rate for a gene in species A if there is no equivalent gene in species B to which you can align the sequence.  Therefore it would seem that these analyses all leave out the least highly conserved genes of all – those that have only recently emerged, or have been lost in other species. (Or am I missing something? If so please leave me a comment).  An exception is one study which does consider gene loss events [Krylov 2003].

discussion

I’m just beginning to get a handle on the different ways that people measure gene conservation and the different tools and formulas available for quantifying this.

I got interested in this for a few reasons.  First, people often say that the gene they study is conserved, but rarely back this up with evidence, and I like to quantify claims like this whenever possible.  Second, I often hear people be surprised at how mild PrP knockout phenotypes are given how “highly conserved” it is, and I want to know whether that surprise is really justified.  Third, I’ve heard people argue against PrP depletion as a therapeutic strategy on the basis that PrP is conserved and therefore must be too important to knock down.

In fact, probably no grand pronouncements are possible – if there is only a modest correlation between knockout phenotype severity and sequence evolution rate, then neither should people be surprised that PrP knockouts are viable, nor would it make much of a difference in our expectations if we realized that PrP is, quantitatively, not as highly conserved as we thought.

Still, this notion of conservation keeps coming up, and I’d like to have the conversation based on hard data.  In the literature I only found Simon Mead’s estimates of Ka/Ks for chimpanzee vs. human and it would be interesting to do this for several other species as well – it appears non-trivial based on this Biostars discussion and the number of possible settings in PAML, so I’ve got a bit to learn yet.  I also suspect there may be other, entirely different, quantitative metrics of gene conservation which I haven’t run across yet (please leave a comment if you know!).  And finally, it will be important to tease out whether PrP’s short length biases any of the potential metrics – it is among the smaller proteins in our proteome.