The other day I was emailing with a statistical genetics colleague about a rare SNP associated with a phenotype. I stated that the minor allele frequency (MAF) was .07% in cases and .01% in controls, for a risk ratio of 7. After clicking send, I felt a twinge of regret. Is that actually what risk ratio means? And is risk ratio the right metric to use here? So I did some Googling to get my facts straight. Here’s what I learned.

I’ll start with a non-genetics example for simplicity, and then move on to a genetic example. Suppose you’re testing for an association of an exposure (say, smoking) with a phenotype (say, cancer), and you have a contingency table as follows:

cancer no cancer
smoker a b
non-smoker c d

First, recall that the odds of an event are defined as P(event)/P(not event). So for an event with p = .25, the odds are .25/.75 = 1/3. To make things more confusing, colloquially, no one refers to odds of one third, instead this would be phrased “three to one against.” An event with p = .75 would have odds .75/.25 = 3, and this would be phrased “three to one for.”

We can define the following terms:

The odds ratio (OR) is the ratio of the odds of cancer in smokers to the odds of cancer in non-smokers.

OR = (a/b)/(c/d) = (ad)/(bc)

The risk ratio (RR), also called the relative risk, is the ratio of the probability of cancer in smokers to the probability of cancer in non-smokers.

RR = (a/(a+b))/(c/(c+d)) = (a(c+d))/(c(a+b))

Given that you know a, b, c, and d, you can compute either of these metrics. Yet odds ratio is strongly preferred as the “right” metric to report in almost all scenarios. That seems to be because the quantity that it measures is more fundamental to the biology of what you’re studying, and less likely to change depending on how you’re studying it. Here are some examples to illustrate that.

Suppose we go out and ascertain 1000 cancer patients and 1000 healthy controls, and then ask them whether they smoke. Suppose the data look like this:

cancer no cancer
smoker 700 200
non-smoker 300 800

OR = (700/200)/(300/800) ≈ 9.33
RR = (700/900)/(300/1100) ≈ 2.85

On the other hand, suppose that cancer patients were hard to come by, so we decided we had to make do basing our study results on just 100 of them – but controls are easy, so we still got 1000 controls. Suppose the percentage of the cancer patients that smoked was the same, though. Then our data would look like:

cancer no cancer
smoker 70 200
non-smoker 30 800

OR = (70/200)/(30/800) ≈ 9.33
RR = (70/270)/(30/830) ≈ 7.17

With this sort of skewed ascertainment, the OR remains the same, but the RR is wildly different.

Alternately, suppose that we did our study the opposite way: we ascertained 900 smokers and 900 non-smokers, then asked whether they had cancer. For example’s sake, here are some data that give the same answers as the first example above, to within a rounding error:

cancer no cancer
smoker 700 200
non-smoker 245 655

OR = (700/200)/(245/655) ≈ 9.35
RR = (700/900)/(245/900) ≈ 2.86

In this case, it doesn’t matter if we over-ascertain smokers or non-smokers. Suppose we could only recruit 90 smokers but we got 900 non-smokers:

cancer no cancer
smoker 70 20
non-smoker 245 655

OR = (70/20)/(245/655) ≈ 9.35
RR = (70/90)/(245/900) ≈ 2.86

We get the exact same answer. So in cases where you ascertain on a behavior (or genotype) and then check for a disease phenotype, the RR is a valid measurement – we’ve measured the real probability of getting a disease given that you have a risk factor, rather than cherry-picking people who have the disease. For this reason, the use of RR is mostly confined to longitudinal studies, where a group of individuals is followed for many years to see whether they develop a disease.

In the above examples, I have always considered the “exposure” as a binary: smoker/non-smoker, 0/1. In genetics, the “exposure” is usually a bi-allelic genotype, where the alt allele count in one individual can have three values: 0/1/2. So how do we compute OR in genetics? Intuitively, you can imagine two ways of doing it. You could dichotomize the genotype into a binary exposure variable: “has an alt allele”/”doesn’t have an alt allele”. That would correspond to a dominant model. Or you could count the alleles themselves instead of the people carrying the alleles. That would correspond to an additive (allelic) model. The allelic model is more commonly used. The contingency table should look like this:

affected unaffected
minor allele count a b
major allele count c d

OR = (a/b)/(c/d) = (minor allele count in cases / minor allele count in controls) / (major allele count in cases / major allele count in controls)

Just to check that this is indeed how OR is calculated in practice in statistical genetics, I looked up the C++ source code for Shaun Purcell’s PLINK, the program that has produced more published odd ratios than all others combined. In assoc.cpp, one can find the lines:

              // with v. large sample N, better to use: ad/bc = a/b * d/c
	      odds[l] = ( (double)A1 / (double)A2 ) * 
		( (double)U2 / (double)U1 ) ;

Here, A1 is the minor allele count in affecteds, A2 the major allele count in affecteds, and U1 and U2 the respective counts in unaffecteds. So Purcell’s formula actually corresponds to (a/b)*(d/c), a rearrangement less likely to cause overflow compared to (ad)/(bc), as per the comment.

Now let’s return to my email from earlier. When I divided the minor allele frequency (MAF) in cases by that in controls, I was taking (a/(a+c))/(b/(b+d)). That’s not actually a proper formula for the risk ratio or odds ratio. In fact, given only the minor allele frequencies in cases and controls and no counts, we can’t actually compute the OR or RR. Yet in this case I actually gave a very close approximation of the odds ratio because for very rare alleles, a « c and b «d, so (a/(a+c))/(b/(b+d)) ≈ (a/c)/(b/d) = (ad)/(bc).

Suppose that the counts for my SNP of interest were as follows:

affected unaffected
minor allele count 70 100
major allele count 100,000 1,000,000

OR = (a/b)/(c/d) = 7.00
MAF(cases)/MAF(controls) = (a/(a+c))/(b/(b+d)) = 6.996
RR = (a/(a+b))/(c/(c+d)) = 4.53

So for very rare alleles, the ratio of minor allele frequencies is pretty close to the odds ratio, while it’s not necessarily a good approximation of the risk ratio, which anyway isn’t a figure you want. For genetics and genomics, unless you ascertained people on genotype at a young age, the figure you’re interested in is the odds ratio.