Student's t test vs. log-rank test for mouse studies

Of the candidate therapeutic studies reviewed on this blog, most establish statistical significance using a log-rank test, otherwise known as a Mantel-Cox test and almost always presented alongside Kaplan-Meier survival curves [ex. Leidel 2011, Chung 2011, Kawasaki 2007, Eiden 2012]. A handful, though, use a Student’s t test a.k.a. one-way ANOVA with two categories [ex. Manueldis 1998, Priola 2000, Geissen 2011, Cortes 2012] and some use both [Riemer 2008].

This can actually make a big difference. Let’s take for example the data from the Scutellaria lateriflora study and run these two tests on the exact same data using R:

> library(survival)
Loading required package: splines
> control=c(144,144,144,146,146,154)
> treatment=c(146,158,176,206,206,206)
> days = c(control,treatment)
> group = c(rep(0,6),rep(1,6))
> status = rep(1,12)
> mice = data.frame(days,status,group)
> msurv = with(mice,Surv(days, status == 1))
> survdiff(Surv(days,status==1)~group,data = mice)
Call:
survdiff(formula = Surv(days, status == 1) ~ group, data = mice)
N Observed Expected (O-E)^2/E (O-E)^2/V
group=0 6 6 2.67 4.17 8.87
group=1 6 6 9.33 1.19 8.87
Chisq= 8.9 on 1 degrees of freedom, p= 0.0029 
> 
> t.test(control,treatment)
Welch Two Sample t-test
data: control and treatment 
t = -3.2993, df = 5.207, p-value = 0.02024
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -64.896046 -8.437287 
sample estimates:
mean of x mean of y 
 146.3333 183.0000

So with the exact same data we get a tenfold difference in p value: .003 with log-rank and .02 with a t test. Why?

It seems especially peculiar that the log-rank test gives a smaller p value. Unlike the t test, it doesn’t assume equal variance. Usually, it seems that the fewer assumptions you can make, the less power your test has. For instance, if we didn’t want to assume equal variance, we could do a two-sample Kolmogorov-Smirnov test which asks “what’s the chance these two samples are drawn from the same distribution?” without any assumption about variance, and we’ll always find that it gives a larger (less significant) p value than a t test. For instance, with the above data:

> ks.test(control,treatment)
Two-sample Kolmogorov-Smirnov test
data: control and treatment 
D = 0.8333, p-value = 0.03101
alternative hypothesis: two-sided
Warning message:
In ks.test(control, treatment) : cannot compute exact p-values with ties

We get p = .03 instead of .02 with the t test. So what makes the log-rank test more “powerful”?

Well, it’s not more powerful, actually. You can make up data for which the log-rank test will give a larger (less significant) p value than the t test, for instance:

> control = c(5,5,5,5,5,6)
> treatment = c(6,6,6,6,6,5)
> days = c(control,treatment)
> group = c(rep(0,6),rep(1,6))
> status = rep(1,12)
> mice = data.frame(days,status,group)
> msurv = with(mice,Surv(days, status == 1))
> survdiff(Surv(days,status==1)~group,data = mice)
Call:
survdiff(formula = Surv(days, status == 1) ~ group, data = mice)

        N Observed Expected (O-E)^2/E (O-E)^2/V
group=0 6        6        4       1.0      4.89
group=1 6        6        8       0.5      4.89

 Chisq= 4.9  on 1 degrees of freedom, p= 0.027 
> t.test(control,treatment)

        Welch Two Sample t-test

data:  control and treatment 
t = -2.8284, df = 10, p-value = 0.0179
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -1.1918440 -0.1414893 
sample estimates:
mean of x mean of y 
 5.166667  5.833333

Here we get a slightly smaller (more significant) p value with a t test. So it’s really just that these two tests assume different underlying distributions, and so respond differently to different types of data.

The Student’s t test asks the question, “I assume these two samples are both normally distributed and have the same variance; tell me, what’s the chance they were drawn from distributions with the same mean?”

The log-rank test is more complicated. It asks, “I assume the total number of events (e.g. deaths of mice) having already occurred at any given time in these two groups is hypergeometrically distributed; tell me, what’s the chance the two groups have the same underlying hazard function?” The log-rank test is calculated, as the name suggests, in terms of ranks. It looks at the order in which events happened and calculates total number of events in each group–without any regard to how far apart the events were spaced. In so doing, it reduces ratio-level data to ordinal-level data. This makes it less sensitive to outliers.

To prove this to myself I tried two different comparisons. First I wrapped the survival test in a function to just handle the simple case where an event is eventually observed for each mouse (more on this simple case later):

simplesurvtest = function(vec1,vec2) {
    days = c(vec1,vec2)
    group = c(rep(0,length(vec1)),rep(1,length(vec2)))
    status = rep(1,length(vec1)+length(vec2))
    mice = data.frame(days,status,group)
    msurv = with(mice,Surv(days, status == 1))
    return (survdiff(Surv(days,status==1)~group,data = mice))
}

data	log-rank p value	t test p value
c(5,5,5,5,5,6) vs. c(5,5,5,5,5,8)	.551	.550
c(5,5,5,5,5,6) vs. c(5,5,5,5,5,52)	.551	.373

Notice that the addition of a wild outlier in the seceond dataset doesn’t change the log-rank p value at all, while it does change the t test p value.

Arguably that’s a good thing – maybe you don’t want to be sensitive to outliers.

On the other hand you can also make up data to “trick” the log-rank test:

control = c(1,2,3,4,5,6)
treatment1 = c(3.01,4.01,5.01,6.01,7.01,8.01)
treatment2 = treatment - .02
simplesurvtest(control,treatment1)
simplesurvtest(control,treatment2)
t.test(control,treatment1)
t.test(control,treatment2)

data	log-rank p value	t test p value
c(1,2,3,4,5,6) vs. c(3.01,4.01,5.01,6.01,7.01,8.01)	.041	.092
c(1,2,3,4,5,6) vs. c(2.99,3.99,4.99,5.99,6.99,7.99)	.173	.095

Here, the data are virtually identical, but because the order of events is changed, the log-rank test sees a big difference in significance. The t test is not “fooled” by this.

But hey, just because you can “fool” a test with very artificial data doesn’t discredit it– in any real study you’ll probably do followup with mice once a day or week, not at arbitrarily small time intervals, and the odds of finding such a “tricky” set of data as shown above are vanishingly small.

The thing that’s special about the log-rank test, the thing you get in return for giving up the ratio-level nature of your data, is its ability to handle censoring. Suppose control scrapie mice in your experiment are dying at 120 dpi. One mouse in the treatment group lives to 130 dpi but then dies of cancer, which you consider an unrelated cause. You don’t know when (if ever) this mouse would have died of scrapie, but you know it lived to at least 130 dpi. In the log-rank test, that information is worth something. In the t test, you have no way to gain information from that mouse. For instance, Geissen 2011 uses a t test and simply throws out the mice that died of other causes.

The same issue arises if you get to the end of your experiment timeline and some or all mice in the treatment group are still alive (this happened with the high IgA, high IgG mice in Goni 2008). There’s simply no way to run a t test on this data, but the log-rank test handles it through different statuses for the mice. Here’s an example in R based on an approximation of the data from Goni’s Figure 3:

> # Goni 2008 data 
> control = c(rep(195,5),rep(200,5),rep(205,10))
> treatment = rep(400,10) # high IgA, high IgG group
> days = c(control,treatment)
> group = c(rep(0,20),rep(1,10))
> status = c(rep(1,20),rep(0,10))
> mice = data.frame(days,status,group)
> msurv = with(mice,Surv(days, status == 1))
> survdiff(Surv(days,status==1)~group,data = mice)
Call:
survdiff(formula = Surv(days, status == 1) ~ group, data = mice)

         N Observed Expected (O-E)^2/E (O-E)^2/V
group=0 20       20    11.33      6.63      22.9
group=1 10        0     8.67      8.67      22.9

 Chisq= 22.9  on 1 degrees of freedom, p= 1.67e-06

The log-rank test is clearly the choice if your data are censored, or if other assumptions (normality, equal variance) of the t test are violated. If the assumptions of the t test do hold (or if you’re willing to just throw out the censored data) then either test is probably acceptable. The Wikipedia page on log-rank notes that “If censored observations are not present in the data then the Wilcoxon rank sum test is appropriate” and the Wilcoxon rank sum test page in turn provides a comparison to the t test which basically says that a t test is slightly more powerful when its assumptions are met (normality, equal variance) but that the Wilcoxon rank sum test is useful to reduce sensitivity to outliers or when data are originally ordinal-level.

For now at least, my current thinking is “use t test if it’s appropriate, and if it’s not, then use the log-rank test”. But the reading I’ve done and the examples I’ve worked through above don’t give a super clear answer, and I’m open to being convinced otherwise. In my next post I’ll argue that the t test is actually the more appropriate test to assume for power calculations, even if you end up doing a log-rank test once you have the data.

Addendum: survdiff is an asymptotic log-rank test; there are instructions for doing an exact log-rank test using surv_test in the R coin package here.