Log-rank test power depends on variance

…assuming the times of the underlying events are normally distributed.

I brought this issue up in my last biostatistics post, comparing the log-rank vs. t tests for calculating power for mouse studies. My claim is that, for mouse studies where all the mice are expected to show symptoms in the course of the study, at times that are approximately normally distributed, you can’t very well use the log-rank test to calculate power. It will give you outlandish answers implying you need 17,000 mice to have 90% power, and that’s because the log-rank power calculation implicitly assumes an underlying Poisson process, and thus an exponential distribution of times to events. Instead, what you need is a test where the power depends on the variance in your actual mouse model, which is probably more normal than it is exponential.

Last night I connected with @a who pointed out that there are lots of ways to get confused about this: the log-rank test statistic has variance 1 regardless of the data– does that mean variance doesn’t matter in the log-rank test?? And the log-rank test converts everything to ordinal data, which makes it less clear exactly how variance of the original data comes into play. But intuitively you can prove to yourself that variance does still matter. Imagine control and treatment groups with normally distributed ages of onset, with different means– the treatment group lives slightly longer. If variance = 0, all of the control mice will get sick before any of the treatment mice do; as variance rises, the amount of the PDFs that overlap will rise and so the times to onset in the two groups will become more intercalated.

But at @a’s suggestion, in this post I am going to use a simulation to show that, if the times to “events” for mice are normally distributed, the actual power of a log-rank test to detect a difference does depend on variance in that normal distribution.

First, I assume that all mice will have an event eventually—there is no censoring. Therefore I’m going to wrap the requisite code from the R survival library in my own function to make it easier to call:

install.packages("survival")
library(survival)
simplesurvtest = function(vec1,vec2) { # vec1, vec2: lists of survival times of mice in control and treatment groups
    days = c(vec1,vec2) # make a single vector with times for both mice
    group = c(rep(0,length(vec1)),rep(1,length(vec2))) # make a matching vector indicating control/treatment status
    status = rep(1,length(vec1)+length(vec2)) # "status" = 1 = "had an event" for all mice
    mice = data.frame(days,status,group) # convert to data frame
    return (survdiff(Surv(days,status==1)~group,data = mice)) # return log-rank test results
}

Next, I assume times to event for both groups are normally distributed, the control group’s normal distribution has mean m (say, 120 days) and there is an effect size es of the drug such that the treatment group’s mean is m*(1+es) (say, for a 10% delay in onset you get 120*1.1 = 132 days). I’ll also assume the control and treatment groups’ normal distributions have the same standard deviation sd, which I think of as a property of the mouse model itself.

Thus for n mice in each of the two groups, and statistical significance at the alpha threshold desired, and a given m, es, and sd, I’ll run niter iterations and calculate power as the fraction of times that a significant result was found. So this function will do all that for me:

logrankpowersim = function(n,m,es,sd,niter,alpha) {
    numsignificant = 0 # number of iterations in which a significant difference was found
    for (iter in 1:niter) {
        control = rnorm(n=n,m=m,sd=sd) # generate normal distribution for control group
	    treatment = rnorm(n=n,m=m*(1+es),sd=sd) # generate normal dist for treatment group
	    a = simplesurvtest(control,treatment) # do a log-rank test
	    p = 1-pchisq(a$chisq,df=1) # extract p value from log-rank test
	    if (p < alpha) { # check if it is significant at the desired threshold
	        numsignificant = numsignificant + 1
	    }
	}
	power = numsignificant/niter
	return (power)
}

So to run a test on a few different standard deviations, do something like this:

m = 120
es = .10
n = 10
alpha = .05
niter = 100
sds = c(4,8,20,40) # list of standard deviations in days to onset that you want to test
for (sd in sds) {
    cat(logrankpowersim(n=n,m=m,es=es,sd=sd,niter=niter,alpha=alpha)) # cat() in R is like echo...
	cat("\n") # except it doesn't add its own line breaks  
}

And here are results:

sd	power
4 days	100%
8 days	78%
20 days	31%
40 days	7%

Sure enough: more variance means less power.

Moreover, in support of my claim that the log-rank power calcuation assumes an exponential distribution, I can re-create the result I got in the last post, of ~90% power with 17,530 mice for a 6% effect size at alpha = .01, if I do use an exponential distribution:

logrankpowersim_exp = function(n,m,es,niter,alpha) {
    numsignificant = 0 # number of iterations in which a significant difference was found
    for (iter in 1:niter) {
        control = rexp(n=n,rate=1/m) # generate exponential distribution for control group
	    treatment = rexp(n=n,rate=1/(m*(1+es))) # generate exponential dist for treatment group
	    a = simplesurvtest(control,treatment) # do a log-rank test
	    p = 1-pchisq(a$chisq,df=1) # extract p value from log-rank test
	    if (p < alpha) { # check if it is significant at the desired threshold
	        numsignificant = numsignificant + 1
	    }
	}
	power = numsignificant/niter
	return (power)
}

m = 120
es = .06
n = 17530/2
alpha = .01
niter = 1000
logrankpowersim_exp(n=n,m=m,es=es,niter=niter,alpha=alpha)

Result: .898

This supports my claim that the log-rank test power calculation assumes the underlying data are exponentially distributed.

In sum, the log-rank test is not the right model to assume when you are doing a power calculation, if your actual data are normally distributed and uncensored, which is usually the case in studies of prion-infected mice. In fact, even if the data have a moderate amount of deviation from normality and there is going to be some censoring, I would still say you should calculate power with a t test, because the log-rank power calculation is just so far off.

The log-rank test is perfect for studies of people, where not everyone will have an “event” and so huge samples are needed to find significance. But if it took thousands of mice to get a significant result, people wouldn’t use mouse models. We design mouse models to be quick and reliable so that we can study a complex disease in a short amount of time at a manageable cost. That ends up meaning you’ll get something closer to a normal distribution.

Does this discredit the use of the log-rank test for mouse studies? Not totally: the test actually does give reasonable results, it’s more just the power calculation I have a problem with. The test arguably still has its place for its ability to handle censoring. That said, after spending some more time thinking about the nature of the underlying data, I am becoming somewhat more of a proponent of the t test for these studies.