Implementation of RVT1 and RVT2 in R and SQL

introduction. RVT1 is a model for finding genes in which more than one rare variant can contribute to a phenotype. I introduced RVT1 in more detail in this post, but here is a quick summary. The idea is that you have called variants on a bunch of case/control samples and you want to test the hypothesis that a number of your cases have different rare variants in the same gene, leading to the same loss of function and same phenotype. This is useful when your sample size is too small to detect the impact of these rare variants individually. Indeed, the ability to look at the impact of multiple variants too rare to be included on SNP arrays is considered to be one of the major benefits of exome sequencing. Formally, Morris and Zeggini 2010 define RVT1 as follows:

Consider a sample of unrelated individuals, phenotyped for a normally distributed trait, and typed for rare variants in a gene or small genomic region. Let n_i denote the number of rare variants for which theith individual has been successfully genotyped, and let r_i denote the number of these variants at which they carry at least one copy of the minor allele. We can model the phenotype, y_i, of the ith individual in a linear regression framework, given by y_i = E[y_i]+ɛ_i, where ɛ_i,σ_E), and

In this expression, x_i denotes a vector of covariate measurements for the ith individual, with corresponding regression coefficients β. The parameter λ is the expected increase in the phenotype for an individual carrying a full complement of minor alleles at rare variants compared to an individual carrying none.

update 2012-10-08: below code now includes RVT2 as well. See Morris and Zeggini 2010 for explanation of that model.

~~Note that when they say “linear regression,” it’s linear if you have a quantitative phenotype and logistic if you have a dichotomous phenotype~~. update 2012-10-08: it may also make more sense to do linear regression even with a dichotomous phenotype, and this code has been updated to do that. see comment. end update There’s already an R implementation of RVT1 in AssotesteR, described here and source code available here. But it’s somewhat limited because it only correlates genotype with phenotype; there is no room to add in x_i, that vector of covariates, such as principal components of population substructure. And for my purposes, most of the battle was just applying the right filters and getting the data into the right format and into R, then getting it out of R again into a format I can play with. At that point I felt like it made more sense to just do my own implementation end-to-end.

assumptions. This post assumes you’ve already called variants and read them into SQL tables as I specified in this GATK pipeline– and that you’ve also gotten annotations as to the (population, not sample) allele frequency of your variants– you can use annovar as described here. You’ll also need R, RPostgreSQL, and PostgreSQL.

the code. Here is an R script with a SQL script embedded in it. The SQL code aggregates the rare variants by gene, R fits the logistic regression model for each gene, and then puts the results into a new SQL table so you can play with them later.

First, create a table of source data in SQL:

-- create table of underlying data for RVT1 and RVT2 models
drop table if exists rvt_data;
create table rvt_data as
select   rvgm.gene_name,
         sv.sid,
         s.phen_code,
         sum(case when ((sv.variant_allele_count > 0 and rvgm.vaf < .125) or (sv.variant_allele_count < 2 and rvgm.vaf > .875)) and use_flag then 1 else 0 end)::numeric ri, -- number of rare variant loci where this sample has a rare variant
         sum(case when use_flag then sv.alleles_genotyped else 0 end)::numeric/2.0 ni -- number of rare variant loci genotyped for this sample
from     variants v, sample_variants sv, samples s,
         (select   e.vid, e.gene_name, coalesce(a.n1000g2012feb_all,0.0) vaf
         from     annovar_genome_summary a, effects e
         where    e.vid = a.vid
         and      (a.n1000g2012feb_all < .125 or a.n1000g2012feb_all is null or a.n1000g2012feb_all > .875) -- specify your MAF threshold in this line. I used 12.5%.
         and      e.effect_impact in ('MODERATE','HIGH') -- sepcify your inclusion criteria here
         group by e.vid, e.gene_name, coalesce(a.n1000g2012feb_all,0.0)
         order by e.vid, e.gene_name) rvgm -- [rare variant]-[gene] match
where    v.vid = sv.vid
and      sv.sid = s.sample_id
and      v.vid = rvgm.vid
group by rvgm.gene_name, sv.sid, s.phen_code
order by rvgm.gene_name, sv.sid
;

Then pull the contents of that table into R, run the analysis and put the RVT1 and RVT2 results each into a new table:

library(RPostgreSQL) # first load package RPostgresql
drv <- dbDriver("PostgreSQL")
readcon <- dbConnect(drv, dbname="mydb", user="postgres", password="password")
writecon <- dbConnect(drv, dbname="mydb", user="postgres", password="password")
nsamples <- 50
readsql <- "
select gene_name, sid, phen_code, ri, ni
from rvt_data
order by gene_name, sid
;"
rs <- dbSendQuery(readcon,readsql)
writesql <- "
drop table if exists rvt1_results;
create table rvt1_results (
 gene_name varchar,
 lambda numeric,
 p_value numeric
);
drop table if exists rvt2_results;
create table rvt2_results (
 gene_name varchar,
 lambda numeric,
 p_value numeric
);
"
tmp <- dbSendQuery(writecon,writesql)
while (!dbHasCompleted(rs)) {  # for each gene
    tbl <- fetch(rs,n=nsamples) # get the data for the fifty samples
    nonzero <- (tbl$ni != 0) # find the rows where ni is nonzero, i.e. the sample had at least one locus genotyped
    tbl <- tbl[nonzero,] # restrict the analysis to rows where ni is nonzero
    if(nrow(tbl) < .9*nsamples) next # only analyze genes where at least 90% of samples can be included
    gene_name <- tbl$gene_name[1]
    # RVT1
    rini <- tbl$ri/tbl$ni # ri/ni
    if (length(unique(rini)) != 1) { # skip monomorphic cases
    model <- lm(tbl$phen_code ~ rini)  # you can add your xi vector of covariates to this model here
    coeff <- summary(model)$coef["rini","Estimate"]
    p_value <- summary(model)$coef["rini","Pr(>|t|)"]
    writesql <- paste("insert into rvt1_results (gene_name,lambda,p_value) values('",gene_name,"',",coeff,",",p_value,");",sep="")
    tmp <- dbSendQuery(writecon,writesql)
    }
    # RVT2
    Iri <- 1*(tbl$ri > 0) # indicator variable for whether any rare variants are present. multiplying by 1 converts boolean t/f to integer 1/0
    if (length(unique(Iri)) != 1) { # skip monomorphic cases
    model <- lm(tbl$phen_code ~ Iri)
    coeff <- summary(model)$coef["Iri","Estimate"]
    p_value <- summary(model)$coef["Iri","Pr(>|t|)"]
    writesql <- paste("insert into rvt2_results (gene_name,lambda,p_value) values('",gene_name,"',",coeff,",",p_value,");",sep="")
    tmp <- dbSendQuery(writecon,writesql)
    }
}
dbDisconnect(readcon)
dbDisconnect(writecon)
dbUnloadDriver(drv)

In writing this I found useful Christopher Manning’s tutorial on logistic regression in R and this stackoverflow discussion of how to extract coefficients from an R model. I also learned that in R it appears there really is no better way to put variables into a SQL statement than by concatenation [1][2][3]. And I discovered that my new least favorite thing about R is that, when results for a variable are undefined, that variable’s coefficient is simply eliminated from the result matrix, thus making indexing impossible. It’s mentioned in this stackoverflow discussion, though the answer marked correct there did not work for me– ~~instead, I just figured out I could use the if/else clause with is.na as shown in the code above.~~ update 2012-10-08: I now check for monomorphic variants with if (length(unique(rini)) != 1). It’s also possible to run the model first and then exclude results with “NA” coefficients with this line of code: if(is.na(coef(model)["rini"])). end update Finally,if you run this and encounter some error you will need to dump all the remaining database results before you can close or reuse the connection. That can be achieved with dbClearResult(rs).

A note about how this code is structured. In a large dataset it’s likely at least one sample will have zero rare variant loci genotyped for at least one gene that has such loci. So if you divide r_i/n_i in your SQL code you’ll get a division by zero error. And if you use a where clause to exclude those rows, then you won’t have exactly nsamples rows per gene in the table you read into R, which would make it more complicated to iterate through the results. So to avoid that problem I only calcuate r_i and n_i in SQL, and then do the zero-checking and division in R.

next steps. After you’ve gotten your p values using this script, you can run this matplotlib scriptto QQ plot them. And once you’ve created variables to capture any population substructure you wish to control for, you can add those into your glm model easily, ex. tbl$phenotype ~ rini + covariates.

update 2012-09-25: the code above has been updated to include variants where the reference genome actually has the minor allele. (Which happens often: in my dataset, about 10% of the variants have a 1000 genomes frequency > 50%, implying the reference has a frequency < 50%). To correctly handle those cases, I increment r_i for samples where the “vaf” (variant allele frequency) is < .125 and the sample has at least one variant allele, OR the vaf > .875 and the sample has at least one reference allele.

update 2012-09-27: fixed bug. sum(sv.alleles_genotyped)::numeric changed to sum(sv.alleles_genotyped)::numeric/2.0

update 2012-10-08: I overhauled all the above code to make the following changes: (1) changed from logistic to linear model (see discussion below), (2) stored model’s input data in a separate table so you can query it later, (3) allowed it to run on the subset of samples with at least one successfully genotyped locus, down to a genotyping rate threshold (currently set to .9) and (4) added the RVT2 model. Also here is a SQL query that provides table showing the RVT1 results along with some summary data about each gene which makes this more intuitive to browse rather than just looking at p values alone:

select   r.gene_name,
         r.lambda,
         r.p_value,
         least(r.p_value*tests.n,1) bonf,  -- p value after correction for multiple testing
         sum(case when d.phen_code = 1 then d.ri else 0 end) case_rv,     -- total number of rare variants in cases
         sum(case when d.phen_code = 0 then d.ri else 0 end) control_rv,  -- total number of rare variants in controls
         max(d.ni) n_loci  -- number of rare variant loci genotyped
from     rvt_data d, rvt1_results r, (select count(*)::numeric n from rvt1_results) tests
where    d.gene_name = r.gene_name
group by r.gene_name,
         r.lambda,
         r.p_value,
         tests.n
order by r.p_value asc
;