In my last post I documented the technical issues I encountered when trying to run the widely used exome sequencing pipeline posted at seqanswers.  That pipeline is based on BWA for alignment, Picard for some intermediate processing steps, and GATK for variant calling among other steps.  After I was unable to get that approach to work, I started over from scratch and created an alternative pipeline based primarily around bowtie2 for alignment and samtools for processing and variant calling.  The purpose of this post is to document this alternative pipeline.

Overview: This pipeline is for analyzing human exome data.  It assumes you have paired-end reads in the form of FASTQ files from your sequencing company and that your goal is to align these to the genome and then call variants for further analysis.

Below I will include code and estimated running times for every step.  For the record, I’ve got 50 samples with paired-end reads, so 100 FASTQ files, ranging from about 15 to 30 million reads each, and I am running this on an HP Red Hat Linux cluster with 8GB memory and 8 cores per node.  But IMHO, the details aren’t important– your running times will obviously be different than mine, but this will give you a sense of the order of magnitude of how long these things will take.  That’s helpful because then you know whether to kill them if they’ve been running too long, and you can set running time limits on your jobs (using bsub‘s -W option if you’re on an LSF scheduler) which will help keep your priority nice and high on shared resources.

0. Prep steps.  You can go ahead and start step 1 and then come back and do these things while waiting for other steps to run.

0.1: By Step 3 you will need a bowtie2-formatted index of hg19.  You could index it yourself but I am told this takes several hours.  Instead just download it:

wget ftp://ftp.cbcb.umd.edu/pub/data/bowtie2_indexes/incl/hg19.zip
unzip hg19.zip

0.2: Also by the time you get to step 3: you will need a samtools index of hg19 which exactly matches the bowtie2 index you just downloaded. What does that mean? hg19 has separate fasta files for all the human chromosomes plus a bunch of random contigs. Which ones you include, and in which order, matters. If you do step 3 with the index found at the above link, which includes random contigs, but then you try to run samtools with an hg19.fa that does not include the random contigs (or, gasp, has them in lexicographic order, i.e. chr1, chr10, chr11 …. chr2, chr 20, … instead of numerical order) it won’t work.

So download hg19, unzip it, concatenate exactly the files included in the above bowtie index in exactly the same order into a new hg19-bt2.fa (thus named to remind you that its contents are designed to match those of the bowtie2 index you downloaded) and then create a samtools reference index (*.fai) on it:

wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
gunzip chromFa.tar.gz
cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa chr1_gl000191_random.fa chr1_gl000192_random.fa chr4_gl000193_random.fa chr4_gl000194_random.fa chr7_gl000195_random.fa chr8_gl000196_random.fa chr8_gl000197_random.fa chr9_gl000198_random.fa chr9_gl000199_random.fa chr9_gl000200_random.fa chr9_gl000201_random.fa chr11_gl000202_random.fa chr17_gl000203_random.fa chr17_gl000204_random.fa chr17_gl000205_random.fa chr17_gl000206_random.fa chr18_gl000207_random.fa chr19_gl000208_random.fa chr19_gl000209_random.fa chr21_gl000210_random.fa chrUn_gl000211.fa chrUn_gl000212.fa chrUn_gl000213.fa chrUn_gl000214.fa chrUn_gl000215.fa chrUn_gl000216.fa chrUn_gl000217.fa chrUn_gl000218.fa chrUn_gl000219.fa chrUn_gl000220.fa chrUn_gl000221.fa chrUn_gl000222.fa chrUn_gl000223.fa chrUn_gl000224.fa chrUn_gl000225.fa chrUn_gl000226.fa chrUn_gl000227.fa chrUn_gl000228.fa chrUn_gl000229.fa chrUn_gl000230.fa chrUn_gl000231.fa chrUn_gl000232.fa chrUn_gl000233.fa chrUn_gl000234.fa chrUn_gl000235.fa chrUn_gl000236.fa chrUn_gl000237.fa chrUn_gl000238.fa chrUn_gl000239.fa chrUn_gl000240.fa chrUn_gl000241.fa chrUn_gl000242.fa chrUn_gl000243.fa chrUn_gl000244.fa chrUn_gl000245.fa chrUn_gl000246.fa chrUn_gl000247.fa chrUn_gl000248.fa chrUn_gl000249.fa > hg19-bt2.fa
samtools faidx hg19-bt2.fa

This will create hg19-bt2.fa.fai which is what you need at step 8.

For future compatibility when new versions of bowtie2 and/or the hg reference sequence come out: I got the above list and order of files included in the bowtie2 index by examining the file make_hg19.sh which is included in the bowtie2 hg19.zip file you downloaded above.

0.3: By step 8 you will want a BED file listing the genomic regions that constitute the human exome.  Now, even though you targeted the exome when you created your library for sequencing, you will get some reads aligning to non-targeted regions. However, odds are that your coverage of non-targeted regions will be too low to have meaningful results from any variants there anyway, so you may as well not even call those variants in the first place. Instead, save time by only calling variants in your targeted regions, i.e. the exome. If you used the Illumina TruSeq exome capture kit, the official BED file for it is here but you need to log in to download; if you trust things people post on the internet, someone has uploaded a free copy here.  If your exome capture kit was NimbleGen SeqCap, the relevant page is here and the actual files are here: v2.0 and v3.0.

1. decompression: gunzip

All of the tools you’re about to use can run directly on .gz files without unzipping.  However, I have found it’s not uncommon for .gz files to get corrupted while you’re downloading them from your sequencing company.  If that happens, your alignment job might run for 3 hours and finish 95% of the file before encountering an invalid compression block and crashing, leaving you with nothing.  The better to know sooner before investing the three hours of CPU time. gunzip first, which takes just a few minutes, and you’ll find out right away if there are any files you need to re-download.

gunzip -c 1_1.fq.gz 1_1.fq

3 minutes per fq.gz file

If there’s a corrupted file you’ll get a message like invalid compressed data--format violated.

Important note: gunzip’s default behavior is to remove the compressed file when it’s done; to be safe you should probably keep it using the -c flag as I’ve done above.

2. raw reads quality control: fastqc

fastqc 1_1.fq -f fastq -o fastqc/

8 minutes per fq file

I’ve provided a script to load the FastQC output data into PostgreSQL to analyze in bulk here.

3. alignment: bowtie2 piped directly to samtools view

bowtie2 --end-to-end --very-fast --rg-id "@RG\tID:1\tLB:project_name\tSM:1\tPL:ILLUMINA" -x hg19 -q -1 1_1.fq -2 2_2.fq | samtools view - -Sb -h -t hg19-bt2.fa.fai -o 1.bam

4 hours per pair of fq files

Notice that this is running the fastest possible settings, –end-to-end –very-fast, and piping directly to samtools which avoids writing a SAM to disk, which speeds things up according to this.  Before running this step I did a little experiment to see whether –end-to-end or –local was faster.  I had some per base sequence content issues at the ends of reads in some of my samples according to FastQC, so I thought soft clipping the ends when necessary might be a good thing and I might prefer –local.  I started two processes on identical input data and killed them after about 10 minutes to see how much data they’d written to the BAM file.  Results:

end-to-end: 1150550016 bytes in 626 seconds = 1,837,939 bytes/sec
local: 683245568 bytes in 606 seconds = 1,127,467 bytes/sec

So end-to-end is about 50% faster.  I didn’t do any experiments on –very-fast versus –fast, –sensitive and –very-sensitive though.  Just note that depending on your needs, the more sensitive options might be important.

Also note that as mentioned in step 0, I used an already-indexed hg19 from UCSC available here, rather than indexing it myself with bowtie2.

4. sorting: samtools sort

samtools sort 1.bam 1.srtd -m 8000000000

20 minutes per bam

By default, samtools only uses 500 MB of memory. If you have more, use more:hence the -m 8000000000 option used above. It will be faster and not create paged files on your disk (which it then leaves there after it is done).

5. reheadering: samtools reheader

This step is necessary because samtools sort (and Picard’s SortSam from what I’m told, though I never got it to run myself) will sort a BAM by coordinates but then not update the header to indicate that the BAM is sorted.  First use view -H to retrieve the original header (this takes just a couple seconds):

samtools view -H 1.srtd.bam > 1originalheader.sam

Then open up the SAM file and edit it. On the first line change SO:unsorted to SO:sorted. (While you’re at it, double check that the @RG line is well-formed. For some reason mine had gotten corrupted and had two “ID” tags in it.  If you’re doing all this programmatically, say, from Python, it’s easy to mess up the tabs).  Then apply the edited header to the BAM file using reheader:

samtools reheader 1reheader.sam 1.srtd.bam > 1.srtd.reh.bam

3 minutes per bam

6. removing PCR duplicates: samtools rmdup

samtools rmdup 1.srtd.reh.bam 1.srtd.reh.ddup.bam

12 minutes per bam

According to FastQC I had about 27% PCR duplicates. This removes them. Picard’s MarkDuplicates is supposed to just mark them rather than remove them, but I couldn’t get Picard to work and I’m also not sure what the benefit is of marking them rather than totally removing them.

7. indexing: samtools index

samtools index 1.srtd.reh.ddup.bam

1 minute per bam

This creates a .bam.bai file in the same directory.  I’m not actually sure if samtools mpileup even uses the .bai index that this creates, but I figured it can’t hurt.

8. calling variants: samtools mpileup piped to bcftools view

First create a list of bams to use:

ls *.srtd.reh.ddup.bam > bamlist.txt

Then pile them up, call variants only on the targeted regions according to your BED file (see discussion above), and pipe it directly to bcftools to create a vcf file, all in one line:

samtools mpileup -d 200 -D -B -f ../hg19/fasta/hg19-bt2.fa -b bamlist.txt -l ../bed/SeqCap_EZ_Exome_v2.bed -u | bcftools view - -v -c -g > variants.vcf

16 hours for 50 bams*

Remember, hg19-bt2.fa is the reference sequence for which we created a samtools reference index, hg19-bt2.fa.fai, for use in step 3. Here we use it again.

In the mpileup example on samtools website, it shows a command that pipes through bcftools twice– once to create a bcf and then again to convert that to vcf. I don’t know why they do this. The above, going directly from mpileup’s output into a vcf, worked great for me.

*If, while your program is running, you want to check on its progress and that its output is well formed, here’s a trick. First, do this:

head -1000 variants.vcf

Then look at the output. Does it have everything you want in it? Good. Now go down to where the data is and see how wide the lines are. Mine were about 800 chars average.  So now you can list details (ll) to get the file size, divide by 800, there you go, that’s an estimate of the number of lines thus far written to the file. Now that you have an idea how many lines there are, use head to retrieve most of them (be conservative so you don’t hit EOF) and pipe them to tail to see what’s at the end.

head -10000 variants.vcf | tail -

You will see the most recent genomic positions where variants were called.  Say you’ve gotten through the first 180M bases of chr1 in an hour. That’s about 1/16 of the genome, so you can estimate you’ll probably take about 16 hours to finish.

9. annotating variants: annovar

At this step, you’ve already got raw variants– chromosome, position, reference allele, alternate allele. “Annotating” refers to overlaying humankind’s existing knowledge of our genome onto this information. What gene does this variant fall in? How highly conserved is that position? And so on.

I used annovar, which was a bit tricky to get started with but is quite powerful once you get rolling. Before you can use annovar, you need to download all the databases containing the annotation information. annovar does not have any lookup service that automatically finds the most recent versions of every database and uses them; instead you have specify what you want.  The defaults are not always the most recent versions of everything (e.g. hg18, dbSNP130, and so on) though the annovar website lists more recent versions of these.  You’ll probably need a few tries to get the right places to download these databases from and specify the right command-line parameters in order to use them. As of September 3, 2012, these commands (executed from the directory where annovar is installed) seem to reflect the most recent verisons of everything and viable places to find them, but I’m sure that won’t be true for long:

annotate_variation.pl -downdb -buildver hg19 gene humandb
annotate_variation.pl -downdb -buildver hg19 band humandb/
annotate_variation.pl -downdb -buildver hg19 tfbs humandb/
annotate_variation.pl -downdb -buildver hg19 mce28way humandb/
annotate_variation.pl -downdb -buildver hg19 segdup humandb/
annotate_variation.pl -downdb -buildver hg19 mirna humandb/
annotate_variation.pl -downdb -buildver hg19 mirnatarget humandb/
annotate_variation.pl -downdb -buildver hg19 evofold humandb/
annotate_variation.pl -downdb -buildver hg19 dgv humandb/
annotate_variation.pl -downdb -buildver hg19 omimgene humandb/
annotate_variation.pl -downdb -buildver hg19 gwascatalog humandb/
annotate_variation.pl -downdb -buildver hg19 phastConsElements28wayPlacMammal humandb/
annotate_variation.pl -downdb -buildver hg19 1000g2012feb humandb/
annotate_variation.pl -downdb -buildver hg19 snp132 humandb/
annotate_variation.pl -downdb -buildver hg19 avsift humandb/
annotate_variation.pl -downdb -buildver hg19 jb_all humandb/
annotate_variation.pl -downdb -buildver hg19 phastConsElements46way humandb/
# ones you have to get directly from annovar use -webfrom annovar
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 ljb_all humandb/
annotate_variation.pl -downdb -webfrom annovar -buildver hg19 esp5400_all humandb/

Then, to actually run annovar, do this:

# convert your vcf to an annovar file
perl ~/bin/annovar/convert2annovar.pl --format vcf4 --includeinfo variants.vcf > variants.annovar
# do the annotation
perl ~/bin/annovar/summarize_annovar.pl --buildver hg19 --ver1000g 1000g2012feb --verdbsnp 132 variants.annovar ~/bin/annovar/humandb -outfile annovar/variantsann

Together these two commands take about 30 minutes.

10. converting to plink format: vcftools

PLINK can do a lot. It can actually do a whole GWAS for you, if your needs are pretty standard. Even if you’re planning to do your own modeling and analysis, you might want to use PLINK to model population substructure, which is to say, to group samples into different groups (ethnic groups or families) in order to control for their relation to each other. To get started with PLINK, you have to have ped and map files, or the transposed version, which are tped and tfam files. You can do this as such:

vcftools --vcf variants.vcf --plink-tped

This took about 30 seconds.

One irritating thing is that when vcftools converts vcf files to tped/tfam files, it leaves strings such as “chr1″ in the chromosome field, which PLINK doesn’t recognize and so if you do any association testing in PLINK, PLINK will lose that information and output 0 in the chromosome column for every row. To avoid this, I wrote this python script to change the chromosome names to numbers (mapping for X, Y, and M described here) and give the SNPs names that include the chromosome number:

import sys

i = open("c:/path/orig.tped","r")
o = open("c:/path/fixed.tped","w")

for line in i.readlines():
    cols = line.split("\t")
    if (line[:3] == "chr"):
        cols[0] = cols[0][3:]# strip off the letters 'chr'
        cols[1] = ".".join((cols[0],cols[1])) # add chr no to pos to form snp id
        if (cols[0] == "X"):
            cols[0] = "23"
        elif (cols[0] == "Y"):
            cols[0] = "24"
        elif (cols[0] == "XY"):
            cols[0] = "25"
        elif (cols[0] == "M"):
            cols[0] = "26"
        fixedline = "\t".join(cols)
        o.write(fixedline)
    else:
        o.write(line)

i.close()
o.close()

Another issue is that at this point your tfam file has no sex or phenotype information. PLINK will still run without this information, but you won’t be able to use a lot of the cooler features. I found it easiest to just copy and paste that information into the relevant columns of the tfam file (believe it or not, I actually used Excel for this).

11. calculate coverage: bedtools

In step 1, FastQC gave you quality metrics on your reads themselves.  Now that you’ve done alignment, one of the basic QC metrics you’ll want to know is what your coverage is, which means, how many reads overlap each point on the genome.  Since you’re just doing exome sequencing, you probably only need to calculate coverage for the targeted parts, so we’ll again use the BED file you downloaded in step 0.3.

The only tool I could find to calculate coverage is bedtools, though it’s a bit overkill. I just wanted a few basic stats: average coverage of target, % of bases with at least 8x coverage, etc, but bedtools only knows how to give you reams and reams of very detailed data.  Hey, at least that gives you flexibility.  In -hist mode, the coverageBed tool will give you a histogram of depth of coverage for every base in each targeted stretch, and then a summary at the end with histogram of coverage for the whole target.  In all, this will create about 400MB worth of histogram data for a several-GB BAM file, which gets kind of burdensome.  I only cared about the summary histogram of target coverage, so I piped bedtools output directly to tail to get a much smaller file containing the summary rows that I cared about:

coverageBed -abam 1.srtd.reh.ddup.bam -b SeqCap_EZ_Exome_v2.bed -hist | tail -10000 - > 1.coverage

5 minutes per bam

Then I wrote a Python script to read the summary rows from all my samples into a PostgreSQL table:

import sys
import csv
import psycopg2

ctsql="""
drop table if exists exome_coverage_histo;
create table exome_coverage_histo (
sample_id integer,
depth integer,
bases integer
);
"""

conn = psycopg2.connect(database="mydb", user="postgres", password="password")
conn.autocommit = True
cur = conn.cursor()

cur.execute(ctsql)

localdir = "c:/your/path/here/"

for i in range(0,50):
    f = open(localdir+str(i)+".coverage","r")
    c = csv.reader(f, delimiter="\t")
    for row in c:
        if (row[0] <> "all"):
            continue
        cur.execute("insert into exome_coverage_histo(sample_id,depth,bases)values(%s,%s,%s);",(i,int(row[1]),int(row[2])))
    f.close()

cur.close()
conn.close()

That took about 2 minutes to run.  And then you can get your summary stats with a few SQL queries:

-- Average coverage of target by sample
select sample_id, sum(depth*bases)::numeric / sum(bases)::numeric as avg_coverage
from exome_coverage_histo
group by sample_id
order by avg_coverage
;

-- Average coverage of target across all samples
select sum(depth*bases)::numeric / sum(bases)::numeric as avg_coverage
from exome_coverage_histo
;

-- Percent of target covered at 8x or greater, by sample
select sample_id, sum(case when depth >= 8 then bases else 0 end)::numeric / sum(bases)::numeric as fraction8x
from exome_coverage_histo
group by sample_id
order by fraction8x
;

-- Percent of target covered at 8x or greater, across all samples
select sum(case when depth >= 8 then bases else 0 end)::numeric / sum(bases)::numeric as fraction8x
from exome_coverage_histo
;

12. converting results to database format: Python and PostgreSQL

Unless you’re using PLINK for all of the analysis you plan to do, you’ll probably want to be able to pull results into R (or Stata, Matlab, etc.) to do a bit more with. It’s frustrating, then, that both VCF files and annovar’s annotated CSV files are so terribly non-relational.  If you want to pull your variants into a MySQL database, Ensembl provides a perl script here, though the Ensembl E-R diagram is quite a monster.  In an effort to create something that would be simple and that I’d understand through-and-through, I wrote my own Python script to populate a PostgreSQL database.  It’s not perfect– some (actually, much) relationality is still lost in my schema, and you have to be careful how your genotypes are coded too: this script assumes that your VCF is unphased and contains only two alleles (ref and alt) per line; if you have multi-allelic variants or you called haplotypes you’ll need to modify this.

import sys
import math
import psycopg2 # http://www.initd.org/psycopg/

create_table_sql = """
drop table if exists variants;
create table variants (
 vid serial primary key,
 chrom varchar,
 pos integer,
 id varchar,
 ref varchar,
 alt varchar,
 qual numeric,
 filter varchar,
 dp numeric,
 dp4 varchar,
 mq numeric,
 fq numeric,
 af1 numeric,
 ac1 numeric,
 g3 varchar,
 hwe numeric,
 clr numeric,
 ugt numeric,
 cgt numeric,
 pv4 varchar,
 indel varchar,
 pc2 varchar,
 pchi2 numeric,
 qchi2 numeric,
 pr numeric,
 vdb numeric
);
drop table if exists sample_variants;
create table sample_variants(
 svid serial primary key,
 vid integer,
 sid integer,
 variant_alleles integer,
 gt varchar,
 pl varchar,
 dp numeric,
 gq numeric
);
"""

conn = psycopg2.connect(database="mydb", user="postgres", password="password")
c = conn.cursor()

c.execute(create_table_sql)
conn.commit()

inpath = "c:/your/path/here/variants.vcf"
vcf = open(inpath,"r")

for line in vcf.readlines():
    if(len(line) < 2):
        continue
    if(line[:2] == "##"):
        continue
    if(line[0] == "#"): # header row
        header = line[1:].strip() # remove # sign and newline
        colnames = header.split("\t")
        for i in range(len(colnames)):
            colnames[i] = colnames[i].lower()
        continue
    # write to variants table
    cols = line.strip().split("\t")
    cols[0] = cols[0][3:] # remove the word "chr"
    infolist = cols[7].split(";") # info column needs special treatment
    infofields = []
    for infoitem in infolist:
        infotokens = infoitem.split("=")
        name=infotokens[0].lower()
        if(name == "indel"):
            value = "indel"
        else:
            value=infotokens[1]
        infofields.append((name,value))
    sql = "insert into variants ("
    sql += ",".join(colnames[0:7]) + ","
    data = tuple(cols[0:7])
    ncols = 7
    for infofield in infofields:
        sql += infofield[0] + "," # name
        data += (infofield[1],)  # value
        ncols += 1
    sql = sql[:-1] # remove final comma
    sql += ") values ("
    sql += "%s,"*ncols
    sql = sql[:-1] # remove final comma
    sql += ") returning vid;"
    c.execute(sql,data)
    vid = c.fetchone()[0] # retrive the newly created vid primary key
    # write to sample variants table
    formatlist = cols[8].split(":") # format column needs special treatment
    gtindex = 0
    for i in range(len(formatlist)):
        formatlist[i] = formatlist[i].lower()
        if (formatlist[i] == "gt"):
            gtindex = i
    for j in range(9,len(cols)): # for all samples
        sid = colnames[j]
        vinfolist = cols[j].split(":")
        assert len(vinfolist) == len(formatlist), "Sample format does not match stated format"
        # extract allele count from genotype
        gt = vinfolist[gtindex]
        variant_alleles = 0
        if(gt == "0/0"):
            variant_alleles = 0
        elif(gt == "0/1" or gt == "1/0"):
            variant_alleles = 1
        elif(gt == "1/1"):
            variant_alleles = 2
        sql = "insert into sample_variants (vid,sid,variant_alleles," + ",".join(formatlist) + ") values (" + "%s,"*(3+len(formatlist))
        sql = sql[:-1] # remove final comma
        sql += ");"
        data = (vid,sid,variant_alleles)
        data += tuple(vinfolist)
        c.execute(sql,data)
    if(vid % 10000 == 0):
        print "Processed " + str(vid) + " samples."

conn.commit()

vcf.close()

c.close()
conn.close()

The above took about an hour to run, with about 500,000 variants.  And then here’s a script to load up your annotations from annovar, which takes only a few minutes (**UPDATED 2012-09-12.  The previous version posted here excluded the vcf_ columns which are necessary in order to join this data to the vcf tables we created above.)

import sys
import csv
import psycopg2 # http://www.initd.org/psycopg/

create_table_sql = """
drop table if exists annovar;
create table annovar (
 func varchar,
 gene varchar,
 exonicfunc varchar,
 aachange varchar,
 conserved varchar,
 segdup numeric,
 esp5400_all numeric,
 n1000g2012feb_all numeric,
 dbsnp132 varchar,
 avsift numeric,
 ljb_phylop numeric,
 ljb_phylop_pred varchar,
 ljb_sift numeric,
 ljb_sift_pred varchar,
 ljb_polyphen2 numeric,
 ljb_polyphen2_pred varchar,
 ljb_lrt numeric,
 ljb_lrt_pred varchar,
 ljb_mutationtaster numeric,
 ljb_mutationtaster_pred varchar,
 ljb_gerp__ numeric,
 chromosome varchar, 
 pstart integer,
 pend integer,
 ref_allele varchar,
 obs_allele varchar,
 vcf_chrom varchar,
 vcf_pos integer,
 vcf_id varchar,
 vcf_ref varchar,
 vcf_alt varchar,
 vcf_qual numeric,
 vcf_filter varchar
);
"""

conn = psycopg2.connect(database="mydb", user="postgres", password="password")
c = conn.cursor()

c.execute(create_table_sql)
conn.commit()

inpath = "c:/your/path/here/variantsann.exome_summary.csv"
anno = open(inpath,"r")
annocsv = csv.reader(anno,delimiter=',', quotechar='"')

ncols = 26+7 # first 26 columns are annovar data, next 7 join to vcf data

rowid = 0
for line in annocsv:
    if( rowid == 0 ): # skip header row
        rowid +=1
        continue
    cols = line[:ncols]
    cols[21] = cols[21][3:] # trim the characters "chr" from the chromosome column
    cols[26] = cols[26][3:] # trim the characters "chr" from the vcf_chrom column
    for i in range(ncols):
        if (cols[i] == ""): # replace empty strings with nulls
            cols[i] = None
    sql = "insert into annovar values(" + "%s,"*ncols
    sql = sql[:-1]
    sql += ");"
    data = tuple(cols)
    c.execute(sql,data)
    rowid += 1
    if(rowid%10000==0):
        print "Processed " + str(rowid) + " rows."

conn.commit()
anno.close()

c.close()
conn.close()

UPDATE 2012-09-12: You’ll want to add a vid column to the annovar table which matches the vids in the vcf tables.  This SQL code does that and also creates some indices which will help speed up certain queries:

-- join tables in order to add vids that match those in the variants table
create index annvoar_chridx on annovar(vcf_chrom);
create index annvoar_posidx on annovar(vcf_pos);
alter table annovar add column vid integer;
update annovar a
set    vid = v.vid
from   variants v
where  v.chrom = a.vcf_chrom
and    v.pos = a.vcf_pos
and    v.ref = a.vcf_ref
and    v.alt = a.vcf_alt
;
create index annovar_vid_pkey on annovar(vid);
alter table annovar add primary key (vid);
-- other indices which may be helpful
create index sample_variants_vidx on sample_variants(vid);
create index sample_variants_sidx on sample_variants(sid);
create index variants_chridx on variants(chrom);
create index variants_posidx on variants(pos);

Which should all run in a minute or two.

Now that your data are in SQL you can easily apply filters on coverage or variant type, drill down to individual samples, pull data into R, and much much more.

A few thoughts

After my difficulty using BWA/Picard/GATK, I’m just relieved that all of the software tools listed in this post actually did their job and now I have data to analyze.  That said, GATK really is becoming a standard tool, so I plan to spend at least a bit more time seeing if I can get it to work.  At the Broad lecture on GATK yesterday one speaker listed some of the ongoing challenges in sequencing, among which was calling multi-nucleotide polymorphisms (MNPs).  If a variant is 9 nucleotides long, some variant callers will call it as 9 SNPs, when actually it’s 1 MNP.  This is indeed a problem I have noticed in my data, and while it sounded as though GATK isn’t quite perfect in this realm either, it sounded as though they’d been working hard on it and had made some improvements.

Conclusion

Now you’re ready to do some analysis!  Easy, right?  The next time someone talks to you about the $1000 genome and how quick and easy sequencing is these days, point them to this blog post.