Getting started with analyzing sequence data

If you order exome sequencing from Illumina or the like, they’ll provide you with both the raw reads and the finished, aligned sequence. But in the words of one investigator at my institution, “no one who has the capability to do their own alignment trusts someone else to do it.” The perception is that the company that does your sequencing may not have as high of standards for alignment quality as you do, may not tell you the details of the algorithm they use, and/or may not provide the particular descriptive statistics you need regarding the quality of your data.

Thus many researchers find themselves in the position of doing alignments themselves. This is a task I learned about, and implemented parts of, in a bioinformatics class but had never had to actually do for anything that mattered. I’m getting oriented on it today and sharing what I learn here.

The most popular algorithm today for doing alignments is the Burrows-Wheeler Alignment. (Among other things, this approach takes advantage of a bit of wizardry called the Burrows-Wheeler Transform which is an entertaining wiki read). It takes in a reference sequence (such as the human genome*) in FASTA format and raw reads in FASTQ format (basically FASTA + call quality scores) and outputs an alignment in SAM format which you can then manipulate using SAMtools. BWA is open source, well documented and built for the *nix command line. SEQwiki provides an excellent how-to for using these tools for exome sequencing analysis. UMich also provides a useful tutorial on low pass sequence analysis.

*UCSC provides the reference genome version 19, hg19, in a compressed file chromFa.tar.gz with the chromosomes (e.g. chr1.fa) as well as several other files called things like chr4_gl000194_random.fa. What are those? Turns out they are contigs that couldn’t be confidently aligned to a particular location in the chromosome (or genome). Including them tends to reduce mapping quality so people often leave them out unless they’re particularly interested in them for some reason. And by the way, for BWA you need all the chromosomes in a single FASTA file. SeqAnswers gives a bash command to do that (in the correct order so it matches GATK):

cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa
chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa
chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa > hg19.fa

Once you’ve done your alignment and gotten the quality scores you want, odds are you’ll want to call variants. ”Call” here just means to recognize the variants present in your sequence data, so that you go from just having a sequence to having a tabular list of variants (relative to the reference sequence) found in it. More on that later.