For the past week or so I have been working on putting together a pipeline for human exome sequencing. I’ve got raw FASTQ files, two per sample with about 25M paired-end reads each (these files are around 6 GB unzipped), for 50 samples. I want to call variants in order to do some analysis.  So the vision is something like this:

raw unaligned paired-end reads (FASTQ) -> aligned reads (BAM) -> variants (VCF) -> SQL database, PED & MAP files for PLINK, etc.

This is bread and butter and shouldn’t be that hard. But it is. I started out following the “how to” at seqanswers and made it in pretty deep before being stymied by three separate software problems:

1. bwa sampe has speed and memory allocation problems.  For some of my samples it ran successfully and finished in a few hours, but other jobs (on same-sized input files) had been running for 24 hours and I finally had to kill them. I learned from seqanswers that using -A to disable Smith-Waterman alignment can speed things up so I tried again with that, but then I started getting segmentation faults. bwa sampe would start, run for about two seconds and then:

[bwa_sai2sam_pe_core] convert to sequence coordinate...
Segmentation fault (core dumped)

This memory allocation issue is documented on biostars, sourceforge, seqanswers and bioincloud.  It appears to be an unresolved bug in the software.  These fora contain some suggestions as to how to resolve it, but none worked for me and I ran into this problem consistently across different versions of bwa (0.5.9, 0.6.1, and 0.6.2). Doing an strace does indeed reveal that the system is unable to allocate memory. But I’m running this with 8GB of memory; how much does BWA need?! (And by the way, I then tried removing the -A flag again and I still got segmentation faults).

2. Picard does not seem to like the format of SAM files produced by BWA.  It didn’t want to run MarkDuplicates because my SAM file was unsorted.  OK, I figured, I’ll run Picard’s SortSam tool first.  When I did that, I got a format problem: a non-numeric value in ISIZE column for mitochondrial DNA reads:

[Tue Aug 28 12:45:16 EDT 2012] net.sf.picard.sam.SortSam done. Elapsed time: 0.01 minutes.
Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Non-numeric value in ISIZE column; Line 3982
at net.sf.samtools.SAMTextReader.reportFatalErrorParsingLine(
at net.sf.samtools.SAMTextReader.access$400(
at net.sf.samtools.SAMTextReader$RecordIterator.parseInt(
at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(
at net.sf.samtools.SAMTextReader$
at net.sf.samtools.SAMTextReader$
at net.sf.samtools.SAMFileReader$
at net.sf.samtools.SAMFileReader$
at net.sf.picard.sam.SortSam.doWork(
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(
at net.sf.picard.sam.SortSam.main(

I couldn’t find any notes about this problem online.  I did try to sort using samtools instead and then go back to Picard’s MarkDuplicates with the ASSUME SORTED option, as suggested here, but with the VALIDATION_STRINGENCY=LENIENT option MarkDuplicates found the need to ignore a warning on every single mitochondrial DNA read:

java -jar ~/bin/picard-tools-1.74/MarkDuplicates.jar \
INPUT=1sorted.bam \
OUTPUT=1dedup.bam \
METRICS_FILE=metrics \

Ignoring SAM validation error: ERROR: Record 691, Read name FCC0CHTACXX:1:1302:4748:176644#GGCTACAT, Mate Alignment start (436154938) must be <= reference sequence length (16571) on reference chrM
Ignoring SAM validation error: ERROR: Record 692, Read name FCC0CHTACXX:1:2104:8494:167812#GGCTACAT, Mate Alignment start should != 0 because reference name != *.
Ignoring SAM validation error: ERROR: Record 693, Read name FCC0CHTACXX:1:1201:21002:183608#GGCTACAT, Mate Alignment start should != 0 because reference name != *.
Ignoring SAM validation error: ERROR: Record 694, Read name FCC0CHTACXX:1:2303:3184:35872#GGCTACAT, Mate Alignment start (436154812) must be <= reference sequence length (16571) on reference chrM

And on and on.  You can set VALIDATION_STRINGENCY to SILENT to ignore these, but to just ignore all the warnings seems foolish.  I did find one discussion of a related picard issue with mtDNA but I used the same FASTA index file for all steps of my process, so I don’t see how that could be the issue.

3. GATK can’t load my reference sequence dictionary.  Due to problems 1 and 2 above, I didn’t have BAMs for all my samples and hadn’t been able to mark duplicate reads, but I figured I would at least try calling variants on just one sample to see if it would work.  It didn’t:

java -Xmx4g -jar ~/bin/gatk-2.1-5/GenomeAnalysisTK.jar \
-glm BOTH \
-R ../hg19.fa \
-T UnifiedGenotyper \
-I 1.bam \
-o 1.vcf

##### ERROR MESSAGE: Invalid command line: Failed to load reference dictionary

This issue is the subject of plenty of comments in web forums [1][2][3], and after perusing them I think I found the issue: the hg19.dict file I was trying to use was of 0 bytes.  This comment suggested re-creating it with Picard’s CreateSequenceDictionary, but the documentation for that feature seems to suggest that it creates a SAM or BAM file, not a .dict file.

At this point I gave up.  All of these issues are probably fixable, but as they piled up it started to seem easier to just start over with an entirely different approach.  In a moment I will post about the pipeline I finally created using alternative tools.  That said, after attending yesterday’s lecture on GATK at the Broad, I now think perhaps GATK is worth a second try, and perhaps if I use GATK end-to-end some of these compatibility problems will be avoided.  More on that later if I do get it to work.