At the CHGR retreat earlier this week I bumped into Ben Neale and mentioned I’d been aligning unmapped reads to virus genomes.  He told me I could do much better: Heng Li, inventor of BWA, has created an entire ‘decoy genome’ to capture reads from human exome or whole genome sequencing that fail to align to the human reference genome.

The Broad Institute hosts a copy of this decoy genome.  This readme explains what’s in the decoy and how it was used in the 1000 Genomes analysis pipeline, and all the files you need are in this folder.

What’s in this decoy genome, you ask?  For one, it contains the genome for human herpesvirus 4, type 1, somewhat better known as Epstein-Barr virus (EBV) and even better known as the cause of mono.  Don’t be confused by ‘herpes’ in the name: this isn’t herpes simplex virus 1 or 2, which cause oral and genital herpes.  Herpesviridae is a large and pretty successful family of viruses that also includes Varicella zoster, the mastermind behind chicken pox / shingles.  Herpesviridae are all DNA viruses, which means you can pick them up in DNA sequencing – as opposed to most viruses, which have only RNA.  (You can pick those up in RNAseq).

Whether or not you’ve had mono, you’ve probably been exposed to EBV (and many other members of the Herpesviridae family) at some point and are likely to have latent copies of the viral genome lying around in your cells even if you’re not now (or never have been) symptomatic.  For this reason EBV very often turns up in human DNA sequencing.

EBV is also used in research laboratories to ‘immortalizelymphoblasts (white blood cell precursors): the virus makes them replicate and grow indefinitely.  This property of EBV is bad when it causes Hodgkin’s or Burkitt’s lymphoma but good when you’re a researcher and you want an endless supply of human cells for experiments.  So if you’re sequencing DNA from cultured cell lines, you might find EBV that was intentionally introduced in the lab as well.

But the EBV genome is only ~170kb, while the full decoy genome is ~36Mb.  The rest of Broad Institute’s decoy genome was constructed from sequences that, though absent from the current reference sequence (hg19 / GRCh37), are found either in HuRef (Craig Venter‘s genome) or Broad’s de novo assembly of NA12878 (an anonymous 1000 Genomes participant and one of the most thoroughly sequenced people in the world) or various other pieces of human DNA that have been cloned and studied in bacteria over the years.  What is all this stuff, then? Much of it really is human genomic material that most or all people have, but just slipped through the cracks when the reference genome was put together.  For instance, the first entry in the decoy is GL582971, a 1.3Mb piece of human chromosome 7 that just didn’t quite make it into build 37 but is slated for inclusion in build 38 whenever such is released.  It’s currently considered as a patch to b37.

If these are genuine pieces of the human genome, then re-aligning your unmapped reads to them will give you extra information, which can only be a good thing.  But actually from what I hear, the major motivation for people to use the decoy genome is speed.  If you include the decoy in your reference genome when you do the original alignment, many reads will quickly find a very confident alignment in the decoy, thus avoiding countless compute cycles spent trying to Smith-Waterman align it to someplace it doesn’t belong.  This purpose – siphoning off these reads to keep them from slowing down the whole alignment – is why this is called the ‘decoy genome.’

With that in mind, let’s return to Broad’s repository of files.  If you’re just starting alignment now, you can use hs37d5.fa.gz, which is all of human genome b37 plus the decoy, and if you’ve already done alignment and you just want to realign the unmapped reads you can use hs37d5ss.fa.gz, which is just the decoy.

Here I’ll consider the case of realigning unmapped reads to the decoy.

step 1. prep files

hs37d5ss.fa.gz is the decoy reference genome; you still need to gunzip it and bwa index it which takes < 1 minute because it’s pretty small.

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5ss.fa.gz
gunzip hs37d5ss.fa.gz
bwa index hs37d5ss.fa

You’ll also want to pull the unmapped reads out of your original BAM files:

samtools view -f 0x04 -h -b original.bam -o unmapped.bam

For me, starting with BAMs of 30-50M reads, about 10% of which are unmapped, this takes ~10 minutes per BAM.

step 2. re-align

These are basically the same instructions as for aligning to viral genomes:

for i in {1820..1869}
do
 bsub.py short 00:50 "bwa aln -b decoygenome/hs37d5ss.fa hdex2.${i}.unmapped.bam > ${i}.decoy.sai; bwa samse decoygenome/hs37d5ss.fa ${i}.decoy.sai hdex2.${i}.unmapped.bam | samtools view - -Sb -o ${i}.decoy.all.bam; samtools view ${i}.decoy.all.bam -b -F 0x04 -o ${i}.decoy.mapped.bam; samtools view ${i}.decoy.all.bam -b -f 0x04 -o ${i}.decoy.unmapped.bam;"
done

For me this took anywhere from 5 to 50 minutes per sample.

step 3. analyze

You might ask, what fraction of the unmapped reads aligned to the decoy?  samtools can count reads from each:

samtools view -c ${i}.decoy.mapped.bam
samtools view -c ${i}.original.bam

You can also compute read depth (either with bedtools or crudely at the command line) or call variants on these BAMs just like any other BAMs – for specifics see the GATK pipieline and the viral genomes pipeline.

discussion

Odds are that re-aligning to the decoy genome will capture some of your originally unmapped reads, but far from all of them.  What is the rest of that stuff?

To peruse a random subset of my still-unmapped reads I used shuf:

samtools view 1820.decoy.unmapped.bam | shuf | less

Glancing over them, I see that many are low information entropy reads: lots of homopolymer runs e.g. AAAAAAAAAAA and lots of low-quality reads full of unknown bases e.g. NNNNNNNNNNN.  These may well be human genomic DNA but just cannot map uniquely to any place in the reference genome nor the decoy.  If you blast them you’ll get no matches, because blast doesn’t bother trying to match information-poor sequences.

Other reads do look, at first glance, like they’ve got some real information. I blasted a handful of them to see what they are.  All of the ones I checked proved to be human DNA, but mapped equally well in several different places.  There are two issues here.  First, since the read aligned equally well in multiple places, it might be a repeated element, part of some larger satellite or retrotransposon even if it doesn’t look repetitive in and of itself.  Second, the blast matches for these reads were especially enriched for BAC / Fosmid clones rather than legit human reference genome matches.  BAC and Fosmid clones are little pieces of human DNA that someone has isolated and cloned into a bacterium.  I suppose, since blast didn’t find a better alignment in the human reference genome itself for these reads, that some of the matching BACs/Fosmids do represent human DNA but have not yet been incorporated into the reference genome nor the decoy genome.