Today I wanted to take unmapped reads from human exome sequencing and try to see if any of the reads came from viruses (whether due to sample contamination or the person having been infected with the virus).  Because exome capture kits don’t target any viral sequence, you might be skeptical that I’d find anything, but in fact exome data can contain all kinds of things that weren’t captured on purpose – such as mitochrondrial DNA [Picardi & Pesole 2012] – provided their copy number is high enough.  Every cell contains just one diploid human genome but can contain tens or hundreds of mitochondria, and who knows how many viruses it might contain if infected.

step 1. get the viral genomes

I went to NCBI’s repository of reference genomes but it was hard to search for what I wanted.  Exome sequencing uses DNA, so I was only interested in either DNA viruses or retroviruses.  And I only wanted viruses that might infect humans, so I wasn’t interested in the plentitude of available reference genomes for viruses that only infect bananas, chickpeas, etc.

Finally I hit upon this text file listing the reference genomes.  After a bit of experimenting, here’s some bash magic that pulled out the viruses I wanted:

egrep 'DNA|Retro' viruses.txt | grep human > viruslist.txt

98 in total, including a few strains of the ‘classics’: herpes, HPV, HIV as well as several things I had never heard of.  The third column of each row is an accession number from NCBI’s bioproject database.  It took some guesswork to figure out how to use this number to get the genome.  Let’s take Adeno-associated virus 2 as an example; its  bioproject accession number is 14060.

Constructing a URL like this will give you a page about Adeno-associated virus 2:

However that page won’t contain the FASTA reference genome nor a link to it.  It will however contain a link to another page with a URL like this:[BioProject]

Which after a couple seconds will redirect to another URL using a different unique id:

These unique ids are from the nuccore database, which can be used to retrieve FASTA files via eFetch:

Which works perfectly in a browser – Chrome prompts me for a location to save the file and the file has what I want – but fails with wget:

wget > 14060.txt
[1] 13591
[2] 13592
[2]+ Done id=110645916
[em476@eris1n2 virus]$ --2013-01-25 12:51:15--
Resolving, 2607:f220:41e:4290::110
Connecting to||:80... connected.
HTTP request sent, awaiting response... 400 Bad Request
2013-01-25 12:51:15 ERROR 400: Bad Request.

A colleague pointed out to me that Python’s urllib behaves more like a browser than wget, so I gave it a try (output truncated):

>>> import urllib2
>>> conn = urllib2.urlopen("")
>>> result =
>>> conn.close()
>>> result
'>gi|110645916|ref|NC_001401.2| Adeno-associated virus - 2, complete genome\nTTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCCCGACGCCCGG\n

Sure enough, it works.  It also works to get the redirect URL, retrieving the nuccore id based on the bioproject id:

>>> conn = urllib2.urlopen("[BioProject]")
>>> conn
<addinfourl at 43180400 whose fp = <socket._fileobject object at 0x7ff6b78d8ed0>>
>>> dir(conn)
['__doc__', '__init__', '__iter__', '__module__', '__repr__', 'close', 'code', 'fileno', 'fp', 'getcode', 'geturl', 'headers', 'info', 'msg', 'next', 'read', 'readline', 'readlines', 'url']
>>> conn.url
>>> conn.close()

So I wrote this python script:

#!/usr/bin/env python
import sys
import urllib2

# input file should contain one column with a list of NCBI bioproject IDs
# if no input file is supplied, uses stdin
if len(sys.argv) < 2:
    print "Usage: python [inputfile=stdin]"

# open input file or stdin
if len(sys.argv) == 2:
    inpath = sys.argv[1]
    i = open(inpath,"rb")
    i = sys.stdin

for line in i.readlines():
        bioprojectId = line.strip() # remove newline
        bioprojectUrl = ""+bioprojectId+"[BioProject]" # create URL to get redirected
        conn = urllib2.urlopen(bioprojectUrl) # open URL
        nuccoreId = conn.url.split("/")[-1] # retrieve nuccore ID from redirected URL
        conn.close() # close URL
        nuccoreUrl = ""+nuccoreId+"&rettype=fasta" # create URL for FASTA file
        conn = urllib2.urlopen(nuccoreUrl) # open URL
        fasta = # get FASTA text
        conn.close() # close URL
        outpath = nuccoreId + ".fa" # create file path to write FASTA 
        with open(outpath,"wb") as o: # auto-close file after done
            o.write(fasta) # write FASTA text to file
    except urllib2.HTTPError:
        print "Unable to retrieve genome for bioproject " + bioprojectId

i.close() # close input file

And with that written, you just need this one-liner:

cut -f3 viruslist.txt |

So putting that together with the earlier steps, three lines total:

egrep 'DNA|Retro' viruses.txt | grep human > viruslist.txt
cut -f3 viruslist.txt |

Or if you want to be really bold, combine it all into one line:

wget -qO- | egrep 'DNA|Retro' | grep human | cut -f3 |

And with that I was able to retrieve 95 of the 98 viral reference genomes – I got errors for 32033, 41109, and 49137.  On further inspection all three of these say “No data” in the final column of the virus list file, indicating a genome is not available.  Two were (presumably obscure?) strains of HPV, and one was HIV-1 (the most common strain of HIV), but there were two HIV-1 rows in the original virus list and the other one did have a genome associated with it, so it looks like I’ve got HIV-1 covered.

step 2. combine and index viral genomes with bwa

Viral genomes are so small this takes about 4 seconds total:

mkdir ref
cat *.fa > ref/viruses.fa
cd ref
bwa index viruses.fa

Amazing, really, that they can do so much damage with so little information content.

step 3. pull out unmapped reads from original BAMs

I had previously aligned these exomes to human reference genome GRCh37 (using this pileline), so I only want to try aligning the unmapped reads to the viruses. samtools view can pull out unmapped reads by forcing (-f) the unmapped (0x04) flag:

samtools view -f 0x04 -h -b projectname.sampleid.clean.dedup.recal.bam -o projectname.sampleid.unmapped.bam

My original BAMs had about 15-27M pairs of reads, of which ~10% were unmapped, so this command gave about 3-6M single-end unmapped reads as the output.   I recommend sending each file as its own job using a simple loop:

for i in {1820..1869}
do short 0:30 "samtools view -f 0x04 -h -b hdex2.${i}.clean.dedup.recal.bam -o hdex2.${i}.unmapped.bam"

The 30 minute time limit is just to be safe; most files finish in under 10 minutes. is just a very short script I wrote as a proxy for bsub so I wouldn’t have to type in cd and the current path every time I wanted to submit a job to run in my current directory.  Here it is:

#!/usr/bin/env python
import os
import sys

cwd = os.getcwd() # get current directory
queue = sys.argv[1] # get LSF queue name
timelimit = sys.argv[2] # get time limit
cmd = sys.argv[3] # get command

# create bsub command and echo to user
bsubcmd = "bsub -q "+queue+" -W "+timelimit+" \"cd "+cwd+"; "+cmd+"\""
print bsubcmd

# submit the job

step 4. realign unmapped reads to viral genomes

At this point I have BAMs of unmapped reads.  Sometimes one read will map and its mate won’t, so my reads are no longer all necessarily paired-end.  While I could try doing paired-end alignment on the pairs and single-end on the orphans, for simplicity I am just going to treat all the reads as single-end now.  BWA can actually align directly from BAM files, so no need to decompose to FASTQs using Picard.

An incredibly neat trick is to use tee to pipe BWA’s output directly to two different samtools processes to create separate mapped and unmapped files:

for i in {1820..1869}
do short 00:20 "bwa aln -b ref/viruses.fa hdex2.${i}.unmapped.bam > ${i}.virus.sai; bwa samse ref/viruses.fa ${i}.virus.sai hdex2.${i}.unmapped.bam | tee >(samtools view - -Sb -f 0x04 -o ${i}.virus.unmapped.bam) | samtools view - -Sb -F 0x04 -o ${i}.virus.mapped.bam;"

When I did this, I found that although the tee command with parentheses works fine directly on the command line, it didn’t work when part of a bsub command: my LSF system felt there was some syntax error with the parentheses.  If that happens to you too, then do this step the old boring way, writing out a BAM to disk and then reading it back into memory:

for i in {1820..1869}
do short 00:20 "bwa aln -b ref/viruses.fa hdex2.${i}.unmapped.bam > ${i}.virus.sai; bwa samse ref/viruses.fa ${i}.virus.sai hdex2.${i}.unmapped.bam | samtools view - -Sb -o ${i}.virus.all.bam; samtools view ${i}.virus.all.bam -b -F 0x04 -o ${i}.virus.mapped.bam; samtools view ${i}.virus.all.bam -b -f 0x04 -o ${i}.virus.unmapped.bam;"

At this point you’re working with small files so things should be quick – most of my jobs finished in a couple of minutes total.  To be safe, I set the runtime limit at 20 minutes.

Afterwards you can check what fraction of your reads that didn’t map to the human genome did map to a viral genome.  It will probably be a very small fraction.  These commands will give you the read count for each:

samtools view -c 1820.virus.unmapped.bam
samtools view -c 1820.virus.mapped.bam

step 5. see which viral genomes had reads aligned

Simplest way:

samtools view 1820.virus.mapped.bam | cut -f3 | sort | uniq -c

Will run in < 1 minute and will give you a count of reads for each virus. You can easily tell that some viruses are either trace amounts or probably wrong alignments ( < 150 reads per virus per sample) and others are definitely really in the sample (several thousand or tens of thousands of reads for one virus in one sample).

To do it for all your files and compile into one handy table:

for i in {1820..1869}
    samtools view ${i}.virus.mapped.bam | cut -f3 | sort | uniq -c | awk -v i=$i '{print i"\t"$1"\t"$2}' >>

step 6: interpretation

You can compute coverage as (total number of reads) * (read length) / (virus genome size), which is probably a good proxy for viral copy number.