use case

I aligned some RNA-seq data using Gsnap and used the --split-output option to create separate SAM files depending on how read pairs aligned (concordant vs. discordant, unique vs. multiple, mapped vs. unmapped).  Now I want to take the completely unmapped read pairs and throw them against some viral genomes to see if any align.

problem

I am starting from a SAM file of unaligned read pairs sorted by name, called samplename.nomapping.sam, and I want to align them to a new reference genome.  Easier said than done:

  • bwa has an option bwa aln -b to use BAM files as input, but no option for uncompressed SAM files.
  • I can convert my SAM to a BAM with samtools view -Sbh, but after bwa aln, the next step, bwa samse, still requires raw FASTQs as one of the inputs.
  • Picard SamToFastq won’t decompose my SAM to a FASTQ because it “Cannot add sequence that already exists in SAMSequenceDictionary: chr1″.  Googling this error didn’t return anything informative for this scenario.

I figured, how much time to I really want to spend getting these tools to work?  For this simple purpose, it should be possible to just brute force convert the SAMs to FASTQs on the Unix command line.

solution

cat samplename.nomapping.sam | grep -v ^@ | awk 'NR%2==1 {print "@"$1"\n"$10"\n+\n"$11}' > unmapped/samplename_1.fastq 
cat samplename.nomapping.sam | grep -v ^@ | awk 'NR%2==0 {print "@"$1"\n"$10"\n+\n"$11}' > unmapped/samplename_2.fastq

Here’s the breakdown:

  • grep -v ^@ removes the SAM header lines
  • NR%2==1 executes on only the odd-numbered rows (this is to split paired-end output – my input SAM file is name-sorted, so paired reads occupy consecutive lines)
  • NR%2==0 executes on only the even-numbered rows
  • "@"$1"\n"$10"\n+\n"$11 grabs the read name ($1), read ($10), and base quality scores ($11) and writes them out per the FASTQ spec

If you have single-end reads, this should do:

cat samplename.nomapping.sam | grep -v ^@ | awk '{print "@"$1"\n"$10"\n+\n"$11}' > unmapped/samplename.fastq

And if you have paired-end but you want to treat them as single-end, this should do:

cat samplename.nomapping.sam | grep -v ^@ | awk 'NR%2==1 {print "@"$1"_1\n"$10"\n+\n"$11}' > unmapped/samplename.fastq 
cat samplename.nomapping.sam | grep -v ^@ | awk 'NR%2==0 {print "@"$1"_2\n"$10"\n+\n"$11}' >> unmapped/samplename.fastq

disclaimer

Viruses do have introns – for instance, some Epstein-Barr virus introns are mentioned in [Young 2007] so aligning viral reads with BWA – as opposed to a proper RNA-seq aligner like TopHat or Gsnap - is not optimal.