UAB day 1: intro to sequencing

This week I’ll be blogging from the UAB Second Short Course on Next-Generation Sequencing. These are just my notes from the course, so they may be a bit scattered at times.

Speaker 1: Greg Barsh

HapMap 2005

~1M SNPs genotyped on 30 YRI trios, 30 CEU trios, 45 CHB, 44 JPT.
haplotype blocks: correlated SNPs exist due to recombination or lack thereof. average length of haplotype block: YRI: 7.3kb, CEU: 16.3kb, CHB: 13.2kb
One important conclusion of project: haplotype structure is surprisingly simple. ex. 120 possible combinations of SNPs but only 7 are seen. small (5-15kb) genomic regions show diversity mainly due to mutation, not recombination

Exome sequencing study of 10 Kabuki syndrome patients

Only 1 gene had novel mutations in all 10 patients. It was MUC16, but likely a false positive due to enormous size: 14,507 AAs.

Conrad 2011 “variation in genome-wide mutation rates within and between human families”

study to determine rate of de novo mutation in the general population
at first glance, a few thousand de novo mutations per individual, but mostly sequencing errors
on validation, only 35 – 49 de novos per person, and only 0 – 1 are in protein coding regions

Speaker 2: Shawn Levy

List of direct to consumer genontyping options

23andMe
dna tribes
navigenics
identigene
homedna

Sanger sequencing

Uses (non-reversible) terminators to create all possible strands with length < N complementary to template strand of length N. then you separate them by length by electrophoresis and look at the fluorophores (originally radiation) to see what the final base was in the strand of that length. The use of fluorescence made automation and large-scale sequencing possible.
Sanger sequencing still relevant today due to:
- Very long read lengths (700-1000nt)
- Very low error rate
- Very high per-base quality
But:
- not very parallel (only ~96 reactions (i.e. reads) per instrument) as opposed to NGS (millions or billions of reads at same time)\

Next-generation sequencing technologies

Intro

current costs at Hudson Alpha
< 10 genomes: $6700/genome,
> 500 genomes, $3100/genome
current times
- 2-3 weeks per genome/instrument (Illumina)
- Complete genomics had a 90 genomes in 90 days deal for a while
- Whole genome now down to about 72 hours as of last few weeks
fluorescent chain termination sequencing (automated Sanger) dominated market until 2005
then NGS arrived, starting with Roche 454

Roche 454: Pyrosequencing

The first NGS technology introduced
Sequencing by synthesis; uses the pyrophosphate removed by DNA polymerase in a secondary reaction to detect incorporation on a base by base basis
Every time a base is incorporated (whether at bench or in your cells), a pyrophosphate is removed
Roche uses pico plates – ~1M wells per plate
100-micrometer-sized beads with multiple copies of same dna sequence attached to them, each bead into one well
DNA is built from tip back towards where the DNA is attached to the bead
Fill all wells with one base of dna (e.g. only A) at a time (natural base, no fluorescence)
You take the pyrophosphate and generate light signal in a reporter reaction
You add only one base type (ACGT) at a time so if there is a flash of light it means that base incorporated.
Then rinse and repeat
This sequencing method has trouble with homopolymeric runs, e.g. AAAAAAA. On the first flood of A you get 10x the light, rather than getting 1x ten separate times. The difference between 1 and 2 As is easy to detect, A vs. AA. But 9 As vs. 10 As is only an 11% difference in intensity and more difficult to detect. The current chemistry is good up to about 10 or 12, but it’s sequence context dependent– you may get different results depending on neighboring sequence
Roche 454 currently ~650 base max read length [sic], (454.com now claims 1000b on the GS FLX+)
1M reads per run
650 bases per read
650M bases per run
8-12 hours per run
$10,000 per run (reagents only)
good for de novo assembly of genomes
$10K/650Mb is expensive compared to other NGS technologies but a bargain compared to Sanger sequencing

Illumina HiSeq 2000/2500

largest installed base worldwide for high-throughput instruments
2000 been out a few years, 2500 came out this month
HiSeq2000: ~600Gb/run, in 2 flowcells at 300Gb/flowcell, 8 lanes per flowcell at 200M reads/lane. $10,000/600Gb run (reagents only)
HiSeq2500: only 180Gb/run, 2 lanes/flowcell, 150Mreads/lane
HiSeq2500 was created for greater speed. a machine that collects less data faster
Illumina uses reversible terminator chemistry (described in Bentley 2008 in Nature)
all four fluorolabeled nucleotides in 1 reaction
then deblock and remove fluorescence
chemistry is about 20 minutes per base
imaging takes most of the cycle time
whereas 454 can do a 650nt read length in 8-12 hours; HiSeq does 100nt reads in 6 days
HiSeq doesn’t have Roche’s problem with homopolymeric runs
location of clusters on flowcell is random (not in picotiter wells like with 454)
originally used NASA’s star-finding algorithms to find where the clusters are
150 maximum read length
1 run ~ 12 days
good for re-sequencing, ChIPseq, RNA-seq
cost: $20,000 per run

Ion Torrent “Personal Genome Machine” (PGM) and Proton

Ion Torrent was founded by the guy who invented 454 before it was sold to Roche
Ion Torrent now owned by Life Technologies
PGM lower output, Proton higher output
both benchtop
uses wells much smaller than 454. 454 uses ~100um wells, Ion Torrent uses 1.3um down to 0.3um wells
density 160M – 1B wells per chip
output in the tens or hundreds of Gb per 2-hour run
Proton II early 2013, Proton III late 2013
same technology as 454 but instead of pyrophosphate it watches the H+ ion that comes off
at a level of abstraction, it is a really sensitive pH meter.
electronics built into bottom of well
no secondary reaction
same issue with homopolymer runs as Roche
200nt reads in 24 hours
Ion Chef is the to-be-released tool to go from sample to beads ready to load into this machine, automatically

Other technologies

Life Technologies SOLiD: sequencing by ligation, as opposed to synthesis which others use.
- fluorescently labeled bases
- 2-base encoding
- less popular now
oxford nanopore: not commercially available yet, first datasets expected in 2013. Derrington 2010 MspA protein nanopore; IBM leading development of synthetic nanopore
Single molecule technologies: Pacific BioSciences- no amplification, very small quantities of DNA

how to get only one sequence per bead?

concentration of DNA and of beads. add oil to emulsify.
water droplets act as microreactors.
on average each water droplet will contain one dna fragment, one bead, and a certain number of reactants. some beads will have nothing, some will have multiple dna (polyclonal). majority will be 1:1, single molecule of dna on bead. some wells have no dna, blank; other wells are uninterpretable because polyclonal.
usually 50 – 80% of wells have active beads.
sequencing devices have software to mask out those wells

Library prep

ways to shatter genomic DNA:
- sonication (covaris)
- nebulization (use nitrogen or other gases through a small orifice under pressure)
- enzymatic treatment (DNAases , etc.)
ideal fragment size is just a little bit longer than total read length, e.g. > 200bp if doing 100bp paired end
for emulsion PCR you want 300-400bp fragments
for illumina can go up to 1000bp
but larger fragments give fewer reads because bridge PCR is better at amplifying small fragments
illumina is the only one that does paired-end reads – no one else has capability to read from both ends of same fragment
steps in library prep
1. shatter
2. repair ends to make sure blunt, not sticky, ends
3. add A
4. ligate known adapter sequences at each 5′ end
illumina forked adapter: complementary section, then it forks, not complementary
bridge PCR: each “circle” on a flowcell represents several hundred to low thousands of identical DNA molecules
See Bentley 2008 and Shendure 2008 for more details on all this chemistry.

Misc

in Illumina can get ~1M clusters / mm^2; much more than that and you get almost no output because things overlap
reduction in sequence quality as position in read increases. due to imperfect chemistry: sometimes a terminator doesn’t get removed, fluorophore doesn’t get cleaved off etc.
why is 30x coverage often recommended? if 30x average coverage, then given the distribution of coverage almost everything will be covered at least 4x or 8x, making it possible to call variants. (high GC content or low information content sequence will never be highly covered with uniquely aligned reads so you’ll still only get ~91% of genome covered at > 1x)

Alignment

Two categories of alignment algorithms.
Hashing: ELAND (proprietary by Illumina), SOAP (deprecated in favor of SOAP2??), Maq
Burrows-Wheeler Transform:SOAP2, BWA, Bowtie
Hashing is more memory intensive, BWT is simpler. the BWT aligners are considered faster
tophat is just bowtie for rnaseq
quick qc metric: look at ratio of reads mapping to x/y chrom, order by ratio. should be a sharp wall at transition from male to female. if you get a gentle slope that means you’ve got contamination
most sequences in genome are 20-70% GC
70% of normalized coverage at ~70% GC content. i think he means that reads with 70% GC content will be 30% underrepresented compared to their actual frequency in genome
~90% coverage at 1x+ is good for whole genome (GC content and repetitive sequences make it impossible to do 100% at 1x)

Speaker 3: Devin Absher - CNVs and Structural Variations

CNVs represent more interindividual genetic variation than SNVs

SNPs = 0.08% of genome
CNVs = 0.12% of genome (and rising) (just in # of base pairs absorbed)

Types of events:

translocations
inversions
loss of heterozygosity – ex. maternal chromosome gets copied to paternal chromosome, usu. by repair mechanisms
deletions
amplifications (aka duplications). including small, common polymorphisms of low copy repeats / segmental duplications , occurring in hot spots within genome

Two approaches:

1. read depth quantification

assumes everything is equally alignable
but this is not true of repeat regions which get lower alignment
pcr duplicates
gc bias
ways to overcome these issues:
A. Yoon 2009: look at 100bp windows and count depth in that window. rolling window. correct for GC content. use a “segmentation” algorithm (like an HMM) to look for change points in read depth
B. create a sort of “reference genome” for read depth. get average read depth for 100s of samples to average out the real copy number events and get just hte expectation of the influence of GC content, repeat regions. now take log2 ratio of your sample to the reference sample. you still get a lot of noise in very high repeat regions because even in the reference, there was close to 0 depth.
depth really matters. with 120x you can see CNVs very clearly, with 10x the noise is much greater than signal for short deletions (for 1 Mb deletion you’ll still see it)

2. aberrant paired reads

if both map to same strand, i.e. head-to-tail, indicates an inversion
if insert size is out of normal range, indicates a deletion
how to do this: either standard paired end(100-500bp) or mate-pair (circularized) 3-5kb fragments, which are more difficult and costly to make

Both of these are still an art form. high false negative and false positive rates.

software pacakges

aberrant read pair approach:

PEMer (Korbel 2009): looks for clusters of outliers where distance or orientation are anomalous
Pindel (Ye 2009)
BreakDancer (Chen 2009): built with Maq in mind as being the aligner
SVDetect (Zeitouni 2010). better with SOLiD data.
VariationHunter – Hormozdiari 2010. tries to not only find anomalous pairs but also look up a db of transposons to see what is likely to be inserted there. has db of known retroviruses in human genome
HYDRA: anomalous pairs + transposon detection Quinlan 2010
BreakSeq: breakpoint detection with overlapping reads
Genome Strip population-based analysis – tries to integrate read depth AND aberrant read. Handsaker 2011.
GASVPro – also integrates read deptha nd read pair analaysis, different statistical model than Genome Strip. Sindi 2012

“old fashioned way”: analyst just integrates results of all of them. Mills 2011 used 12 different CNV detection algorithms on 1000G data. built a set of high confidence copy number events. this is that gold standard thing from GATK

how to get better alignment around segmental duplications

search through unalignable reads to find anomalies

use different aligners
- first pass with a fast aligner (BWA, Bowtie)
- then re-align unaligned reads using Novoalign, Mr(s)Fast
Mr(s) Fast: chops the read into smaller pieces and tries to align each of them

hundreds of de novo CNVs per generation. orders of magnitude more than de novo SNVs.
also older people have accumulated many more somatic CNVs over their lifetime than young people

all of this is concentrated around segmental duplications.

all of these methods are terrible at detecting variations that are within the actual distribution of insert sizes in your fragments
very hard to call 100-300 bp inserts/deletions
big events of 1kb or more are pretty easy to spot

mechanisms by which sgemental duplications arise
- nonhomologous recombination

there are also software e.g. Velvet to do de novo assembly to detect CNVs, as alternative to alignment