This week I’ll be blogging from the UAB Second Short Course on Next-Generation Sequencing.  These are just my notes from the course, so they may be a bit scattered at times.

Speaker 1: Greg Barsh

HapMap 2005

• ~1M SNPs genotyped on 30 YRI trios, 30 CEU trios, 45 CHB, 44 JPT.
• haplotype blocks: correlated SNPs exist due to recombination or lack thereof. average length of haplotype block: YRI: 7.3kb, CEU: 16.3kb, CHB: 13.2kb
• One important conclusion of project: haplotype structure is surprisingly simple. ex. 120 possible combinations of SNPs but only 7 are seen.  small (5-15kb) genomic regions show diversity mainly due to mutation, not recombination

Exome sequencing study of 10 Kabuki syndrome patients

• Only 1 gene had novel mutations in all 10 patients. It was MUC16, but likely a false positive due to enormous size: 14,507 AAs.

Conrad 2011 “variation in genome-wide mutation rates within and between human families”

• study to determine rate of de novo mutation in the general population
• at first glance, a few thousand de novo mutations per individual, but mostly sequencing errors
• on validation, only 35 – 49 de novos per person, and only 0 – 1 are in protein coding regions

Speaker 2: Shawn Levy

List of direct to consumer genontyping options

• 23andMe
• dna tribes
• navigenics
• identigene
• homedna

Sanger sequencing

• Uses (non-reversible) terminators to create all possible strands with length < N complementary to template strand of length N. then you separate them by length by electrophoresis and look at the fluorophores (originally radiation) to see what the final base was in the strand of that length. The use of fluorescence made automation and large-scale sequencing possible.
• Sanger sequencing still relevant today due to:
- Very long read lengths (700-1000nt)
- Very low error rate
- Very high per-base quality
But:
- not very parallel (only ~96 reactions (i.e. reads) per instrument) as opposed to NGS (millions or billions of reads at same time)\

Next-generation sequencing technologies

Intro

• current costs at Hudson Alpha
< 10 genomes: $6700/genome, > 500 genomes,$3100/genome
• current times
- 2-3 weeks per genome/instrument (Illumina)
- Complete genomics had a 90 genomes in 90 days deal for a while
- Whole genome now down to about 72 hours as of last few weeks
• fluorescent chain termination sequencing (automated Sanger) dominated market until 2005
• then NGS arrived, starting with Roche 454
• The first NGS technology introduced
• Sequencing by synthesis; uses the pyrophosphate removed by DNA polymerase in a secondary reaction to detect incorporation on a base by base basis
• Every time a base is incorporated (whether at bench or in your cells), a pyrophosphate is removed
• Roche uses pico plates – ~1M wells per plate
• 100-micrometer-sized beads with multiple copies of same dna sequence attached to them, each bead into one well
• DNA is built from tip back towards where the DNA is attached to the bead
• Fill all wells with one base of dna (e.g. only A) at a time (natural base, no fluorescence)
• You take the pyrophosphate and generate light signal in a reporter reaction
• You add only one base type (ACGT) at a time so if there is a flash of light it means that base incorporated.
• Then rinse and repeat
• This sequencing method has trouble with homopolymeric runs, e.g. AAAAAAA.  On the first flood of A you get 10x the light, rather than getting 1x ten separate times.  The difference between 1 and 2 As is easy to detect, A vs. AA. But 9 As vs. 10 As is only an 11% difference in intensity and more difficult to detect. The current chemistry is good up to about 10 or 12, but it’s sequence context dependent– you may get different results depending on neighboring sequence
• Roche 454 currently ~650 base max read length [sic], (454.com now claims 1000b on the GS FLX+)
• 650M bases per run
• 8-12 hours per run
• $10,000 per run (reagents only) • good for de novo assembly of genomes •$10K/650Mb is expensive compared to other NGS technologies but a bargain compared to Sanger sequencing

Illumina HiSeq 2000/2500

• largest installed base worldwide for high-throughput instruments
• 2000 been out a few years, 2500 came out this month
• HiSeq2000: ~600Gb/run, in 2 flowcells at 300Gb/flowcell, 8 lanes per flowcell at 200M reads/lane.  $10,000/600Gb run (reagents only) • HiSeq2500: only 180Gb/run, 2 lanes/flowcell, 150Mreads/lane • HiSeq2500 was created for greater speed. a machine that collects less data faster • Illumina uses reversible terminator chemistry (described in Bentley 2008 in Nature) • all four fluorolabeled nucleotides in 1 reaction • then deblock and remove fluorescence • chemistry is about 20 minutes per base • imaging takes most of the cycle time • whereas 454 can do a 650nt read length in 8-12 hours; HiSeq does 100nt reads in 6 days • HiSeq doesn’t have Roche’s problem with homopolymeric runs • location of clusters on flowcell is random (not in picotiter wells like with 454) • originally used NASA’s star-finding algorithms to find where the clusters are • 150 maximum read length • 1 run ~ 12 days • good for re-sequencing, ChIPseq, RNA-seq • cost:$20,000 per run

Ion Torrent “Personal Genome Machine” (PGM) and Proton

• Ion Torrent was founded by the guy who invented 454 before it was sold to Roche
• Ion Torrent now owned by Life Technologies
• PGM lower output, Proton higher output
• both benchtop
• uses wells much smaller than 454.  454 uses ~100um wells, Ion Torrent uses 1.3um down to 0.3um wells
• density 160M – 1B wells per chip
• output in the tens or hundreds of Gb per 2-hour run
• Proton II early 2013, Proton III late 2013
• same technology as 454 but instead of pyrophosphate it watches the H+ ion that comes off
• at a level of abstraction, it is a really sensitive pH meter.
• electronics built into bottom of well
• no secondary reaction
• same issue with homopolymer runs as Roche
• 200nt reads in 24 hours
• Ion Chef is the to-be-released tool to go from sample to beads ready to load into this machine, automatically

Other technologies

• Life Technologies SOLiD:  sequencing by ligation, as opposed to synthesis which others use.
- fluorescently labeled bases
- 2-base encoding
- less popular now
• oxford nanopore: not commercially available yet, first datasets expected in 2013. Derrington 2010 MspA protein nanopore; IBM leading development of synthetic nanopore
• Single molecule technologies: Pacific BioSciences- no amplification, very small quantities of DNA

how to get only one sequence per bead?

• water droplets act as microreactors.
• on average each water droplet will contain one dna fragment, one bead, and a certain number of reactants. some beads will have nothing, some will have multiple dna (polyclonal). majority will be 1:1, single molecule of dna on bead. some wells have no dna, blank; other wells are uninterpretable because polyclonal.
• usually 50 – 80% of wells have active beads.
• sequencing devices have software to mask out those wells

Library prep

• ways to shatter genomic DNA:
- sonication (covaris)
- nebulization (use nitrogen or other gases through a small orifice under pressure)
- enzymatic treatment (DNAases , etc.)
• ideal fragment size is just a little bit longer than total read length, e.g. > 200bp if doing 100bp paired end
• for emulsion PCR you want 300-400bp fragments
• for illumina can go up to 1000bp
• but larger fragments give fewer reads because bridge PCR is better at amplifying small fragments
• illumina is the only one that does paired-end reads – no one else has capability to read from both ends of same fragment
• steps in library prep
1. shatter
2. repair ends to make sure blunt, not sticky, ends
4. ligate known adapter sequences at each 5′ end
• illumina forked adapter: complementary section, then it forks, not complementary
• bridge PCR: each “circle” on a flowcell represents several hundred to low thousands of identical DNA molecules
• See Bentley 2008 and Shendure 2008 for more details on all this chemistry.

Misc

• in Illumina can get ~1M clusters / mm^2; much more than that and you get almost no output because things overlap
• reduction in sequence quality as position in read increases.  due to imperfect chemistry: sometimes a terminator doesn’t get removed, fluorophore doesn’t get cleaved off etc.
• why is 30x coverage often recommended?  if 30x average coverage, then given the distribution of coverage almost everything will be covered at least 4x or 8x, making it possible to call variants.  (high GC content or low information content sequence will never be highly covered with uniquely aligned reads so you’ll still only get ~91% of genome covered at > 1x)

Alignment

• Two categories of alignment algorithms.
Hashing: ELAND (proprietary by Illumina), SOAP (deprecated in favor of SOAP2??), Maq
Burrows-Wheeler Transform:SOAP2, BWA, Bowtie
• Hashing is more memory intensive, BWT is simpler. the BWT aligners are considered faster
• tophat is just bowtie for rnaseq
• quick qc metric: look at ratio of reads mapping to x/y chrom, order by ratio.  should be a sharp wall at transition from male to female. if you get a gentle slope that means you’ve got contamination
• most sequences in genome are 20-70% GC
• 70% of normalized coverage at ~70% GC content.  i think he means that reads with 70% GC content will be 30% underrepresented compared to their actual frequency in genome
• ~90% coverage at 1x+ is good for whole genome (GC content and repetitive sequences make it impossible to do 100% at 1x)

Speaker 3: Devin Absher - CNVs and Structural Variations

CNVs represent more interindividual genetic variation than SNVs

• SNPs = 0.08% of genome
• CNVs = 0.12% of genome (and rising) (just in # of base pairs absorbed)

Types of events:

• translocations
• inversions
• loss of heterozygosity – ex. maternal chromosome gets copied to paternal chromosome, usu. by repair mechanisms
• deletions
• amplifications (aka duplications).  including small, common polymorphisms of low copy repeats / segmental duplications , occurring in hot spots within genome

Two approaches:

• assumes everything is equally alignable
• but this is not true of repeat regions which get lower alignment
• pcr duplicates
• gc bias
• ways to overcome these issues:
• A. Yoon 2009: look at 100bp windows and count depth in that window. rolling window. correct for GC content. use a “segmentation” algorithm (like an HMM) to look for change points in read depth
• B. create a sort of “reference genome” for read depth.  get average read depth for 100s of samples to average out the real copy number events and get just hte expectation of the influence of GC content, repeat regions.  now take log2 ratio of your sample to the reference sample. you still get a lot of noise in very high repeat regions because even in the reference, there was close to 0 depth.
• depth really matters. with 120x you can see CNVs very clearly, with 10x the noise is much greater than signal for short deletions (for 1 Mb deletion you’ll still see it)

• if both map to same strand, i.e. head-to-tail, indicates an inversion
• if insert size is out of normal range, indicates a deletion
• how to do this: either standard paired end(100-500bp) or mate-pair (circularized) 3-5kb fragments, which are more difficult and costly to make

Both of these are still an art form. high false negative and false positive rates.

software pacakges

• PEMer (Korbel 2009): looks for clusters of outliers where distance or orientation are anomalous
• Pindel (Ye 2009)
• BreakDancer (Chen 2009): built with Maq in mind as being the aligner
• SVDetect (Zeitouni 2010). better with SOLiD data.
• VariationHunter – Hormozdiari 2010. tries to not only find anomalous pairs but also look up a db of transposons to see what is likely to be inserted there.  has db of known retroviruses in human genome
• HYDRA: anomalous pairs + transposon detection Quinlan 2010
• BreakSeq: breakpoint detection with overlapping reads
• Genome Strip population-based analysis – tries to integrate read depth AND aberrant read. Handsaker 2011.
• GASVPro – also integrates read deptha nd read pair analaysis, different statistical model than Genome Strip. Sindi 2012

“old fashioned way”: analyst just integrates results of all of them. Mills 2011 used 12 different CNV detection algorithms on 1000G data. built a set of high confidence copy number events.  this is that gold standard thing from GATK

how to get better alignment around segmental duplications

search through unalignable reads to find anomalies

use different aligners
- first pass with a fast aligner (BWA, Bowtie)
- then re-align unaligned reads using Novoalign, Mr(s)Fast
Mr(s) Fast: chops the read into smaller pieces and tries to align each of them

hundreds of de novo CNVs per generation. orders of magnitude more than de novo SNVs.
also older people have accumulated many more somatic CNVs over their lifetime than young people

all of this is concentrated around segmental duplications.

all of these methods are terrible at detecting variations that are within the actual distribution of insert sizes in your fragments
very hard to call 100-300 bp inserts/deletions
big events of 1kb or more are pretty easy to spot

mechanisms by which sgemental duplications arise
- nonhomologous recombination

there are also software e.g. Velvet to do de novo assembly to detect CNVs, as alternative to alignment