UAB day 2: analysis, expression and epigenomics

Today I am blogging from day 2 of the UAB short course on sequencing.

Speaker 1: Braden Boone – Data handling and analysis for high-performance sequencing

Outline of data analysis workflow for Illumina HiSeq sequencing:

HiSeq control software (real time):
- Sequencing images (actual optical images of flowcells)
- Image analysis
- Base calling (*.bcl files)
Hudson Alpha uses Illumina’s CASAVA 1.8 for all downstream steps
- FastQ generation
- Alignment
- Variant calling

Speaker 2: Greg Cooper – Variant analysis and annotation

Quality control

Example analysis of 1000G samples. % of variants in dbSNP is stable within people of one ancestry group – ~88% in Africans, 94% in Europeans, ~93% in Asians.
Allele frequency spectrum. Rule of thumb: 99% of variants in one sample should be at 1% or greater allele frequency, in general (n-1%) at (n)% or greater allele frequency.
Rare and novel variants should be overwhelmingly heterozygous
Number of (non-reference??) stop codons ~ 100 / genome
Ti/Tv ratio ~2 whole genome, 3 in protein coding, whereas random errors would be 0.5
Size distribution of indels: two trends: (1) should be shifted towards small indels (a few bases) and (2) should be shifted towards multiples of 3, i.e. non-frameshift.

Annoatations as QC for individual variants

Study frequency vs. database frequency – if variant is common in your samples and rare in dbSNP / 1000G or vice versa, that is sketchy. Not necessarily wrong but a red flag.
HWE – e.g. a longevity GWAS reported results driven by genotyping errors, clear because all the top reported SNPs were far out of HWE.
Be more skeptical of variants inside repeats / segmental duplications because harder to uniquely align.

Annotations to improve genetic discovery power

High effect size, high allele frequency variants do exist, e.g. warfarin sensitivity SNP, AMD risk SNP (probably these have all already been discovered, don’t count on more??)
Low effect size, high allele frequency: e.g. cholesterol, diabetes risk. Best uncovered by GWAS
High effect size, low allele frequency
Low effect size, low allele frequency – very hard to detect due to low statistical power
The “multiple testing” issue in genome-wide experiments is not really about multiplicity, it’s about low prior probabilities / low quality hypotheses. The goal of functional annotation is to generate better priors / use information wisely to focus on high value variants.
Using information from evolution: purifying selection reduces rates of evolution at functional sites. Therefore sites with higher levels of sequence conservation are more likely to have greater functional importance. Relationship is quantitative.
Shared sequences like motifs, domains etc. are also informative
Focus on non-synonymous variants b/c enriched for functional effects, disease consequences.
Stop codons and splice-site disruptions are even more likely important
Still, most non-synonymous variants are not important.
Three sources of data for classifying non-synonymous SNPs: (1) biochemical info about amino acids, (2) structural predictions from biophysics, (3) evolutionary information, i.e. sequence conservation with deep phylogenetic comparisons
Two types of approaches: (1) model-based: classify each mutation according to its similarity to a theoretical definition of deleterious vs. neutral, or (2) trained-classifier: heuristically learn the properties of “known” deleterious vs. “known” neutral variants.
List of tools for protein-sequence-based predictions of deleteriousness – Cooper & Shendure 2011.
SIFT
MAPP – measures conservation as a function of biochemical properties ex. how tightly is hydropathy conserved, how tightly is polarity conserved, how tightly is volume conserved.
Polyphen – trained classifier based on many features. Polyphen has 4 categories: synonymous, benign non-syn, possibly damaging, and probably damaging. These correlate well with allele frequency – singletons are enriched for “probably damaging”; high frequency variants are enriched for “benign” in Ng 2009.
What about the 99% of variations that are not located in protein coding regions? How to annotate non-coding variants?
Comparative genomics: evolutionary approach usually absed on mammalian multiple alignment, b/c non-protein-coding sequence in humans is almost never conserved outside of the mammalian clade
Functional genomics: promoters, etc.
Tools: phastCons, GERP, Gumby, PhyloP, SCONE, binCons, Chai Cons, Vista
GERP scores: In mammalian multiple alignment (~24 mammals?), neutral average 5 substitutions per site. If 0 substitutions observed at a site, that site gets +5 points for conservation, if 1 substitution, +4 for conservation, etc.
Are GERP scores meaningful? Two lines of evidence: (1) Coding regions are heavily enriched for conserved sites according to GERP, and (2) GERP score correlates well with allele frequency: singletons are enriched for variants in highly conserved sites. So past rates of evolution (measured by GERP) are predictive of current prevalence (i.e. allele frequency). GERP score also predicts disease influence well in variants that were otherwise identified by amino acid conservation measures.
Functional annotations for non-coding regions are just now emerging. Key types of available information:
Transcription factor binding events (discovered via ChIP-seq)
Regions of open chromatin (via DNase HS/FAIRE/others)
Chromatin modifications
Methylation status
Identification of promoters, enhancers, silencers, and insulators via integrative analysis of molecular annotation
ENCODE project has the goal of producing a functional map of whole human genome including non-coding. There are now hundreds of ENCODE tracks, all genome-wide and publicly available

Error rates

All genomic datasets have both random and structured errors
Sources: ex. depth of coverage, chemistry, PCR mistakes, software limitations
Each institution, machine, technician, protocol, sample etc will have its own systematic error, and sequencing is sensitive enough to pick all these up.
Even when errors are rare overall, they might be proportionally enriched in subsets of variants.
Irony: the most interesting variants are the most likely to be an error.
Why? false positive rate != false discovery rate
Evolution reduces variants in functional regions, yet error is uniform, so the fraction of variants that are errors is elevated in functional regions.
Example: looking for de novos in trios. Accepted actual mutation rate is 2.5e-8. 20Mb of exome is prone to non-synonymous variants. Sequencing false positive rate is 1e-6. So we expect for every one child there will be 0.5 real de novos and 20 false de novos. False discovery rate is 97.6%. And this doesn’t even account for the possibility that the parent actually did have the variant and you just had a false negative of missing that varint in the parent.
e.g. Vissers 2010 – aggressive QC led to 51 candidate de novos, only 9 of which validated.
Need to integrate known association singals. ex. if candidate rare variant in gene X might be correlated with phenbotype Y, that’s more interesting if it’s already known that common variants in X are associated with Y. relevant databases: NHGRI, HapMap and GVS.
eQTLs are enriched for functional variants.

Challenges and future directions

Discrete filters are the norm but are not optimal. e.g. eliminating everything in dbSNP to look for causes of new rare mendelian diseases. Better to use a continuous filter like allele frequency.
Conserved vs. neutral, damaging v. benign are dichotomous rather than quantitative
Most causal variants, even for rare and monogenic diseases, are NOT non-synonymous, obvious loss-of-function protein coding mutations.
Fewer than 50% of rare, likely-Mendelian diseases to which exome sequencing has been applied have yielded useful hits.
New direction: unified predictions of disease relevance. e.g. Beta-globin associations with thalassemia.

Q&A

chromHMM or Segway – attempts to integrate all information from different ENCODE tracks
Rule of thumb: if you find a de novo variant in a child and it IS in dbSNP, then it was probably a false negative in the parent. this will be less true as more rare things are added to dbSNP, but for now is still a good rule of thumb.

Speaker 3: Jay Gertz – RNA-seq

Brief history of gene expression assays

Northern blots – very low throughput.
qPCR & nanostring – low to medium throughput, sensitive.
- qPCR: still used. copy RNA to cDNA, make primers for it, and PCR it quantitatively.
- nanostring: attach RNA to a glass side, use current to line the molecules up and hybridize fluorescent probes. very sensitive b/c counting single molecules
Expression microarrays – revolutionary, first way to look at all genes in an organism in one experiement. glass sides with oligos complementary to sequences of interest. create fluorolabeled cDNA based on RNA from organism of interest, let it hybridize to the oligos on the slide, and see where fluorescence appears. This is very high throughput but fairly noisy. Some mRNAs will bind to other genes’ probes by mistake.
Sequencing-based RNA assays. both high throughput and sensitive. subject of today’s talk

Topics for today

mRNA-Seq – usually called RNA-seq
EDGE RNA-seq
micro RNA-seq

mRNA-seq

Step 1: enrichment of mRNAs. mRNAs are only 3% of all RNA in a cell. (ribosomal = 80%, transfer = 15%, small = 2%). you only want to sequence mRNA. approaches: (1) capture mRNA with poly-dT oligo probes attached to magnetic beads, that hybridize to poly-A tails of mRNA. can use “locked” nucleic acids to get tight hybridization. or (2) use beads to remove rRNA. Ribominus (Invitrogen) and Ribozero (Epicentre). (3) Crab DSN (duplex-specific nucleases). Take all RNA, make cDNA, heat to 95C to dissociate, then lower to ~65C and because rRNA are so abundant they are more likely to find their complement, then use a duplex-specific DNase to chew up anything double-stranded. caveat: this will also reduce transcripts of most highly expressed genes, e.g. actin.
how to deal with degraded samples (low quality, hard-to-obtain samples). with mRNA capture, you’ll bias towards 3′ end of mRNAs. Ribozero and Crab DSN work better for degraded samples
Each method requires 100ng – 1μg of total RNA
Step 2: construct library. method is due to Mortazavi 2008.
steps: (1) fragment RNA by heat or chemical methods, (2) make double-stranded cDNA from random priming, (3) end-repair to make blunt ends, (4) add A at 3′ using Taq, (5) ligate adapters which have single T overhang to find the A, (6) PCR amplify
Does not preserve strandedness of RNA
Directional methods do preserve strandedness. Two approaches: (1) ligate known adapters to 5′ and 3′ ends of the transcript itself before reverse transcribing to cDNA, or (2) dUTP cDNA method – mark the complementary strand by synthesizing with U instead of T, then use a selective digester that will digest the strand with the Us in it
These methods are very effective at getting `99.5% reads from positive strand. Very important for, ex. antisense transcription
Efficient library prep protocol using Nextera enzyme mix intended for genomic DNA -see Gertz 2011.

Quality control metrics for RNA-seq

Alignment – at least 75% of reads should map to known RefSeq genes
Complexity of library (number of different molecules after library construction). Of random sample of 10M reads, how many different genomic positions are represented? Should be > 85% complexity (i.e. 85% unique positions). Complexity drops off sharply at 1 pg mRNA (~10 cells); to get good complexity you want at least 10 pg mRNA so at least 100-1000 mammalian cells.
Coefficient of variation across a transcript – you want even coverage across a transcript, minimizing 3′ bias which is especially pronounced in degraded samples

Single-cell RNA-seq

Key step is amplification of cDNA

How many reads is enough?

ENCODE guidelines:
30M paired-end reads for gene expression
100-200M paired-end reads for transcript discovery
The lower the expression of a gene, the more depth you need to get an accurate measure of its expression

When to use microarray vs. RNA-seq?

microarray have limited dynamic range – a small range of expression levels are detectable; RNA-seq can detect much larger range
Larger fold differences in expression can be detected with RNA-seq and not with microarray
microarray only gives you expression measurements; RNA-seq also lets you discover new transcripts, get info about splicing, call SNPs.
RNA-seq is still about 2x more expensive than microarray
For guidance on depth see Tarazona 2011.

Paired vs. single-end reads

Paired end is MUCH more important in RNA-seq than in genome/exome sequencing.
1. RNA is much harder to align to genome. Paired end gives you greater alignment rate
2. Paired end gives you more splicing information
Longer reads are also better b/c you can span more exon junctions, but limited by insert size of fragment. Hudson-Alpha does paired-end 50bp reads. Paired-end is more important than read length.

Computational tools for RNA-seq

With or without a genome you can do: transcript/isoform discovery, expression levels and isoform usage, and viral and bacterial contaminants
With a genome you can do: RNA editing, calling SNPs, fusion genes
If your organism has a reference genome, first step is alignment to genome
Alignment is difficult for two reasons:
1. exons can be 10 or even 100kb apart in genome but adjacent in transcript
2. many reads span exon junctions so they don’t match to the genome itself
TopHat works well for RNA-seq alignment though it is relatively slow
To discover transcripts: CuffLinks, Abyss, Scripture
Compare with annotations from known genes: RefSeq, ENSEMBLE, Gencode (annotations by ENCODE, includes long non-coding RNAs lncRNAs)
If no reference genome, do de novo assembly of transcripts using Trans-ABySS, Trinity, Velvet.
Expression measurements: add up reads mapping to a transcript, normalized by both the length of gene and number of aligned reads in that sample or experiment. this lets you compare across genes. you don’t need to normalize by length of gene if you are only interested in comparing same gene’s expression across different samples.
Measure is Reads Per Kilobase of transcript per Million mapped reads = RPKM. FPKM (“fragments”) is identical
Some methods use raw read counts instead of normalized. This lets you examine power for a particular gene.
Tools for comparing gene expression: CuffLinks, MyRNA, DEseq, DEGseq, baySeq, edgeR, NOIseq, SAMSeq, DEXSeq. some require replicates.

Common computational problems

Outliers can cause problems – a gene with identical expression across experiments except one huge outlier point can have a very small p value by edgeR or DEseq methods.
Different methods all have different biases. SAMSeq is non-parametric – looks only at rank of samples, so it’s not sensitive to outliers. But that’s true even if 6 out of 20 samples are outliers, at which point they are not really outliers, so it might be missing something that really is significant.

Other things you can do

Can align RNA-seq reads to viral and bacterial genomes to detect “contamination”. genomes available from NCBI Genome database.
Call SNPs using Samtools, SNVmix. Not as good as variant calling on genomic DNA, esp. near junctions, and heterozygous variants can only be called if both transcripts have very similar expression level
Identify transcript fusion using TopHat-fusion, deFuse, FusionMap, ChimeraScan (he recommends this). You can discover cancer-specific transcript fusions.
Very hard to do transcript-specific expression. You can do it if you already have phased genome (or exome) sequenced.

EDGE RNA-seq

Eco-P15I-tagged Digital Gene Expression (EDGE)
You cleave to get only the poly-A tail up to a restriction site, then ligate an adapter. Eco-P15I will bind to adapter but then cut 27b downstream of the binding site. Now you have a 26-28b “edge tag” available for sequencing.
If reference genome available, align and count tags for expression
If not available, can align to a similar organism’s reference genome to see which homologous genes are expressed
Library prep is more laborious but you sequence much less material in order to get same amount of expression information

micro RNA-seq

miRNAs regulate expression of mRNAs
only ~21b long
methods for analyzing microRNA:
qPCR – Taqman assay – ligation of primer sequence first
Nanostring – imaging with fluorescent probes
microRNA microarrays
microRNA-seq: the only method for microRNA discovery
protocol for creating miRNA libraries is similar to direction-specific mRNA-seq: (1) ligate adapters to miRNAs, (2) reverse transcribe to cDNA, (3) acrylamide gel size selection for 21b + adapter length – difficult and laborious
To sequence miRNA you only need 2-5M short single-end reads. There are only 1700 known miRNAs in mammals.
Things you can do:
Quantify miRNA expression (spike-in synthetic RNAs may help compare samples)
Identify the processed end
Mutation / editing detection is very difficult due to length – 21b is hard to align uniquely to genome
Analysis of miRNA-seq:
trim adapter from sequence
align to genome or to known miRNAs
count reads
Tools for analyzing: miRNAkey, miRDeep, miRexpress, miRanalyzer

RNA-seq is become more cost effective. also you can use DNA barcodes to pool samples in one lane (on Illumina).

Speaker 4: Devin Absher – Genome-wide analysis of DNA methylation

Interest in epigenetics arises in part out of a search for missing heritability that did not end up being explained by GWAS because all the associated variants found through GWAS had relatively low effect size.

Epigenome reflects genetic factors and environmental factors (on us, and even our parents and grandparents) and disease state itself.

Recent literature has shown that smoking, diet, etc. affect the epigenome.

How to best study the epigenome? There are two major things to study: histone modification and DNA methylation. The two are highly correlated. To study histone modifications use ChIP-seq – pull down modified portions of genome using antibodies to modified histones. To study DNA methylation, three approaches:

Immunoprecipiatation/protein-based: MeDIP-seq, MBD-seq
Restriction enzyme-based: CHARM, MethylSeq/HELP
Bisulfite-based: Reduced representation (RRBS), whole genome, targeted, combined with above assays

Most assays rely on “peak calling” to make quantitative level of enrichment a binary trait – which parts of genome “are” or “aren’t” methylated.

Biology of DNA methylation

Cytosine (C) accounts for almost all methylation in genome – other bases very rarely get methylated.
Methylation in gene promoters strongly correlated with lack of expression
Methylation at other sites can have different effect depending on context.
Most C methylation occurs in a 5′-CpG-3′ context.
3% of cytosines in adult mammalian cells are methylated
CpG-rich sequences are often found at gene promoter.
CpG islands are ~200b regions where >50% GC content, CpGobs/CpGexp > 0.6
Repetitive elements / retrotransposons are very highly methylated
CpG islands – very low methylation levels
CpG “shores” bordering islands – often sites of active regulation through differential methylation. Tend to be a bit more methylated than the islands. Often in cancer the methylation pattern in the shores changes.
CpGs outside CpG islands are ~80% methylated
High methylation typically found in condensed chromatin, low methylation in open chromatin
Effect on expression depends on gene context
3′UTR methylation prevents antisense transcription, thus increasing expression of the gene. (Whereas methylation 5′ of the gene is associated with reduced expression).

Reduced representation bisulfite sequencing

If money was no object, everyone would do whole genome bisulfite sequencing.
To save money, people often try to enrich for regions of interest before doing bisulfite sequencing.
Reduced representation bisulfite sequencing (RRBS) due to Meissner 2005.
Takes advantage of restriction enzyme MspI which cuts CpG sequence (regardless of methylation status). Digest whole genome with this enzyme, then do size selection for small sizes, thus enriching for sites with high CpG content.
Alternate method: Agilent sells a targeted capture kit (similar to an exome capture kit) designed to get promoter regions and things you’re interested in for bisulfite sequencing. This avoids the RRBS issue that CpG-poor genes are underrepresented.
Each read gives you one data point: a CpG at the restriction site
And you’ve biased the sequencing towards CpG-rich genomic regions: CpG islands and shores.
Steps: digest, ligate adapters, size select, bisulfite conversion
This method covers about 75% of gene promoters in the human genome quite well
Bisulfite conversion: methylated cytosines are protected from bisulfite conversion; unmethylated are converted to T. Analysis ends up like SNP calling: ratio of C to T at a base that is C in the reference genome gives us an estimate of the proportion of C that were methylated at that site across the different chromsomes/cells that went into our sample.
This works for all cytosines even if not in a CpG context.
Major issue: strands are no longer complementary because only the C stand gets converted to T (the G on the other side doesn’t get converted to A). This is especially a problem with paired-end sequencing because the read from a more heavily modified strand may not align to the right place in the genome.
One solution: ignore C and T when doing alignment. Mask out all Cs (or just Cs in a CpG context) by converting in silico to a T at the position of every C. Do this to both the reads and the reference genome. Once alignment coordinates are found, go back and pull original reads from FASTQ to get proportion of methylation. This still doesn’t solve the strandedness issue.
Solution for strandedness: convert all reads and both strands of genome C->T, resulting in two copies of the reference genome, one for each strand. Then align each read to both genomes and accept only those reads which can be aligned uniquely to only one genome.
For paried-end reads, need to reverse complement one of the reads so that the first and second read in a pair are capable of aligning to the same strand of the genome. Make sure settings in aligner (e.g. Bowtie) are set to deal with forward-forward pairs.
This all used to be done manually; now software tools are available: Bowtie, Pash, Bismark, BSMAP, BS Seeker, MethylCoder.

Distribution of methylation across genome

Bimodal – very low methylation levels in CpG islands, very high methylation elsewhere. Not much of the genome is in an intermediate state of methyation, but those parts are of much interest: they are probably intermediately methylated because they have either cell type specific methylation or some kind of dynamic regulation.
Results are always very cell type specific. For instance DNA extracted from blood is from white blood cells, which are really a mixture of cell types, each with a different methylation pattern. For instance a region might be fully methylated in T cells and not at all methylated in B cells and so the overall result will look like an intermediate level of methylation. And different blood cells have different ratios of these cell types. An even bigger issue with tumor/normal comparisons because there are often some normal cells in a tumor.
Often cells “choose” one allele to methylate, giving rise to allele-specific expression. This results in perceived intermediate levels of methylation as well.

5-OH (read “hydroxy”) Methylcytosine aka 5hmc

Cells use this modification over short time periods to change methylation state.
a UAB group has evidence that hydroxymethylation is involved in memory consolidation in neurons in animals after fear tests, with long-term effect on gene expression.
Also has influence on splicing: 5-hmc across intron/exon boundaries affects which exons get included in the final transcript
Bisulfite doesn’t distinguish between 5mc and 5hmc – both protect C from conversion to T
However if you pre-treat DNA with an oxidative reaction you can make 5hmc (but not 5mc) sensitive to bisulfite
Do two parallel reactions: one standard bisulfite, and one oxidated bisulfite, thus you can sequence twice as much in order to get three-way distinction between C, 5mc and 5hmc
Nanopore sequencers claim that they can distinguish C, 5mc, and 5hmc without bisulfite conversion. Hopefully when those become commercially available, bisulfite will no longer be necessary.

Q. what depth is needed for bisulfite sequencing?

300x needed for bisulfite based on simulations – this is the point where you saturate the information content. But no one is doing that depth. People are doing more like 20x or 30x. 300x would be warranted only if you need highly accurate calls. But because most things are fully or not-at-all methylated, you can distinguish those groups with about 10x. But then you can only report methylation in 10% increments. Read depth is important if looking for a 20% vs. 30% difference in methylation in a treatment vs. control group.

Q. what depth is needed for MeDIP-seq?

You really need to do pilot experiments to see what depth will be needed for your experiment. In general, methylated C are fairly unevenly distributed across the genome, so it’s likely that if you have enough reads to have average 1x whole genome coverage, this will result in pileups of 10-20x at methylation sites.

Q. how to do quantitative comparisons of methylation levels?

You need to do this in two steps. First, do peak-calling in order to remove noise – you only want to consider sites that have real appreciable methylation. Once you’ve called peaks, integrate area under the curve at each peak and compare that between experiments. However there are many caveats with this approach. First, methylation is heteroskedastic. As mentioned above, the histogram of methylation levels (x axis: % methylated at site, y axis: number of sites at this methylation level) is bimodal. At intermediate methylation levels, the distribution of methylation levels across individuals or across samples is normally distributed, but near the extremes of complete methylation or zero methylation, a normal distribution is impossible (because you can’t have < 0 or > 100% methylation) so you get a beta distribution instead. This heteroskedasticity means it is not valid to use regression models on methylation levels across samples. One approach is to use an m score instead but even this is not a complete solution to the problem.

Q. what accounts for sites of intermediate methylation levels?

Probably a few factors:

Sample contains multiple cell types, some of which are completely methylated and some of which are not at all methylated.
Allele-specific expression is achieved through total methylation of either only the paternal or only the maternal chromosome.
There may also be stochastic behavior, suggested by the fact that we see variation in methylation levels at a given site even within clonal cell lines. But since we don’t yet have protocols for single-cell methylation analysis, this hasn’t been proven.

Q. QC & alignability

Because bisulfite-seq relies on basically a 3-letter genome, % aligned reads drops from 90% to more like 65-70%. To filiter and only use best data, we might only consider high-quality alignments and high-quality SNP calls.

Q. relative value of single base resolution

Methylation is usually correlated across 50 – 100 bp windows but not always. Sometimes a single base’s methylation is very important for a particular transcription factor’s ability to bind, so single base pair resolution of bisulfite sequencing can be important. This is a major advantage of sequencing over methylation arrays, which they still do use at Hudson Alpha.

N.B. epigenetic analysis tells you about potential for expression, which is often correlated with actual expression (RNA-seq) but not completely. In lupus / autoimmune diseases you will see hypomethylation meaning that genes are poised for expression, even between bouts of disease when high expression is not seen in RNA.

Q. arrays vs. sequencing

Arrays are more accurate than very low-depth sequencing, but less good than high-depth sequencing. But they also have cross-hybridization, dynamic range problem, etc. Methyl 450 arrays are generally very good and Hudson Alpha still uses them.

Speaker 5: Rick Myers – Genome-wide measurement of protein:DNA interactions

What we know about gene regulation

Many genes have multiple promoters
Sequence-specific transcription factors bind in this region
Other proteins bind at sites further upstream. Some of these are sequence-specific, others are methylation specific
cis-acting components: recognize small segments (<21b), > 1M non-coding regulatory elements, located proximal to, distal to or within genes. 24,000 protein-coding genes have been identified and that’s probably most of them. So far we have identified 1,000 non-coding RNA genes (miRNA, lincRNA).
trans components: > 1500 DNA-binding proteins in mammals (maybe as much as a tenth of the proteins in our genome are devoted to regulating other proteins), most recognize small, specific cis elements; others are general (e.g. histones, RNA polymerase)
how do we get a comprehensive understanding of gene expression, DNA replication, recombination, and other genomic events mediated by DNA binding proteins?
ENCODE project is one group trying to do comprehensive functional annotation of genome. It’s kind of a ‘baseline’ – it doesn’t look at every cell type, but it does aim to look at every transcription factor, every marker, in at least a few cell types and sometimes a lot of cell types. Nine-month moratorium on genome-wide analysis has been lifted, the public is welcome to publish anything on this.

At peak of human genome project (2001-2), max output was 150 million reads / year from all labs working on it nationwide. Today sequencers get 300 million reads / lane.

ChIP-seq

Measure occupancy of DNA-binding proteins in living cells [Johnson 2007]
Goal: take a snapshot of what the cell looked like at one moment.
Step 1: Use a cross-linker (ex. aldehyde) to link proteins to DNA. This is because although factors bind pretty tightly, they will come off in rough handling as you disassemble the cell to extract DNA, unless you cross-link.
Step 2: Quickly kill the cell – ex. freeze and macerate
Step 3: Isolate chromatin
Step 4. Shear DNA into 300-500b segments
Step 5. Use an antibody specific to the protein you’re interested in to bring down only the segments of DNA bound to that protein. (So far, ENCODE has done this for 160 transcription factors, about 10% of total; the limiting step is the ability to make antibodies specific to every transcription factor)
Step 6. Reverse-crosslink by heating
Step 7. Isolate DNA from protein.
At this point, DNA is in vanishingly small amounts. But you’ll amplify as you build library from this DNA. Size select, mostly to get rid of small pieces. Then do sequencing.
Align reads to genome
Do “peak calling”

Technical requirements for ChIP-seq

Validated, specific, high-titer antibodies
Negative control for every batch. a.k.a. “Mock IP” take some cross-linked, sheared DNA without antibody pull-down and reverse-crosslink and sequence it. You’ll find some regions of genome show up at higher-than-normal rates. This is because sonication of chromatin breaks it more readily in places where a lot of gene regulation is going on, because those areas are histone-free and bound by sequence-specific proteins which do not protect much from shearing. So you need to subtract peak depth of Mock IP from peak depth of ChIP to isolate the effect of the antibody.
2 biological replicates for every transcription factor – you don’t get exactly the same answer. Usually you take the intersection of the sets of peaks between the two samples.
High-quality chromatin from lots of healthy cells
High-complexity library for NGS
Sufficiently deep read depth
Software for peak-calling: QuEST, MACS, SPP, Peakseq, Findpeaks. HudsonAlpha uses MACS. These will handle subtraction of Mock IP results and will control for certain systematic errors that happen in these experiments.
Quality assessment to see if experiment worked

Directionality of ChIP-seq reads identifies location of binding site.

How many reads are enough?

For most proteins, 50 million reads almost completely saturates so that you can call all the peaks that exist. Some (esp. histones) can require 100 million reads.
Depends on which DNA-binding protein – could get 50-90% of existent peaks with 20M reads. Often 20M-30M reads are used.
It’s not just a matter of how many peaks you can call- you will also get better resolution as to exactly where the binding site is, if you have more depth.

Examples of what we can learn from ChIP-seq

If two proteins are always seen binding to the same places, that suggests they are part of a complex.
Ex. ChIP-seq identified NRSF (aka REST1) occupancy in NeuroD exon
Glucocorticoid receptor – GR binds hormone cortisol in cytoplasm, translocates to nucleus, activates and represses transcription of many genes
PER1 – primary transcription factor regulating circadian rhythms
About 11% of genes are in bidirectional pairs – gene A and gene B start about 100 – 1000b away from each other and go in opposite directions 5′ to 3′. This structure is conserved across mammals which suggests it’s important. In the 100-1000b in between the genes exists a bidirectional promoter. These tend to be GC rich and lack a TATA box. There are several over-represented motifs in bidirectional promoters. Turns out GABP binds to almost all bidirectional promoters.
Evolutionary perspective: of our genome, 1% is conserved exons and 3% is conserved non-exons – mostly promoters and such.
Allele-specific occupancy (occupancy = protein-DNA binding). Sometimes ChIP-seq gives you 90% maternal occupancy / 10% paternal occupancy. You can see that almost all binding is occurring on the maternal allele.
Would like to apply ChIP-seq to disease as well.

Q&A

Q. For disease comparisons, could you use quantitative readout (e.g. amount of binding, area-under-curve) instead of binary peak calling?

Yes, we sort of can and people want to. It is difficult though because for example: GABP 4500 sites. Some have 1000 reads, some have 20 reads. Is that strong vs. weak binding? Probably not – weak binding is probably coming off and not getting sequenced at all. More likely it is bound in some cells and not others – because even in cloncal cell lines you have cell cycle, etc. differences. Another possibility: other factors bound there make the epitope more available or not. We’re really not sure what the relative contribution of these different things is. DNase footprinting can tell you fraction of copies of that site that are occupied or not, but we don’t usually do that for this. You can get some hints from quantitative comparison. People in field including myself will often refer to a strong vs. weak binder even though the truth is we can’t tease out that signal from what we see. So yes, you can do a quantitative comparison but it’s difficult to interpret.

Q. Cell type specificity

We see a lot of specificity but be careful – nothing is completely specific. Also a lot of binding doesn’t appear to do anything. e.g. when hormone is available, GR will bind to 4900 promoters but only 250 of those genes will have increased expression. So why the other binding – is it wasted? Some people think that over evolution, promoter sites are born and die often because they are relatively easy to evolve. Or maybe the other sites are a “sink” to hold onto the transcription factors.