analysis of chip-seq experiments -...
TRANSCRIPT
Analysis of ChIP-seq experiments
Jan. 13, 2011
Hot Topics: Analysis of ChIP-seq experiments 1
Hot Topics on analysis high throughput sequencing experiments
• Mapping next generation sequence reads (December 2010 )
http://iona.wi.mit.edu/bio/education/hot_topics/shortRead_mapping/Mapping_HTseq.pdf
• ChIP-seq (January 2011)
• RNA-seq (February 2011)
• High throughput sequencing pipeline in Galaxy (March 2011)
2
Outline
• ChIP-seq overview
• Critical steps in data analysis
• Tools available for analysis and software performance
– BaRC bake-off
– Published evaluations
• Software available on Tak
• Suggested pipelines
3
ChIP-Seq overview I
4
Park, P. J., ChIP-seq: advantages and challenges of a maturing technology, Nat Rev Genet. Oct;10(10):669-80 (2009)
ChIP-Seq overview II
5
Park, P. J., ChIP-seq: advantages and challenges of a maturing technology, Nat Rev Genet. Oct;10(10):669-80 (2009)
Critical steps in data analysis
1. Effective mapping
2. Read extension and signal profile generation
3. Peak assignment Most original software looked for fold enrichment of the sample over input or expected background, and used a Poisson distribution to assess the significance of the fold enrichment Newer versions a. Use of strand dependent bimodality b. Use background distribution from
input DNA or model background data to adjust for local variation
6
Pepke, S. et al. Computation for ChIP-seq and RNA-seq studies, Nat Methods. Nov (2009)
Critical steps in data analysis Mapping your reads
• See Hot Topic: Mapping Next Generation Sequence Reads (December 2010 )
http://iona.wi.mit.edu/bio/education/hot_topics/shortRead_mapping/Mapping_HTseq.pdf
• If a read maps to several places on the genome, keep one position
at random to avoid over counting the reads • Example of mapping command bsub "bowtie -k 1 -n 2 -l 36 --best --solexa1.3-quals
/nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 inputSeq.txt output_mm9.k1.n2.l36.best.map"
-k 1: report 1 alignment per read -n 2: max number of mismatches in the seed -l 36: seed length --best: hits guaranteed best stratum; ties broken by quality --solexa1.3-quals: input quals are from GA Pipeline ver. >= 1.3
7
Critical steps in data analysis
Peak calling Regions that may occur in ChIP-seq data for TFs
8
Rye M. B. et al. A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs N.A.R. Nov (2010)
Critical steps in data analysis
Peak calling The value of having input control
9
Sample wig file
Control wig file
10
Directional methods (i.e. SISSRs) look for the point where reads shift from mapping to
the sense strand to mapping to the antisense strand.
These methods can be very precise for sharp binding but they are less useful for
identifying broad enrichment signals where the shift point is not present anymore
Critical steps in data analysis
Peak calling Using strand dependent bimodality in peak calling
Wilbanks, E.G. et al. Evaluation of Algorithm Performance in ChIP-Seq Peak Detection . PLoS ONE July (2010)
10
Sharp binding Broad binding
Outline
• ChIP-seq overview
• Critical steps in data analysis
• Tools available for analysis and software performance
– BaRC bake-off
– Published evaluations
• Software available on Tak
• Suggested pipelines
11
12
Pepke, S. Wold, B. Mortazavi, A. Computation for ChIP-seq and RNA-seq studies. Nat Methods. Nov (2009).
Software packages we have tried
BaRC ChIP-seq bakeoff
• FindPeaks Doesn’t take control data
• PeakSeq Input has to be Eland output
• ERANGE Running time is too long
• CisGenome Easy to use GUI available
• SISSRs Recommended for TFs
• MACS Recommended for TFs and Histone modifications
Data used: Marson et al., Cell. 2008 Aug 8;134(3):521-33, (Young lab)
13
BaRC ChIP-seq bakeoff Comparison between programs
Cisgenome v1.1 (9403)
SISSRs v1.4 (10933)
MACS 1.3.7.1 top ~17K
(out of ~30K) Marson et. al
(16688)
Cisgenome NA - - -
SISSRs 83.78 NA - -
MACS top 17K 97.31 96.91 NA -
Marson et. al 96.78 96.22 84.01 NA
14
Pair-wise comparisons of the peaks for Nanog. Numbers represent the percentage of total peaks from one method (column) that are shared with another method (row).
Data used: Marson et al., Cell. 2008 Aug 8;134(3):521-33, (Young lab)
Outline
• ChIP-seq overview
• Critical steps in data analysis
• Tools available for analysis and software performance
– BaRC bake-off
– Published evaluations
• Software available on Tak
• Suggested pipelines
15
Other evaluations of ChIP-seq peak
calling programs
16 “Evaluation of Algorithm Performance in ChIP-Seq peak Detection (PLoS ONE, July 2010)”
Benchmarking ChIP-seq peak calling algorithms
• Agreement between different programs
• Co-occurrence of binding motifs
• Experimental verification (still small amount of data)
17
Agreement between different programs
Programs that call a larger number of peaks tend to include the peaks by the programs calling fewer peaks
18 “Evaluation of Algorithm Performance in ChIP-Seq peak Detection (PLoS ONE, July 2010)”
Pair-wise comparison of shared peaks for NRSF human neuron-restrictive silencer factor (NRSF) and growth-associated binding protein (GABP) Numbers represent the percentage of total peaks from one method (column) that is shared with another method (row). Programs are ordered by increasing number of peaks called.
Evaluation based on number of peaks containing the expected motif
PeakSeq and Hpeak are outliers
19 Evaluation of Algorithm Performance in ChIP-Seq peak Detection (PLoS ONE, July 2010)
Recommendations
• Include an DNA-input control
• Look at your raw data as well as the peak calls in a genome browser
• If your data/signal is not very strong try using several peak call programs.
• We have had good results using MACs. SISSRs is a good second choice if you are expecting sharp peaks.
20
Software available on Tak
• MACS
macs
• SISSRs
sissrs
21
Outline
• ChIP-seq overview
• Critical steps in data analysis
• Tools available for analysis and software performance
– BaRC bake-off
– Published evaluations
• Software available on Tak
• Suggested pipelines
22
A pipeline for ChIP-seq analysis with MACS
• Mapping reads with bowtie bsub "bowtie -k 1 -n 2 -l 36 --best --solexa1.3-quals
/nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 inputSeq.txt output_mm9.k1.n2.l36.best.map “
• Calling peaks with MACS bsub "macs -t sample_mm9.k1.n2.l36.best.map -c inputControl_mm9.k1.n2.l36.best.map --name=test1 --
format=BOWTIE --tsize=36 --wig --space=25 --mfold=10,30"
PARAMETERS
• -t TFILE Treatment file
• -c CFILE Control file
• –name=NAME Experiment name, which will be used to generate output file names. DEFAULT: “NA”
• –format=FORMAT Format of tag file, “BED” or “ELAND” or “ELANDMULTI” or “ELANDMULTIPET” or “SAM” or “BAM” or “BOWTIE”. DEFAULT: “BED”
• –tsize=TSIZE Tag size. DEFAULT: 25
• –wig: Whether or not to save shifted raw tag count at every bp into a wiggle file
• –mfold=MFOLD Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30
23
MACs output files
1. Folder with wig files for control and sample.
2. Excel file containing the following columns: chr
start
end
length
summit
tags
“-10*LOG10(pvalue)”
fold_enrichment
FDR(%)
To visualize the peaks make a bedgraph file with columns: chr start end fold_enrichment
24
A pipeline for ChIP-seq analysis with SISSRs
• Mapping reads with bowtie, get sam output bsub "bowtie -k 1 -n 2 -l 36 --sam --best --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9
inputSeq.txt bowtieoutput_mm9.k1.n2.l36.best.sam “
• Convert formats Convert SAM to BAM samtools view -S -b -o OUTFileName.bam INFile.sam
bsub "samtools view -S -b -o bowtieoutput_mm9.k1.n2.l36.best.bam bowtieoutput_mm9.k1.n2.l36.best.sam "
Convert to BED bsub "bamToBed -i bowtieoutput_mm9.k1.n2.l36.best.bam >
bowtieoutput_mm9.k1.n2.l36.best.bed "
• Run SISSRs sissrs -i bowtieoutput_mm9.k1.n2.l36.best.bed -o outputName -s 2716965481 -b
BG_mm9.k1.n2.l36.best.bed -L 200 -s genome size -L upper-bound on the DNA fragment length
25
SISSRs output
outputName.bed
chr1 3053011 3053071 outputName 55.54 .
chr1 3333731 3333791 outputName 12.62 .
Convert it to bedgraph
outputName.bedgraph
chr1 3053011 3053071 55.54
chr1 3333731 3333791 12.62
26
Visualization in IGV
27
MACS MACS
SISSRs
Sample wig file
BaRC’s bake off peak calls from MACS and SISSRs Data from Marson et al., Cell. 2008
http://www.broadinstitute.org/software/igv/
Control wig file
References
• Reviews and benchmark papers: – ChIP-seq: advantages and challenges of a maturing technology (Oct 09)
(http://www.nature.com/nrg/journal/v10/n10/full/nrg2641.html) – Computation for ChIP-seq and RNA-seq studies (Nov 09)
(http://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.1371.html) – Evaluation of Algorithm Performance in ChIP-Seq peak Detection (July
2010)(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011471) – Rapid innovation in ChIP-seq peak-calling algorithms is outdistancing benchmarking efforts (Oct
2010)(http://bib.oxfordjournals.org/content/early/2010/11/08/bib.bbq068.full) – A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder
programs (NAR, Nov 2010) (PMID: 21113027 )
• MACs: Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137 http://liulab.dfci.harvard.edu/MACS/index.html • SISSRs: (Site Identification from Short Sequence Reads) Raja et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR (2008) 36 (16): 5221-5231.
http://wiki.bioinformatics.ucdavis.edu/index.php/Bioinformatics_Course_Sissrs • GPS (Genome Positioning System): Guo et al. Discovering homotypic binding events at high spatial resolution. Bioinformatics (2010) 26
(24). Dave Gifford’s group.
28
Hot Topics slides
29