analysis of chip-seq experiments -...

Analysis of ChIP-seq experiments

Jan. 13, 2011

Hot Topics: Analysis of ChIP-seq experiments 1

Hot Topics on analysis high throughput sequencing experiments

• Mapping next generation sequence reads (December 2010 )

http://iona.wi.mit.edu/bio/education/hot_topics/shortRead_mapping/Mapping_HTseq.pdf

• ChIP-seq (January 2011)

• RNA-seq (February 2011)

• High throughput sequencing pipeline in Galaxy (March 2011)

2



Outline

• ChIP-seq overview

• Critical steps in data analysis

• Tools available for analysis and software performance

– BaRC bake-off

– Published evaluations

• Software available on Tak

• Suggested pipelines

3

ChIP-Seq overview I

4

Park, P. J., ChIP-seq: advantages and challenges of a maturing technology, Nat Rev Genet. Oct;10(10):669-80 (2009)

ChIP-Seq overview II

5

Park, P. J., ChIP-seq: advantages and challenges of a maturing technology, Nat Rev Genet. Oct;10(10):669-80 (2009)

Critical steps in data analysis

1. Effective mapping

2. Read extension and signal profile generation

3. Peak assignment Most original software looked for fold enrichment of the sample over input or expected background, and used a Poisson distribution to assess the significance of the fold enrichment Newer versions a. Use of strand dependent bimodality b. Use background distribution from

input DNA or model background data to adjust for local variation

6

Pepke, S. et al. Computation for ChIP-seq and RNA-seq studies, Nat Methods. Nov (2009)

Critical steps in data analysis Mapping your reads

• See Hot Topic: Mapping Next Generation Sequence Reads (December 2010 )


• If a read maps to several places on the genome, keep one position

at random to avoid over counting the reads • Example of mapping command bsub "bowtie -k 1 -n 2 -l 36 --best --solexa1.3-quals

/nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 inputSeq.txt output_mm9.k1.n2.l36.best.map"

-k 1: report 1 alignment per read -n 2: max number of mismatches in the seed -l 36: seed length --best: hits guaranteed best stratum; ties broken by quality --solexa1.3-quals: input quals are from GA Pipeline ver. >= 1.3

7




Peak calling Regions that may occur in ChIP-seq data for TFs

8

Rye M. B. et al. A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs N.A.R. Nov (2010)


Peak calling The value of having input control

9

Sample wig file

Control wig file

10

Directional methods (i.e. SISSRs) look for the point where reads shift from mapping to

the sense strand to mapping to the antisense strand.

These methods can be very precise for sharp binding but they are less useful for

identifying broad enrichment signals where the shift point is not present anymore


Peak calling Using strand dependent bimodality in peak calling

Wilbanks, E.G. et al. Evaluation of Algorithm Performance in ChIP-Seq Peak Detection . PLoS ONE July (2010)

10

Sharp binding Broad binding

Outline




– BaRC bake-off




11

12

Pepke, S. Wold, B. Mortazavi, A. Computation for ChIP-seq and RNA-seq studies. Nat Methods. Nov (2009).

Software packages we have tried

BaRC ChIP-seq bakeoff

• FindPeaks Doesn’t take control data

• PeakSeq Input has to be Eland output

• ERANGE Running time is too long

• CisGenome Easy to use GUI available

• SISSRs Recommended for TFs

• MACS Recommended for TFs and Histone modifications

Data used: Marson et al., Cell. 2008 Aug 8;134(3):521-33, (Young lab)

13

BaRC ChIP-seq bakeoff Comparison between programs

Cisgenome v1.1 (9403)

SISSRs v1.4 (10933)

MACS 1.3.7.1 top ~17K

(out of ~30K) Marson et. al

(16688)

Cisgenome NA - - -

SISSRs 83.78 NA - -

MACS top 17K 97.31 96.91 NA -

Marson et. al 96.78 96.22 84.01 NA

14

Pair-wise comparisons of the peaks for Nanog. Numbers represent the percentage of total peaks from one method (column) that are shared with another method (row).

Data used: Marson et al., Cell. 2008 Aug 8;134(3):521-33, (Young lab)

Outline




– BaRC bake-off




15

Other evaluations of ChIP-seq peak

calling programs

16 “Evaluation of Algorithm Performance in ChIP-Seq peak Detection (PLoS ONE, July 2010)”

Benchmarking ChIP-seq peak calling algorithms

• Agreement between different programs

• Co-occurrence of binding motifs

• Experimental verification (still small amount of data)

17

Agreement between different programs

Programs that call a larger number of peaks tend to include the peaks by the programs calling fewer peaks

18 “Evaluation of Algorithm Performance in ChIP-Seq peak Detection (PLoS ONE, July 2010)”

Pair-wise comparison of shared peaks for NRSF human neuron-restrictive silencer factor (NRSF) and growth-associated binding protein (GABP) Numbers represent the percentage of total peaks from one method (column) that is shared with another method (row). Programs are ordered by increasing number of peaks called.

Evaluation based on number of peaks containing the expected motif

PeakSeq and Hpeak are outliers

19 Evaluation of Algorithm Performance in ChIP-Seq peak Detection (PLoS ONE, July 2010)

Recommendations

• Include an DNA-input control

• Look at your raw data as well as the peak calls in a genome browser

• If your data/signal is not very strong try using several peak call programs.

• We have had good results using MACs. SISSRs is a good second choice if you are expecting sharp peaks.

20

Software available on Tak

• MACS

macs

• SISSRs

sissrs

21

Outline




– BaRC bake-off




22

A pipeline for ChIP-seq analysis with MACS

• Mapping reads with bowtie bsub "bowtie -k 1 -n 2 -l 36 --best --solexa1.3-quals

/nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 inputSeq.txt output_mm9.k1.n2.l36.best.map “

• Calling peaks with MACS bsub "macs -t sample_mm9.k1.n2.l36.best.map -c inputControl_mm9.k1.n2.l36.best.map --name=test1 --

format=BOWTIE --tsize=36 --wig --space=25 --mfold=10,30"

PARAMETERS

• -t TFILE Treatment file

• -c CFILE Control file

• –name=NAME Experiment name, which will be used to generate output file names. DEFAULT: “NA”

• –format=FORMAT Format of tag file, “BED” or “ELAND” or “ELANDMULTI” or “ELANDMULTIPET” or “SAM” or “BAM” or “BOWTIE”. DEFAULT: “BED”

• –tsize=TSIZE Tag size. DEFAULT: 25

• –wig: Whether or not to save shifted raw tag count at every bp into a wiggle file

• –mfold=MFOLD Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30

23

MACs output files

1. Folder with wig files for control and sample.

2. Excel file containing the following columns: chr

start

end

length

summit

tags

“-10*LOG10(pvalue)”

fold_enrichment

FDR(%)

To visualize the peaks make a bedgraph file with columns: chr start end fold_enrichment

24

A pipeline for ChIP-seq analysis with SISSRs

• Mapping reads with bowtie, get sam output bsub "bowtie -k 1 -n 2 -l 36 --sam --best --solexa1.3-quals /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9

inputSeq.txt bowtieoutput_mm9.k1.n2.l36.best.sam “

• Convert formats Convert SAM to BAM samtools view -S -b -o OUTFileName.bam INFile.sam

bsub "samtools view -S -b -o bowtieoutput_mm9.k1.n2.l36.best.bam bowtieoutput_mm9.k1.n2.l36.best.sam "

Convert to BED bsub "bamToBed -i bowtieoutput_mm9.k1.n2.l36.best.bam >

bowtieoutput_mm9.k1.n2.l36.best.bed "

• Run SISSRs sissrs -i bowtieoutput_mm9.k1.n2.l36.best.bed -o outputName -s 2716965481 -b

BG_mm9.k1.n2.l36.best.bed -L 200 -s genome size -L upper-bound on the DNA fragment length

25

SISSRs output

outputName.bed

chr1 3053011 3053071 outputName 55.54 .

chr1 3333731 3333791 outputName 12.62 .

Convert it to bedgraph

outputName.bedgraph

chr1 3053011 3053071 55.54

chr1 3333731 3333791 12.62

26

Visualization in IGV

27

MACS MACS

SISSRs

Sample wig file

BaRC’s bake off peak calls from MACS and SISSRs Data from Marson et al., Cell. 2008

http://www.broadinstitute.org/software/igv/

Control wig file

References

• Reviews and benchmark papers: – ChIP-seq: advantages and challenges of a maturing technology (Oct 09)

(http://www.nature.com/nrg/journal/v10/n10/full/nrg2641.html) – Computation for ChIP-seq and RNA-seq studies (Nov 09)

(http://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.1371.html) – Evaluation of Algorithm Performance in ChIP-Seq peak Detection (July

2010)(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011471) – Rapid innovation in ChIP-seq peak-calling algorithms is outdistancing benchmarking efforts (Oct

2010)(http://bib.oxfordjournals.org/content/early/2010/11/08/bib.bbq068.full) – A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder

programs (NAR, Nov 2010) (PMID: 21113027 )

• MACs: Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137 http://liulab.dfci.harvard.edu/MACS/index.html • SISSRs: (Site Identification from Short Sequence Reads) Raja et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR (2008) 36 (16): 5221-5231.

http://wiki.bioinformatics.ucdavis.edu/index.php/Bioinformatics_Course_Sissrs • GPS (Genome Positioning System): Guo et al. Discovering homotypic binding events at high spatial resolution. Bioinformatics (2010) 26

(24). Dave Gifford’s group.

28

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0011471

http://bib.oxfordjournals.org/content/early/2010/11/08/bib.bbq068.full

http://wiki.bioinformatics.ucdavis.edu/index.php/Bioinformatics_Course_Sissrs

Hot Topics slides

29

analysis of chip-seq experiments -...

Documents