more on tf motif finding chip-chip / seq

49
More on TF Motif Finding ChIP-chip / seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Upload: inoke

Post on 13-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

More on TF Motif Finding ChIP-chip / seq. Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520. De novo Sequence Motif Finding. Goal: look for common sequence patterns enriched in the input data (compared to the genome background) Regular expression enumeration Pattern driven approach - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: More on TF Motif Finding ChIP-chip / seq

More on TF Motif Finding ChIP-chip / seq

Xiaole Shirley Liu

STAT115, STAT215, BIO298, BIST520

Page 2: More on TF Motif Finding ChIP-chip / seq

De novo Sequence Motif Finding

• Goal: look for common sequence patterns enriched in the input data (compared to the genome background)

• Regular expression enumeration – Pattern driven approach

– Enumerate patterns, check significance in dataset

– Oligonucleotide analysis, MobyDick

• Position weight matrix update – Data driven approach, use data to refine motifs

– Consensus, EM & Gibbs sampling

– Motif score and Markov background2

Page 3: More on TF Motif Finding ChIP-chip / seq

Position Weight Matrix Update

• Advantage– Can look for motifs of any widths– Flexible with base substitutions

• Disadvantage:– EM and Gibbs sampling: no guaranteed

convergence time– No guaranteed global optimum

3

Page 4: More on TF Motif Finding ChIP-chip / seq

Motif Finding in Bacteria

• Promoter sequences are short (200-300 bp)• Motif are usually long (10-20 bases)

– Some have two blocks with a gap, some are palindromes

– Long motifs are usually very degenerate

• Single microarray experiment sometimes already provides enough information to search for TF motifs

4

Page 5: More on TF Motif Finding ChIP-chip / seq

Motif Finding in Lower Eukaryotes

• Upstream sequences longer (500-1000 bp), with some simple repeats

• Motif width varies (5 – 17 bases)• Expression clusters provide decent input

sequences quality for TF motif finding• Motif combination and redundancy appears,

although single motifs are usually significant enough for identification

5

Page 6: More on TF Motif Finding ChIP-chip / seq

Yeast Promoter

Architecture

• Co-occurring regulators suggest physical interaction between the regulators

6

Page 7: More on TF Motif Finding ChIP-chip / seq

Motif Finding in Higher Eukaryotes

• Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream

• Motifs can be short or long (6-20 bases), and appear in combination and clusters

• Gene expression cluster not good enough input• Need:

– Comparative Genomics: phastcons score– Motif modules: motif clusters– ChIP-chip/seq

7

Page 8: More on TF Motif Finding ChIP-chip / seq

8

Yeast Regulatory Sequence Conservation

Page 9: More on TF Motif Finding ChIP-chip / seq

9

UCSC PhastCons Conservation• Functional regulatory sequences are under

stronger evolutionary constraint• Align orthologous sequences together• PhastCons conservation score (0 – 1) for each

nucleotide in the genome can be downloaded from UCSC

Page 10: More on TF Motif Finding ChIP-chip / seq

10

Conserved Motif Clusters

• First find conserved regions in the genome

• Then look for repeated transcription factors (TF) binding sites

• They form transcription factor modules

Page 11: More on TF Motif Finding ChIP-chip / seq

Outline

• ChIP-chip on yeast– Technology and data analysis: MDscan motif finding,

regulatory network

• ChIP-X on human– Tiling microarrays and peak finding

– High throughput sequencing and peak finding

– Data analysis and examples

• Analysis: peak finding, gene expression analysis, sequence motif finding, regulatory network– Holistic picture of gene regulation

11

Page 12: More on TF Motif Finding ChIP-chip / seq

Motivation

• Motif finding works well in bacteria, OK in yeast, marginal in worm/fly, and almost never in mammals

• Cistrome: Genome-wide in vivo binding sites of DNA-binding proteins

• ChIP-chip and ChIP-seq gives cistrome results

12

Page 13: More on TF Motif Finding ChIP-chip / seq

ChIP-chip Technology• Chromatin ImmunoPrecipitation + microarray

– ChIP-on-chip or ChIP-chip

– Also known as Genome Scale Location Analysis

• Detect genome-wide in vivo location of TF and other DNA-binding proteins– Find all the DNA sequences bound by TF-X?

– Cook all the dishes with cinnamon

• Can learn the regulatory mechanism of a transcription factor or DNA-binding protein much better and faster

13

Page 14: More on TF Motif Finding ChIP-chip / seq

Chromatin ImmunoPrecipitation (ChIP)

14

Page 15: More on TF Motif Finding ChIP-chip / seq

TF/DNA Crosslinking in vivo

15

Page 16: More on TF Motif Finding ChIP-chip / seq

Sonication (~500bp)

16

Page 17: More on TF Motif Finding ChIP-chip / seq

TF-specific Antibody

17

Page 18: More on TF Motif Finding ChIP-chip / seq

Immunoprecipitation

18

Page 19: More on TF Motif Finding ChIP-chip / seq

Reverse Crosslink and DNA Purification

19

Page 20: More on TF Motif Finding ChIP-chip / seq

Promoter Array Hybridization

Genes Intergenetic ChIP

Page 21: More on TF Motif Finding ChIP-chip / seq

ChIP-DNA chip Detection• Started in yeast, use promoter

cDNA microarray– ~ 6000 spots, each 800-1000 bp

• Two color assay– Control: no antibody, or chromatin

(a little bit of everything)– Need triplicates to cancel noise

• Applied to all yeast TFs– TF modified to contain a tag– Tag can be precipitated with

Immunoglobin

21

Page 22: More on TF Motif Finding ChIP-chip / seq

ChIP-chip Motif Finding

• ChIP-chip gives 10-5000 binding regions ~600-1000bp long. Precise binding motif?– Raw data is like perfect clustering, plus enrichment

values

• MDscan– High ChIP ranking => true targets, contain more sites

– Search TF motif from highest ranking targets first (high signal / background ratio)

– Refine candidate motifs with all targets

– Used successfully in ChIP-chip motif finding

22

Page 23: More on TF Motif Finding ChIP-chip / seq

Similarity Defined by m-match

For a given w-mer and any other random w-mer

TGTAACGT 8-mer

TGTAACGT matched 8

AGTAACGT matched 7

TGCAACAT matched 6

TGACACGG matched 5

AATAACAG matched 4

m-matches for TGTAACGT

Pick a reasonable m to call two w-mers similar

23

Page 24: More on TF Motif Finding ChIP-chip / seq

MDscan Seeds

ATTGCAAATTTTGCGAATTTTGCAAAT

Seedmotif pattern

ATTGCAAAT

A 9-mer

TTTGCAAAT

TTTGCGAAT

Hig

her

enri

chm

ent

ChIP-chip selected upstream sequences

TTGCAAATC

CAAATCCAACAAATCCAAGAAATCCAC

GCAAATCCAGCAAATTCGGCAAATCCAGGAAATCCAGGAAATCCT

TGCAAATCCTGCAAATTC

GCCACCGTACCACCGTACCACGGTGCCACGGC…

TTGCAAATCTTGCGAATATTGCAAATTTTGCCCATC

24

Page 25: More on TF Motif Finding ChIP-chip / seq

Seed1 m-matches

Update Motifs With Remaining Seqs

ExtremeHighRank

All ChIP-selected targets25

Page 26: More on TF Motif Finding ChIP-chip / seq

Seed1 m-matches

Refine the Motifs

ExtremeHighRank

All ChIP-selected targets26

Page 27: More on TF Motif Finding ChIP-chip / seq

Yeast TF Regulatory Network

Protein

Gene

RegulateTranscribe

27

Page 28: More on TF Motif Finding ChIP-chip / seq

ChIP-chip Better Explains Expression

Ndt80 regulated genes Sum1 regulated genes

Ndt80 & Sum1 regulated genes

28

Page 29: More on TF Motif Finding ChIP-chip / seq

Genome Tiling Microarrays• Promoter array doesn’t work for human ChIP-chip

• Binding could appear in much further intergenic sequences, introns, exons, or downstream sequences.

Genomic DNA on the chromosome

Tiling Probes

29

Page 30: More on TF Motif Finding ChIP-chip / seq

DNA Purification

30

Page 31: More on TF Motif Finding ChIP-chip / seq

ChIP-chip on Tiling Microarray

ChIP-DNA

Noise

ChIP

Ctrl

Chromosome

31

Page 32: More on TF Motif Finding ChIP-chip / seq

ChIP-chip

• Detect genome-wide location of transcription and epigenetic factors

• Affymetrix genome tiling arrays are cheaper

• $2000 7 arrays * 6 million probes * (3 ChIP + 3 Ctrl)

• But data is noisier and less informative

Two peaks? How about ChIP alone? Over 42M probes?

32

ChIP

Ctrl

Chromosome CoordinatesLog

Pro

be I

nte

nsit

y

Page 33: More on TF Motif Finding ChIP-chip / seq

ChIP-chip AnalysisMann-Whitney U-test

• Affy TAS, Cawley et al (Cell 2004): – Assign 1 to all probe pairs with MM > PM

– Each probe: rank probes within [-500bp, +500bp] window

33

Page 34: More on TF Motif Finding ChIP-chip / seq

ChIP-chip AnalysisMann-Whitney U-test

• Affy TAS, Cawley et al (Cell 2004): – Assign 1 to all probe pairs with MM > PM

– Each probe: rank probes within [-500bp, +500bp] window

– Check whether sum of ChIP ranks is much smaller

– Consider all probes equally

– Half of the probes have MM > PM

PM – MM

Histogram of (PM – MM)

34

Page 35: More on TF Motif Finding ChIP-chip / seq

Affymetrix Tiling Array Peak Finding

• Challenges:– Massive data, probe values noisy

– Only 1/3 of researchers get it to work the first time

– Previous algorithms only work by comparing 3 ChIP with 3 Ctrl

• Model-based Analysis of Tiling arrays (MAT)– Work with single ChIP (no rep, no ctrl)

– Find individual failed samples

– More sensitive, specific, and quantitative with 3 ChIP & 3 Ctrl

MAT: Johnson et al, PNAS 2006

35

Page 36: More on TF Motif Finding ChIP-chip / seq

MAT• Most of the probes in ChIP-chip measures

non-specific hybridization and background noise• Estimate probe behavior by checking other

probes with similar sequence on the same array• Probe sequence plays

a big role in signal

value

36

Page 37: More on TF Motif Finding ChIP-chip / seq

Model Sequence-Specific Probe Effect

• First detailed model of probe sequence on probe signal

• AATGC ACTGT GCACA GATCG GCCAT7 A, 7 C, 6 G, 5 T, map to 2 places in genome

• Use all the probes on the array to estimate the parameters

# of T’sintercept

Position-specific

A, C, G effect

A,C,G,T count squared

25-mer copynumber

Probesignal

37

5α + β1A + β 2A + β 4G + β 5C + ...

+ 49γA + 49γC + 36γG + 25γT + Log(2)δ + ε

Page 38: More on TF Motif Finding ChIP-chip / seq

Probe Standardization

• Fit the probe model array by array

6M Probes

2K bins

binaffinityi

iii s

mPMLogt

ˆ)(

Model predicted probe intensity

Observed probe intensity

Observed probe variance within

each bin38

Page 39: More on TF Motif Finding ChIP-chip / seq

Raw probe values at two spike-in regions with concentration 2X

ChIP

Ctrl

Sequence-based probe behavior standardization

ChIP standardized

Ctrl standardized

Window-based neighboring probe combination for ChIP-region detection

ChIP Window

(ChIP – Ctrl)

(3 ChIP – 3 Ctrl)

2X 2X

39

Page 40: More on TF Motif Finding ChIP-chip / seq

MA2C: Model-based for 2-Color Arrays

• Normalize probes by GC bins within each array– How much variance is observed in the GC bin

– Give high confidence probes more weight

• Running window average or median for peak finding

MA2C: Song et al, Genome Biol 2007

40

Page 41: More on TF Motif Finding ChIP-chip / seq

Is a ChIP experiment working?

• MAT window scores ~ normal with long tails• Estimate pvalue of normal from left half of data• FDR = A / B (Ctrl/ChIP peaks are all FPs)• Spike-in shows MAT FDR estimate is accurate• Can find individual failed replicate

41

<1% enriched

MAT: Quality Control

Background

Enriched DNA

A B

Page 42: More on TF Motif Finding ChIP-chip / seq

ChIP-Seq

ChIP-DNA

Noise

Sequence millions of 30-mer ends of fragments

Map 30-mers back to the genome

42

Page 43: More on TF Motif Finding ChIP-chip / seq

MACS: Model-based Analysis for ChIP-Seq

• Use confident peaks to model shift size

Binding

43

Page 44: More on TF Motif Finding ChIP-chip / seq

Peak Calls

• Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size)

• ChIP-Seq show local biases in the genome– Chromatin and sequencing bias

44

Page 45: More on TF Motif Finding ChIP-chip / seq

Peak Calls

• Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size)

• ChIP-Seq show local biases in the genome– Chromatin and sequencing bias– 200-300bp control windows have to few tags– But can look

further

Dynamic λlocal =

max(λBG, [λctrl, λ1k,] λ5k, λ10k)

ChIP

Control

300bp1kb5kb10kb

http://liulab.dfci.harvard.edu/MACS/Zhang et al, Genome Bio, 2008

Page 46: More on TF Motif Finding ChIP-chip / seq

CEAS: Cis-regulatory Element Annotation System

• Data Analysis Button for Biologists

http://ceas.cbi.pku.edu.cn

Page 47: More on TF Motif Finding ChIP-chip / seq

Estrogen Receptor

• Carroll et al, Cell 2005• Overactive in > 70% of breast cancers• Where does it go in the genome?• ChIP-chip on chr21/22, motif and expression

analysis found its partner FoxA1

TF??ER

Page 48: More on TF Motif Finding ChIP-chip / seq

Estrogen Receptor (ER) Cistrome in Breast Cancer

• Carroll et al, Nat Genet 2006

• ER may function far away (100-200KB) from genes

• Only 20% of ER sites have PhastCons > 0.2

• ER has different effect based on different collaborators

AP1

ER

NRIP

Page 49: More on TF Motif Finding ChIP-chip / seq

Estrogen Receptor (ER) Cistrome in Breast Cancer

• Carroll et al, Nat Genet 2006

• ER may function far away (100-200KB) from genes

• Only 20% of ER sites have PhastCons > 0.2

• ER has different effect based on different collaborators

AP1

ERNRIP