chip-chip data, model and analysis ying nian wu dept. of statistics ucla joint with ming zheng, leah...

42
ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Upload: arron-sims

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chipData, Model and Analysis

Ying Nian WuDept. Of Statistics

UCLA

Joint with Ming Zheng, Leah Barrera, Bing Ren

Page 2: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chip A technology for isolation and

identification of the DNA sequences occupied by specific DNA binding proteins (regulatory sequences) in living cells.

Chromatin-immunoprecipitation and microarray analysis (chip) are combined to study protein-DNA interaction in vivo.

Also known as “genome-wide location analysis”.

Page 3: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chip processStep 1: Bound transcription factors are cross- linked to DNA with formaldehyde

Page 4: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chip process (cont’d)Step 2: sonication is used to break genomic DNA to small DNA fragments (various lengths, difficult to measure, 1-2kb)

Page 5: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chip process (cont’d)Step 3: Special antibody is added to immuno- precipitate DNA segments crossed-linked with target protein

Page 6: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chip process (cont’d)Step 4.1: the cross-linking between DNA and protein is reversed and DNA is amplified by LM-PCR and labeled with a fluorescent dye Cy5.

Page 7: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chip process (cont’d)Step 4.2: As a negative control, a sample of DNA which is not enriched by the immuno- precipitation process are also amplified by LM-PCR and labeled with another dye Cy3.

Page 8: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chip process (cont’d)Step 5: Both IP-enriched and IP-unenriched samples are hybridized to the same oligonucleotide array.

Page 9: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chip process (cont’d)Step 6: The microarray is scanned, Cy5 and Cy3 signal strengths are extracted, and log(Cy5/Cy3) is calculated after normalization.

Ren, B. UCSD

Page 10: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Summary of ChIP-chip

•Protein bound to DNA

•Sonication

•Immunoprecipitation

•Amplify DNA and add control

•Hybridize to probes

•Microarray analysis

Ren, B. UCSD

Page 11: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chip data

One probe is one data point in the dataset.

The x-axis represents the genomic position of the probe.

The y-axis (the height) denotes the signal strength log(Cy5/Cy3) of each probe.

SignalMap, NimbleGen Inc.

Page 12: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

A closer look

SignalMap, NimbleGen Inc.

Page 13: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Cy5 signal The Cy5 signal strength at a point should be

proportional to the probability that an IP-enriched segment contains that point.

0P

Page 14: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Single binding site scenario Assume there is only one binding site at

the origin. To contribute to the signal at :

1) this binding site is bound by protein2) no cut should occur between 0 and

Signal at is proportional to (approx):

})(exp{0

0P

B dssq

0P

0P

0P

Page 15: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Model derivation Assume to be constant around the

binding site. Therefore, the Cy5 signal strength should decrease exponentially from the binding site.

Log(Cy5/Cy3) decreases linearly from the binding site: triangular shape.

)(s

00 slope2slope1intercept)Cy3

Cy5log( PP

Page 16: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Two binding sites scenario

})(exp{})(exp{

})(exp{

)Pr()Pr()Pr(

: toalproportion is at Signal

. and between occurscut no and bound is :Event

. and between occurscut no and bound is :Event

between.in point a and ,, :BSs woConsider t

2

1

2

0

0

1

212

1

0

202

101

021

B

B

B

P

P

B

dssqqdssq

dssq

BABA

P

BPBB

BPBA

PBB

Page 17: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

General scenario

)Pr()Pr()Pr(: toalproportion is at Signal

Right. the toBS boundnearest and between cut no :Event

left. the toBS boundnearest and between cut no :Event

0

0

0

BABAP

PB

PA

Page 18: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

General scenario

m

i

P

Bi

m

ijj

m

iiii

i

dssqq

BPBB

A

1 1

10

})(exp{])1([

)good is | and between cut Pr(no ) is BS goodnearest Pr(

)Pr(

0

Page 19: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Regression to fit triangle

),1

,,1

,0,0,0,0(

)0,0,,0,0,1

,,1

,(

),,,,,,,,(

2

1

11011

R

R

R

R

RX

LL

L

L

LX

yyyyyyyY RR

RR

RLLL

LL

A simple case: probes are evenly spaced.

Page 20: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Best fitted triangle Fix left boundary and the right boundary,

we can identify the slopes and intercept.

For different combinations of left and right boundary, find the best one with the minimum variance of residual.

This is the best fitted triangle centered at the probe we are considering.

Page 21: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Mpeak process Arrange local maxima

by their signal strength.

For the first local maximum, find the best fitted triangle in a small neighborhood and identify the center as peak.

Page 22: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Mpeak process For any local maximum in the range of this

triangle, if the difference between two fitted values is small, mark it as non-peak.

Continue this process until every local maximum has been considered or smaller than a threshold.

Page 23: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

P value of peaks Null hypothesis: background signal in ChIP-chip

data follows normal distribution with mean 0.

is used as the statistic for testing: it is zero-mean and variance stabilized.

Background signals are not independent: probes close to each other tend to be included in the same segment simultaneously.

nYYn i /

Page 24: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Variance approximation

data.input from estimated beboth can factor

n correlatio-auto theand noise of variancemarginal The

)1(

))var(/),cov(1)(var(

),cov(/1),cov(/1)/Var(

.|| when with correlates that Assume

2

2

||

||,

f

f

YYYY

YYnYYnnY

mPPYY

mPPijii

mPPji

jijii

jiji

ji

ji

Page 25: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Result

SignalMap, NimbleGen Inc.

Page 26: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Result

SignalMap, NimbleGen Inc.

Page 27: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Result

SignalMap, NimbleGen Inc.

Page 28: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Result

SignalMap, NimbleGen Inc.

Page 29: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

How good the fit is?

Page 30: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Result

Kim, T.H. et al. A high-resolution map of active promoters in the human genome. Nature, 436, 876-880

9,328 promoters for known transcripts

1,196 putative promoters for unknown transcripts

Ren, B. UCSD

Page 31: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Comparison with kernel smoothing

Page 32: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Multi-resolutionPeak tree

Page 33: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Why use model? A promoter is characterized

not only by a large probe signal, but also a truncated triangle shape

Identify the neighboring probes that are caused by the same promoter to pool the info for ranking the potential binding sites

SignalMap, NimbleGen Inc.

Page 34: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Model justification Intuitively, human vision recognizes the local

shape, instead of a single probe, to detect peaks.

Model fitting improves detection: 1) largest signal may not always be the tip of the best fitted triangle, 2) we can handle outliers caused by probe malfunctioning.

For window smoothing, if the window size is not chosen well, a local maximum of the window average can well be the bottom of a valley.  

Page 35: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Model justification The model gives us a sensible way to choose the

range:

this enables us to pool many weak signals together if they form a good triangle. So that we can reduce the chance of false negative.

this prevents us from pooling too many weak signals together if they do not form a good triangle. So that we can reduce the chance of false positive.

Page 36: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Model justification Probabilistic approx: Poisson process Fact: two different slopes around the non-

differential tip Functional approx: line segments locally

Gives reasonable fit to data

Not enough data for more complex model Not enough computational power to fit more

complex model within minutes

Page 37: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Software Fast: ~10 seconds for ~400,000 probes with

a regular PC.

Robust to noise (data shown later). Software and source code publicly available:

www.stat.ucla.edu/~zmdl/Mpeak

Page 38: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Chromosome structure

Lodish, H. et al. Molecular Cell Biology.

Page 39: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Histone and transcription

Histone proteins need to be modified and DNA needs to be released for transcription to take place.

LS3 class note, UCLA

Page 40: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Histone and transcription

Ren, B., UCSD

Page 41: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Twin-peak phenomenon

The promoter region is in between two binding sites of the modified histone protein, e.g., Acetylated histone H3 (AcH3).

ChIP-chip data for AcH3 show a twin-peak phenomenon, with a valley corresponding to promoter region.

LS3 class note, UCLA

SignalMap, NimbleGen Inc.

Page 42: ChIP-chip Data, Model and Analysis Ying Nian Wu Dept. Of Statistics UCLA Joint with Ming Zheng, Leah Barrera, Bing Ren

Possible solutions Fit twin-peak shape to data based on the

probability model for two binding site scenario.

Use Witkin’s scale-space filtering to detect peaks and twin-peaks.