chip-chip data, model and analysis ying nian wu dept. of statistics ucla joint with ming zheng, leah...

Post on 26-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ChIP-chipData, Model and Analysis

Ying Nian WuDept. Of Statistics

UCLA

Joint with Ming Zheng, Leah Barrera, Bing Ren

ChIP-chip A technology for isolation and

identification of the DNA sequences occupied by specific DNA binding proteins (regulatory sequences) in living cells.

Chromatin-immunoprecipitation and microarray analysis (chip) are combined to study protein-DNA interaction in vivo.

Also known as “genome-wide location analysis”.

ChIP-chip processStep 1: Bound transcription factors are cross- linked to DNA with formaldehyde

ChIP-chip process (cont’d)Step 2: sonication is used to break genomic DNA to small DNA fragments (various lengths, difficult to measure, 1-2kb)

ChIP-chip process (cont’d)Step 3: Special antibody is added to immuno- precipitate DNA segments crossed-linked with target protein

ChIP-chip process (cont’d)Step 4.1: the cross-linking between DNA and protein is reversed and DNA is amplified by LM-PCR and labeled with a fluorescent dye Cy5.

ChIP-chip process (cont’d)Step 4.2: As a negative control, a sample of DNA which is not enriched by the immuno- precipitation process are also amplified by LM-PCR and labeled with another dye Cy3.

ChIP-chip process (cont’d)Step 5: Both IP-enriched and IP-unenriched samples are hybridized to the same oligonucleotide array.

ChIP-chip process (cont’d)Step 6: The microarray is scanned, Cy5 and Cy3 signal strengths are extracted, and log(Cy5/Cy3) is calculated after normalization.

Ren, B. UCSD

Summary of ChIP-chip

•Protein bound to DNA

•Sonication

•Immunoprecipitation

•Amplify DNA and add control

•Hybridize to probes

•Microarray analysis

Ren, B. UCSD

ChIP-chip data

One probe is one data point in the dataset.

The x-axis represents the genomic position of the probe.

The y-axis (the height) denotes the signal strength log(Cy5/Cy3) of each probe.

SignalMap, NimbleGen Inc.

A closer look

SignalMap, NimbleGen Inc.

Cy5 signal The Cy5 signal strength at a point should be

proportional to the probability that an IP-enriched segment contains that point.

0P

Single binding site scenario Assume there is only one binding site at

the origin. To contribute to the signal at :

1) this binding site is bound by protein2) no cut should occur between 0 and

Signal at is proportional to (approx):

})(exp{0

0P

B dssq

0P

0P

0P

Model derivation Assume to be constant around the

binding site. Therefore, the Cy5 signal strength should decrease exponentially from the binding site.

Log(Cy5/Cy3) decreases linearly from the binding site: triangular shape.

)(s

00 slope2slope1intercept)Cy3

Cy5log( PP

Two binding sites scenario

})(exp{})(exp{

})(exp{

)Pr()Pr()Pr(

: toalproportion is at Signal

. and between occurscut no and bound is :Event

. and between occurscut no and bound is :Event

between.in point a and ,, :BSs woConsider t

2

1

2

0

0

1

212

1

0

202

101

021

B

B

B

P

P

B

dssqqdssq

dssq

BABA

P

BPBB

BPBA

PBB

General scenario

)Pr()Pr()Pr(: toalproportion is at Signal

Right. the toBS boundnearest and between cut no :Event

left. the toBS boundnearest and between cut no :Event

0

0

0

BABAP

PB

PA

General scenario

m

i

P

Bi

m

ijj

m

iiii

i

dssqq

BPBB

A

1 1

10

})(exp{])1([

)good is | and between cut Pr(no ) is BS goodnearest Pr(

)Pr(

0

Regression to fit triangle

),1

,,1

,0,0,0,0(

)0,0,,0,0,1

,,1

,(

),,,,,,,,(

2

1

11011

R

R

R

R

RX

LL

L

L

LX

yyyyyyyY RR

RR

RLLL

LL

A simple case: probes are evenly spaced.

Best fitted triangle Fix left boundary and the right boundary,

we can identify the slopes and intercept.

For different combinations of left and right boundary, find the best one with the minimum variance of residual.

This is the best fitted triangle centered at the probe we are considering.

Mpeak process Arrange local maxima

by their signal strength.

For the first local maximum, find the best fitted triangle in a small neighborhood and identify the center as peak.

Mpeak process For any local maximum in the range of this

triangle, if the difference between two fitted values is small, mark it as non-peak.

Continue this process until every local maximum has been considered or smaller than a threshold.

P value of peaks Null hypothesis: background signal in ChIP-chip

data follows normal distribution with mean 0.

is used as the statistic for testing: it is zero-mean and variance stabilized.

Background signals are not independent: probes close to each other tend to be included in the same segment simultaneously.

nYYn i /

Variance approximation

data.input from estimated beboth can factor

n correlatio-auto theand noise of variancemarginal The

)1(

))var(/),cov(1)(var(

),cov(/1),cov(/1)/Var(

.|| when with correlates that Assume

2

2

||

||,

f

f

YYYY

YYnYYnnY

mPPYY

mPPijii

mPPji

jijii

jiji

ji

ji

Result

SignalMap, NimbleGen Inc.

Result

SignalMap, NimbleGen Inc.

Result

SignalMap, NimbleGen Inc.

Result

SignalMap, NimbleGen Inc.

How good the fit is?

Result

Kim, T.H. et al. A high-resolution map of active promoters in the human genome. Nature, 436, 876-880

9,328 promoters for known transcripts

1,196 putative promoters for unknown transcripts

Ren, B. UCSD

Comparison with kernel smoothing

Multi-resolutionPeak tree

Why use model? A promoter is characterized

not only by a large probe signal, but also a truncated triangle shape

Identify the neighboring probes that are caused by the same promoter to pool the info for ranking the potential binding sites

SignalMap, NimbleGen Inc.

Model justification Intuitively, human vision recognizes the local

shape, instead of a single probe, to detect peaks.

Model fitting improves detection: 1) largest signal may not always be the tip of the best fitted triangle, 2) we can handle outliers caused by probe malfunctioning.

For window smoothing, if the window size is not chosen well, a local maximum of the window average can well be the bottom of a valley.  

Model justification The model gives us a sensible way to choose the

range:

this enables us to pool many weak signals together if they form a good triangle. So that we can reduce the chance of false negative.

this prevents us from pooling too many weak signals together if they do not form a good triangle. So that we can reduce the chance of false positive.

Model justification Probabilistic approx: Poisson process Fact: two different slopes around the non-

differential tip Functional approx: line segments locally

Gives reasonable fit to data

Not enough data for more complex model Not enough computational power to fit more

complex model within minutes

Software Fast: ~10 seconds for ~400,000 probes with

a regular PC.

Robust to noise (data shown later). Software and source code publicly available:

www.stat.ucla.edu/~zmdl/Mpeak

Chromosome structure

Lodish, H. et al. Molecular Cell Biology.

Histone and transcription

Histone proteins need to be modified and DNA needs to be released for transcription to take place.

LS3 class note, UCLA

Histone and transcription

Ren, B., UCSD

Twin-peak phenomenon

The promoter region is in between two binding sites of the modified histone protein, e.g., Acetylated histone H3 (AcH3).

ChIP-chip data for AcH3 show a twin-peak phenomenon, with a valley corresponding to promoter region.

LS3 class note, UCLA

SignalMap, NimbleGen Inc.

Possible solutions Fit twin-peak shape to data based on the

probability model for two binding site scenario.

Use Witkin’s scale-space filtering to detect peaks and twin-peaks.

top related