gibbs sampling for motif finding yves moreau. 2 overview markov chain monte carlo gibbs sampling...

24
Gibbs sampling for motif finding Yves Moreau

Upload: ashley-garrett

Post on 13-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Gibbs sampling for motif finding

Yves Moreau

2

Overview

Markov Chain Monte Carlo

Gibbs sampling

Motif finding in cis-regulatory DNA

Biclustering microarray data

3

Markov Chain Monte-Carlo

Markov chain with transition matrix T

)|( 1 iXjXPT ttij

A C G TA 0.0643 0.8268 0.0659 0.0430

C 0.0598 0.0484 0.8515 0.0403

G 0.1602 0.3407 0.1736 0.3255

T 0.1507 0.1608 0.3654 0.3231

X=A

X=C X=G

X=T

4

Markov Chain Monte-Carlo

Markov chains can sample from complex distributionsACGCGGTGTGCGTTTGACGAACGGTTACGCGACGTTTGGTACGTGCGGTGTACGTGTACGACGGAGTTTGCGGGACGCGTACGCGCGTGACGTACGCGTGAGACGCGTGCGCGCGGACGCACGGGCGTGCGCGCGTCGCGAACGCGTTTGTGTTCGGTGCACCGCGTTTGACGTCGGTTCACGTGACGCGTAGTTCGACGACGTGACACGGACGTACGCGACCGTACTCGCGTTGACACGATACGGCGCGGCGGGCGCGGACGTACGCGTACACGCGGGAACGCGCGTGTTTACGACGTGACGTCGCACGCGTCGGTGTGACGGCGGTCGGTACACGTCGACGTTGCGACGTGCGTGCTGACGGAACGACGACGCGACGCACGGCGTGTTCGCGGTGCGG

ACGT

%

Positio

n

5

Markov Chain Monte-Carlo

Let us look at the transition after two steps

Similarly, after n steps

TTT

TT

iXkXPkXjXP

iXkXPiXkXjXPiXjXPT

S

kkjik

S

ktttt

S

ktttttttij

.

)|()|(

)|(),|()|(

)2(

1

1112

11122

)2(

( ) ( | )n nt n tT P X X T

6

Markov Chain Monte-Carlo

Stationary distribution

If the samples are generated to the distribution , the samples at the next step will also be generated according to

is a left eigenvector of T Equilibrium distribution

Rows of T are stationary distributions From an arbitrary initial condition and after a sufficient number of

steps (burn-in), the successive states of the Markov chains are samples from a stationary distribution

T

TT

TTT

TT

n

n

n

n

1lim

lim 0.1188 0.0643 0.8268 0.0659 0.0430 0.1188

0.2788 0.0598 0.0484 0.8515 0.0403 0.2788. =

0.3905 0.1602 0.3407 0.1736 0.3255 0.3905

0.2119 0.1507 0.1608 0.3654 0.3231 0.2119

T

7

Detailed balance

A sufficient condition for the Markov chain to converge to the stationary distribution p is that they satisfy the condition of detailed balance

Proof:

Problem: disjoint regions in probability space

, ,i ij j jip T p T i j

,j ji i ij i ij iij j j

pT p T p T p T p i

8

Gibbs sampling

Markov chain for Gibbs sampling

1

1 1

0 0 0

( | , )1

( | , )1 1

( | , )1 1 1

( , , ) ( | , ) ( | , ) ( | , )

( , , )

( , , )

( , , )

( , , )

i i

i i

i i

P A B b C ci i i i

P B A a C ci i i i

P C A a B bi i i i

P A B C P A B C P B A C P C A B

a b c

a a b c

b a b c

c a b c

9

Gibbs sampling

Detailed balance Detailed balance for the Gibbs sampler

Prove detailed balance

Bayes’ rule

Q.E.D.

1 1 1 1

1 1 1 1 1 1

( , , ) ( | , , , , , )

( , , , , , , ) ( | , , , , , )n i i i n

i i i n i i i n

P x x P x x x x x

P x x x x x P x x x x x

( ) ( | ) ( ) ( | ), ,P x y x P y x y x y

1 1 1( | ) ( | , , , , , )i i i ny x P x x x x x 1( ) ( , , )nP x P x x

1 1 1 1 1 1 1

1 1 1 1 1 1

( , , ) ( , , , , , , ) / ( , , , , , )

( , , , , , , ) ( , , ) / ( , , , , , )n i i i n i i n

i i i n i n i i n

P x x P x x x x x P x x x x

P x x x x x P x x P x x x x

10

Data augmentation Gibbs sampling

Introducing unobserved variables often simplifies the expression of the likelihood

A Gibbs sampler can then be set up

Samples from the Gibbs sampler can be used to estimate parameters

( , | ) ( | , ) ( | , )

( | , ) ( | , )

model parameters, missing data, data

i ji j

P M D P M D P M D

P M D P M D

M D

PME

1

1( | ) ( , | )

Nk

kM

E D P M D dMdN

11

Pros and cons

Pros Clear probabilistic interpretation Bayesian framework “Global optimization”

Cons Mathematical details not easy to work out Relatively slow

12

Motif finding

13

Gibbs sampler

Gibbs sampling for motif finding Set up a Gibbs sampler for the joint probability of the motif matrix and the

alignment given the sequences

Sequence by sequence

Lawrence et al. One motif of fixed length One occurrence per sequence Background model based on single nucleotides Too sensitive to noise Lots of parameter tuning

( , | ) ( | , ) ( | , )

motif matrix, alignment, sequences

P A S P A S P A S

A S

),|(),|(1

iii

K

iSaPSAP

2.005.09.005.004.005.005.01.0

4.09.005.09.003.005.004.02.0

2.002.001.001.08.01.09.04.0

2.003.004.004.003.08.001.03.0

NCACGTGN :model Motif

T

G

C

A

28.0

24.0

16.0

32.0

model Background

T

G

C

A

1 20 Motif( | , , )W bg bgP S a B P P P

1Motif ,1

x j

W

j bj

P q

1

1

0,1

j

a

bg bj

P q

Translation start500 bp

2 0, j

L

bg bj a W

P q

15

Gibbs motif finding

Initialization Sequences Random motif matrix

Iteration Sequence scoring Alignment update Motif instances Motif matrix

Termination Convergence of the alignment

and of the motif matrix

16

Gibbs motif finding

Initialization Sequences Random motif matrix

Iteration Sequence scoring Alignment update Motif instances Motif matrix

Termination Convergence of the alignment

and of the motif matrix

17

Gibbs motif finding

Initialization Sequences Random motif matrix

Iteration Sequence scoring Alignment update Motif instances Motif matrix

Termination Convergence of the alignment

and of the motif matrix

1

1

1

,

10 0,

, ,

( | , )( )

( | , )l i

l i

l l W

Wi bW

i b

x b b

P x SW x

P x S

18

Gibbs motif finding

Initialization Sequences Random motif matrix

Iteration Sequence scoring Alignment update Motif instances Motif matrix

Termination Convergence of the alignment

and of the motif matrix

19

Gibbs motif finding

Initialization Sequences Random motif matrix

Iteration Sequence scoring Alignment update Motif instances Motif matrix

Termination Convergence of the alignment

and of the motif matrix

20

Gibbs motif finding

Initialization Sequences Random motif matrix

Iteration Sequence scoring Alignment update Motif instances Motif matrix

Termination Convergence of the alignment

and of the motif matrix

21

Gibbs motif finding

Initialization Sequences Random motif matrix

Iteration Sequence scoring Alignment update Motif instances Motif matrix

Termination Convergence of the alignment

and of the motif matrix

22

Gibbs motif finding

Initialization Sequences Random motif matrix

Iteration Sequence scoring Alignment update Motif instances Motif matrix

Termination Stabilization of the motif matrix

(not of the alignment)

23

Motif Sampler (extended Gibbs sampling)

Model One motif of fixed length per round Several occurrences per sequence

Sequence have a discrete probability distribution over the number of copies of the motif (under a maximum bound)

Multiple motifs found in successive rounds by masking occurrences of previous motifs

Improved background model based on oligonucleotides

Gapped motifs

2.005.09.005.004.005.005.01.0

4.09.005.09.003.005.004.02.0

2.002.001.001.08.01.09.04.0

2.003.004.004.003.08.001.03.0

NCACGTGN :model Motif

T

G

C

A)...|(

model Background

21 mjjjj bbbbP

0Motif

1

( | , , , )c

i im bg bg

i

P S a c B P P P

1Motif ,1

a ji

Wi

j bj

P q

x

mjmjjjmbg bbbPbbPP

111

0 )...|(),...,(

1

11

( | ... )i

i

ai

bg j j j mj a w

P P b b b

Translation start500 bp