university of north carolina at chapel...

30
A hierarchical regression mixture model for inferring gene regulatory networks Mayetri Gupta [email protected] University of North Carolina at Chapel Hill [email protected] -- p.1/30

Upload: others

Post on 20-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

A hierarchical regression mixture modelfor inferring gene regulatory networks

Mayetri [email protected]

University of North Carolina at Chapel Hill

[email protected] -- p.1/30

Page 2: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Upstream regulation ↔ Downstream expression

Fundamental question: how can we understand the biologicalmechanisms leading to disease?

...gtggtTAGAATagcgactgttttt... gene 1

...taggTATAATacagtctgacaaaa... gene 2

...cagcaacattgaTATAATtgccat... gene 3

...ctaaaacaatTATTATttatcagg... gene 4

0

1

2

bits | 1 T 2 A 3 GT 4 TA 5 GCA 6 T|

Co-regulated genes sharesimilar upstream patterns

Identify genes that are differentially expressed under differenttreatments or conditions

[email protected] -- p.2/30

Page 3: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Gene Regulation: DNA Motifs

Proteins bind to DNA to activate gene transcription

0

1

2

bits |

1 T 2 A 3 GT

4 TA 5 GCA 6 T|Position specific weight

matrix (PSWM)

or Motif

[email protected] -- p.3/30

Page 4: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Gene regulation in complex genomes

Harder problem: many transcription factors working inco-ordination

LARGE sequence search space: using sequence data only→ many false positives?

[email protected] -- p.4/30

Page 5: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Upstream regulation ↔ Downstream expression

Gene expression contains information about sequence motifs

Sequence may contain information on gene co-regulation

Expression clustering→ Motif discovery

or

Expression clustering↔ Motif discovery ?

What if initial clustering is inaccurate?

[email protected] -- p.5/30

Page 6: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Cell-cycle data set (Spellman, Mol Biol Cell. 1998)

28 63 98

−2−1

01

23

G1G2/MM/G1SS/G2

PS

fragreplacem

ents

gene

expr

essi

on

time (minutes)

Measurements over 18 time points, 3 different experiments∼ 800 genes known to be cell-cycle dependent

Do clusters of genes share common TFs?Do certain TFs work combinatorially on groups of genes?

color: time when gene is active

[email protected] -- p.6/30

Page 7: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Using Gene Expression in Motif Discovery

REDUCER (Bussemaker, Nat. Genet. 2001) correlatesexpression of gene with number of motif occurrences

MDScan (Liu, Nat Biotech. 2002) Most strongly differentiallyexpressed genes → candidate motifs.

Motif Regressor (Conlon, PNAS 2003)Multiple regression model: Sum of motif effects explainsgene expression

Yg = α+M

∑m=1

βmSmg + εg

Yg: expression of gene g; Smg: motif-match score

[email protected] -- p.7/30

Page 8: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Using Gene Expression in Motif Discovery

Non-parametric approaches

Phuong et al. (Bioinformatics, 2004): Classification Trees(CART)

Arbitrary decision criterion based on the number ofoccurrences of a motif type

Multivariate adaptive regression splines (Das et al, PNAS2004)

Joint sequence-expression model without parametric connection

Holmes and Bruno (ISMB proc.,2000) joint likelihood forsequence and expression data

[email protected] -- p.8/30

Page 9: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Using motif information in gene clustering

Infer sets of transcription factors involved in regulating groups ofgenes

Higher transcriptional activity → greater presence of TFbinding sites, more pronounced expression changes

Genes within a “cluster” may be correlated, with or withoutsharing common transcription factors

Measurements on the same gene in different conditions maybe correlated due to sharing the same upstream transcriptionfactor binding sites

[email protected] -- p.9/30

Page 10: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Linear mixed effects model

y = Xβ+Zb+ ε

β: fixed effectsb: random effects

ε ∼ N(0,τ−10 I)

y: gene expression

Fixed effects: Sequence MotifLevels of factor are reproduced exactly if experiment isrepeated

Random effects: Expression clusterLevels of factor (expression + clusters) may not bereproduced exactly if experiment is repeated

[email protected] -- p.10/30

Page 11: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Joint Model for Sequence-Expression

Complication: gene cluster identity cannot be assumed knownZ matrix not “fixed”

Conditional on cluster k, (k = 1, . . . ,K), vector of log-expressionvalues of gene g generated from mixed-effects model:

Yg|zg = k,X , parameters ∼ N(ξg +XTg βk1,σ2

kI) (≡ fk)

[email protected] -- p.11/30

Page 12: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Joint Model for Sequence-Expression

Complication: gene cluster identity cannot be assumed knownZ matrix not “fixed”

Conditional on cluster k, (k = 1, . . . ,K), vector of log-expressionvalues of gene g generated from mixed-effects model:

Yg|zg = k,X , parameters ∼ N(ξg +XTg βk1,σ2

kI) (≡ fk)

Unconditionally, a mixture model

P(Y |X , parameters) = ∏genes g

[∑

cluster kπk fk(Yg| parameters)

]

Which β’s significant, in which [email protected] -- p.12/30

Page 13: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Bayesian hierarchical formulation

Prior distributions for cluster k parameters:“Expression” model:

µk ∼ N(·,v2k0I)

σ2k ∼ InvGamma(·, ·)

ξg|zg = k ∼ N(µk,τ0σ2kI)

“Sequence” model: βk ∼ N(β0,V k)Probabilities of cluster membership:P(zg = k) = πk

(π1, . . . ,πK) ∼ Dirichlet(α1, . . . ,αK)

G genes; T measurements per gene

[email protected] -- p.13/30

Page 14: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Bayesian hierarchical formulation

For each βk, we use multivariate extension of g-prior (Zellner,1986), so that

V k =cσ2

k

TS−1

k , where Sk = ∑zg=k

XgXTg

Why use g-prior?

Computational efficiency, varying c −→ more/less informative

Induces dependence among genes in a cluster due tosequence effects

Cov(Yg,Yg′|Zg,Zg′ = k) = v2k0I +

cσ2k

T1[XT

g S−1k Xg

]1T

[email protected] -- p.14/30

Page 15: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Regulatory Motif Model

· · ·θ0 θ0 θ1 · · · θ6 θ0 θ0 · · ·

Every non-siteposition multinomialwithθ0 = (θ01, . . . ,θ04)

Every motif position imultinomial withθi = (θi1, . . . ,θi4)

Product Multinomial modelChallenge: Find position of sites and θ’s

MEME (Bailey, ISMB 1994) Gibbs Motif Sampler (Liu, JASA 1995), AlignAce (Roth, Nat. BioTech 1998),

BioProspector (Liu, Pac. Symp. Biocomp. 2001), Stochastic Dictionary (Gupta, JASA 2003)

[email protected] -- p.15/30

Page 16: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Sequence Motif Scoring

Starting set of motifs

De-novo: Different clusters of genes exhibiting “strong” up- ordown-regulation (MDScan)

Databases: Derived from experimental data

Motif score for w-width motif j and upstream sequence g:

Xg j = ∑positions i

P(seq(i, i+w−1)|motif j)P(seq(i, i+w−1)| null)

[email protected] -- p.16/30

Page 17: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Sequence motif selectionInitial set of D motif candidates (D can be large!)In regression model, want to know which motifs correlated“significantly” with response (gene expression)

u = (u1, . . . ,uD) where

u j =

{1 if motif j is in model

0 otherwise.

Prior probability of motif inclusion

P(u) =D

∏j=1

ηu j(1−η)1−u j

Variable selection from LARGE potential [email protected] -- p.17/30

Page 18: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Outline of method

select motifs from

given motifs update clusters

update model parametersgiven cluster membership

and functional motifs ineach class

large initial set

active in each class

Complications

Cluster membership z unknown

Number of clusters K may be unknown (assume fixed for now)

Number of motif candidates is large

[email protected] -- p.18/30

Page 19: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Parameter updating using MCMC

Update from joint posterior distribution

P(θ,β,u,z|Y ,X ,K)

θ = (µ,σ2,π)

For updating steps for parameters θ,β and z, marginalize overother parameters for efficiency (Conjugate forms permit this)

For updating u, use evolutionary Monte Carlo (Liang andWong, JASA 2001)

Select motifs that have most effect on expression, and differentiatemost among clusters

[email protected] -- p.19/30

Page 20: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Three simulated data sets

Regression coeffs. for motifs cor-

responding to 3 PSWMs from

JASPAR database SAP1, SRF,

amd MEF2

200 “genes” in K = 2 clusters

2 measurements each

Data 1 Data 2 Data 3

Coef. C1 C2 C1 C2 C1 C2

βM1 2 0 2 0 2 -2

βM2 0 2 0 -2 0 0

βM3 0 0 0 0 0 0

(Motif 3 not present in data)

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

++

++

+

+

+

+

++

+

+

+

++

+

++++

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

o

o

o

o

o o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o o o

ooo

o o

o

o

o

o

o

o

ooo

o

o

oo

o

o

o

o

o

o

o

o

o

o

−1 0 1 2

01

23

4

x11

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

++

++

+

+

+

+

++

+

+

+

++

+

+++

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

oo

o o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

ooo

ooo

oo

o

o

o

o

o

o

ooo

o

o

oo

o

o

o

o

o

o

o

o

o

o

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

01

23

4

x12

Y1 +

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

++

++

+

+

+

+

++

+

+

+

++

+

+++

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

o

o

o

o

o o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o o

oo

o o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o o o

ooo

o o

o

o

o

o

o

o

o oo

o

o

oo

o

o

o

o

o

o

o

o

o

o

−1.0 −0.5 0.0 0.5 1.0

01

23

4

x13

Y1

++

+

+

++

+

+ ++

++

+

+

++

+

+

+

+

+

+

+

+

+++

+

+

+

+

+

+

++

+

+

++

+

++

+ ++

+

+

+

++

+

+

++

+

+++

+

+

+

++

+

++

++

+

+

+

+++

+

+

++

+

+++

+

+

+

+

+

++

+

+

+

+

++

+

+

+

++

o

oo

o

o

oo

oo

o

o

o

o

o

o

o o

o

o

oo

o oo

o

oo

o

oo

oo

o

o

ooooo

oo

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o o

o

o

oo

o

o

o

o

oooo

o

o

oo

o

o

o

o

oo

o

o oo

o

o

o

o

o

o

o

o

o

o

o

o

−1 0 1 2

−10

12

34

5

x31

++

+

+

++

+

+ ++

++

+

+

++

+

+

+

+

+

+

+

+

+ ++

+

+

+

+

+

+

++

+

+

++

+

++

+ + +

+

+

+

++

+

+

+ +

+

++++

+

+

++

+

+ +

++

+

+

+

++ +

+

+

+ +

+

++ +

+

+

+

+

+

+++

+

+

+

++

+

+

+

++

o

oo

o

o

oo

oo

o

o

o

o

o

o

oo

o

o

oo

ooo

o

oo

o

oo

oo

o

o

oo ooo

oo

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

oo

o

o

oo

o

o

o

o

oooo

o

o

oo

o

o

o

o

oo

o

ooo

o

o

o

o

o

o

o

o

o

o

o

o

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−10

12

34

5

x32

Y3

++

+

+

++

+

+ ++

+ +

+

+

++

+

+

+

+

+

+

+

+

+++

+

+

+

+

+

+

++

+

+

++

+

+ +

+++

+

+

+

++

+

+

+ +

+

+ ++

+

+

+

++

+

++

++

+

+

+

+++

+

+

+ +

+

++ +

+

+

+

+

+

++

+

+

+

+

++

+

+

+

++

o

oo

o

o

oo

oo

o

o

o

o

o

o

o o

o

o

oo

ooo

o

oo

o

o o

oo

o

o

o oo oo

oo

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o o

o

o

oo

o

o

o

o

o oo o

o

o

o o

o

o

o

o

oo

o

o oo

o

o

o

o

o

o

o

o

o

o

o

o

−1.0 −0.5 0.0 0.5 1.0

−10

12

34

5

x33

Y3

++

++

+

++

+

+

+

+

+

++

+

+

+

++

+

++

+

+

+

+

++

+

+

+

+++

+

+ +

+

+

++

+

+

+

++

+

+

++

+

+

+

+

++

+

+

+

++

++

+

+

+ +

+

++

+

+

++

+ ++

+

++

+

++

+

++

+

++

+

+

++

+

+

+

+

+

+

+

o

o

o

o

o

o

o

oo

o

o

o

o

ooo

o

o

oo o

o

oo

o

o

o

o

oo

ooo

o

o

oo

o

oo

oo

o

o

o

oo

o

o oo

o

o

o

o

o

o

o o

o

o

o

oo

o

oo

ooo

o

oo

o

oo

oo

o

o

o

o

o o

o

o

oo

o

o

o

o

ooooo

oo

o

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

−20

24

++

++

+

++

+

+

+

+

+

+ +

+

+

+

++

+

+ +

+

+

+

+

++

+

+

+

+++

+

++

+

+

++

+

+

+

++

+

+

++

+

+

+

+

++

+

+

+

++

++

+

+

++

+

++

+

+

++

+ ++

+

++

+

+ +

+

++

+

++

+

+

++

+

+

+

+

+

+

+

o

o

o

o

o

o

o

oo

o

o

o

o

ooo

o

o

oo o

o

oo

o

o

o

o

oo

ooo

o

o

oo

o

oo

oo

o

o

o

oo

o

o oo

o

o

o

o

o

o

o o

o

o

o

oo

o

oo

ooo

o

oo

o

oo

o o

o

o

o

o

oo

o

o

oo

o

o

o

o

o oooo

o o

o

−0.5 0.0 0.5 1.0 1.5 2.0

−20

24

Y2

++

++

+

++

+

+

+

+

+

++

+

+

+

++

+

+ +

+

+

+

+

++

+

+

+

+ ++

+

+ +

+

+

++

+

+

+

++

+

+

++

+

+

+

+

++

+

+

+

++

++

+

+

++

+

+ +

+

+

+++ +++

++

+

++

+

++

+

+ +

+

+

++

+

+

+

+

+

+

+

o

o

o

o

o

o

o

oo

o

o

o

o

ooo

o

o

oo o

o

oo

o

o

o

o

o o

ooo

o

o

oo

o

oo

oo

o

o

o

oo

o

o oo

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

oo o

o

oo

o

oo

o o

o

o

o

o

oo

o

o

oo

o

o

o

o

o ooo

o

o o

o

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−20

24

Y2

S1 S2 S3Y

(3)

1Y

(2)

1Y

(1)

1

Y1 with motif 1,2,3 scores (columns)

in 3 data sets (rows)

[email protected] -- p.20/30

Page 21: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Simulation study results: Bayes factors

Data 1 Data 2 Data 3

1 2 3 4 5 6

−3

00

−2

50

−2

00

PS

fragreplacem

ents

K

P(M

K)

1 2 3 4 5 6

−7

00

−6

00

−5

00

−4

00

data3

PS

fragreplacem

ents

K

P(M

K)

1 2 3 4 5 6

−7

00

−6

00

−5

00

−4

00

−3

00

data2

PS

fragreplacem

ents

K

P(M

K)

Optimal choice: K = 2

Marginal model probability for MK through Double mixtureimportance sampling

P(Y |MK)=̂1

NtNs∑

t∑

sP(Y |Z(t)

,θ(s),K)

π1(θ(s)|z(t))

f1(θ(s)|z(t))

π2(z(t))

f2(z(t))

[email protected] -- p.21/30

Page 22: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Simulation: β estimates

Data 1 Data 2 Data 3

1 2

−0

.50

.00

.51

.01

.52

.0

1 2

−0

.50

.00

.51

.01

.52

.0

PS

fragreplacem

ents

motif type

coef

ficie

nt

c1c2

1 2

−3

−2

−1

01

2

1 2

−3

−2

−1

01

2

PS

fragreplacem

entsmotif type

coef

ficie

nt

c1c2

−2

−1

01

2−

2−

10

12

PS

fragreplacem

ents

motif type

coef

ficie

nt c1c2

Data sets 1 and 2: Motifs 1,2 selectedData set 3: Motif 1 selected

[email protected] -- p.22/30

Page 23: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Spellman (1998) data

28 63 98

−2−1

01

23

G1G2/MM/G1SS/G2

PS

fragreplacem

ents

gene

expr

essi

on

time (minutes)

Pick different groups of genes that are highly differentially expressed

Sets of motif candidates of widths 7-12 bp (using MDscan)

Motif overlaps lead to collinearity: remove motifs with correlation > 0.5 with a

higher-ranked one → 32 candidate motifs

Two consecutive time points: 1-2, 3-4, . . ., 17-18

[email protected] -- p.23/30

Page 24: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Number of clusters K∗

log(B̂F) compared to 1-component model

22 2

2

2

2 2

2 2−2

00

2040

time interval

log(

BF)

1 2 3 4 5 6 7 8 9

3

3

3

3

3

3

3

3 3

Interval 1 2 3 4 5 6 7 8 9K∗ 1 3 1 2 2 2 2 1 1

0

1

2

bits

1

TA

C

2

T

AG

3

AGC

4

A

CG

5

TA

6

TA

7

TA

8

G

CTA

9

GTA

0

1

2

bits

1

TA

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

0

1

2

bits

1

TA

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

0

1

2

bits

1

GTA

2

CGTA

3

GTC

4GTA

5CTGA

6

GTC

7

CGTA

0

1

2

bits

1

TA

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

0

1

2

bits

1

C

TAG

2

C

TAG

3

TCGA

4

T

GAC

5

G

TCA

6

CATG

7

CTGA

0

1

2

bits

1

TGAC

2

TCAG

3

TGCA

4

GCTA

5

T

CAG

6

GCTA

7

TCAG

0

1

2

bits

1

G

TAC

2

AT

3

GCTA

4

CTA

5

AT

6

A

CT

7

GCTA

0

1

2

bits

1

TA

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

SCB MCB MCB SFF MCB MCM1 MCM1 MCB

0

1

2

bits

1

GTA

2

CGTA

3

GTC

4

GTA

5

CTGA

6

GTC

7

CGTA

0

1

2bi

ts

1

GTA

2

CGTA

3GTC

4

GTA

5

CTGA

6

GTC

7

CGTA

0

1

2

bits

1

T

A

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

0

1

2

bits

1

T

A

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

SFF SFF MCB MCB

0

1

2

bits

1GTA

2CGTA

3

GTC

4

GTA

5

CTGA

6

GTC

7

CGTA

SFFSignificant motifs over time [email protected] -- p.24/30

Page 25: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Motif influence at different time intervalsMotif index −→

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

Post

erio

rpro

b.of

sele

ctio

n

Selected motif types for 9 time intervals for optimal K

[email protected] -- p.25/30

Page 26: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Significant motifs match experimental PSWMs

Index TF name Consensus Expt. Phase4 MCM1 CGAAGAG/CTCTTCG CCNNNWWRGG M6 GCN1 TCAGTCA/TGACTGA TCAGTCA7 [CSRE] GGACAGA/TCTGTCC [YCGGAYRRAWGG]

10 MCB ACGCGTA/TACGCGT WCGCGW G116 SFF AACAACA/TGTTGTT GTMAACAA M18 MCM1 CCAATTAGG/CCTAATTGG CCNNNWWRGG M20 [RME1] TTCAGGTAC/GTACCTGAA [GAACCTCAA]22 SCB CGCGAAAAA/TTTTTCGCG CNCGAAA G125 PHO4 CGTACGTAC/GTACGTACG CACGTK29 − CTTCGCATC/GATGCGAAG

K ≡ {G or T} ; M ≡ {A or C} ; N ≡ {A or C or G or T} ; R ≡ {A or G} ; W ≡ {A or T} ; Y ≡ {C or T}

[email protected] -- p.26/30

Page 27: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Summary

Treating gene expression clustering as a variable may help indiscovering relationships between functional sequence motifs, andgroups of genes they regulate

Different groups of genes may behave as a cluster at differenttime points

Upstream sequence motifs “constant” but effects/interactionsover time may vary

[email protected] -- p.27/30

Page 28: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Further extensions

Motif scoring issues: sensitivity, co-occurrence of sites

Efficient model selection

Extension to high density ChIP tiling arrays

Acknowledgement:Joseph G. Ibrahim (UNC), Jason Lieb (UNC)

UNC high-performance scientific computing group

[email protected] -- p.28/30

Page 29: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Model Selection: number of clusters K

Likelihood-based methods not valid (BIC, etc.)

Bayes factor: ratio of model marginal probabilitiesMarginal probability for model MK

P(Y |MK) = ∑z

θP(Y |Z,K)p(θ|z)p(Z|K)dθ

Double mixture importance sampling

P(Y |MK)=̂1

NtNs∑

t∑

sP(Y |Z(t)

,θ(s),K)

π1(θ(s)|z(t))

f1(θ(s)|z(t))

π2(z(t))

f2(z(t))

[email protected] -- p.29/30

Page 30: University of North Carolina at Chapel Hillpeople.bu.edu/gupta/papers/talks/wnar05-regmix.pdfgupta@bios.unc.edu University of North Carolina at Chapel Hill gupta@bios.unc.edu -- p.1/30

Model Selection: Number of clusters

Double mixture importance sampling

P(Y |MK)=̂1

NtNs∑

t∑

sP(Y |Z(t)

,θ(s),K)

π1(θ(s)|z(t))

f1(θ(s)|z(t))

π2(z(t))

f2(z(t))

Challenge: Good sampling densities f (·) for z and θ

For a simpler case (θ marginalized), Raftery et al (TR, 2003)propose permutation-based methods to find good candidatesampling densities for z

[email protected] -- p.30/30