university of north carolina at chapel...

A hierarchical regression mixture modelfor inferring gene regulatory networks

Mayetri [email protected]

University of North Carolina at Chapel Hill

[email protected] -- p.1/30

Upstream regulation ↔ Downstream expression

Fundamental question: how can we understand the biologicalmechanisms leading to disease?

...gtggtTAGAATagcgactgttttt... gene 1

...taggTATAATacagtctgacaaaa... gene 2

...cagcaacattgaTATAATtgccat... gene 3

...ctaaaacaatTATTATttatcagg... gene 4

0

1

2

bits | 1 T 2 A 3 GT 4 TA 5 GCA 6 T|

Co-regulated genes sharesimilar upstream patterns

Identify genes that are differentially expressed under differenttreatments or conditions


Gene Regulation: DNA Motifs

Proteins bind to DNA to activate gene transcription

0

1

2

bits |

1 T 2 A 3 GT

4 TA 5 GCA 6 T|Position specific weight

matrix (PSWM)

or Motif


Gene regulation in complex genomes

Harder problem: many transcription factors working inco-ordination

LARGE sequence search space: using sequence data only→ many false positives?


Upstream regulation ↔ Downstream expression

Gene expression contains information about sequence motifs

Sequence may contain information on gene co-regulation

Expression clustering→ Motif discovery

or

Expression clustering↔ Motif discovery ?

What if initial clustering is inaccurate?


Cell-cycle data set (Spellman, Mol Biol Cell. 1998)

28 63 98

−2−1

01

23

G1G2/MM/G1SS/G2

PS

fragreplacem

ents

gene

expr

essi

on

time (minutes)

Measurements over 18 time points, 3 different experiments∼ 800 genes known to be cell-cycle dependent

Do clusters of genes share common TFs?Do certain TFs work combinatorially on groups of genes?

color: time when gene is active


Using Gene Expression in Motif Discovery

REDUCER (Bussemaker, Nat. Genet. 2001) correlatesexpression of gene with number of motif occurrences

MDScan (Liu, Nat Biotech. 2002) Most strongly differentiallyexpressed genes → candidate motifs.

Motif Regressor (Conlon, PNAS 2003)Multiple regression model: Sum of motif effects explainsgene expression

Yg = α+M

∑m=1

βmSmg + εg

Yg: expression of gene g; Smg: motif-match score


Using Gene Expression in Motif Discovery

Non-parametric approaches

Phuong et al. (Bioinformatics, 2004): Classification Trees(CART)

Arbitrary decision criterion based on the number ofoccurrences of a motif type

Multivariate adaptive regression splines (Das et al, PNAS2004)

Joint sequence-expression model without parametric connection

Holmes and Bruno (ISMB proc.,2000) joint likelihood forsequence and expression data


Using motif information in gene clustering

Infer sets of transcription factors involved in regulating groups ofgenes

Higher transcriptional activity → greater presence of TFbinding sites, more pronounced expression changes

Genes within a “cluster” may be correlated, with or withoutsharing common transcription factors

Measurements on the same gene in different conditions maybe correlated due to sharing the same upstream transcriptionfactor binding sites


Linear mixed effects model

y = Xβ+Zb+ ε

β: fixed effectsb: random effects

ε ∼ N(0,τ−10 I)

y: gene expression

Fixed effects: Sequence MotifLevels of factor are reproduced exactly if experiment isrepeated

Random effects: Expression clusterLevels of factor (expression + clusters) may not bereproduced exactly if experiment is repeated


Joint Model for Sequence-Expression

Complication: gene cluster identity cannot be assumed knownZ matrix not “fixed”

Conditional on cluster k, (k = 1, . . . ,K), vector of log-expressionvalues of gene g generated from mixed-effects model:

Yg|zg = k,X , parameters ∼ N(ξg +XTg βk1,σ2

kI) (≡ fk)


Joint Model for Sequence-Expression

Complication: gene cluster identity cannot be assumed knownZ matrix not “fixed”

Conditional on cluster k, (k = 1, . . . ,K), vector of log-expressionvalues of gene g generated from mixed-effects model:

Yg|zg = k,X , parameters ∼ N(ξg +XTg βk1,σ2

kI) (≡ fk)

Unconditionally, a mixture model

P(Y |X , parameters) = ∏genes g

[∑

cluster kπk fk(Yg| parameters)

]

Which β’s significant, in which [email protected] -- p.12/30

Bayesian hierarchical formulation

Prior distributions for cluster k parameters:“Expression” model:

µk ∼ N(·,v2k0I)

σ2k ∼ InvGamma(·, ·)

ξg|zg = k ∼ N(µk,τ0σ2kI)

“Sequence” model: βk ∼ N(β0,V k)Probabilities of cluster membership:P(zg = k) = πk

(π1, . . . ,πK) ∼ Dirichlet(α1, . . . ,αK)

G genes; T measurements per gene


Bayesian hierarchical formulation

For each βk, we use multivariate extension of g-prior (Zellner,1986), so that

V k =cσ2

k

TS−1

k , where Sk = ∑zg=k

XgXTg

Why use g-prior?

Computational efficiency, varying c −→ more/less informative

Induces dependence among genes in a cluster due tosequence effects

Cov(Yg,Yg′|Zg,Zg′ = k) = v2k0I +

cσ2k

T1[XT

g S−1k Xg

]1T


Regulatory Motif Model

· · ·θ0 θ0 θ1 · · · θ6 θ0 θ0 · · ·

Every non-siteposition multinomialwithθ0 = (θ01, . . . ,θ04)

Every motif position imultinomial withθi = (θi1, . . . ,θi4)

Product Multinomial modelChallenge: Find position of sites and θ’s

MEME (Bailey, ISMB 1994) Gibbs Motif Sampler (Liu, JASA 1995), AlignAce (Roth, Nat. BioTech 1998),

BioProspector (Liu, Pac. Symp. Biocomp. 2001), Stochastic Dictionary (Gupta, JASA 2003)


Sequence Motif Scoring

Starting set of motifs

De-novo: Different clusters of genes exhibiting “strong” up- ordown-regulation (MDScan)

Databases: Derived from experimental data

Motif score for w-width motif j and upstream sequence g:

Xg j = ∑positions i

P(seq(i, i+w−1)|motif j)P(seq(i, i+w−1)| null)


Sequence motif selectionInitial set of D motif candidates (D can be large!)In regression model, want to know which motifs correlated“significantly” with response (gene expression)

u = (u1, . . . ,uD) where

u j =

{1 if motif j is in model

0 otherwise.

Prior probability of motif inclusion

P(u) =D

∏j=1

ηu j(1−η)1−u j

Variable selection from LARGE potential [email protected] -- p.17/30

Outline of method

select motifs from

given motifs update clusters

update model parametersgiven cluster membership

and functional motifs ineach class

large initial set

active in each class

Complications

Cluster membership z unknown

Number of clusters K may be unknown (assume fixed for now)

Number of motif candidates is large


Parameter updating using MCMC

Update from joint posterior distribution

P(θ,β,u,z|Y ,X ,K)

θ = (µ,σ2,π)

For updating steps for parameters θ,β and z, marginalize overother parameters for efficiency (Conjugate forms permit this)

For updating u, use evolutionary Monte Carlo (Liang andWong, JASA 2001)

Select motifs that have most effect on expression, and differentiatemost among clusters


Three simulated data sets

Regression coeffs. for motifs cor-

responding to 3 PSWMs from

JASPAR database SAP1, SRF,

amd MEF2

200 “genes” in K = 2 clusters

2 measurements each

Data 1 Data 2 Data 3

Coef. C1 C2 C1 C2 C1 C2

βM1 2 0 2 0 2 -2

βM2 0 2 0 -2 0 0

βM3 0 0 0 0 0 0

(Motif 3 not present in data)

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

++

++

+

+

+

+

++

+

+

+

++

+

++++

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

o

o

o

o

o o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o o o

ooo

o o

o

o

o

o

o

o

ooo

o

o

oo

o

o

o

o

o

o

o

o

o

o

−1 0 1 2

01

23

4

x11

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

++

++

+

+

+

+

++

+

+

+

++

+

+++

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+ +

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

oo

o o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

ooo

ooo

oo

o

o

o

o

o

o

ooo

o

o

oo

o

o

o

o

o

o

o

o

o

o

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

01

23

4

x12

Y1 +

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

++

++

+

+

+

+

++

+

+

+

++

+

+++

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

o

o

o

o

o o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o o

oo

o o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o o o

ooo

o o

o

o

o

o

o

o

o oo

o

o

oo

o

o

o

o

o

o

o

o

o

o

−1.0 −0.5 0.0 0.5 1.0

01

23

4

x13

Y1

++

+

+

++

+

+ ++

++

+

+

++

+

+

+

+

+

+

+

+

+++

+

+

+

+

+

+

++

+

+

++

+

++

+ ++

+

+

+

++

+

+

++

+

+++

+

+

+

++

+

++

++

+

+

+

+++

+

+

++

+

+++

+

+

+

+

+

++

+

+

+

+

++

+

+

+

++

o

oo

o

o

oo

oo

o

o

o

o

o

o

o o

o

o

oo

o oo

o

oo

o

oo

oo

o

o

ooooo

oo

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o o

o

o

oo

o

o

o

o

oooo

o

o

oo

o

o

o

o

oo

o

o oo

o

o

o

o

o

o

o

o

o

o

o

o

−1 0 1 2

−10

12

34

5

x31

++

+

+

++

+

+ ++

++

+

+

++

+

+

+

+

+

+

+

+

+ ++

+

+

+

+

+

+

++

+

+

++

+

++

+ + +

+

+

+

++

+

+

+ +

+

++++

+

+

++

+

+ +

++

+

+

+

++ +

+

+

+ +

+

++ +

+

+

+

+

+

+++

+

+

+

++

+

+

+

++

o

oo

o

o

oo

oo

o

o

o

o

o

o

oo

o

o

oo

ooo

o

oo

o

oo

oo

o

o

oo ooo

oo

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

oo

o

o

oo

o

o

o

o

oooo

o

o

oo

o

o

o

o

oo

o

ooo

o

o

o

o

o

o

o

o

o

o

o

o

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−10

12

34

5

x32

Y3

++

+

+

++

+

+ ++

+ +

+

+

++

+

+

+

+

+

+

+

+

+++

+

+

+

+

+

+

++

+

+

++

+

+ +

+++

+

+

+

++

+

+

+ +

+

+ ++

+

+

+

++

+

++

++

+

+

+

+++

+

+

+ +

+

++ +

+

+

+

+

+

++

+

+

+

+

++

+

+

+

++

o

oo

o

o

oo

oo

o

o

o

o

o

o

o o

o

o

oo

ooo

o

oo

o

o o

oo

o

o

o oo oo

oo

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o o

o

o

oo

o

o

o

o

o oo o

o

o

o o

o

o

o

o

oo

o

o oo

o

o

o

o

o

o

o

o

o

o

o

o

−1.0 −0.5 0.0 0.5 1.0

−10

12

34

5

x33

Y3

++

++

+

++

+

+

+

+

+

++

+

+

+

++

+

++

+

+

+

+

++

+

+

+

+++

+

+ +

+

+

++

+

+

+

++

+

+

++

+

+

+

+

++

+

+

+

++

++

+

+

+ +

+

++

+

+

++

+ ++

+

++

+

++

+

++

+

++

+

+

++

+

+

+

+

+

+

+

o

o

o

o

o

o

o

oo

o

o

o

o

ooo

o

o

oo o

o

oo

o

o

o

o

oo

ooo

o

o

oo

o

oo

oo

o

o

o

oo

o

o oo

o

o

o

o

o

o

o o

o

o

o

oo

o

oo

ooo

o

oo

o

oo

oo

o

o

o

o

o o

o

o

oo

o

o

o

o

ooooo

oo

o

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

−20

24

++

++

+

++

+

+

+

+

+

+ +

+

+

+

++

+

+ +

+

+

+

+

++

+

+

+

+++

+

++

+

+

++

+

+

+

++

+

+

++

+

+

+

+

++

+

+

+

++

++

+

+

++

+

++

+

+

++

+ ++

+

++

+

+ +

+

++

+

++

+

+

++

+

+

+

+

+

+

+

o

o

o

o

o

o

o

oo

o

o

o

o

ooo

o

o

oo o

o

oo

o

o

o

o

oo

ooo

o

o

oo

o

oo

oo

o

o

o

oo

o

o oo

o

o

o

o

o

o

o o

o

o

o

oo

o

oo

ooo

o

oo

o

oo

o o

o

o

o

o

oo

o

o

oo

o

o

o

o

o oooo

o o

o

−0.5 0.0 0.5 1.0 1.5 2.0

−20

24

Y2

++

++

+

++

+

+

+

+

+

++

+

+

+

++

+

+ +

+

+

+

+

++

+

+

+

+ ++

+

+ +

+

+

++

+

+

+

++

+

+

++

+

+

+

+

++

+

+

+

++

++

+

+

++

+

+ +

+

+

+++ +++

++

+

++

+

++

+

+ +

+

+

++

+

+

+

+

+

+

+

o

o

o

o

o

o

o

oo

o

o

o

o

ooo

o

o

oo o

o

oo

o

o

o

o

o o

ooo

o

o

oo

o

oo

oo

o

o

o

oo

o

o oo

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

oo o

o

oo

o

oo

o o

o

o

o

o

oo

o

o

oo

o

o

o

o

o ooo

o

o o

o

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−20

24

Y2

S1 S2 S3Y

(3)

1Y

(2)

1Y

(1)

1

Y1 with motif 1,2,3 scores (columns)

in 3 data sets (rows)


Simulation study results: Bayes factors


1 2 3 4 5 6

−3

00

−2

50

−2

00

PS

fragreplacem

ents

K

P(M

K)

1 2 3 4 5 6

−7

00

−6

00

−5

00

−4

00

data3

PS

fragreplacem

ents

K

P(M

K)

1 2 3 4 5 6

−7

00

−6

00

−5

00

−4

00

−3

00

data2

PS

fragreplacem

ents

K

P(M

K)

Optimal choice: K = 2

Marginal model probability for MK through Double mixtureimportance sampling

P(Y |MK)=̂1

NtNs∑

t∑

sP(Y |Z(t)

,θ(s),K)

π1(θ(s)|z(t))

f1(θ(s)|z(t))

π2(z(t))

f2(z(t))


Simulation: β estimates


1 2

−0

.50

.00

.51

.01

.52

.0

1 2

−0

.50

.00

.51

.01

.52

.0

PS

fragreplacem

ents

motif type

coef

ficie

nt

c1c2

1 2

−3

−2

−1

01

2

1 2

−3

−2

−1

01

2

PS

fragreplacem

entsmotif type

coef

ficie

nt

c1c2

−2

−1

01

2−

2−

10

12

PS

fragreplacem

ents

motif type

coef

ficie

nt c1c2

Data sets 1 and 2: Motifs 1,2 selectedData set 3: Motif 1 selected


Spellman (1998) data

28 63 98

−2−1

01

23

G1G2/MM/G1SS/G2

PS

fragreplacem

ents

gene

expr

essi

on

time (minutes)

Pick different groups of genes that are highly differentially expressed

Sets of motif candidates of widths 7-12 bp (using MDscan)

Motif overlaps lead to collinearity: remove motifs with correlation > 0.5 with a

higher-ranked one → 32 candidate motifs

Two consecutive time points: 1-2, 3-4, . . ., 17-18


Number of clusters K∗

log(B̂F) compared to 1-component model

22 2

2

2

2 2

2 2−2

00

2040

time interval

log(

BF)

1 2 3 4 5 6 7 8 9

3

3

3

3

3

3

3

3 3

Interval 1 2 3 4 5 6 7 8 9K∗ 1 3 1 2 2 2 2 1 1

0

1

2

bits

1

TA

C

2

T

AG

3

AGC

4

A

CG

5

TA

6

TA

7

TA

8

G

CTA

9

GTA

0

1

2

bits

1

TA

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

0

1

2

bits

1

TA

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

0

1

2

bits

1

GTA

2

CGTA

3

GTC

4GTA

5CTGA

6

GTC

7

CGTA

0

1

2

bits

1

TA

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

0

1

2

bits

1

C

TAG

2

C

TAG

3

TCGA

4

T

GAC

5

G

TCA

6

CATG

7

CTGA

0

1

2

bits

1

TGAC

2

TCAG

3

TGCA

4

GCTA

5

T

CAG

6

GCTA

7

TCAG

0

1

2

bits

1

G

TAC

2

AT

3

GCTA

4

CTA

5

AT

6

A

CT

7

GCTA

0

1

2

bits

1

TA

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

SCB MCB MCB SFF MCB MCM1 MCM1 MCB

0

1

2

bits

1

GTA

2

CGTA

3

GTC

4

GTA

5

CTGA

6

GTC

7

CGTA

0

1

2bi

ts

1

GTA

2

CGTA

3GTC

4

GTA

5

CTGA

6

GTC

7

CGTA

0

1

2

bits

1

T

A

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

0

1

2

bits

1

T

A

2

A

C

3

T

AG

4

A

TC

5

A

G

6

AT

7

CTA

SFF SFF MCB MCB

0

1

2

bits

1GTA

2CGTA

3

GTC

4

GTA

5

CTGA

6

GTC

7

CGTA

SFFSignificant motifs over time [email protected] -- p.24/30

Motif influence at different time intervalsMotif index −→

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

PS

fragreplacem

ents

Post

erio

rpro

b.of

sele

ctio

n

Selected motif types for 9 time intervals for optimal K


Significant motifs match experimental PSWMs

Index TF name Consensus Expt. Phase4 MCM1 CGAAGAG/CTCTTCG CCNNNWWRGG M6 GCN1 TCAGTCA/TGACTGA TCAGTCA7 [CSRE] GGACAGA/TCTGTCC [YCGGAYRRAWGG]

10 MCB ACGCGTA/TACGCGT WCGCGW G116 SFF AACAACA/TGTTGTT GTMAACAA M18 MCM1 CCAATTAGG/CCTAATTGG CCNNNWWRGG M20 [RME1] TTCAGGTAC/GTACCTGAA [GAACCTCAA]22 SCB CGCGAAAAA/TTTTTCGCG CNCGAAA G125 PHO4 CGTACGTAC/GTACGTACG CACGTK29 − CTTCGCATC/GATGCGAAG

K ≡ {G or T} ; M ≡ {A or C} ; N ≡ {A or C or G or T} ; R ≡ {A or G} ; W ≡ {A or T} ; Y ≡ {C or T}


Summary

Treating gene expression clustering as a variable may help indiscovering relationships between functional sequence motifs, andgroups of genes they regulate

Different groups of genes may behave as a cluster at differenttime points

Upstream sequence motifs “constant” but effects/interactionsover time may vary


Further extensions

Motif scoring issues: sensitivity, co-occurrence of sites

Efficient model selection

Extension to high density ChIP tiling arrays

Acknowledgement:Joseph G. Ibrahim (UNC), Jason Lieb (UNC)

UNC high-performance scientific computing group


Model Selection: Number of clusters

Double mixture importance sampling

P(Y |MK)=̂1

NtNs∑

t∑

sP(Y |Z(t)

,θ(s),K)

π1(θ(s)|z(t))

f1(θ(s)|z(t))

π2(z(t))

f2(z(t))

Challenge: Good sampling densities f (·) for z and θ

For a simpler case (θ marginalized), Raftery et al (TR, 2003)propose permutation-based methods to find good candidatesampling densities for z


university of north carolina at chapel...

Documents