university of north carolina at chapel...

A hierarchical regression mixture modelfor inferring gene regulatory networks

Mayetri Guptagupta@bios.unc.edu

University of North Carolina at Chapel Hill

gupta@bios.unc.edu -- p.1/30

Upstream regulation ↔ Downstream expression

Fundamental question: how can we understand the biologicalmechanisms leading to disease?

...gtggtTAGAATagcgactgttttt... gene 1

...taggTATAATacagtctgacaaaa... gene 2

...cagcaacattgaTATAATtgccat... gene 3

...ctaaaacaatTATTATttatcagg... gene 4

bits | 1 T 2 A 3 GT 4 TA 5 GCA 6 T|

Co-regulated genes sharesimilar upstream patterns

Identify genes that are differentially expressed under differenttreatments or conditions

Gene Regulation: DNA Motifs

Proteins bind to DNA to activate gene transcription

bits |

1 T 2 A 3 GT

4 TA 5 GCA 6 T|Position specific weight

matrix (PSWM)

or Motif

Gene regulation in complex genomes

Harder problem: many transcription factors working inco-ordination

LARGE sequence search space: using sequence data only→ many false positives?

Upstream regulation ↔ Downstream expression

Gene expression contains information about sequence motifs

Sequence may contain information on gene co-regulation

Expression clustering→ Motif discovery

Expression clustering↔ Motif discovery ?

What if initial clustering is inaccurate?

Cell-cycle data set (Spellman, Mol Biol Cell. 1998)

28 63 98

−2−1

G1G2/MM/G1SS/G2

fragreplacem

time (minutes)

Measurements over 18 time points, 3 different experiments∼ 800 genes known to be cell-cycle dependent

Do clusters of genes share common TFs?Do certain TFs work combinatorially on groups of genes?

color: time when gene is active

Using Gene Expression in Motif Discovery

REDUCER (Bussemaker, Nat. Genet. 2001) correlatesexpression of gene with number of motif occurrences

MDScan (Liu, Nat Biotech. 2002) Most strongly differentiallyexpressed genes → candidate motifs.

Motif Regressor (Conlon, PNAS 2003)Multiple regression model: Sum of motif effects explainsgene expression

Yg = α+M

∑m=1

βmSmg + εg

Yg: expression of gene g; Smg: motif-match score

Using Gene Expression in Motif Discovery

Non-parametric approaches

Phuong et al. (Bioinformatics, 2004): Classification Trees(CART)

Arbitrary decision criterion based on the number ofoccurrences of a motif type

Multivariate adaptive regression splines (Das et al, PNAS2004)

Joint sequence-expression model without parametric connection

Holmes and Bruno (ISMB proc.,2000) joint likelihood forsequence and expression data

Using motif information in gene clustering

Infer sets of transcription factors involved in regulating groups ofgenes

Higher transcriptional activity → greater presence of TFbinding sites, more pronounced expression changes

Genes within a “cluster” may be correlated, with or withoutsharing common transcription factors

Measurements on the same gene in different conditions maybe correlated due to sharing the same upstream transcriptionfactor binding sites

Linear mixed effects model

y = Xβ+Zb+ ε

β: fixed effectsb: random effects

ε ∼ N(0,τ−10 I)

y: gene expression

Fixed effects: Sequence MotifLevels of factor are reproduced exactly if experiment isrepeated

Random effects: Expression clusterLevels of factor (expression + clusters) may not bereproduced exactly if experiment is repeated

Joint Model for Sequence-Expression

Complication: gene cluster identity cannot be assumed knownZ matrix not “fixed”

Conditional on cluster k, (k = 1, . . . ,K), vector of log-expressionvalues of gene g generated from mixed-effects model:

Yg|zg = k,X , parameters ∼ N(ξg +XTg βk1,σ2

kI) (≡ fk)

Joint Model for Sequence-Expression

Complication: gene cluster identity cannot be assumed knownZ matrix not “fixed”

Conditional on cluster k, (k = 1, . . . ,K), vector of log-expressionvalues of gene g generated from mixed-effects model:

Yg|zg = k,X , parameters ∼ N(ξg +XTg βk1,σ2

kI) (≡ fk)

Unconditionally, a mixture model

P(Y |X , parameters) = ∏genes g

cluster kπk fk(Yg| parameters)

Which β’s significant, in which cluster?gupta@bios.unc.edu -- p.12/30

Bayesian hierarchical formulation

Prior distributions for cluster k parameters:“Expression” model:

µk ∼ N(·,v2k0I)

σ2k ∼ InvGamma(·, ·)

ξg|zg = k ∼ N(µk,τ0σ2kI)

“Sequence” model: βk ∼ N(β0,V k)Probabilities of cluster membership:P(zg = k) = πk

(π1, . . . ,πK) ∼ Dirichlet(α1, . . . ,αK)

G genes; T measurements per gene

Bayesian hierarchical formulation

For each βk, we use multivariate extension of g-prior (Zellner,1986), so that

V k =cσ2

TS−1

k , where Sk = ∑zg=k

Why use g-prior?

Computational efficiency, varying c −→ more/less informative

Induces dependence among genes in a cluster due tosequence effects

Cov(Yg,Yg′|Zg,Zg′ = k) = v2k0I +

g S−1k Xg

Regulatory Motif Model

· · ·θ0 θ0 θ1 · · · θ6 θ0 θ0 · · ·

Every non-siteposition multinomialwithθ0 = (θ01, . . . ,θ04)

Every motif position imultinomial withθi = (θi1, . . . ,θi4)

Product Multinomial modelChallenge: Find position of sites and θ’s

MEME (Bailey, ISMB 1994) Gibbs Motif Sampler (Liu, JASA 1995), AlignAce (Roth, Nat. BioTech 1998),

BioProspector (Liu, Pac. Symp. Biocomp. 2001), Stochastic Dictionary (Gupta, JASA 2003)

Sequence Motif Scoring

Starting set of motifs

De-novo: Different clusters of genes exhibiting “strong” up- ordown-regulation (MDScan)

Databases: Derived from experimental data

Motif score for w-width motif j and upstream sequence g:

Xg j = ∑positions i

P(seq(i, i+w−1)|motif j)P(seq(i, i+w−1)| null)

Sequence motif selectionInitial set of D motif candidates (D can be large!)In regression model, want to know which motifs correlated“significantly” with response (gene expression)

u = (u1, . . . ,uD) where

{1 if motif j is in model

0 otherwise.

Prior probability of motif inclusion

P(u) =D

∏j=1

ηu j(1−η)1−u j

Variable selection from LARGE potential setgupta@bios.unc.edu -- p.17/30

Outline of method

select motifs from

given motifs update clusters

update model parametersgiven cluster membership

and functional motifs ineach class

large initial set

active in each class

Complications

Cluster membership z unknown

Number of clusters K may be unknown (assume fixed for now)

Number of motif candidates is large

Parameter updating using MCMC

Update from joint posterior distribution

P(θ,β,u,z|Y ,X ,K)

θ = (µ,σ2,π)

For updating steps for parameters θ,β and z, marginalize overother parameters for efficiency (Conjugate forms permit this)

For updating u, use evolutionary Monte Carlo (Liang andWong, JASA 2001)

Select motifs that have most effect on expression, and differentiatemost among clusters

Three simulated data sets

Regression coeffs. for motifs cor-

responding to 3 PSWMs from

JASPAR database SAP1, SRF,

amd MEF2

200 “genes” in K = 2 clusters

2 measurements each

Data 1 Data 2 Data 3

Coef. C1 C2 C1 C2 C1 C2

βM1 2 0 2 0 2 -2

βM2 0 2 0 -2 0 0

βM3 0 0 0 0 0 0

(Motif 3 not present in data)

−1 0 1 2

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

−1.0 −0.5 0.0 0.5 1.0

−1 0 1 2

oo ooo

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

o oo oo

o oo o

−1.0 −0.5 0.0 0.5 1.0

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

o oooo

−0.5 0.0 0.5 1.0 1.5 2.0

+++ +++

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

S1 S2 S3Y

Y1 with motif 1,2,3 scores (columns)

in 3 data sets (rows)

Simulation study results: Bayes factors

1 2 3 4 5 6

fragreplacem

1 2 3 4 5 6

fragreplacem

1 2 3 4 5 6

fragreplacem

Optimal choice: K = 2

Marginal model probability for MK through Double mixtureimportance sampling

P(Y |MK)=̂1

NtNs∑

sP(Y |Z(t)

,θ(s),K)

π1(θ(s)|z(t))

f1(θ(s)|z(t))

π2(z(t))

f2(z(t))

Simulation: β estimates

fragreplacem

motif type

fragreplacem

entsmotif type

fragreplacem

motif type

nt c1c2

Data sets 1 and 2: Motifs 1,2 selectedData set 3: Motif 1 selected

Spellman (1998) data

28 63 98

−2−1

G1G2/MM/G1SS/G2

fragreplacem

time (minutes)

Pick different groups of genes that are highly differentially expressed

Sets of motif candidates of widths 7-12 bp (using MDscan)

Motif overlaps lead to collinearity: remove motifs with correlation > 0.5 with a

higher-ranked one → 32 candidate motifs

Two consecutive time points: 1-2, 3-4, . . ., 17-18

Number of clusters K∗

log(B̂F) compared to 1-component model

2 2−2

time interval

1 2 3 4 5 6 7 8 9

Interval 1 2 3 4 5 6 7 8 9K∗ 1 3 1 2 2 2 2 1 1

SCB MCB MCB SFF MCB MCM1 MCM1 MCB

SFF SFF MCB MCB

SFFSignificant motifs over time intervalsgupta@bios.unc.edu -- p.24/30

Motif influence at different time intervalsMotif index −→

0 5 10 15 20 25 30

fragreplacem

0 5 10 15 20 25 30

fragreplacem

0 5 10 15 20 25 30

fragreplacem

0 5 10 15 20 25 30

fragreplacem

0 5 10 15 20 25 30

fragreplacem

0 5 10 15 20 25 30

fragreplacem

0 5 10 15 20 25 30

fragreplacem

0 5 10 15 20 25 30

fragreplacem

ents0 5 10 15 20 25 30

fragreplacem

Selected motif types for 9 time intervals for optimal K

Significant motifs match experimental PSWMs

Index TF name Consensus Expt. Phase4 MCM1 CGAAGAG/CTCTTCG CCNNNWWRGG M6 GCN1 TCAGTCA/TGACTGA TCAGTCA7 [CSRE] GGACAGA/TCTGTCC [YCGGAYRRAWGG]

10 MCB ACGCGTA/TACGCGT WCGCGW G116 SFF AACAACA/TGTTGTT GTMAACAA M18 MCM1 CCAATTAGG/CCTAATTGG CCNNNWWRGG M20 [RME1] TTCAGGTAC/GTACCTGAA [GAACCTCAA]22 SCB CGCGAAAAA/TTTTTCGCG CNCGAAA G125 PHO4 CGTACGTAC/GTACGTACG CACGTK29 − CTTCGCATC/GATGCGAAG

K ≡ {G or T} ; M ≡ {A or C} ; N ≡ {A or C or G or T} ; R ≡ {A or G} ; W ≡ {A or T} ; Y ≡ {C or T}

Summary

Treating gene expression clustering as a variable may help indiscovering relationships between functional sequence motifs, andgroups of genes they regulate

Different groups of genes may behave as a cluster at differenttime points

Upstream sequence motifs “constant” but effects/interactionsover time may vary

Further extensions

Motif scoring issues: sensitivity, co-occurrence of sites

Efficient model selection

Extension to high density ChIP tiling arrays

Acknowledgement:Joseph G. Ibrahim (UNC), Jason Lieb (UNC)

UNC high-performance scientific computing group

Model Selection: number of clusters K

Likelihood-based methods not valid (BIC, etc.)

Bayes factor: ratio of model marginal probabilitiesMarginal probability for model MK

P(Y |MK) = ∑z

θP(Y |Z,K)p(θ|z)p(Z|K)dθ

Double mixture importance sampling

P(Y |MK)=̂1

NtNs∑

sP(Y |Z(t)

,θ(s),K)

π1(θ(s)|z(t))

f1(θ(s)|z(t))

π2(z(t))

f2(z(t))

Model Selection: Number of clusters

Double mixture importance sampling

P(Y |MK)=̂1

NtNs∑

sP(Y |Z(t)

,θ(s),K)

π1(θ(s)|z(t))

f1(θ(s)|z(t))

π2(z(t))

f2(z(t))

Challenge: Good sampling densities f (·) for z and θ

For a simpler case (θ marginalized), Raftery et al (TR, 2003)propose permutation-based methods to find good candidatesampling densities for z

university of north carolina at chapel...

Documents

chapel on south hyco : the story of lea's chapel united

the sovereign school forms registered (session 2017-18) ·...

school name duncan chapel elem duncan chapel elementary

morris chapel united methodist church, inc. chapel...

lutheran chapel

statistical learning and high-dimensional...

dr. pratibha gupta dr. abha gupta west bengal

instructor neelima gupta instructor: ms. neelima gupta

assembly committee minutes - nevada … flint, representing...

chapel & garden wedding ceremonies - stoneridge estate ·...

multimedia systems cse 228f amarnath gupta gupta@sdsc.edu

chapel oaks funeral home, inc dba chapel oaks …

leadership council - wordpress.com leadership council |...

semiparametric transformation models for survival data with...

the chapel *iiiiiiiiiiiiiiiiiiif- the chapel can

st. paul’s chapel will st. paul’s chapel be open for

blessed sacrament chapel- mount carmel chapel -ponda

testimony of deepak gupta founding principal, gupta ... ·...

the gupta and post gupta period

bethania chapel - aberfan - rowland jones chapel - aberfan -...