outline who regulates whom and when? model learning algorithm evaluation wet lab experiments...

35
Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work? R e g . ACGTGC

Upload: nigel-evans

Post on 30-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Outline

Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work?

Reg

.

ACGTGC

Activator Repressor

Regulated gene

Activator Repressor

Regulated gene

Activator

Regulated gene

Repressor

State 1

Act

ivat

or

State 2

Act

ivat

or

Repressor

State 3

Gene Regulation: Simple Example

Regulated gene

DNA Microarray

Regulators

DNA Microarray

Regulators

truefalse

truefalse

Regulation Tree

Activator?

Repressor?

State 1 State 2 State 3

true Regulation

program

Module

genes

Activator expressio

n

Repressor expressio

n

Genes in the same module share the same regulation

program

Module Networks

Goal: Discover regulatory modules and their regulators Module genes: set of genes that are similarly

controlled Regulation program: expression as function of

regulators

Modu

les

HAP4

CMK1 truefalse

truefalse

Expression level in each module is a

function of expression of regulators

Module Network Probabilistic Model

Experiment

Gene

Expression

Module

Regulator1

Regulator2

Regulator3

Level

What module does gene “g” belong

to?

Expression level of Regulator1 in experiment

BMH1

GIC2

00 0

2

1

Module

P(Level | Module, Regulators)

HAP4

CMK1

0

0 0

Outline

Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work?

Reg

.

ACGTGC

Learning Problem

Experiment

Gene

Expression

Module

Regulator1

Regulator2

Regulator3

Level

HAP4

CMK1

0

00

Find gene module assignments and tree structures that maximize P(M|D)

Goal:

Gene module

assignments

Tree structures

Hard

Genes: 5000-10000

Regulators: ~500

Learning Algorithm Overview

Relearn gene

assignments to modules

clustering

Gene module assignment

Regulatory modules

Learn regulatio

n program

s

HAP4

CMK1

Learning Regulation ProgramsExperiments

Mod

ul

e

gen

esExperiments

sorted in original order

Experiments sorted by Hap4

expression

log P(M|D) log P(D|,) + log P(,)

HAP4

log P(M|D) log P(DHAP4 |HAP4 ,HAP4 ) + log P(DHAP4 |HAP4 ,HAP4 ) + log P(HAP4,HAP4, HAP4 ,HAP4)

SIP4

log P(M|D) log P(DSIP4 |SIP4 ,SIP4 ) + log P(DSIP4 |SIP4 ,SIP4 ) + log P(SIP4,SIP4, SIP4 ,SIP4)

log P(M|D) log P(DHAP4 |HAP4 ,HAP4 ) + log P(DCMK1 |CMK1 ,CMK1 ) + log P(DCMK1 |CMK1 ,CMK1 ) + …

HAP4

CMK1

Mod

ul

e

gen

es

Hap4 expression

Regulator

Learning Algorithm Performance

-131

-130

-129

-128

0 5 10 15 20

Bayesi

an

sco

re (

avg

. p

er

gen

e)

Algorithm iterations

0

10

20

30

40

50

0 5 10 15 20

Algorithm iterations

Gen

e m

od

ule

ass

ign

ment

ch

an

ges

(% f

rom

tota

l)

Significant improvements across

learning iterations

Many genes (50%) change module assignment in

learning

Outline

Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work?

Reg

.

ACGTGC

Yeast Stress Data

Genes Selected 2355 that showed activity

Experiments (173) Diverse environmental stress

conditions: heat shock, nitrogen depletion,…

Comparison to Bayesian Networks

Problems Robustness Interpretability

Cmk1

Hap4

Mig1

Ste12

Bayesian Network

Friedman et al ’00Hartemink et al. ’01

Yap1

Gic1

Expression level of each gene is a function of expression of

regulators

Fragment of learned Bayesian network 2355 variables (genes) 173 instances (experiments)

Comparison to Bayesian Networks

Problems Robustness Interpretability

Cmk1

Hap4

Mig1

Ste12

Bayesian Network

Friedman et al ’00Hartemink et al. ’01

Yap1

Gic1

Module NetworkSPRKF ’03 (UAI)

Solutions Robustness sharing parameters Interpretability module-level

model

Regulator1

Regulator2

Regulator3

Level

Module

Comparison to Bayesian Networks

Problems Robustness Interpretability

Solutions Robustness sharing parameters Interpretability module-level

model

Test

Data

Log

-Lik

elih

ood

(gain

per

inst

an

ce)

Number of modules

Bayesian Network performance

-150

-100

-50

0

50

100

150

0 100 200 300 400 500

Learn which parameters are shared(by learning which genes are in the same

module)

Module

From Model to Regulatory Modules

Regulator1

Regulator2

Regulator3

Level

HAP4

CMK1

Biologically relevant?

HAP4

CMK1

0

0 0

Respiration Module

Regulation

program

Module genes

Energy production (oxid. phos. 26/55 P<10-30)

Hap4+Msn4 known to regulate module genes

Module genes functionally coherent? Module genes known targets of predicted regulators?

Predicted regulator

Energy, Osomlarity, & cAMP Signaling

Tpk1: Regulation by non-TFs

(Tpk1 is a catalytic unit of cAMP dependent protein kinase)

Module contains known Tpk1 targets (e.g. Tps1)

Tpk1-mediated STRE motif (50/64 genes; p<3x10-11)

EM: Biological Improvement

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 40 45

Ne

gat

ive

log

p-v

alu

e (

mo

du

le n

etw

ork

)

Negative log p-value (standard clustering)

Hap4

Xbp1

Yer184c

Yap6

Gat1

Ime4

Lsg1

Msn4

Gac1

Gis1

Ypl230w

Not3

Sip2

Amino acidmetabolism

Energy andcAMP signaling

DNA and RNAprocessing

nuclear

12 3 253341

ST

RE

N41

HA

P234

426

RE

PC

AR

CA

T8

N26

AD

R1

3947

HS

F

HA

C1

XB

P1

30 42M

CM

1

N30

31 36

AB

F_C

N36

5 16

Kin82

Cm

k1

Tpk1

Ppt1

N11

GA

TA

8109

GC

N4

CB

F1_B

Tpk2

Pph3

13 141517

N14

N13

Regulation supported in literature

Regulator (Signaling molecule)

Regulator (transcription factor)

Inferred regulation

48 Module (number)

Experimentally tested regulator

Enriched cis-Regulatory Motif

Bm

h1

Gcn20

GC

R1

18

MIG

1

N18

11

Biological Evaluation Summary

Are the module genes functionally coherent?

Are some module genes known targets of the predicted regulators?

46/50

30/50

Functionally coherent = module genes enriched for GO annotations with hypergeometric p-value < 0.01 (corrected for multiple hypotheses)

Known targets = direct biological experiments reported in the literature

Outline

Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work?

Reg

.

ACGTGC

From Model to Detailed Predictions

Prediction:

Experiment:

Regulator ‘X’ regulates process ‘Y’

Knock out ‘X’ and repeat experiment

HAP4

Ypl230w X

?

Does ‘X’ Regulate Predicted Genes?

Experiment: knock out Ypl230w (stationary phase)

1334 regulated genes(312 expected by

chance)

wild-type

mutant

>4x

Regulated genes

Rank modules by regulated genes

Predicted modules

Module Sig.

Protein foldingP<0.0001

Cell diferentiation P<0.02

Glycolysis and folding P<0.04

Mitochondrial and protein fate

P<0.04

Module Sig.

Protein foldingP<0.0001

Cell diferentiation P<0.02

Glycolysis and folding P<0.04

Mitochondrial and protein fate

P<0.04

Modules predicted to be regulated by

Ypl230w

Ypl230w regulates

computationally predicted genes

Regulated genes(1014)

Ppt1 knockout(hypo-osmotic

stress)wild-type

mutant

Regulated genes(1034)

wild-type

mutant

Kin82 knockout (heat

shock)

Module Sig.

Energy and osmotic stressP<0.0001

Energy, osmolarity & cAMP signaling

P<0.006

mRNA, rRNA and tRNA processing

P<0.02

Module Sig.

Ribosomal and phosphate metabolism

P<0.009

Amino acid and purine metabolism

P<0.01

mRNA, rRNA and tRNA processing

P<0.02

Protein folding P<0.02

Cell cycle P<0.02

Does ‘X’ Regulate Predicted Genes?

Wet Lab Experiments Summary

3/3 regulators regulate computationally predicted genes

New yeast biology suggested Ypl230w activates protein-

folding, cell wall and ATP-binding genes

Ppt1 represses phosphate metabolism and rRNA processing

Kin82 activates energy and osmotic stress genes

Outline

Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work?

Reg

.

ACGTGC

Why does it work? Underlying assumption:

Regulators are transcriptionally regulated

Regulators are part of regulatory structures in which they are themselves regulated*

Statistical methods can detect associations between regulators and their targets

* [Shen-Orr et al., ’02] find many such structures

Regulator Chain

Respiration module

Time

Activeprotein

level

mRNAexpression

level

Phd1Hap4Targets

Phd1

Hap4Targets

Phd1 (TF)

Hap4 (TF)

Cox4 Cox6 Atp17

Black: regulators that cannot be detectedRed: correctly predicted regulatorBlue: targets

Auto Regulation

Snf kinase regulated processes module

Yap6 (TF)

Vid24 Tor1 Gut2

Black: regulators that cannot be detectedRed: correctly predicted regulatorBlue: targets

Positive Signaling Loop

Sporulation and cAMP pathway module

Sip2 (SM)

Msn4 (TF)

Vid24 Tor1 Gut2

Black: regulators that cannot be detectedRed: correctly predicted regulatorBlue: targets

Negative Signaling Loop

Energy and osmotic stress module

Tpk1 (SM)

Msn4 (TF)

Nth1 Tps1 Glo1

Black: regulators that cannot be detectedRed: correctly predicted regulatorBlue: targets

Why Does it Work?

Feed-forward and feedback loops

Some transcription factors and signal transduction molecules have a detectable expression signature

Module Networks infers their regulatory relationships

Assignment Download the yeast stress expression dataset Download the list of transcription factor regulators Randomly partition the dataset in a 5-fold cross

validation scheme For k=50:

Create a hard-clustering model (use code from earlier exercise). At each array, this model has a separate Gaussian distribution for each of the 50 values of the cluster variable

Use the assignment of genes to clusters that you learned in the hard-clustering, and for each cluster, learn a decision tree with at most: (1) one split (2) two splits (3) three splits

Note 1: allow only splits with >=5 arrays in each side of the split Note 2: split question is whether the expression level of the

transcription factor is greater than some value

Assignment Continued Note 3: at each leaf of the resulting model, there is a single

Gaussian distribution that is used for all arrays that map to that leaf

Compute the log-likelihood of the test data for each model (hard-clustering, and each of the three regulation models)

Plot the avg. and std. test log-likelihood for each model For the model with two splits on each cluster, use the

Gaussian distribution at each array to sample a new expression dataset with exactly the same number of genes and number of arrays. For each original gene and array, you sample from the Gaussian distribution associated with that gene and that array

Learn a model with two splits for each cluster Plot the number of regulation tree splits that are identical

between the model that sampled the data and the new model that you learned