probabilistic modelling in computational biology

Post on 01-Jan-2016

23 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Dirk Husmeier. Probabilistic modelling in computational biology. Biomathematics & Statistics Scotland. James Watson & Francis Crick, 1953. Frederick Sanger, 1980. Network reconstruction from postgenomic data. Model Parameters q. Marriage between - PowerPoint PPT Presentation

TRANSCRIPT

Probabilistic modelling in computational biology

Dirk Husmeier

Biomathematics & Statistics Scotland

James Watson & Francis Crick, 1953

Frederick Sanger, 1980

Network reconstruction from postgenomic data

Model Parameters q

Friedman et al. (2000), J. Comp. Biol. 7, 601-620

Marriage between

graph theory

and

probability theory

Bayes net

ODE model

Model Parameters q

Probability theory Likelihood

Model Parameters q

Bayesian networks: integral analytically tractable!

UAI 1994

Identify the best network structure

Ideal scenario: Large data sets, low noise

Uncertainty about the best network structure

Limited number of experimental replications, high noise

Sample of high-scoring networks

Sample of high-scoring networks

Feature extraction, e.g. marginal posterior probabilities of the edges

High-confident edge

High-confident non-edge

Uncertainty about edges

Number of structures

Number of nodes

Sampling with MCMC

Madigan & York (1995), Guidici & Castello (2003)

Overview

• Introduction

• Limitations

• Methodology

• Application to morphogenesis

• Application to synthetic biology

Homogeneity assumption

Interactions don’t change with time

Limitations of the homogeneity assumption

Example: 4 genes, 10 time points

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Supervised learning. Here: 2 components

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Changepoint model

Parameters can change with time

Changepoint model

Parameters can change with time

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10

X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10

X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10

X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Unsupervised learning. Here: 3 components

Extension of the model

q

Extension of the model

q

Extension of the model

q

k

h

Number of components (here: 3)

Allocation vector

Analytically integrate out the parameters

q

k

h

Number of components (here: 3)

Allocation vector

P(network structure | changepoints, data)

P(changepoints | network structure, data)

Birth, death, and relocation moves

RJMCMC within Gibbs

Dynamic programming, complexity N2

Collaboration with the Institute of

Molecular Plant Sciences at Edinburgh University (Andrew Millar’s group)

- Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4,

ELF3, GI, PRR9, PRR5, and PRR3

- Transcriptional profiles at 4*13 time points in 2h intervals under constant light for

- 4 experimental conditions

Circadian rhythms in Arabidopsis thaliana

Comparison with the literature

PrecisionProportion of identified interactions that

are correct

Recall = Sensitivity Proportion of true interactions that we

successfully recovered

SpecificityProportion of non-interactions that are

successfully avoided

CCA1

LHY

PRR9

GI

ELF3

TOC1

ELF4

PRR5

PRR3

False negative

Which interactions from the literature are found?

True positive

Blue: activations

Red:Inhibitions

True positives (TP) = 8

False negatives (FN) = 5

Recall= 8/13= 62%

Which proportion of predicted interactions are confirmed by the literature?

False positives

Blue: activationsRed: Inhibitions

True positive

True positives (TP) = 8

False positives (FP) = 13

Precision = 8/21= 38%

Precision= 38%

CCA1

LHY

PRR9

GI

ELF3

TOC1

ELF4

PRR5

PRR3

Recall= 62%

True positives (TP) = 8

False positives (FP) = 13

False negatives (FN) = 5

True negatives (TN) = 9²-8-13-5= 55

Sensitivity = TP/[TP+FN] = 62%

Specificity = TN/[TN+FP] = 81%

Recall

Proportion of avoided non-interactions

Model extension So far: non-stationarity in the

regulatory process

Non-stationarity in the network structure

Flexible network structure .

Model Parameters q

Model Parameters q

Use prior knowledge!

Flexible network structure .

Flexible network structure with regularization

Hyperparameter

Normalization factor

Flexible network structure with regularization

Exponential priorversus

Binomial prior with conjugate beta

hyperprior

NIPS 2010

Overview

• Introduction

• Limitations

• Methodology

• Application to morphogenesis

• Application to synthetic biology

Morphogenesis in Drosophila melanogaster

• Gene expression measurements at 66 time points during the life cycle of Drosophila (Arbeitman et al., Science, 2002).

• Selection of 11 genes involved in muscle development.

Zhao et al. (2006),

Bioinformatics 22

Can we learn the morphogenetic transitions: embryo larva

larva pupa pupa

adult ?

Average posterior probabilities of transitions

Morphogenetic transitions: Embryo larva larva pupa pupa adult

Can we learn changes in the regulatory network structure ?

Overview

• Introduction

• Limitations

• Methodology

• Application to morphogenesis

• Application to synthetic biology

Can we learn the switch Galactose Glucose?

Can we learn the network structure?

Task 1:Changepoint detection

Switch of the carbon source:Galactose Glucose

Galactose Glucose

Task 2:Network reconstruction

PrecisionProportion of identified interactions

that are correct

Recall Proportion of true interactions that

we successfully recovered

BANJO: Conventional homogeneous DBN TSNI: Method based on differential equations

Inference: optimization, “best” network

Sample of high-scoring networks

Sample of high-scoring networks

Marginal posterior probabilities of the edges

P=1

P=0

P=0.5

P=1

True network

Thresh 0.9

Prec 1

Recall 1/2

PrecisionRecall

P=1 P=0.5

True network

Thresh 0.9 0.4

Prec 1 2/3

Recall 1/2 1

PrecisionRecall

P=1

P=0

P=0.5

True network

Thresh 0.9 0.4 -0.01

Prec 1 2/3 1/2

Recall 1/2 1 1

PrecisionRecall

Future work

How are we getting from here …

… to there ?!

Input:Learn:MCMC

Prior knowledge

top related