learning bayesian networks from postgenomic data with an improved structure mcmc sampling scheme...

75
Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics Scotland 2) Centre for Systems Biology at Edinburgh

Upload: primrose-ramsey

Post on 13-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme

Dirk HusmeierMarco Grzegorczyk

1) Biomathematics & Statistics Scotland2) Centre for Systems Biology at Edinburgh

Page 2: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Systems Biology

Page 3: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Cell membran

nucleus

Protein activation cascade

TF

TF

phosphorylation

-> cell response

Page 4: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Raf signalling network

From Sachs et al Science 2005

Page 5: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 6: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

unknown

high-throughput experiment

s

postgenomic data

machine learning

statistical methods

Page 7: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Differential equation models

• Multiple parameter sets can offer equally plausible solutions.

• Multimodality in parameters space: point estimates become meaningless.

• Overfitting problem not suitable for model selection.

• Bayesian approach: computing of marginal likelihood computationally challenging.

Page 8: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Bayesian networks

A

CB

D

E F

NODES

EDGES

•Marriage between graph theory and probability theory.

•Directed acyclic graph (DAG) representing conditional independence relations.

•It is possible to score a network in light of the data: P(D|M), D:data, M: network structure.

•We can infer how well a particular network explains the observed data.

),|()|(),|()|()|()(

),,,,,(

DCFPDEPCBDPACPABPAP

FEDCBAP

Page 9: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 10: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Learning Bayesian networks

P(M|D) = P(D|M) P(M) / Z

M: Network structure. D: Data

Page 11: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 12: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 13: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

MCMC in structure spaceMadigan & York (1995), Guidici & Castello (2003)

Page 14: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Alternative paradigm: order MCMC

Page 15: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 16: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 17: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

MCMC in structure spaceInstead of

Page 18: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

MCMC in order space

Page 19: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 20: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 21: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Problem: Distortion of the prior distribution

Page 22: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

A

A

A

B

B

B

A B

B A

Page 23: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

A

A

A

B

B

B

A B

B A 0.5

0.5

Page 24: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

A

A

A

B

B

B

A B

B A 0.5

0.5

0.5

0.5

Page 25: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

A

A

A

B

B

B

A B

B A 0.5

0.5

0.5

0.5

0.5

0.5

Page 26: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

A

A

A

B

B

B

A B

B A 0.5

0.5

0.5

0.5

0.5

0.5

0.25

0.5

0.25

Page 27: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Current work with Marco Grzegorczyk

• MCMC in structure space rather than order space.

• Design new proposal moves that achieve faster mixing and convergence.

Proposed new paradigm

Page 28: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

First idea

Propose new parents from the distribution:

•Identify those new parents that are involved in the formation of directed cycles.

•Orphan them, and sample new parents for them subject to the acyclicity constraint.

Page 29: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

1) Select a node

2) Sample new parents 3) Find directed cycles

4) Orphan “loopy” parents

5) Sample new parents for these parents

Page 30: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Problem: This move is not reversible

Path via illegal structure

Page 31: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Devise a simpler move that is reversible

•Identify a pair of nodes X Y

•Orphan both nodes.

•Sample new parents from the “Boltzmann distribution” subject to the acyclicity constraint such the inverse edge Y X is included.

C1

C2

C1,2C1,2

Page 32: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

1) Select an edge

2) Orphan the nodes involved 3) Constrained resampling of the parents

Page 33: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

This move is reversible!

Page 34: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

1) Select an edge

2) Orphan the nodes involved 3) Constrained resampling of the parents

Page 35: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Simple ideaMathematical Challenge:

• Show that condition of detailed balance is satisfied.

• Derive the Hastings factor …

• … which is a function of various partition functions

Page 36: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Acceptance probability

Page 37: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 38: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Ergodicity

• The new move is reversible but …• … not irreducible

A B

BA

BA

•Theorem: A mixture with an ergodic transition kernel gives an ergodic Markov chain.

•REV-MCMC: at each step randomly switch between a conventional structure MCMC step and the proposed new move.

Page 39: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 40: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

• Does the new method avoid the bias intrinsic to order MCMC?

• How do convergence and mixing compare to structure and order MCMC?

• What is the effect on the network reconstruction accuracy?

Evaluation

Page 41: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Results

• Analytical comparison of the convergence properties

• Empirical comparison of the convergence properties

• Evaluation of the systematic bias

• Molecular regulatory network reconstruction with prior knowledge

Page 42: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Analytical comparison of the convergence properties

• Generate data from a noisy XOR

• Enumerate all 3-node networks

t

Page 43: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Analytical comparison of the convergence properties

• Generate data from a noisy XOR

• Enumerate all 3-node networks

• Compute the posterior distribution p°

• Compute the Markov transition matrix A for the different MCMC methods

• Compute the Markov chain p(t+1)= A p(t)

• Compute the (symmetrized) KL divergence KL(t)= <p(t), p°>

t

Page 44: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Solid line: REV-MCMC. Other lines: structure MCMC and different versions of inclusion-driven MCMC

Page 45: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Results

• Analytical comparison of the convergence properties

• Empirical comparison of the convergence properties

• Evaluation of the systematic bias

• Molecular regulatory network reconstruction with prior knowledge

Page 46: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Empirical comparison of the convergence and mixing properties

• Standard benchmark data: Alarm network (Beinlich et al. 1989) for

monitoring patients in intensive care• 37 nodes, 46 directed edges• Generate data sets of different size• Compare the three MCMC algorithms under the same computational costs

structure MCMC (1.0E6)

order MCMC (1.0E5)

REV-MCMC (1.0E5)

Page 47: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 48: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 49: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 50: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 51: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

AUC=0.75

AUC=1AUC=0.5

What are the implications for network reconstruction ?

ROC curvesArea under the ROC curve

(AUROC)

Page 52: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 53: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Conclusion

• Structure MCMC has convergence and mixing difficulties.

• Order MCMC and REV-MCMC show a similar (and much better) performance.

Page 54: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Conclusion

• Structure MCMC has convergence and mixing difficulties.

• Order MCMC and REV-MCMC show a similar (and much better) performance.

• How about the bias?

Page 55: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Results

• Analytical comparison of the convergence properties

• Empirical comparison of the convergence properties

• Evaluation of the systematic bias

• Molecular regulatory network reconstruction with prior knowledge

Page 56: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Evaluation of the systematic bias using standard benchmark data

• Standard machine learning benchmark data: FLARE and VOTE

• Restriction to 5 nodes complete enumeration possible (~ 1.0E4 structures)

• The true posterior probabilities of edge features can be computed

• Compute the difference between the true scores and those obtained with MCMC

Page 57: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Deviations between true and estimated directed edge feature posterior probabilities

Page 58: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Deviations between true and estimated directed edge feature posterior probabilities

Page 59: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Results

• Analytical comparison of the convergence properties

• Empirical comparison of the convergence properties

• Evaluation of the systematic bias• Molecular regulatory network

reconstruction with prior knowledge

Page 60: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Raf regulatory network

From Sachs et al Science 2005

Page 61: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Raf signalling pathway

• Cellular signalling network of 11 phosphorylated proteins and phospholipids in human immune systems cell

• Deregulation carcinogenesis

• Extensively studied in the literature gold standard network

Page 62: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

DataPrior knowledge

Page 63: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Flow cytometry data

• Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins

• 5400 cells have been measured under 9 different cellular conditions (cues)

• Downsampling to 10 & 100 instances (5 separate subsets): indicative of microarray experiments

Page 64: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

DataPrior knowledge

Page 65: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 66: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Biological prior knowledge matrix

Biological Prior Knowledge

Define the energy of a Graph G

Indicates some knowledge aboutthe relationship between genes i and j

P B (for “belief”)

Page 67: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Prior distribution over networks

Energy of a network

Page 68: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Prior knowledge

Sachs et al.

Edge Non-edge

0.9 0.6 0.55

0.10.40.45

Page 69: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

AUROC scores

Page 70: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Conclusion

• True prior knowledge that is strong no significant difference

• True prior knowledge that is weak Order MCMC leads to a slight yet significant deterioration. (Significant at the p=0.01 value obtained from a paired t-test).

Page 71: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics
Page 72: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Prior knowledge from KEGG

Page 73: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Flow cytometry data and KEGG

Page 74: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

• The new method avoids the bias intrinsic to order MCMC.

• Its convergence and mixing are similar to order MCMC; both methods outperform structure MCMC.

• We can get an improvement over order MCMC when using explicit prior knowledge.

Conclusions

Page 75: Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics

Thank you!

Any questions?