learning regulatory networks from postgenomic data and prior knowledge dirk husmeier 1)...
TRANSCRIPT
Learning regulatory networks from postgenomic data and prior knowledge
Dirk Husmeier
1) Biomathematics & Statistics Scotland
2) Centre for Systems Biology at Edinburgh
Raf signalling network
From Sachs et al Science 2005
Systems Biology
unknown
high-throughput experiment
s
postgenomic data
machine learning
statistical methods
Bayesian networks
A
CB
D
E F
NODES
EDGES
•Marriage between graph theory and probability theory.
•Directed acyclic graph (DAG) representing conditional independence relations.
•It is possible to score a network in light of the data: P(D|M), D:data, M: network structure.
•We can infer how well a particular network explains the observed data.
),|()|(),|()|()|()(
),,,,,(
DCFPDEPCBDPACPABPAP
FEDCBAP
Model
Parameters
Learning Bayesian networks
P(M|D) = P(D|M) P(M) / Z
M: Network structure. D: Data
MCMC in structure spaceMadigan & York (1995), Guidici & Castello (2003)
Alternative paradigm: order MCMC
Machine Learning, 2004
Successful application of Bayesian networks to the Raf regulatory network
From Sachs et al Science 2005
Flow cytometry data
• Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins
• 5400 cells have been measured under 9 different cellular conditions (cues)
• Optimzation with hill climbing
• Perfect reconstruction
Microarray data Spellman et al (1998)Cell cycle73 samples
Tu et al (2005)Metabolic cycle36 samples
Ge
nes
Ge
nes
time time
AUC scores TP for FP=5
Part 1
Integration of prior knowledge
+
+
+
+…
Use TF binding motifs in promoter sequences
Biological prior knowledge matrix
Biological Prior Knowledge
Define the energy of a Graph G
Indicates some knowledge aboutthe relationship between genes i and j
Prior distribution over networks
Deviation between the network G and the prior knowledge B: Graph: є {0,1}
Prior knowledge: є [0,1]“Energy”
Hyperparameter
New contribution
• Generalization to more sources of prior knowledge
• Inferring the hyperparameters
• Bayesian approach
Multiple sources of prior knowledge
Sample networks and hyperparameters from the posterior distribution
Bayesian networkswith two sources of prior
Data
BNs + MCMC
Recovered Networks and trade off parameters
Source 1 Source 2
1 2
Bayesian networkswith two sources of prior
Data
BNs + MCMC
Source 1 Source 2
1 2
Recovered Networks and trade off parameters
Bayesian networkswith two sources of prior
Data
BNs + MCMC
Source 1 Source 2
1 2
Recovered Networks and trade off parameters
Sample networks and hyperparameters from the posterior distribution with MCMC
Metropolis-Hastings scheme
Proposal probabilities
Sample networks and hyperparameters from the posterior distribution
Metropolis-Hastings scheme
Proposal probabilities
Sample networks and hyperparameters from the posterior distribution
Metropolis-Hastings scheme
Proposal probabilities
Prior distribution
Rewriting the energy
Energy of a network
Approximation of the partition function
Partition function of an ideal gas
Evaluation on the Raf regulatory network
From Sachs et al Science 2005
Evaluation: Raf signalling pathway
• Cellular signalling network of 11 phosphorylated proteins and phospholipids in human immune systems cell
• Deregulation carcinogenesis
• Extensively studied in the literature gold standard network
DataPrior knowledge
Flow cytometry data
• Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins
• 5400 cells have been measured under 9 different cellular conditions (cues)
• Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments
Microarray example Spellman et al (1998)Cell cycle73 samples
Tu et al (2005)Metabolic cycle36 samples
Ge
nes
Ge
nes
time time
DataPrior knowledge
Prior knowledge from KEGG
Prior distribution
Prior knowledge from KEGG
Raf network
0.25
00.5
0
0.5
0.87
0
1
0.5
0 0
0.5
0
10.71
0
0
Data and prior knowledge
+ KEGG
+ Random
Evaluation
• Can the method automatically evaluate how useful the different sources of prior knowledge are?
• Do we get an improvement in the regulatory network reconstruction?
• Is this improvement optimal?
Sampled values of the
hyperparameters
Bayesian networkswith two sources of prior knowledge
Data
BNs + MCMC
Recovered Networks and trade off parameters
Random KEGG
1 2
Bayesian networkswith two sources of prior knowledge
Data
BNs + MCMC
Random KEGG
1 2
Recovered Networks and trade off parameters
Evaluation
• Can the method automatically evaluate how useful the different sources of prior knowledge are?
• Do we get an improvement in the regulatory network reconstruction?
• Is this improvement optimal?
•We use the Area Under the Receiver Operating
Characteristic Curve (AUC).
0.5<AUC<1
AUC=1AUC=0.5
Performance evaluation:ROC curves
5 FP counts
BN
GGM
RN
Alternative performance evaluation: True positive (TP) scores
Flow cytometry data and KEGG
Evaluation
• Can the method automatically evaluate how useful the different sources of prior knowledge are?
• Do we get an improvement in the regulatory network reconstruction?
• Is this improvement optimal?
Learning the trade-off hyperparameter
• Repeat MCMC simulations for large set of fixed hyperparameters β
• Obtain AUC scores for each value of β
• Compare with the proposed scheme in which β is automatically inferred.
Mean and standard deviation of the sampled trade off parameter
Flow cytometry data and KEGG
Part 2
Combining data from different experimental conditions
What if we have multiple data sets obtained under different experimental conditions?
Example: Cytokine network
• Infection
•Treatment with IFN
•Infection and treatment with IFN
Collaboration with Peter Ghazal, Paul Dickinson, Kevin Robertson, Thorsten Forster & Steve Watterson.
datadata data datadata data
Monolithic Individual
datadata data datadata data
Monolithic Individual
Propose a compromise between the two
M1 M221
D1 D2
M*
MII
DI
. . .
Compromise between the two previous ways of combining the data
BGe or BDe
Ideal gas approximation
MCMC
Empirical evaluation
Real application: macrophages infected with CMV and pre-treated
with IFN-γ
No gold-standard
Simulated data from the Raf signalling network
Simulated dataRaf network
Simulated data
v-Raf network
Simulated data
Raf network
v-Raf network
Simulated data
Simulated data
Simulated Data
Weights between nodes are different for different data sets.
Simulated Data
Weights between nodes are different for different data sets.
5 data sets100 data points each
1 random data set (pure noise)
1 data set from the modified network
3 data sets from the Raf network, but with
different regulations strengths
M1 M221
D1
M*
MII
DI
. . .
Compromise between the two previous ways of combining the data
Corrupt, noisy data
Modified network
Raf network
Posterior distribution of ß
M1 M221
D1
M*
MII
DI
. . .
Compromise between the two previous ways of combining the data
0
5 data sets100 data points each
1 random data set (pure noise)
1 data set from the modified network
3 data sets from the Raf network, but with
different regulations strengths
Corrupt, noisy data
Modified network
Raf network
Network reconstruction accuracy
Convergence problems
Coupling method
Std MCMC
Data sets:1 rand (blue)3 raf 1 vraf (cyan)
Traceplots of sampled
hyperparameters;
Gaussian data set
log likelihood
• The MCMC simulations have convergence problems.• If the simulations “converge”:
– Random data set is identified and switched off.– Data from a slightly modified network are also
identified.– The reconstructed network outperforms the two
competing approaches.• Future work:
The convergence problems need to be addressed.
Conclusions – Part 2
Part 3
Markov chain Monte Carlo
Learning Bayesian networks
P(M|D) = P(D|M) P(M) / Z
M: Network structure. D: Data
MCMC in structure spaceMadigan & York (1995), Guidici & Castello (2003)
Main idea
Propose new parents from the distribution:
•Identify those new parents that are involved in the formation of directed cycles.
•Orphan them, and sample new parents for them subject to the acyclicity constraint.
1) Select a node
2) Sample new parents 3) Find directed cycles
4) Orphan “loopy” parents
5) Sample new parents for these parents
Mathematical Challenge:
• Show that condition of detailed balance is satisfied.
• Derive the Hastings factor …
• … which is a function of various partition functions
Acceptance probability
Summary
• Learning Bayesian networks from postgenomic data
• Integration of biological prior knowledge
• Learning regulatory networks from heterogeneous data obtained under different experimental conditions
• Improving MCMC
Acknowledgements
Funding from the
Scottish Government
Rural and Environment Research and Analysis Directorate (RERAD)
Collaboration with
Adriano Werhli
Marco Grzegorczyk
Adriano Werhli
Marco Grzegorzcyk
Thank you!
Any questions?