using bayesian networks to analyze whole-genome expression data nir friedman iftach nachman dana...

1
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University of Jerusalem Biological Background Similar DNA... Yet different expression of proteins The expression profile depends on Tissue External conditions Growth stages ... Gene expression is responsible for cell activity (including regulation of expression) DNA RNA Protein cDNA microarray DNA hybridization measures abundance of RNA •Recently developed technologies allow parallel measurement of the expression level of thousands of genes/proteins •This allows biologists to view the cell as a complete system 1 Big Challenge Extracting meaningful information from the expression data Infer regulatory mechanisms Reveal function of proteins Experiment planning Prior Work Clustering of expression data Groups together genes with similar expression patterns Disadvantage: Does not reveal structural relations between genes Boolean Networks Deterministic models of the logical interactions between genes Disadvantages: Deterministic, impractical for real data We suggest a probabilistic framework capable of learning complex relations between genes. 2 Bayesian Networks A Bayesian Network (BN) is a graphical representation of a probability distribution. Advantages: Compact & intuitive representation Captures causal relationships Efficient model learning Deals with Noisy data Integration of prior knowledge Effective inference for experiment planning 0.9 0.1 e b e 0.2 0.8 0.01 0.99 0.9 0.1 b e b b e B E P(A | E,B) Gene E Gene D Gene B Gene A Gene C Qualitative part : Directed acyclic graph (DAG): •Nodes - random variables of interest •Edges - direct (causal) influence Quantitative part : •Local probability models. •Set of conditional probability distributions. 3 Data from Spellman et al. (Mol.Bio. of the Cell 1998) http://genome-www.stanford.edu/cell-cycle Contains 76 samples of all the yeast genome Different methods for synchronizing cell-cycle in yeast Time series at few minutes (5-20min) intervals Spellman et al. Identified 800 cell-cycle regulated genes, and clustered them 250 of these genes were combined in 8 clusters We took these 250 genes and Discretized into three levels of expression Run 100-fold bootstrap using our sparse learning algorithm Computed confidence in predictions Evaluation Pairs with 80% confidence were evaluated against original clustering: 70% of these were intra-cluster The rest show interesting inter-cluster relations Biological Insight M. Linial, Life Sciences, Hebrew U., examined relations Most relations involved unknown/putative proteins, ...but we can guess functions based on homologies … and they mostly make a lot of biological sense only 3 pairs considered suspicious Preliminary Experiments 6 To get better results, we need More data! Publicly available gene expression experiments are extremely small. Frequent samples: Current sampling is far below rate of the regulation process External Variables: We want to relate regulation to external events: stimuli, temperature, nutrient levels, etc. We plan to improve modeling by More suitable local distribution models Correct handling of hidden variables Can we recognize hidden causes of coordinated regulation events Improving computational efficiency Incorporating prior knowledge Need to incorporate large mass of biological knowledge, and insight from sequence/structure databases Learning from interventions How to learn causality from knockout experiments? How to plan such experiments? Related issues have been examined in the BN literature Future Directions & Work 8 N. Friedman, I. Nachman, and D. Pe’er, Learning of Bayesian Network structure form massive datasets: The “sparse candidate algorithm”. HUJI tech report CS99-3. (Submitted) N. Friedman, M. Goldszmidt, and A. Wyner. Data Analysis with Bayesian Networks: A Bootstrap Approach. HUJI tech report CS99-4. (Submitted) N. Friedman, M. Linial, I. Nachman, and D. Pe’er, Using Bayesian Networks to analyze whole genome expression data: A Preliminary Investigation. HUJI tech report CS99-6. (In preparation.) D. Heckerman, A tutorial on learning with Bayesian Networks. In Learning Graphical Models, MIT press 1998 J. Pearl, Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Francisco, Calif., 1988 Spellman et. al., Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Sacch. Cervisiae by Microarray Hybridization, Mol. Bio. of the Cell, vol. 9, December 1998. References 9 Possible extensions: Random variables that measure External stimuli Environment parameters (temp, nutrients, PH, etc.) Biological factors Measured expression level of each gene Random variables affecting on another Bayesian Networks for Gene Expression We want to apply methods for learning Bayesian networks to analyze gene expression experiments 4 Learne Learne r r Data + Prior information E D B A C Efficient algorithms exist for learning a BN from data. Learning a BN can: Reveal underlying structure of domain. Direct relations between variables Find causal influence. Discover hidden variables. Learning Bayesian Networks Issues: Massive number of variables (thousands) Small number of samples (dozens) Sparse networks (only a small number of genes directly affect one another). Crucial Aspects: Computational Complexity Statistical significance of features in learned models To address these issues we developed: Sparse Candidate algorithm Efficient heuristic search that relies on sparseness •Choose candidate set for direct influence for each gene •Find optimal BN constrained on candidates •Iteratively improve candidate set Bootstrap confidence estimates Use resampling to generate perturbations of training data. Use the number of times a feature is repeated among networks learned from these datasets to estimate confidence of Bayesian network features parents in BN candidate s 5 Technical Challenges Network Learned 0.9--1.0 0.8--0.9 0.7--0.8 0.6--0.7 0.5--0.6 0.4--0.5 0.0--0.4 7

Upload: beverly-james

Post on 02-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University

Using Bayesian Networks to Analyze Whole-Genome Expression Data

Nir Friedman Iftach Nachman Dana Pe’er

Institute of Computer Science, The Hebrew University of Jerusalem

Biological Background

Similar DNA... Yet different expression

of proteins

The expression profile depends onTissueExternal conditionsGrowth stages...

Gene expression is responsible for

cell activity (including regulation of expression)

DNA RNA Protein

cDNA microarray

DNA hybridization measures abundance of RNA

• Recently developed technologies allow parallel measurement of the expression level of thousands of genes/proteins

• This allows biologists to view the cell as a complete system

1Big Challenge

Extracting meaningful information from the expression data Infer regulatory mechanisms Reveal function of proteins

Experiment planning

Prior Work Clustering of expression data

Groups together genes with similar expression patterns Disadvantage: Does not reveal structural relations between

genes

Boolean Networks Deterministic models of the logical interactions between genes Disadvantages: Deterministic, impractical for real data

We suggest a probabilistic framework capable of learning

complex relations between genes.

2Bayesian Networks

A Bayesian Network (BN) is a graphical representation of a probability distribution.

Advantages: Compact & intuitive representation Captures causal relationships Efficient model learning Deals with Noisy data Integration of prior knowledge Effective inference for experiment planning

0.9 0.1

e

b

e

0.2 0.8

0.01 0.99

0.9 0.1

be

b

b

e

BE P(A | E,B)Gene E

Gene D

Gene B

Gene A

Gene C

Qualitative part: Directed acyclic graph (DAG):

•Nodes - random variables of interest •Edges - direct (causal) influence

Quantitative part:

•Local probability models. •Set of conditional probability distributions.

3

Data from Spellman et al. (Mol.Bio. of the Cell 1998) http://genome-www.stanford.edu/cell-cycle

Contains 76 samples of all the yeast genome Different methods for synchronizing cell-cycle in yeast Time series at few minutes (5-20min) intervals

Spellman et al. Identified 800 cell-cycle regulated genes, and clustered them 250 of these genes were combined in 8 clusters

We took these 250 genes and Discretized into three levels of expression Run 100-fold bootstrap using our sparse learning algorithm Computed confidence in predictions

Evaluation Pairs with 80% confidence were evaluated against

original clustering:70% of these were intra-clusterThe rest show interesting inter-cluster relations

Biological Insight M. Linial, Life Sciences, Hebrew U., examined relations Most relations involved unknown/putative proteins,

...but we can guess functions based on homologies… and they mostly make a lot of biological senseonly 3 pairs considered suspicious

Preliminary Experiments6

To get better results, we need More data!

Publicly available gene expression experiments are extremely small.

Frequent samples: Current sampling is far below rate of the regulation process

External Variables: We want to relate regulation to external events: stimuli,

temperature, nutrient levels, etc.

We plan to improve modeling by More suitable local distribution models Correct handling of hidden variables

Can we recognize hidden causes of coordinated regulation events

Improving computational efficiency Incorporating prior knowledge

Need to incorporate large mass of biological knowledge, and insight from sequence/structure databases

Learning from interventions How to learn causality from knockout experiments? How to plan

such experiments? Related issues have been examined in the BN literature

Future Directions & Work8

N. Friedman, I. Nachman, and D. Pe’er, Learning of Bayesian Network structure form massive datasets: The “sparse candidate algorithm”. HUJI tech report CS99-3. (Submitted)

N. Friedman, M. Goldszmidt, and A. Wyner. Data Analysis with Bayesian Networks: A Bootstrap Approach. HUJI tech report CS99-4. (Submitted)

N. Friedman, M. Linial, I. Nachman, and D. Pe’er, Using Bayesian Networks to analyze whole genome expression data: A Preliminary Investigation. HUJI tech report CS99-6. (In preparation.)

D. Heckerman, A tutorial on learning with Bayesian Networks. In Learning Graphical Models, MIT press 1998

J. Pearl, Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Francisco, Calif., 1988

Spellman et. al., Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Sacch. Cervisiae by Microarray Hybridization, Mol. Bio. of the Cell, vol. 9, December 1998.

References9

Possible extensions: Random variables that measureExternal stimuliEnvironment parameters (temp, nutrients, PH, etc.)Biological factors

Measured expression level of each gene

Random variables affecting on another

Bayesian Networks for Gene Expression

We want to apply methods for learning Bayesian networks to analyze gene expression experiments

4

LearnerLearnerLearnerLearnerData +Prior information

E

D

B

A

C

Efficient algorithms exist for learning a BN from data. Learning a BN can:

Reveal underlying structure of domain. Direct relations between variables Find causal influence. Discover hidden variables.

Learning Bayesian Networks

Issues:Massive number of variables (thousands) Small number of samples (dozens)Sparse networks (only a small number of genes directly affect one another).

Crucial Aspects:Computational ComplexityStatistical significance of features in learned models

To address these issues we developed:Sparse Candidate algorithm

Efficient heuristic search that relies on sparseness•Choose candidate set for direct influence for each gene•Find optimal BN constrained on candidates•Iteratively improve candidate set

Bootstrap confidence estimates Use resampling to generate perturbations of

training data. Use the number of times a feature is repeated among networks learned

from these datasets to estimate confidence of Bayesian network features

parents in BNcandidates

5Technical Challenges

Network Learned

0.9--1.0

0.8--0.9

0.7--0.8

0.6--0.7

0.5--0.6

0.4--0.5

0.0--0.4

7