dependency networks sushmita roy bmi/cs 576 [email protected] nov 26 th, 2013
TRANSCRIPT
Goals for today
• Introduction to Dependency networks• GENIE3: A network inference algorithm for learning a
dependency network from gene expression data• Comparison of various network inference algorithms
What you should know
• What are dependency networks?• How they differ from Bayesian networks?• Learning a dependency network from expression
data• Evaluation of various network inference methods
Graphical models for representing regulatory networks
• Bayesian networks• Dependency networks
Structure
Msb2
Sho1
Ste20
Random variables encode expression levels
TARGET
REGULATORS
X1
X2
Y3
X1 X2
Y3
Edges correspond to some form of statistical dependencies
Y3=f(X1,X2)
Function
Dependency network
• A type of probabilistic graphical model• As in Bayesian networks has– A graph component– A probability component
• Unlike Bayesian network – Can have cyclic dependencies
Dependency Networks for Inference, Collaborative Filtering and Data visualization Heckerman, Chickering, Meek, Rounthwaite, Kadie 2000
Notation
• Xi: ith random variable
• X={X1,.., Xp}: set of p random variables
• xik: An assignment of Xi in the kth sample
• x-ik: Set of assignments to all variables other than Xi
in the kth sample
Dependency networks
?? ?…
Xj
Regulators
•Function: fj can be of different types.•Learning requires estimation of each of the fj functions•In all cases it is trying to minimize an error of predicting Xj from its neighborhood:
fj
Different representations of the fj function
• If X is continuous– fj can be a linear function
– fj can be a regression tree
– fj can be a random forest• An ensemble of trees
• If X is discrete– fj can be a conditional probability table
– fj can be a conditional probability tree
Linear regressionY
(out
put)
X (input)
Linear regression assumes that output (Y) is a linear function of the input (X)
Slope Intercept
Estimating the regression coefficient
• Assume we have N training samples• We want to minimize the sum of square errors
between true and predicted values of the output Y.
An example random forest for predicting gene expression
…
Ensemble of Regression trees
Output
1Input
A selected path for a set of genes
Sox6>0.5
Considerations for learning regression trees
• Assessing the purity of samples under a leaf node– Minimize prediction error– Minimize entropy
• How to determine when to stop building a tree?– Minimum number of data points at each leaf node– Depth of the tree– Purity of the data points under any leaf node
Algorithm for learning a regression tree
• Input: Output variable Xj, Input variables Xj
• Initialize tree to single node with all samples under node– Estimate
• mc: the mean of all samples under the node• S: sum of squared error
• Repeat until no more nodes to split– Search over all input variables and split values and compute
S for possible splits– Pick the variable and split value that has the highest
improvement in error
GENIE3: GEne Network Inference with Ensemble of trees
• Solves a set of regression problems– One per random variable
• Models non-linear dependencies• Outputs a directed, cyclic graph with a confidence of
each edge• Focus on generating a ranking over edges rather than
a graph structure and parameters
Inferring Regulatory Networks from Expression Data Using Tree-Based Methods Van Anh Huynh-Thu, Alexandre Irrthum, Louis Wehenkel, Pierre Geurts, Plos One 2010
GENIE3 algorithm sketch
• For each gene j, generate input/output pairs– LSj={(x-j
k,xjk),k=1..N}
– Use a feature selection technique on LSj such as tree building to compute wij for all genes i ≠ j
– wij quantifies the confidence of the edge between Xi and Xj
• Generate a global ranking of regulators based on each wij
GENIE3 algorithm sketch
Figure from Huynh-Thu et al.
Feature selection in GENIE3
• Random forest to represent the fj• Learning the Random forest
• Generate M=1000 bootstrap samples• At each node to be split, search for best split among K randomly
selected variables
– K was set to p-1 or (p-1)1/2
Computing the importance weight of each predictor
• Feature importance is computed at each test node• Remember there can be multiple test nodes per
regulator• For a test node importance is given by the reduction
in variance if we make a split on that node
Test node Set of data samples that reach the test node
#S: Size of the set S
Var(S): variance of the output variable in set S
Computing the importance of a predictor
• For a single tree the overall importance is then sum over over all points in the tree where this node is used to split
• For an ensemble the importance is averaged over all trees.
Computational complexity of GENIE3
• Complexity per variable– O(TKNlog N)– T is the number of trees– K is the number of random attributes selected per split– N is the learning sample size
Evaluation of network inference methods
• Assume we know what the “right” network is• One can use Precision-Recall curves to evaluate the
predicted network• Area under the PR curve (AUPR) curve quantifies
performance
AUPR based performance comparison
DREAM: Dialogue for reverse engineeting assessments and methods
Community effort to assess regulatory network inference
DREAM 5 challenge
Previous challenges: 2006, 2007, 2008, 2009, 2010 Marbach et al. 2012, Nature Methods
Where do different methods rank?
Marbach et al., 2010 Com
mun
ityRa
ndom
Comparing module (LeMoNe) and per-gene (CLR) methods
Summary of network inference methods
• Probabilistic graphical models provide a natural representation of networks
• A lot of network inference is done using gene expression data
• Many algorithms exist, we have seen three– Bayesian networks
• Sparse candidates• Module networks
– Dependency networks– GENIE3
• Algorithms can be grouped into per-gene and per-module