k.u.leuven department of computer science predicting gene functions using hierarchical multi-label...
TRANSCRIPT
K.U.LeuvenDepartment of
Computer Science
Predicting gene functions using hierarchical multi-label
decision tree ensembles
Celine Vens, Leander Schietgat, Jan Struyf, Hendrik Blockeel,Dragi Kocev, Sašo Džeroski
K.U.LeuvenDepartment of
Computer Science
K.U.LeuvenDepartment of
Computer Science
K.U.LeuvenDepartment of
Computer Science
• Classification: a common machine learning task e.g.,
•Given: genes with known function
•Task: predict function for new genes
•Special case: hierarchical multi-label classification (HMC)
• gene can have multiple functions
• functions are organized in a hierarchy
•tree (e.g., MIPS FunCat)
•DAG (e.g., Gene Ontology)
Hierarchy constraint: if gene is labeled with function X, then
it is also labeled with all parents of X
Hierarchical Multi-Label Classification (HMC) for Gene Function Prediction
K.U.LeuvenDepartment of
Computer Science
Predictions in Functional Genomics
• S. cerevisiae (13 datasets) and A. thaliana (12 datasets)
• two of biology’s model organisms
• most genes are annotated, ideal for testing purposes
• method can be applied to other organisms
• Data
• based on sequence statistics, phenotype, secondary structure, homology, microarray data,…
K.U.LeuvenDepartment of
Computer Science
Predictive Clustering Trees•Our focus is on decision trees
•Advantages: fast to build, noise-resistant, fast to apply, accurate predictions, easy to interpret,
…
•General framework: predictive clustering trees (PCTs)
PCT-algo
genes with features and known functions
Name A1 A2 … An 1 … 5 5/1 … 40 40/3 40/16 …G1 … … … … x x x x xG2 … … … … x x x x G3 … … … … x x G4 … … … … x x xG5 … … … … x x xG6 … … … … x x x… … … … … … … … … … … … … … … …
Input Algorithm Output
top-down inductionof PCTs PCT
K.U.LeuvenDepartment of
Computer Science
Clus-SC Clus-HSC
Clus-HMC
Hierarchy constraint
Identifies global feats
Predictive performance
Model size
Efficiency
Standard approachlearns one tree per class
Special-purpose approachlearns one tree per class +
hierarchy constraint
Our approachlearns one single tree
for all classes
Decision Trees for HMC: Different Approaches
K.U.LeuvenDepartment of
Computer Science
Predictive Clustering Forests
50 predictions
50 bootstrap replicates
Training set
•Ensembles
•Less interpretability
•Better performance
•Algorithm: Clus-HMC-Ens
…
1
2
n
3
Clus-HMC
50 PCTs
…
Test set
combined prediction
Clus-HMC
Clus-HMC
Clus-HMC
L1
L2
L3
Ln
L
K.U.LeuvenDepartment of
Computer Science
Clus-SC Clus-HSC
Clus-HMC Clus-HMC-Ens
Hierarchy constraint
Identifies global feats
Predictive performance
Model size
Efficiency
Standard approachlearns one tree per class
Special-purpose approachlearns one tree per class +
hierarchy constraint
Our approachlearns one single tree
for all classes
Variant of our approach
learns forest
Decision Trees for HMC: Different Approaches
K.U.LeuvenDepartment of
Computer Science
• Evaluation: precision-recall
• precision: percentage of predicted functions that are correct (TP/(TP+FP))
• recall: percentage of actual functions predicted by the algorithm (TP/(TP+FN))
• Average PR curve
– Consider (instance,class) couples
– Couple is (predicted) true if instance (is predicted to have) has class
Evaluation
TP FN
FP TN
K.U.LeuvenDepartment of
Computer Science
S. cerevisiae-FunCat (hom) A. thaliana-GO (seq)
S. cerevisiae-FunCat (expr) A. thaliana-GO (interpro)
•Clus-HMC-Ens better than Clus-HMC (average AUC improvement of 7%)
•Clus-HMC better than C4.5H (state-of-the-art system for HMC)(for the same recall of C4.5H, average precision improvement of 20.9%)
K.U.LeuvenDepartment of
Computer Science
QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor
zijn vereist om deze afbeelding weer te geven.
QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor
zijn vereist om deze afbeelding weer te geven.
K.U.LeuvenDepartment of
Computer Science
• Comparison with SVMs(Barutcuoglu et al.)
– Learn SVM per class
– Correct for HC violations with bayesian model
QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor
zijn vereist om deze afbeelding weer te geven.
K.U.LeuvenDepartment of
Computer Science
• Clus-HMC outperforms (or is comparable to) state-of-the-art methods on functional genomics tasks
• Ensembles of Clus-HMC are able to boost performance, if the user is willing to give up on interpretability
• “Revenge of the decision trees”
Conclusions