Download - Machine Learning for Functional Genomics II
1
Machine Learning for Functional Genomics II
Matt Hibbs
http://cbfg.jax.org
2
Functional Genomics• Identify the roles played by
genes/proteins
Sealfon et al., 2006.
3
Promise of Computational Functional Genomics
Data & Existing
Knowledge
Data & Existing
Knowledge
Computational Approaches
Computational Approaches
PredictionsPredictions
Laboratory ExperimentsLaboratory
Experiments
4
Computational Solutions
• Machine learning & data mining– Use existing data to make new predictions
• Similarity search algorithms• Bayesian networks• Support vector machines• etc.
– Validate predictions with follow-up lab work
• Visualization & exploratory analysis– Seeing and interacting with data important– Show data so that questions can be
answered• Scalability, incorporate statistics, etc.
5
Computational Solutions
• Machine learning & data mining– Use existing data to make new predictions
• Similarity search algorithms• Bayesian networks• Support vector machines• etc.
– Validate predictions with follow-up lab work
• Visualization & exploratory analysis– Seeing and interacting with data important– Show data so that questions can be
answered• Scalability, incorporate statistics, etc.
6
Bayesian Networks• Encodes dependence relationships
between observed and unobserved events
Raining?Raining?
Jim brought umbrella
Jim brought umbrella
Cloudy this morning
Cloudy this morning
Rain in forecastRain in forecast
7
Bayesian Network Overview• Graphical representation of
relationships– Probabilistic information from data to
concepts
8
Bayesian Network Overview• Graphical representation of
relationships– Probabilistic information from data to
concepts
9
Bayesian Network OverviewP(FR | CE, AP, Y2H)
P(FR | CE=yes, AP=yes, Y2H=yes)
= α P(FR) P(CE=yes|FR) Σ P(PI|FR) P(AP=yes|PI) P(Y2H=yes|PI)
P(FR=yes) + P(FR=no) = 0.0105α + 0.0216α P(FR) = .327 (up from 0.10)Bayes’ Rule: P(A|B) ~ P(A) P(B|A)Bayes’ Rule: P(A|B) ~ P(A) P(B|A)
10
Naïve Bayes• No internal hidden nodes• Greatly simplifies problem, reduces
computational complexity and time• Imposes independence assumption
11
Naïve BayesP(FR | D1, D2, D3, D4)
Bayes’ Rule: P(A|B) ~ P(A) P(B|A)Bayes’ Rule: P(A|B) ~ P(A) P(B|A)
= α P(FR) P(D1|FR) P(D2|FR) P(D3|FR) P(D4|FR)
Assumes that all measures are independent
12
Learning Naïve Bayes Nets
FR = yes FR = no
100 900 counts
0.1 0.9 prob.
FR # D1 = yes # D1 = no P(D1=yes)
yes 70 30 .7
no 300 600 .33…
13
Steps for Bayesian network integration
• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each
dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise
probabilities• Evaluate performance• Predict functions given network
14
Steps for Bayesian network integration
• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each
dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise
probabilities• Evaluate performance• Predict functions given network
15
Gold Standard Construction• Gene Ontology annotations used to define known
functional relationships
Threshold for positive relationships
Threshold for negative relationships
Myers et al., 2006
16
Gold Standard Used For Trainingpositive relationshipsnegative relationships
Global Gold Standard
17
Steps for Bayesian network integration
• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each
dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise
probabilities• Evaluate performance• Predict functions given network
18
Gene-Gene Scores• Binary data– PPI, co-localization, synthetic lethality– Can use binary scores– Can use profiles to generate scores (dot
product)
• Continuous data– Profile distance metrics
• Binning results– Converts everything to discrete case
19
Distance Metrics• Choice of distance measure is important for
quantifying relationships in datasets• Pair-wise metrics – compare vectors of numbers– e.g. genes x & y, ea. with n measurements
Euclidean Distance
Pearson Correlation
Spearman Correlation
20
Distance Metrics
Euclidean Distance
Pearson Correlation
Spearman Correlation
21
• Commonly used Pearson correlation yields greatly different distributions of correlation
• These differences complicate comparisons
DeRisi et al., 97 Primig et al., 00
Histograms of Pearson correlations between all pairs of genes
Sensible Binning
22
• Fisher Z-transform, Z-score equalizes distributions
• Increases comparability between datasets
Histograms of Z-scores between all pairs of genes
Sensible Binning
23
Pre-calculation and Storage• Pair-wise distances only need to be
calculated once, even if using different binnings
• Typical mouse microarray ~5-20k genes
• 16M pair-wise distances• ~50-700 MB of storage for one dataset• ~800 datasets in GEO• ~200 GB for all datasets
24
Steps for Bayesian network integration
• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each
dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise
probabilities• Evaluate performance• Predict functions given network
25
Counting & Learning• Conceptually straightforward• Counting– Just look at all of the pairs in each dataset,
see which bin it falls into, increment a counter
– But… you need to do this 16M times/dataset
– “Dumb” parallelization – each dataset is independent
• Learning CPTs– Fractions based on counts
26
Steps for Bayesian network integration
• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each
dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise
probabilities• Evaluate performance• Predict functions given network
27
Inference• Also pretty straightforward– For all pairs of genes…• For each dataset
– Look-up value from pre-calculated distances– Determine bin and value from CPT– Multiply probability into product
• Do this for FR=yes and FR=no• Normalize out α• Store Result
• 1.5GB result file
28
Steps for Bayesian network integration
• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each
dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise
probabilities• Evaluate performance• Predict functions given network
29
Evaluation Metrics• TPs, FPs, TNs, FNs• Agnostic to pairs not appearing in
standard
• ROC curves: Sensitivity-Specificity• PR curves: Precision-Recall
30
Precision Recall Curves
Ordered Predictions
Precision
TPTP + FP
Recall TPTP + FN
0
0 1
1
31
Summary Statistics• AUC – area under the (ROC) curve– equivalent to Mann-Whitney U
• Average Precision – average of the precisions calculated at each true positive– quantized version of area under
precision recall curve (AUPRC)
• Precision @ n% recall
32
Cross Validation
33
Steps for Bayesian network integration
• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each
dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise
probabilities• Evaluate performance• Predict functions given network
34
Graph Analysis for Predictions
ci = confidence of functionS = set of genes in functionG = set of all geneswi,j = weight of edge
gi
35
Steps for Our Evaluation
• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each
dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise
probabilities• Evaluate performance• Predict functions given network
36
Bayesian Network IntegrationG
ene
exp
ress
ion
Gene expression dataset 1
Gene expression dataset 2
Gene expression dataset N
Ph
ysic
al
inte
ract
ion
s
Yeast two-hybrid dataset 1
Co-precipitation dataset 1
Oth
er
Transcription factor bin sites
Localization
Curated literature
Gen
etic
in
tera
ctio
ns
Synthetic lethality dataset
Synthetic rescue dataset
Myers et al., 2005; Huttenhower et al., 2006; Guan et al., 2008
New genes predicted to interact with known mitochondrial genes
Data integration via a Bayesian network
User-selected query focuses search
Probabilistic, weighted networks of gene function
Results displayed
37
Basic Approach Applied Several Times
Myers et al., 2005; 2007
Guan et al., 2008
Huttenhower et al., 2007
Huttenhower et al., 2009
38
Limitations and Improvements• Original work designed for yeast, and
general notion of functionally related– Ignores reality that some genes are related
only under certain conditions– Treats multi-cellular organisms as big single-
celled organisms
• Increased specificity can be used to improve results– 2nd iteration of bioPIXIE included biological
processes into gold standards– Currently working on 2nd generation
mouseNET to account for tissue and developmental stages
39
General mouseNET Approach
40
Global Gold Standardpositive relationshipsnegative relationships
Global Gold Standard
41
Specific Gold Standards• Not all datasets capture all functional
relationships– Process/Pathway specific
• Functionally related genes aren’t always functionally related– Tissue specific– Developmental stage specific
42
Specific Gold Standard Construction
positive relationshipsnegative relationships
Global Gold Standard Specific Gold Standard
43
Tissue/Stage Gold Standards• Based on data from GXD• Cross reference Theiler stages with
mammalian anatomy hierarchy• 729 total intersections– ranging from 50 to ~3500 genes– not including post-natal stages
44
Initial Computational Evaluations
45
Preliminary Results• Running 4-fold cross validation using
tissue/stage specific GO-based gold standards
training evaluation
test evaluation
46
Preliminary Results• Accounting for developmental stage
helps
training evaluation
test evaluation
47
Preliminary Results• Many specific tissue/stage
combinations are overfitting
training evaluation
test evaluation
48
Preliminary Results• Folds were randomly generated, are
biased, need to balance positives and negatives
49
New Visualization Interface• Graphle
50
Simple Things Long Times• No single step is too complicated• Mostly O(G2D)• 16M * 800 * 4• Evaluating one fold ~7 hours• So far have results for ~200
tissue/stages– Should take ~3 days on the cluster– Actually took ~15 days
51
Bayesian network utility• Bayesian networks powerful tool• Currently improving on existing
MouseNET project by incorporating tissue/stage information
• Preliminary results are promising, standards may be too limited
• Multiple stage process may be useful– predict tissue/stage specific expression– use these predictions in functional gold
standards– use a continuous gold standard?
52
Computational Solutions
• Machine learning & data mining– Use existing data to make new predictions
• Similarity search algorithms• Bayesian networks• Support vector machines• etc.
– Validate predictions with follow-up lab work
• Visualization & exploratory analysis– Seeing and interacting with data important– Show data so that questions can be
answered• Scalability, incorporate statistics, etc.
53
From Relationships to Phenotypes
• Use the outputs of Bayesian integration of data as inputs to a phenotype prediction problem
• For each gene – vector of relationship probabilities used as feature vector
• Use a Support Vector Machine (SVM) to classify genes involved in a phenotypes vs. not involved in a phenotype
• Process repeated for hundreds of phenotypes
54
SVM Methodology• Every feature vector is thought of as
a point in space• Points nearer to each other tend to
belong to the same class• In our case, we have a ~20k
dimensional space where each point is a gene, and its location is determined by the relationship probabilities
~20K
gen
es
~20K probabilities
55
SVM Methodology
56
SVM Methodology
57
SVM Methodology
58
SVM Methodology
59
SVM Methodology
60
SVM Methodology
61
Software for SVM• SVMlight & SVMperf from Cornell• http://svmlight.joachims.org/• Several simple kernels implemented,
can write additional code to use custom kernels
• “perf” version maximizes different statistics (AUC, precision, etc.)
62
Phenotype Predictions• Using the MGI phenotype info as a
starting point, predicted genes for ~1150 phenotypes
63
Phenotype Predictions• Every gene with at least one allele
annotated considered “involved” with the phenotype
64
Phenotype Predictions• Selected phenotypes with >30
annotations, <500 annotations, non-identical
• SVM trained for each phenotype• Classification predictions created for
all tested phenotypes• Can assess prediction performance
computationally
65
Evaluation Metrics• TPs, FPs, TNs, FNs• Agnostic to pairs not appearing in
standard
• ROC curves: Sensitivity-Specificity (TPR-FPR)
• PR curves: Precision-Recall
66
PR-curves vs. ROC curves• ROC gives you credit for correctly
predicting negatives• For function/phenotype predictions,
we realistically are only concerned with positives
• Further, we care most about high confidence positives
• PR-curves better at showing this
67
Performance Measurements• On average, 10 fold improvement
over random–Median - ~7.5 fold over random–Max ~100, Min ~1
• Some phenotypeswe can predictwell, others, not somuch
68
Some Top Phenotypes• Arrested B cell differentiation• Abnormal joint morphology• Abnormal cell cycle checkpoint• Decreased circulating hormone level• Abnormal liver development• …
69
Some Bottom Phenotypes• hepatoma• head bobbing• disheveled coat• necrosis• increased glycogen level• lethargy• …
70
PR vs. ROC
AUC=0.63~45 fold improvement AUC=0.63
71
PR vs. ROC
AUC=0.70~3 fold improvement
72
Some Interesting Phenotypes
73
Laboratory Evaluation• Computational evaluation helpful, but
not the real goal• Cheryl has been kindly testing two
predictions related to bone phenotypes– Timp2 and Abcg8– Timp2-/- female mice have decreased
bone density, and possible morphological defects
– Abcg8-/- male mice have increased bone density
74
Timp2 Preliminary Results
Timp2-/-, 5 days old
75
Results are complementary to Quantitative Genetics
76
Conclusions & Plans• Bayes nets and SVMs are powerful
tools• Careful construction of Training Sets
(Gold Standard) is key• Computational evaluations need to
be appropriate to the problem context
• Laboratory evaluations are critical• Complementary approaches are good
77
Acknowledgements
• Hibbs Lab– Karen Dowell– Tongjun Gu– Al Simons
• Olga Troyanskaya Lab– Patrick Bradley– Maria Chikina– Yuanfang Guan
• Chad Myers• David Hess• Florian Markowetz• Edo Airoldi• Curtis Huttenhower
• Kai Li Lab– Grant Wallace
• Amy Caudy
• Maitreya Dunham
• Botstein, Kruglyak, Broach, Rose labs
• Kyuson Yun
• Carol Bult