jm - 1 machine learning for studies of genotype-phenotype correlations jarek meller jarek meller...
TRANSCRIPT
JM - http://folding.chmcc.org 1
Machine Learning for Studies of Genotype-Phenotype Correlations
Jarek MellerJarek Meller
Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC
JM - http://folding.chmcc.org 2
Outline
Motivating story: correlating inputs and outputs Learning with a teacher (supervised learning) Model selection, feature selection and generalization k-Nearest Neighbors, Least Squares regression, Support Vector
Machines and some other machine learning approaches Genotype-phenotype correlations and predictive fingerprints of
phenotypes Ritchie et al., Multifactor-Dimensionality Reduction Reveals High-
Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer, Am. J. Hum. Genet., 69:138-147, 2001
Early results for JRA SNP data (D. Glass et al.)
JM - http://folding.chmcc.org 3
Of statistics and machine learning
t-Test vs. regression or decision treesAssessment vs. predictive models
Treatment group mean
Control group mean
Continuous variables
Discrete (categorical) variables1 0
1 0 1
0 1 2
JM - http://folding.chmcc.org 4
Choice of the model, problem representation and feature selection: another simple example
heights
est
rog
en
F
M
adults children
weight
testosterone
JM - http://folding.chmcc.org 5
Three phases in supervised learning protocols
Training data: examples with class assignment are given Learning:
i) appropriate model (or representation) of the problem needs to be selected in terms of attributes, distance measure and classifier type; ii) adaptive parameters in the model need to optimized to provide correct classification of training examples (e.g. minimizing the number of misclassified training vectors)
Validation: cross-validation, independent control sets and other measure of “real” accuracy and generalization should be used to assess the success of the model and the training phase (finding trade off between accuracy and generalization is not trivial)
JM - http://folding.chmcc.org 6
Model complexity, training set size and generalization
0 1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8data 1 linear cubic 7th degree
JM - http://folding.chmcc.org 7
Examples of machine learning algorithms for classification and regression problems
Linear perceptron, Least Squares LDA/FDA (Linear/Fisher Discriminate Analysis)
(simple linear cuts, kernel non-linear generalizations) SVM (Support Vector Machines) (optimal, wide
margin linear cuts, kernel non-linear generalizations) Decision trees (logical rules) k-NN (k-Nearest Neighbors) (simple non-parametric) Neural networks (general non-linear models,
adaptivity, “artificial brain”)
JM - http://folding.chmcc.org 8
Decision trees provide a piecewise linear solution
0 1
1
0
JM - http://folding.chmcc.org 9
Support Vector Machines provide a wide margin solution (separating hyperplane)
wx+b=0
JM - http://folding.chmcc.org 10
Optimizing adaptable parameters in the model
Find a model y(x;w) that describes the objects of each class as a function of the features and adaptive parameters (weights) w.
Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a stroke or heart attack in the next 5 years)
y(x;w)
0
1
JM - http://folding.chmcc.org 11
Training accuracy vs. generalization
JM - http://folding.chmcc.org 12
Case Study: Sporadic Breast Cancer
Ritchie et al., Multifactor-Dimensionality Reduction Reveals High-Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer, Am. J. Hum. Genet., 69:138-147, 2001
Study based on 200 white women with sporadic primary invasive breast cancer who were treated at Vanderbilt University Medical Center during 1982-96
Patients with sporadic breast cancer were frequency age-matched to control patients at Vanderbilt University Medical Center who had been hospitalized for various acute and chronic illnesses
Analysis focused on the genes: COMT (MIM 116790), 22q11.2; CYP1A1 (MIM 108330), 15q22-qter; CYP1B1 (MIM 601771), 2p21-22; GSTM1 (MIM 138350), 1p13.3; and GSTT1 (MIM 600436), 22q11.2
Case-control study (machine learning to the rescue)
JM - http://folding.chmcc.org 13
Polymorphisms in the genes of interest
Genes involved in oxidative metabolism of estrogens
JM - http://folding.chmcc.org 14
Genotype representation and identification of predictive loci (fingerprints): MDR
JM - http://folding.chmcc.org 15
Main effects (individual SNPs and chi2 test)
For the simulated data shown before:
High Risk
Low Risk
total
AA 27 24 51
Aa 36 38 74
aa 21 24 45
total 84 86 170
(O-E)2 / E
JM - http://folding.chmcc.org 16
Genotype/haplotype representations
AABBCCAaBBCC
AABbCC
aaBBCC
aabbcc
AAbbCCIn general, 3n genotypesfor n biallelic loci.
xyz
0, 1 ; x, y, z = 0, 1, 2
Vector representation:
In general, highly dimensional representations …
JM - http://folding.chmcc.org 17
Multiple loci and more complex fingerprints
JM - http://folding.chmcc.org 18
Cross-validation results
JM - http://folding.chmcc.org 19
The role of gene-gene interactions in multifactorial disease: towards even more complex traits …
CYP1A1, GSTM1, and GSTT1 polymorphisms were examined before in a case-control study of 328 white and 108 African American women, using multiple logistic-regression analysis (Bailey et al. 1998b). None of the enzyme genotypes individually or combined were associated with an increased risk for breast cancer. However, COMT and CYP1B1 were not included in the analysis, because their roles in the catechol-estrogen pathway and/or their various polymorphisms were only recently elucidated.
Here, the influence of each genotype on disease risk appears to be dependent on the genotypes at each of the other loci: gene-
gene interactions.
JM - http://folding.chmcc.org 20
Complexity of the model and power calculations: as before adopted from Ritchie et al.
In logistic regression, as each additional main effect is included in the model, the number of possible interaction terms grows exponentially.
On the other hand, simulation studies by Peduzzi et al. (1996) suggest that having fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients.
Hosmer and Lemeshow (2000) suggest that logistic-regression models should contain no more than P < min(n1,n0)/10 parameters, where n1 is the number of events of type 1 and n0 is the number of events of type 0.
For the 200 cases and the 200 controls evaluated in the present study, this formula suggests that no more than 19 parameters should be estimated in a logistic-regression model.
JM - http://folding.chmcc.org 21
Complexity of the model and power calculations: as before adopted from Ritchie et al.
The number of regression terms needed to describe the interactions among a subset, k, of n biallelic loci is (n choose k) × 2k (Wade 2000).
Thus, for 10 genes, we would need 20 parameters to model the main effects (assuming two dummy variables per biallelic locus), 180 parameters to model the two-way interactions, 1,920 parameters to model the three-way interactions, 3,360 parameters to model the four-way interactions, and so forth. The MDR method avoids the problems associated with the use of parametric statistics to model high-order interactions.
At the same time, MDR involves sampling (evaluation) of different combinations of loci – exponential scaling anyway …
JM - http://folding.chmcc.org 22
Some conclusions from Ritchie et al.
“If MDR is going to be used for genome scans with hundreds to thousands of single-nucleotide polymorphisms, then it will be necessary to develop machine learning strategies to optimize the selection of polymorphisms to be modeled, since an exhaustive search of all possible combinations will not be possible. We are currently exploring the use of parallel genetic algorithms (Cantú-Paz 2000) as a robust machine learning approach.”
Feature selection and aggregation, inferring a classifier (approximator), validating prediction using cross-validation and independent new data, i.e., applying machine learning approaches …
JM - http://folding.chmcc.org 23
Reducing (somewhat) the complexity of the problem: LD, hyplotype blocks and tagging SNPs
JM - http://folding.chmcc.org 24
Reducing (somewhat) the complexity of the problem: LD, hyplotype blocks and tagging SNPs
Muse and Gibson, 2004
JM - http://folding.chmcc.org 25
Merging bottom-up and top-down approaches
Main effects and interactions (for limited k-tuples): “statistics-based” approach, collaboration with Jack Collins and his group (NCI)
Selection of loci/SNPs (feature selection) based on the initial (limited) statistical analysis: use haplotype-based Tag SNPs
Combining promising features into a complex pattern (predictive fingerprint): machine learning
JM - http://folding.chmcc.org 26
Some early results for JRA (joint work with the Rheumatology and Human Genetics Divisions)
771 SNPs from chromosome 2 and 765 from chromosome 7, respectively (regions around implied before loci with high LOD scores for associations with JRA subtypes)
Haplotype blocks identified and representative SNPs derived Feature selection based on chi2-statistics and other measures Training and assessment using cross-validation on a set of about 200
data points (in several classes), case-control type of study, multiple machine learning applied
No significant correlation of individual SNPs with clinical classes observed
Top 20 SNPs, when combined into a classifier, yield classification accuracy of about 70% for the problem of distinguishing between joint erosion and lack of thereof (for affected individuals, baseline 62%)
Much less success for the classification into JRA subtypes, i.e., it appears that SNPs included in the study cannot be used to predict if a person is likely to have a specific (clinically defined) disease subtype (e.g., poly vs. pauci)
JM - http://folding.chmcc.org 27
Hyplotype-based tag SNPs on chr2 vs. joint erosion …
JM - http://folding.chmcc.org 28
Next steps …
Use larger data sets with careful selection of informative SNPs using prior knowledge and feature selection algorithms
Use expression profiling to define “molecular” phenotypes to define classes and find predictive patterns in SNPs
Validate, validate, validate …