jm - 1 machine learning for studies of genotype-phenotype correlations jarek meller jarek meller...

28
JM - http://folding.chmcc.o rg 1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Jarek Meller Meller Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, & Department of Biomedical Engineering, UC UC

Upload: emily-merritt

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 1

Machine Learning for Studies of Genotype-Phenotype Correlations

Jarek MellerJarek Meller

Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC

Page 2: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 2

Outline

Motivating story: correlating inputs and outputs Learning with a teacher (supervised learning) Model selection, feature selection and generalization k-Nearest Neighbors, Least Squares regression, Support Vector

Machines and some other machine learning approaches Genotype-phenotype correlations and predictive fingerprints of

phenotypes Ritchie et al., Multifactor-Dimensionality Reduction Reveals High-

Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer, Am. J. Hum. Genet., 69:138-147, 2001

Early results for JRA SNP data (D. Glass et al.)

Page 3: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 3

Of statistics and machine learning

t-Test vs. regression or decision treesAssessment vs. predictive models

Treatment group mean

Control group mean

Continuous variables

Discrete (categorical) variables1 0

1 0 1

0 1 2

Page 4: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 4

Choice of the model, problem representation and feature selection: another simple example

heights

est

rog

en

F

M

adults children

weight

testosterone

Page 5: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 5

Three phases in supervised learning protocols

Training data: examples with class assignment are given Learning:

i) appropriate model (or representation) of the problem needs to be selected in terms of attributes, distance measure and classifier type; ii) adaptive parameters in the model need to optimized to provide correct classification of training examples (e.g. minimizing the number of misclassified training vectors)

Validation: cross-validation, independent control sets and other measure of “real” accuracy and generalization should be used to assess the success of the model and the training phase (finding trade off between accuracy and generalization is not trivial)

Page 6: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 6

Model complexity, training set size and generalization

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8data 1 linear cubic 7th degree

Page 7: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 7

Examples of machine learning algorithms for classification and regression problems

Linear perceptron, Least Squares LDA/FDA (Linear/Fisher Discriminate Analysis)

(simple linear cuts, kernel non-linear generalizations) SVM (Support Vector Machines) (optimal, wide

margin linear cuts, kernel non-linear generalizations) Decision trees (logical rules) k-NN (k-Nearest Neighbors) (simple non-parametric) Neural networks (general non-linear models,

adaptivity, “artificial brain”)

Page 8: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 8

Decision trees provide a piecewise linear solution

0 1

1

0

Page 9: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 9

Support Vector Machines provide a wide margin solution (separating hyperplane)

wx+b=0

Page 10: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 10

Optimizing adaptable parameters in the model

Find a model y(x;w) that describes the objects of each class as a function of the features and adaptive parameters (weights) w.

Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a stroke or heart attack in the next 5 years)

y(x;w)

0

1

Page 11: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 11

Training accuracy vs. generalization

Page 12: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 12

Case Study: Sporadic Breast Cancer

Ritchie et al., Multifactor-Dimensionality Reduction Reveals High-Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer, Am. J. Hum. Genet., 69:138-147, 2001

Study based on 200 white women with sporadic primary invasive breast cancer who were treated at Vanderbilt University Medical Center during 1982-96

Patients with sporadic breast cancer were frequency age-matched to control patients at Vanderbilt University Medical Center who had been hospitalized for various acute and chronic illnesses

Analysis focused on the genes: COMT (MIM 116790), 22q11.2; CYP1A1 (MIM 108330), 15q22-qter; CYP1B1 (MIM 601771), 2p21-22; GSTM1 (MIM 138350), 1p13.3; and GSTT1 (MIM 600436), 22q11.2

Case-control study (machine learning to the rescue)

Page 13: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 13

Polymorphisms in the genes of interest

Genes involved in oxidative metabolism of estrogens

Page 14: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 14

Genotype representation and identification of predictive loci (fingerprints): MDR

Page 15: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 15

Main effects (individual SNPs and chi2 test)

For the simulated data shown before:

High Risk

Low Risk

total

AA 27 24 51

Aa 36 38 74

aa 21 24 45

total 84 86 170

(O-E)2 / E

Page 16: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 16

Genotype/haplotype representations

AABBCCAaBBCC

AABbCC

aaBBCC

aabbcc

AAbbCCIn general, 3n genotypesfor n biallelic loci.

xyz

0, 1 ; x, y, z = 0, 1, 2

Vector representation:

In general, highly dimensional representations …

Page 17: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 17

Multiple loci and more complex fingerprints

Page 18: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 18

Cross-validation results

Page 19: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 19

The role of gene-gene interactions in multifactorial disease: towards even more complex traits …

CYP1A1, GSTM1, and GSTT1 polymorphisms were examined before in a case-control study of 328 white and 108 African American women, using multiple logistic-regression analysis (Bailey et al. 1998b). None of the enzyme genotypes individually or combined were associated with an increased risk for breast cancer. However, COMT and CYP1B1 were not included in the analysis, because their roles in the catechol-estrogen pathway and/or their various polymorphisms were only recently elucidated.

Here, the influence of each genotype on disease risk appears to be dependent on the genotypes at each of the other loci: gene-

gene interactions.

Page 20: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 20

Complexity of the model and power calculations: as before adopted from Ritchie et al.

In logistic regression, as each additional main effect is included in the model, the number of possible interaction terms grows exponentially.

On the other hand, simulation studies by Peduzzi et al. (1996) suggest that having fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients.

Hosmer and Lemeshow (2000) suggest that logistic-regression models should contain no more than P < min(n1,n0)/10 parameters, where n1 is the number of events of type 1 and n0 is the number of events of type 0.

For the 200 cases and the 200 controls evaluated in the present study, this formula suggests that no more than 19 parameters should be estimated in a logistic-regression model.

Page 21: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 21

Complexity of the model and power calculations: as before adopted from Ritchie et al.

The number of regression terms needed to describe the interactions among a subset, k, of n biallelic loci is (n choose k) × 2k (Wade 2000).

Thus, for 10 genes, we would need 20 parameters to model the main effects (assuming two dummy variables per biallelic locus), 180 parameters to model the two-way interactions, 1,920 parameters to model the three-way interactions, 3,360 parameters to model the four-way interactions, and so forth. The MDR method avoids the problems associated with the use of parametric statistics to model high-order interactions.

At the same time, MDR involves sampling (evaluation) of different combinations of loci – exponential scaling anyway …

Page 22: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 22

Some conclusions from Ritchie et al.

“If MDR is going to be used for genome scans with hundreds to thousands of single-nucleotide polymorphisms, then it will be necessary to develop machine learning strategies to optimize the selection of polymorphisms to be modeled, since an exhaustive search of all possible combinations will not be possible. We are currently exploring the use of parallel genetic algorithms (Cantú-Paz 2000) as a robust machine learning approach.”

Feature selection and aggregation, inferring a classifier (approximator), validating prediction using cross-validation and independent new data, i.e., applying machine learning approaches …

Page 23: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 23

Reducing (somewhat) the complexity of the problem: LD, hyplotype blocks and tagging SNPs

Page 24: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 24

Reducing (somewhat) the complexity of the problem: LD, hyplotype blocks and tagging SNPs

Muse and Gibson, 2004

Page 25: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 25

Merging bottom-up and top-down approaches

Main effects and interactions (for limited k-tuples): “statistics-based” approach, collaboration with Jack Collins and his group (NCI)

Selection of loci/SNPs (feature selection) based on the initial (limited) statistical analysis: use haplotype-based Tag SNPs

Combining promising features into a complex pattern (predictive fingerprint): machine learning

Page 26: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 26

Some early results for JRA (joint work with the Rheumatology and Human Genetics Divisions)

771 SNPs from chromosome 2 and 765 from chromosome 7, respectively (regions around implied before loci with high LOD scores for associations with JRA subtypes)

Haplotype blocks identified and representative SNPs derived Feature selection based on chi2-statistics and other measures Training and assessment using cross-validation on a set of about 200

data points (in several classes), case-control type of study, multiple machine learning applied

No significant correlation of individual SNPs with clinical classes observed

Top 20 SNPs, when combined into a classifier, yield classification accuracy of about 70% for the problem of distinguishing between joint erosion and lack of thereof (for affected individuals, baseline 62%)

Much less success for the classification into JRA subtypes, i.e., it appears that SNPs included in the study cannot be used to predict if a person is likely to have a specific (clinically defined) disease subtype (e.g., poly vs. pauci)

Page 27: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 27

Hyplotype-based tag SNPs on chr2 vs. joint erosion …

Page 28: JM -  1 Machine Learning for Studies of Genotype-Phenotype Correlations Jarek Meller Jarek Meller Division of Biomedical Informatics,

JM - http://folding.chmcc.org 28

Next steps …

Use larger data sets with careful selection of informative SNPs using prior knowledge and feature selection algorithms

Use expression profiling to define “molecular” phenotypes to define classes and find predictive patterns in SNPs

Validate, validate, validate …