jm - 1 machine learning for studies of genotype-phenotype correlations jarek meller jarek meller...

JM - http://folding.chmcc.org 1

Machine Learning for Studies of Genotype-Phenotype Correlations

Jarek MellerJarek Meller

Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC


Outline

Motivating story: correlating inputs and outputs Learning with a teacher (supervised learning) Model selection, feature selection and generalization k-Nearest Neighbors, Least Squares regression, Support Vector

Machines and some other machine learning approaches Genotype-phenotype correlations and predictive fingerprints of

phenotypes Ritchie et al., Multifactor-Dimensionality Reduction Reveals High-

Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer, Am. J. Hum. Genet., 69:138-147, 2001

Early results for JRA SNP data (D. Glass et al.)


Of statistics and machine learning

t-Test vs. regression or decision treesAssessment vs. predictive models

Treatment group mean

Control group mean

Continuous variables

Discrete (categorical) variables1 0

1 0 1

0 1 2


Choice of the model, problem representation and feature selection: another simple example

heights

est

rog

en

F

M

adults children

weight

testosterone


Three phases in supervised learning protocols

Training data: examples with class assignment are given Learning:

i) appropriate model (or representation) of the problem needs to be selected in terms of attributes, distance measure and classifier type; ii) adaptive parameters in the model need to optimized to provide correct classification of training examples (e.g. minimizing the number of misclassified training vectors)

Validation: cross-validation, independent control sets and other measure of “real” accuracy and generalization should be used to assess the success of the model and the training phase (finding trade off between accuracy and generalization is not trivial)


Model complexity, training set size and generalization

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8data 1 linear cubic 7th degree


Examples of machine learning algorithms for classification and regression problems

Linear perceptron, Least Squares LDA/FDA (Linear/Fisher Discriminate Analysis)

(simple linear cuts, kernel non-linear generalizations) SVM (Support Vector Machines) (optimal, wide

margin linear cuts, kernel non-linear generalizations) Decision trees (logical rules) k-NN (k-Nearest Neighbors) (simple non-parametric) Neural networks (general non-linear models,

adaptivity, “artificial brain”)


Decision trees provide a piecewise linear solution

0 1

1

0


Support Vector Machines provide a wide margin solution (separating hyperplane)

wx+b=0


Optimizing adaptable parameters in the model

Find a model y(x;w) that describes the objects of each class as a function of the features and adaptive parameters (weights) w.

Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a stroke or heart attack in the next 5 years)

y(x;w)

0

1


Training accuracy vs. generalization


Case Study: Sporadic Breast Cancer

Ritchie et al., Multifactor-Dimensionality Reduction Reveals High-Order Interactions among Estrogen-Metabolism Genes in Sporadic Breast Cancer, Am. J. Hum. Genet., 69:138-147, 2001

Study based on 200 white women with sporadic primary invasive breast cancer who were treated at Vanderbilt University Medical Center during 1982-96

Patients with sporadic breast cancer were frequency age-matched to control patients at Vanderbilt University Medical Center who had been hospitalized for various acute and chronic illnesses

Analysis focused on the genes: COMT (MIM 116790), 22q11.2; CYP1A1 (MIM 108330), 15q22-qter; CYP1B1 (MIM 601771), 2p21-22; GSTM1 (MIM 138350), 1p13.3; and GSTT1 (MIM 600436), 22q11.2

Case-control study (machine learning to the rescue)


Polymorphisms in the genes of interest

Genes involved in oxidative metabolism of estrogens


Genotype representation and identification of predictive loci (fingerprints): MDR


Main effects (individual SNPs and chi2 test)

For the simulated data shown before:

High Risk

Low Risk

total

AA 27 24 51

Aa 36 38 74

aa 21 24 45

total 84 86 170

(O-E)2 / E


Genotype/haplotype representations

AABBCCAaBBCC

AABbCC

aaBBCC

aabbcc

AAbbCCIn general, 3n genotypesfor n biallelic loci.

xyz

0, 1 ; x, y, z = 0, 1, 2

Vector representation:

In general, highly dimensional representations …


Multiple loci and more complex fingerprints


Cross-validation results


The role of gene-gene interactions in multifactorial disease: towards even more complex traits …

CYP1A1, GSTM1, and GSTT1 polymorphisms were examined before in a case-control study of 328 white and 108 African American women, using multiple logistic-regression analysis (Bailey et al. 1998b). None of the enzyme genotypes individually or combined were associated with an increased risk for breast cancer. However, COMT and CYP1B1 were not included in the analysis, because their roles in the catechol-estrogen pathway and/or their various polymorphisms were only recently elucidated.

Here, the influence of each genotype on disease risk appears to be dependent on the genotypes at each of the other loci: gene-

gene interactions.


Complexity of the model and power calculations: as before adopted from Ritchie et al.

In logistic regression, as each additional main effect is included in the model, the number of possible interaction terms grows exponentially.

On the other hand, simulation studies by Peduzzi et al. (1996) suggest that having fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients.

Hosmer and Lemeshow (2000) suggest that logistic-regression models should contain no more than P < min(n1,n0)/10 parameters, where n1 is the number of events of type 1 and n0 is the number of events of type 0.

For the 200 cases and the 200 controls evaluated in the present study, this formula suggests that no more than 19 parameters should be estimated in a logistic-regression model.


Complexity of the model and power calculations: as before adopted from Ritchie et al.

The number of regression terms needed to describe the interactions among a subset, k, of n biallelic loci is (n choose k) × 2k (Wade 2000).

Thus, for 10 genes, we would need 20 parameters to model the main effects (assuming two dummy variables per biallelic locus), 180 parameters to model the two-way interactions, 1,920 parameters to model the three-way interactions, 3,360 parameters to model the four-way interactions, and so forth. The MDR method avoids the problems associated with the use of parametric statistics to model high-order interactions.

At the same time, MDR involves sampling (evaluation) of different combinations of loci – exponential scaling anyway …


Some conclusions from Ritchie et al.

“If MDR is going to be used for genome scans with hundreds to thousands of single-nucleotide polymorphisms, then it will be necessary to develop machine learning strategies to optimize the selection of polymorphisms to be modeled, since an exhaustive search of all possible combinations will not be possible. We are currently exploring the use of parallel genetic algorithms (Cantú-Paz 2000) as a robust machine learning approach.”

Feature selection and aggregation, inferring a classifier (approximator), validating prediction using cross-validation and independent new data, i.e., applying machine learning approaches …


Reducing (somewhat) the complexity of the problem: LD, hyplotype blocks and tagging SNPs


Reducing (somewhat) the complexity of the problem: LD, hyplotype blocks and tagging SNPs

Muse and Gibson, 2004


Merging bottom-up and top-down approaches

Main effects and interactions (for limited k-tuples): “statistics-based” approach, collaboration with Jack Collins and his group (NCI)

Selection of loci/SNPs (feature selection) based on the initial (limited) statistical analysis: use haplotype-based Tag SNPs

Combining promising features into a complex pattern (predictive fingerprint): machine learning


Some early results for JRA (joint work with the Rheumatology and Human Genetics Divisions)

771 SNPs from chromosome 2 and 765 from chromosome 7, respectively (regions around implied before loci with high LOD scores for associations with JRA subtypes)

Haplotype blocks identified and representative SNPs derived Feature selection based on chi2-statistics and other measures Training and assessment using cross-validation on a set of about 200

data points (in several classes), case-control type of study, multiple machine learning applied

No significant correlation of individual SNPs with clinical classes observed

Top 20 SNPs, when combined into a classifier, yield classification accuracy of about 70% for the problem of distinguishing between joint erosion and lack of thereof (for affected individuals, baseline 62%)

Much less success for the classification into JRA subtypes, i.e., it appears that SNPs included in the study cannot be used to predict if a person is likely to have a specific (clinically defined) disease subtype (e.g., poly vs. pauci)


Hyplotype-based tag SNPs on chr2 vs. joint erosion …


Next steps …

Use larger data sets with careful selection of informative SNPs using prior knowledge and feature selection algorithms

Use expression profiling to define “molecular” phenotypes to define classes and find predictive patterns in SNPs

Validate, validate, validate …

jm - 1 machine learning for studies of genotype-phenotype correlations jarek meller jarek meller...

Documents

generalizationjm http

ucjm http

yearsyxw01jm http

trivialjm http

artificial brainjm http

appropriate model

model yxw

feature selection