statistics for microarray data analysis with r session 8: discrimination class web site: darlene

45
Statistics for Microarray Data Analysis with R Session 8: Discrimination ss web site: http://ludwig-sun2.unil.ch/~darlene/

Upload: franklin-hart

Post on 12-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Statistics for Microarray Data Analysis with R

Session 8: Discrimination

Class web site: http://ludwig-sun2.unil.ch/~darlene/

Page 2: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Today’s Outline

• Discrimination (classification)

• Some classification rules

• Performance assessment

• Discrimination (classification) in R

• Exercises

Page 3: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

cDNA gene expression data

Data on G genes for n samples

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

= (normalized) Log( Red intensity / Green intensity)

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Page 4: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Classification

• Task: assign objects to classes (groups) on the basis of measurements made on the objects

• Unsupervised: classes unknown, want to discover them from the data (cluster analysis)

• Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations

Page 5: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Discrimination

• Objects (e.g. arrays) are to be classified as belonging to one of a number of predefined classes {1, 2, …, K}

• Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

• Aim: predict Y from X

Page 6: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Example: Tumor Classification• Reliable and precise classification essential for

successful cancer treatment

• Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables

• Uncertainties in diagnosis remain; likely that existing classes are heterogeneous

• Characterize molecular variations among tumors by monitoring gene expression (microarray)

• Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)

Page 7: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Tumor Classification Using Gene Expression Data

Three main types of statistical problems associated with tumor classification:

• Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering)

• Classification of malignancies into known classes (supervised learning – discrimination)

• Identification of “marker” genes that characterize the different tumor classes (feature or variable selection)

Page 8: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Classifiers• A predictor or classifier partitions (divides) the

space of gene expression profiles into K disjoint subsets, A1, ..., AK, such that for a sample with expression profile X=(X1, ...,XG) in Ak the predicted class is k

• Classifiers are built from a learning set (LS) L = (X1, Y1), ..., (Xn,Yn)

• Classifier C built from a learning set L: C( . ,L): X {1,2, ... ,K}

• Predicted class for observation X:C(X,L) = k if X is in Ak

Page 9: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Maximum likelihood discriminant rule

• A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest

• For example, if I toss a coin 10 times and I see 7 heads, the MLE for p = probability of heads is .7

• If the histograms for the individual classes are known, the maximum likelihood (ML) discriminant rule predicts the class of an observation X to be the one with the highest bar on the histogram (density scale)

Page 10: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Fisher Linear Discriminant Analysis

First applied in 1935 by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA):

1. finds linear combinations of the gene expression profiles X=X1,...,XG with large ratios of between-groups to within-groups sums of squares - discriminant variables;

2. predicts the class of an observation X by the class whose mean vector is closest to X in terms of the discriminant variables

Page 11: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Gaussian ML Discriminant Rules

• For multivariate Gaussian (normal) class densities X|Y= k ~ N(k,k), the ML classifier is

C(X) = argmink {(X - k) k-1

(X - k)’ + log| k |}

• In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)

• In practice, population mean vectors k and covariance matrices k are estimated by corresponding sample quantities

Page 12: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Gaussian ML Discriminant Rules

• When all class densities have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2)

• When all class densities have the same diagonal covariance matrix =diag(1

2… G2),

the discriminant rule is again linear (Diagonal linear discriminant analysis, or DLDA)

Page 13: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Nearest Neighbor Classification

• Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation)

• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:– find the k observations in the learning set closest to

X– predict the class of X by majority vote, i.e., choose

the class that is most common among those k observations.

• The number of neighbors k can be chosen by cross-validation (more on this later)

Page 14: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Classification Trees

• Partition the feature space into a set of rectangles, then fit a simple model in each one

• Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself)

• Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier

Page 15: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Classification Tree

Page 16: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Three Aspects of Tree Construction

• Split Selection Rule

• Split-stopping Rule

• Class assignment Rule

Different approaches to these three issues (e.g. CART: Classification And Regression Trees, Breiman et al. (1984); C4.5 and C5.0, Quinlan (1993))

Page 17: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Three Rules (CART)

• Splitting: At each node, choose split maximizing decrease in impurity (e.g. misclassification error)

• Split-stopping: Grow large tree, prune to obtain a sequence of subtrees, then use cross-validation to identify the subtree with lowest misclassification rate

• Class assignment: For each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node

Page 18: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene
Page 19: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene
Page 20: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene
Page 21: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene
Page 22: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Other Classifiers Include…

• Support vector machines (SVMs)

• Neural networks

• Bayesian regression methods

Page 23: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Features

• Feature selection– Automatic with trees– For DA, NN need preliminary selection– Need to account for selection when

assessing performance

• Missing data– Automatic imputation with trees– Otherwise, impute (or ignore)

Page 24: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Performance assessment (I)

• Resubstitution estimation: error rate on the learning set– Problem: downward bias

• Test set estimation: divide cases in learning set into two sets, L1 and L2; classifier built using L1, error rate computed for L2. L1 and L2 must be iid.

– Problem: reduced effective sample size

Page 25: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Performance assessment (II)• V-fold cross-validation (CV) estimation:

Cases in learning set randomly divided into V subsets of (nearly) equal size (for example, 5 or 10 subsets). Build classifiers leaving one set out; test set error rates computed on left out set and averaged to get overall error– Bias-variance tradeoff: smaller V can

give larger bias but smaller variance

• Out-of-bag estimation: covered below

• ROC curves: not covered at all, I’m still learning about this!

Page 26: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Cross-validation

learning learning test

learning test learning

test learning learning

error

error

error

average error

Page 27: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Performance assessment (III)

• Common to do feature selection using all of the data, then CV only for model building and classification

• However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased

• Features should be selected only from the learning set used to build the model (and not the entire set)

Page 28: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

How NOT to estimate error

• ** DON’T DO THIS ** DON’T DO THIS **• Use the whole data set to choose which

variables (features) to use in the classifier

• Divide the data into (10, say) subsets for CV

• Leave out a subset and build a classifier with features chosen from the whole data set

• Use the classifier to predict the left out subset

• Average over left out subsets to estimate error

• ** DON’T DO THIS ** DON’T DO THIS **

Page 29: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Aggregating classifiers

• Breiman (1996, 1998) found that gains in accuracy could be obtained by aggregating predictors built from perturbed versions of the learning set

• Many possibilities for perturbing

• The multiple versions of the predictor are aggregated by (weighted) voting

Page 30: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Bagging

• Bagging = Bootstrap aggregating

• Nonparametric Bootstrap (standard bagging): perturbed learning sets drawn at random with replacement from the learning sets; predictors built for each perturbed dataset and aggregated by plurality voting (wb = 1)

• Parametric Bootstrap: perturbed learning sets are some known distribution (e.g. multivariate Gaussian)

• Convex pseudo-data (Breiman 1996)

Page 31: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Aggregation By-products: Out-of-bag estimation of error

rate

• Out-of-bag error rate estimate: unbiased

• Expect about 1/3 of cases to be left out of each bootstrap sample

• Use these left out cases from each bootstrap sample as a test set

• Classify these test set cases, and compare to the true class labels to get the out-of-bag estimate of the error rate

Page 32: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Boosting

• Freund and Schapire (1997), Breiman (1998)

• Data resampled adaptively so that the weights in the resampling are increased for those cases most often misclassified

• Predictor aggregation done by weighted voting

Page 33: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Comparison of classifiers

• Dudoit, Fridlyand, Speed (JASA, 2002)

• FLDA

• DLDA

• DQDA

• NN

• CART

• Bagging and boosting

Page 34: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Comparison study datasets

• Leukemia – Golub et al. (1999)n = 72 samples, G = 3,571 genes3 classes (B-cell ALL, T-cell ALL, AML)

• Lymphoma – Alizadeh et al. (2000)n = 81 samples, G = 4,682 genes3 classes (B-CLL, FL, DLBCL)

• NCI 60 – Ross et al. (2000)N = 64 samples, p = 5,244 genes8 classes

Page 35: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Leukemia data, 2 classes: Test set error rates;150 LS/TS runs

Page 36: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Leukemia data, 3 classes: Test set error rates;150 LS/TS runs

Page 37: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Lymphoma data, 3 classes: Test set error rates; N=150 LS/TS runs

Page 38: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

NCI 60 data :Test set error rates;150 LS/TS runs

Page 39: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Results

• In the main comparison, NN and DLDA had the smallest error rates, FLDA had the highest

• Aggregation improved the performance of CART classifiers, the largest gains being with boosting and bagging with convex pseudo-data

• Dettling and Bühlmann, improved the performance of boosting (LogitBoost)

Page 40: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Results, cont

• For the lymphoma and leukemia datasets, increasing the number of genes to G=200 didn't greatly affect the performance of the various classifiers; there was an improvement for the NCI 60 dataset.

• More careful selection of a small number of genes (10) improved the performance of FLDA dramatically

Page 41: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Comparison study – Discussion (I)

• “Diagonal” LDA: ignoring correlation between genes helped here

• Unlike classification trees and nearest neighbors, LDA is unable to take into account gene interactions

• Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions

Page 42: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Comparison study – Discussion (II)

• Classification trees are capable of handling and revealing interactions between variables

• Useful by-product of aggregated classifiers: prediction votes, variable importance statistics

• Variable selection: A crude criterion such as BSS/WSS may not identify the genes that discriminate between all the classes and may not reveal interactions between genes

• With larger training sets, expect improvement in performance of aggregated classifiers

Page 43: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Acknowledgements

• Sandrine Dudoit

• Jane Fridlyand

• Yee Hwa (Jean) Yang

• Terry Speed

• www.bioconductor.org

Page 44: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

R: discrimination

• A number of R packages (libraries) contain functions to carry out discrimination, including: – MASS: lda, qda

– sma: dlda

– class: knn

– rpart: classification and regression trees (recursive partitioning)

– ipred: bagging

– e1071: svm

– LogitBoost: boosting

Page 45: Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site: darlene

Exercise: discrimination

• These ideas have been hard!

• The handout gives a very brief guide to using some of the classification routines in R, including a little bit of cross-validation

• There are many others as well, feel free to explore...