february 5th, 2008 - vital-itvlad popovici (sib) class discrimination for microarray studies...

Class discrimination for microarray studies

Vlad Popovici

Swiss Institute of Bioinformatics

February 5th, 2008

Vlad Popovici (SIB) Class discrimination for microarray studies February 5th, 2008 1 / 45

Outline

1 Introduction

2 Discriminant analysis

3 Performance assessment

4 Estimating the performance parameters


Introduction

Example: ER status prediction

Questions:

How to decide which patient isER+ and which is ER-?

What is the expected error?

What if I prefer to detect most ofER+?


Introduction

Know your problem!

RememberGood study←→ clear objectives.

Problems:

Class Comparison: find genes differentially expressed betweenpredefined classes;

Class Prediction: predict one of the predefined classes using thegene expressions;

Class Discovery: cluster analysis – define new classes using clustersof genes/specimens.


Introduction

Class prediction

Typical applications:

predict treatment response

predict patient relapse

predict the phenotype

toxico–genomics: predict which chemicals are toxic

. . .

Characteristics:

supervised learning: requires labelled training data

the goal is prediction accuracy

uses some measure of similarity

relies on feature selection

quite often incorrectly used


Introduction

Usage problems:improper methodological approach:

I well fitted model does not ensure good prediction (overfitted model)I too many features used in the model (curse of dimensionality)I feature selection on the full dataset(!)

reproducibility:I improper/insufficient validationI batch effects unaccounted forI insufficiently documented

therapeutic relevance


Introduction

External validation

Data acquisition

Design decisions:

−feature selection method(s)

−classifier(s)

−performance criterion

Model construction:

Feature selection

Classifier design

Performance estimation

Model selection

Data acquisition: everything up to (andincluding) normalization

Design decisions: should be taken beforereal modeling

Model design: DO NOT USE ALL DATA ATONCE!!

External validation: other datasets; clinicaltrials: phase II and III


Discriminant analysis


GoalFind a separation boundary between the classes.

f(x) > 0

f(x) < 0

f(x)

= 0



Representing data

1007_s_at

117_at

1053_at

211585_at

211584_s_at

Tu

mo

r 1

Tu

mo

r k

Tu

mo

r n

Tu

mo

r 2

probeset i

. . . . . .

. . .

. . .

p f

ea

ture

s

each element to beclassified is a vector

usually we classifytumors/samples/patients→ use columns

p � n



Formalism:

Data points: X = {xi ∈ Rp |i = 1, . . . , n}

x could be the log ratios or log signals

Labels: Y = {yi ∈ {ω1, . . . , ωk }|i = 1, . . . , n}; (k classes)e.g. ω1 = pCR and ω2 = non-pCR

Easier: take yi ∈ {1, 2, . . . , k } or yi ∈ {−1,+1} (for two classes)

2–class problem (dichotomy)Given a finite set of points X and their corresponding labels Y (a trainingset), find a discriminant function f such that

f(x)

> 0 for x ∈ ω1

< 0 for x ∈ ω2,

for all x.



Assumptions:

training set is representative for the whole population

the characteristics of the population do not change over time (e.g.same experimental conditions)

Comments:

for all x: i.e. infere a rule that works for unseen data – generalization

perfect classification of the training data does not ensuregeneralization;e.g.: f(xi) = yi will hardly work on new data

as stated, the problem is ill–posed: there are an infinity of solutions

real data is noisy: usually there is no perfect solution



Linear discriminants

With x = (x1, . . . , xn)t , w = (w1, . . . ,wn)t ,

Linear discriminant functions

f(x) = wtx + w0 =n∑

i=1

wixi + w0, w, x ∈ Rn,w0 ∈ R

New problem: optimize some criterion

(w∗,w∗0) = arg maxw,w0

J(X,Y; w,w0)



Geometry of the linear discriminants

w

x



Geometry of the linear discriminants

w

w0



Fisher’s LDA

Fisher’s criterion

J(w) =wtSBwwtSW w

where

SB = (µ1 − µ2)(µ1 − µ2)t is the between class scatter matrix (µi is theaverage of the class i)

SW = 1n−2 (n1Σ1 + n2Σ2) is the pooled within class covariance matrix

(Σi is the estimated covariance of the class i)



Fisher’s criterion (in plain English)Find the direction w along which the two classes are best separated – insome sense.

Solution:

w = S−1W (µ1 − µ2)

w0 =??I assuming data is normal with

equal covariances:closed–form formula

I alternative: estimate w0 byline search

I can be used to embed priorprobabilities in the classifier

w

w0



Versions of Fisher’s DA

Under normality assumption and

if the covariance matrices are equal and the features areuncorrelated: Diagonal LDA (the covariance matrices are diagonal);

if the covariance matrices are not equal: Quadratic DA



Versions of Fisher’s DA(from Duda, Hart & Stork, Pattern Classification)



A Bayesian perspective

2 classes, ω1, ω2

one continuous feature(e.g. log expression)

p(x |ωi): class–conditional distribution

from Bayes’ formula:

p(ωi |x) =p(x |ωi)p(ωi)

p(x)

posterior =likelihood × prior

evidence

optimal decision (Bayes decision rule):decide x ∈ ω1 if p(ω1|x) > p(ω2|x),otherwise decide x ∈ ω2

p(ωi) =?

p(x |ωi) =?



Comments:

taking p(x |ωi) to be normal densities⇒ Bayes decision boundary isone of the previous DA forms

this hypothesis allows estimating the posteriors too⇒ we can have aconfidence measure on the decision

estimating p(x |ωi) from data⇒ Naïve Bayes classifier



Support Vector Machines (SVM)

separation of the classesis measured in terms ofmargin.

rigorous mathematicalframework (structuralrisk minimization)

linear discriminant(optimal hyperplane)

margin



Support Vector Machines (SVM)

separation of the classesis measured in terms ofmargin.

rigorous mathematicalframework (structuralrisk minimization)

linear discriminant(optimal hyperplane)

error



Principle of SVMMaximize the margin while minimizing the training error.

Optimization problem: find the hyperplane such that

1

margin2+ cost ·

∑errori

is minimized.



Kernel trick

Q: What if data is not linearly separable?A: Transform data such that it becomes (more or less) linearly separable.

Kernels: polynomial, radial basis function, etc.



Other commonly used classifiers

nearest centroid, k−NN

logistic regression

classification trees (CART, C4.5); random forests

compound covariate predictor



Final remarks on classifiers

try to understand the classifier: assumptions, strong/weak points

feature selection should be related to the classifier used (e.g. t–test issuited for LDA but not for SVM)

complex classifiers generally need more data for training

use the simplest classifier that gives good results...

...but not a simpler one!


Performance assessment

Aspects of classifiers’ performance

discriminability: how well the rule predicts unseen data

reliability: robustness of the prediction or how well the posteriorprobabilities are estimated



Counting errors

basic: how many times the predicted class differs from the true class(0–1 loss)

cost–aware methods: some errors may be more costly

root mean square error (RMSE): requires yi ∈ {0, 1} and that theclassifier outputs posterior probabilities pi :

RMSE =

√(1/n)

∑i

(yi − pi)2

mean absolute error (MAE): same requirements as RMSE:

MAE = (1/n)∑

i

|yi − pi |



Confusion matrix

(for 2 classes called positive and negative, respectively)

Ground truth (gold standard)Positive Negative

Predicted Positive a bNegative c d

a + b + c + d = n

ideally: b = c = 0

a: true positive, b: false positive, c: false negative, d: true negative



Error/performance metrics

These are point estimates:

apparent positives = a + c,apparent negatives = b + d

false positive rateFPR = b/(b + d), false negativerate FNR = c/a + c)

accuracy ACC = (a + d)/n

misclassification rateErr = (b + c)/n

sensitivity SN = a/(a + c) = truepositive rate

specificity SP = d/(b + d) = truenegative rate

positive prediction valuePPV = a/(a + b)

negative prediction valueNPV = d/(c + d)



ROC and AUC

What if we like to see the effect of trading–off true positive rate and falsepositive rate (SN and 1 − SP)?



ROC and AUCWhat if we like to see the effect of trading off true positive rate and falsepositive rate (SN and 1 − SP)?

vary the threshold

for each value of thethreshold, measureTPR, FPR

1

10

Tru

e p

os

itiv

e r

ate

(S

N)

False positive rate (1−SP)

SN

SP

ROC = Receiver Operating Characteristics



ROC and AUCDifferent types of ROC curve:

1

10

Tru

e p

osit

ive r

ate

(S

N)


random

cla

ssifier

R0

R1

R2



ROC and AUCROC curves can be used for comparing classifiers...

1

10

Tru

e p

osit

ive r

ate

(S

N)


R1R2



ROC and AUCROC curves can be used for comparing classifiers... but not always

1

10

Tru

e p

osit

ive r

ate

(S

N)


R1R2



ROC and AUC

Area Under the Curve (AUC) is a summary of the ROC...

1

10

Tru

e p

os

itiv

e r

ate

(S

N)


AUC

...and can be used for comparing classifiers.


Estimating the performance parameters

Estimation schemes

Why estimation? Generally there is no analytic form of the error rate.

We discuss the error rate, but this applies to any performance metric.

Simple schemes:

apparent error: estimate on the training set

holdout estimate: unique split train/test

Resampling schemes:

k–fold cross–validation

repeated k–fold cross–validation

leave–one–out

bootstrapping




separated train and test sets

randomly dived data into ksubsets (folds) – you may alsochoose to enforce the proportionof the classes (stratified CV)

train on k − 1 folds and test onthe holdout fold

estimate the error as theaverage error measured onholdout folds

TRAIN SET TEST SET

usually k = 5 or k = 10

if k = n ⇒ leave–one–outestimator

improved estimation: repeatedk−CV (e.g. 100 × (5 − CV))




From k folds:

ε1, . . . , εk errors on the test folds

Ek−CV = 1k∑k

j=1 εj

estimated standard deviation

Confidence intervals (simple version – binomial approximation):

E ≈ E ±

0.5n+ z

√E(1 − E)

n

where n is the dataset size and z = Φ−1(1 − α/2), for a 1 − α confidenceinterval (e.g. z = 1.96 for 95% conf. interval)



Bootstrap error estimation

Performance estimation

(X,Y)

(X1,Y1) (X2,Y2) (XB,YB)

E1 E2 EB

...

1 generate a new dataset (Xb ,Yb) by resampling with replacementfrom the original dataset (X,Y)

2 train the classifier on (Xb ,Yb) and test on the left out data, to obtainan error Eb .

3 repeat 1–2 for b = 1, . . . ,B and collect Eb .



Bootstrap error estimation

estimate the error: for example, use the .632 estimator

E = 0.368E0 + 0.6321B

B∑b=1

Eb

where E0 is the error rate on the full training set (X,Y).

use the empirical distribution of Eb to obtain confidence intervals



Case study (1)

Build a classifier to predict ER status:

select top 2 probesets using the ratio of between–groups towithin–groups sum of squares (similar to t–test)

use LDA to discriminate between ER- and ER+

Algorithm

Let (X,Y) be the full dataset.For (all k=1...K folds):

let the training set be Xtr = X \ X (k), Ytr = Y \ Y (k), (all but the k–thfold), and let the test set be Xts = X (k), Yts = Y (k)

select the best 2 probesets on (Xtr ,Ytr) and train LDA on these data

test on (Xts ,Yts) and record the values for error, AUC,...

Compute the expected error, AUC,...



Case study (1)

Build a classifier to predict ER status:

select top 2 probesets using the ratio of between–groups towithin–groups sum of squares (similar to t–test)

use LDA to discriminate between ER- and ER+Algorithm

Let (X,Y) be the full dataset.For (all k=1...K folds):

let the training set be Xtr = X \ X (k), Ytr = Y \ Y (k), (all but the k–thfold), and let the test set be Xts = X (k), Yts = Y (k)

select the best 2 probesets on (Xtr ,Ytr) and train LDA on these data

test on (Xts ,Yts) and record the values for error, AUC,...

Compute the expected error, AUC,...



Case study (1)

CV scheme Error 95% CI for Error AUC5 − CV 0.0692 0.0073–0.1311 0.96610 − CV 0.0615 0.0026–0.1204 0.97020 × (5 − CV) 0.0615 (0.0538–0.0692)∗ 0.969

∗ from the quantiles of the empirical distribution of the error



Case study (2)

Build a classifier to predict patients with pathologic complete response(pCR).

find the best number of probesets to include in the model using theratio of between–groups to within–groups sum of squares (similar tot–test)

use LDA to discriminate between patients with pCR and thosewithout.

best number of probesets is a meta–parameter, which has to beestimated within a cross–validation

problem: if we estimate the meta–parameter on the full training set,we will likely fail to correctly estimate the performance (optimistic bias)



Case study (2)Two–external (nested) CV:

TRAIN SET TEST SET

Ou

ter

CV

lo

op

Inn

er

CV

lo

op



What we do learn from CV:

the expected performance of the modeling recipe;

the imprecision in estimating the performance;we can have a look at:

I what are the most stable featuresI what are the points always missclassified

What we do not learn from CV:

the best features

the best classifier

the best meta–parameters

We obtain these by training on thefull dataset (no CV).


Bibliography

Bibliography

Duda, Hart, Stork: Pattern Classification

Hastie, Tibshirani, Friedman: The Elements of Statistical Learning

T.Fawcett: ROC Graphs: Notes and Practical Considerations for Data Mining Researchers, HP Laboratories Tech. Rep.HPL–2003–4

A.Webb: Statistical Pattern Recognition

I.Shmulevich, E.Dougherty: Genomic Signal Processing


february 5th, 2008 - vital-itvlad popovici (sib) class discrimination for microarray studies...

Documents