classification 10/03/07. diagnose disease by gene expression pattern golub et al. 1999

Classification

10/03/07

Diagnose disease by gene expression pattern

Golub et al. 1999

Two types of statistical learning

• Supervised– The classes are predefined. The membership

for a set of objects are known. Try to develop a rule to predict the membership for a new object.

• Unsupervised– Discover clusters of patterns from observed

data. Both membership and the clusters need to be identified.

• Classification is a kind of supervised learning.

How good is good enough?

• Suppose a test is used to screen for a certain disease. The test has 99% sensitivity and 99% specificity.

• The disease is rare: 1 case out of 1 million people.

• Question: Is this test useful?

• Misclassification rate = 0.999999 * 0.01 + 0.000001 * 0.01 = 0.01

• If we predict that no one has the disease, the misclassification rate = 0.000001 * 1 = 0.000001

• Does that mean the test is no good?

How good is good enough?

Loss function

• Often our goal is to minimize the misclassification error rate.

• Sometimes an error in one direction outweighs an error in the other direction. For example, It is more costly to classify a patient as healthy then to classify a healthy patient as sick.

• In general, we want to minimize a loss function L(Ctrue, Cpredict).

Procedure for developing a classifier

• Collect data with known class association.• Take out a subset, don’t touch it. This will be the

testing subset.• Building a model using information from the rest

of the data, i.e., the training set.• Apply the trained model to the testing data.

Evaluate model performance.• If you use all data to train your model, then you

will be overfitting your model and the performance will be exaggerated.

k-nearest-neighbor classifier

?


2

3 4

5

1

• Find k-nearest neighbors


2

3 4

5

1

• Find k-nearest neighbors

• Classify the unknown case by majority vote.

• Despite its simplicity, kNN can be effective.

Issues with k-nearest-neighbor classifier

• Computationally intensive• How to choose k• Nearest-neighbors may not be close

(especially when X is high dimensional). – Most genes are probably irrelevant to the

prediction anyhow.– Pre-select features using dimension reduction

methods (discussed by Prof. Cai last time). – Dimension reduction is important for other

classifiers as well.

Feature selection

• The dimension of the model = number of genes is very high.

• It is hard to find close neighbors in high dimensional space

• Many genes are irrelevant• Pre-select genes using dimension

reduction methods• Dimension reduction is required for other

models as well.

Feature selection

Feature selection methods

• Stepwise regression

• PCA

• PLS

• Ridge regression

• LASSO

• etc.

(Cai)

Classification Methods

• Linear discriminant analysis (LDA)

• Logistic regression

• Classification trees

• Support vector machine (SVM)

• Neural network

• Many other methods!

Class 1

Class 2

Linear methods

???

Class 1

Class 2

Linear Discriminant Analysis (LDA)

Approximate the probability distribution within each class by a Gaussian distribution.

Bayes RuleThe posterior distribution

Select k with the largest posterior distribution.

Minimizes the average misclassification rate.

Maximum likelihood rule is equivalent to Bayes rule with uniform prior.

Decision boundary is

jj

k

jGP

kGP

P

kGPkGP

)|(

)|(

)(

),()|(

x

x

x

xx

)|(maxarg)( xx kGPkB C

)2|()1|( GPGP xx

Linear Discriminant Analysis

Assume )()(

2

1

2/12/

1

||)2(

1)(

kkT

k xx

k

pk exf

Linear Discriminant Analysis

Class 1

Class 2

LDA

•The boundary is linear if the variances for the two classes are the same.

•Otherwise, the boundary is quadratic and the method is called QDA.

Diabetes Data Set

Logistic regression

• Model the log-odds between the k-th class vs a reference class: e.g. 1st class.

Select k with the largest P(G = k | X = x)

Question: How to estimate the ’s?

,)|1Pr(

)|Pr(log 0 x

xXG

xXkG Tkk

Kk ,,1

,)exp(1

)exp()|Pr(

0

0

x

xxXkG

Tjjj

Tkk

Kk ,,2

)exp(1

1)|1Pr(

0 xxXG

Tjjj

Fitting logistic regression model• Let

Maximize the conditional log-likelihood.

where

In the special case of two classes, let yi = 0 when gi = 1, and yi = 1 when gi = 2. Then

The maximum is achieved when

i

ig xpli

),(log)( ββ

);|Pr(),( ββ xXkGxp ik

i

xi

Ti

iiiii

iT

exy

xpyxpyl

)1log(

));(1log()1();(log)(

iiii xpyx

l0);(

)(

),,,,( 10,1110 KK Lβ

Fitting logistic regression model (ctd)• Since this is a non-linear equation, it can only be solved numerically. This is achieved by the Newton-

Raphson method.

where

Note: global convergence is not guaranteed.

• For multiple classes can be solved similarly.

iii

TiiT

xpxpxxl

);(1);()(2

)()(

12 llT

oldnew

Connection between LDA and logistic regression

Diabetes Data Set

Naïve Bayes method

From Bayes’ rule,

If is high-dimensional (number of genes considered), pk(X) is difficult to estimate.

However, if we assume the Xj’s are independent with each other, i.e.,

then pkj(Xj) can be easily estimated.

)(

)(log

)|Pr(

)|Pr(log

Xp

Xp

lGX

kGX

ll

kk

j jkjk XpXp )()(

),,( 1 JXXX

Naïve Bayes method

jjlj

jkj

l

k

ll

kk

Xp

Xp

Xp

Xp

lGX

kGX

)(

)(loglog

)(

)(log

)|Pr(

)|Pr(log

Therefore

Note: Surprisingly, even though the assumption that Xj’s are independent is almost never met, the naïve Bayes classified often performs well, even beating more sophisticated methods.

Classification tree

Goal: Predict whether a person owns a house by asking a few questions with yes or no answers.

Predictors: Age, Car Type, etc.

Age

>=30

YES

<30

NO YES

Car Type

sports car minivan

Age

>=30

YES

<30

NO YES

Car Type

sports car minivan

Age >= 30Sports car

minivan

Regression tree: Algorithm

Response function is continuous.

Goal: select a partition of regions (nodes): R1, …, RM, so that the

response can be modeled as a constant cm in each region.

Step 1: For a splitting variable Xj and a splitting point s, define

Seek j and s, so that

is minimized.

Step 2: For each Rm , refine the partition by repeating step 1, stop when

the number of nodes reaches a predefined cutoff.

sXXsjR

sXXsjR

j

j

|),(

|),(

2

1

2)2()2(2)1()1( )()( i iii ii yyyy

Classification tree: Pruning

Define a subtree to be any tree that can be obtained by pruning T. Let

The quality of a tree is given by

Define a cost-complexity criterion

for a pre-selected level

Seek the subtree T that minimized the C(T).

mi Xx

im

m yN

c1

ˆ

mi Rx

mim

m cyN

TQ 2)ˆ(1

)(

||

1

||)()(T

mmm TTQNTC

0TT

Classification tree: Pruning

Find the weak link, that is, a node that leads to minimum increase of .

Repeat the above procedure until a single node tree is achieved.

Theorem (Breiman et al. 1984): The optimal subtree is contained in the above sequence of subtrees.

The level of can be determined through cross-validation.

(We will talk about cross-validation later.)

m

mm TQN )(

Classification tree

• Classification tree differs from regression tree in the quality term.

• For regression tree, minimize

• For classification tree, minimize

– Misclassification error:

– Gini index

– or Cross-entropy or deviance

mRi

mmkim

pmkyIN )(ˆ1))((1

mi Rx

mim

m cyN

TQ 2)ˆ(1

)(

' 1

' )ˆ1(ˆˆˆkk

K

kmkmkmkmk pppp

mRi

mmkim

pmkyIN )(ˆ1))((1

Classification tree

• Advantage– Visually intuitive– Mathematically “simple”

• Drawback– Unstable: tree structures are sensitive to data – Theoretical properties are not well understood

Performance of a classifier

• Cross-validation

• Bootstrap

Cross-validation

• The data is divided into a training subset and a testing subset.

• Model building must be independent of testing subset, including variable selection, tree structure, and so on.

Example: n-fold cross-validation

• A dataset is randomly divided into n subsets of equal size.

• Each subset is selected in turn as the testing set, whereas the rest are used as the training set.

Bootstrap methods

Idea: Random draw with replacement from the training data, each sample the same size as the original training set.

Fit the model using the resampled data, then treat the original training data as testing data.

Estimate

Improved version

B

b

N

ii

biBoot xfyL

BNErr

1 1

* ))(ˆ,(1

N

i Cbi

biiBoot

i

xfyLCN

Err1

* ))(ˆ,(||

11

Use cross-validation to select parameters

• A classifier may have several tunable parameters. For example, number of nearest neighbors, for classification tree.

• These parameters can be selected by CV. In these cases, the full dataset is divided into three parts: training set, testing set 1, and testing set 2.

• Testing set 1 is used to tune parameters. So it cannot be used to objectively estimate model performance. Therefore, testing set 2 is needed.

Acknowledgement

• Sources of slides:– Cheng Li– http://www.cs.cornell.edu/johannes/papers/20

01/kdd2001-tutorial-final.pdf– www.cse.msu.edu/~lawhiu/

intro_SVM_new.ppt

http://www.cs.cornell.edu/johannes/papers/2001/kdd2001-tutorial-final.pdf

http://www.cs.cornell.edu/johannes/papers/2001/kdd2001-tutorial-final.pdf

classification 10/03/07. diagnose disease by gene expression pattern golub et al. 1999

Documents