classification 10/03/07. diagnose disease by gene expression pattern golub et al. 1999
Post on 21-Dec-2015
214 views
TRANSCRIPT
Two types of statistical learning
• Supervised– The classes are predefined. The membership
for a set of objects are known. Try to develop a rule to predict the membership for a new object.
• Unsupervised– Discover clusters of patterns from observed
data. Both membership and the clusters need to be identified.
• Classification is a kind of supervised learning.
How good is good enough?
• Suppose a test is used to screen for a certain disease. The test has 99% sensitivity and 99% specificity.
• The disease is rare: 1 case out of 1 million people.
• Question: Is this test useful?
• Misclassification rate = 0.999999 * 0.01 + 0.000001 * 0.01 = 0.01
• If we predict that no one has the disease, the misclassification rate = 0.000001 * 1 = 0.000001
• Does that mean the test is no good?
How good is good enough?
Loss function
• Often our goal is to minimize the misclassification error rate.
• Sometimes an error in one direction outweighs an error in the other direction. For example, It is more costly to classify a patient as healthy then to classify a healthy patient as sick.
• In general, we want to minimize a loss function L(Ctrue, Cpredict).
Procedure for developing a classifier
• Collect data with known class association.• Take out a subset, don’t touch it. This will be the
testing subset.• Building a model using information from the rest
of the data, i.e., the training set.• Apply the trained model to the testing data.
Evaluate model performance.• If you use all data to train your model, then you
will be overfitting your model and the performance will be exaggerated.
k-nearest-neighbor classifier
2
3 4
5
1
• Find k-nearest neighbors
• Classify the unknown case by majority vote.
• Despite its simplicity, kNN can be effective.
Issues with k-nearest-neighbor classifier
• Computationally intensive• How to choose k• Nearest-neighbors may not be close
(especially when X is high dimensional). – Most genes are probably irrelevant to the
prediction anyhow.– Pre-select features using dimension reduction
methods (discussed by Prof. Cai last time). – Dimension reduction is important for other
classifiers as well.
Feature selection
• The dimension of the model = number of genes is very high.
• It is hard to find close neighbors in high dimensional space
• Many genes are irrelevant• Pre-select genes using dimension
reduction methods• Dimension reduction is required for other
models as well.
Classification Methods
• Linear discriminant analysis (LDA)
• Logistic regression
• Classification trees
• Support vector machine (SVM)
• Neural network
• Many other methods!
Class 1
Class 2
Linear Discriminant Analysis (LDA)
Approximate the probability distribution within each class by a Gaussian distribution.
Bayes RuleThe posterior distribution
Select k with the largest posterior distribution.
Minimizes the average misclassification rate.
Maximum likelihood rule is equivalent to Bayes rule with uniform prior.
Decision boundary is
jj
k
jGP
kGP
P
kGPkGP
)|(
)|(
)(
),()|(
x
x
x
xx
)|(maxarg)( xx kGPkB C
)2|()1|( GPGP xx
Class 1
Class 2
LDA
•The boundary is linear if the variances for the two classes are the same.
•Otherwise, the boundary is quadratic and the method is called QDA.
Logistic regression
• Model the log-odds between the k-th class vs a reference class: e.g. 1st class.
Select k with the largest P(G = k | X = x)
Question: How to estimate the ’s?
,)|1Pr(
)|Pr(log 0 x
xXG
xXkG Tkk
Kk ,,1
,)exp(1
)exp()|Pr(
0
0
x
xxXkG
Tjjj
Tkk
Kk ,,2
)exp(1
1)|1Pr(
0 xxXG
Tjjj
Fitting logistic regression model• Let
Maximize the conditional log-likelihood.
where
In the special case of two classes, let yi = 0 when gi = 1, and yi = 1 when gi = 2. Then
The maximum is achieved when
i
ig xpli
),(log)( ββ
);|Pr(),( ββ xXkGxp ik
i
xi
Ti
iiiii
iT
exy
xpyxpyl
)1log(
));(1log()1();(log)(
iiii xpyx
l0);(
)(
),,,,( 10,1110 KK Lβ
Fitting logistic regression model (ctd)• Since this is a non-linear equation, it can only be solved numerically. This is achieved by the Newton-
Raphson method.
where
Note: global convergence is not guaranteed.
• For multiple classes can be solved similarly.
iii
TiiT
xpxpxxl
);(1);()(2
)()(
12 llT
oldnew
Naïve Bayes method
From Bayes’ rule,
If is high-dimensional (number of genes considered), pk(X) is difficult to estimate.
However, if we assume the Xj’s are independent with each other, i.e.,
then pkj(Xj) can be easily estimated.
)(
)(log
)|Pr(
)|Pr(log
Xp
Xp
lGX
kGX
ll
kk
j jkjk XpXp )()(
),,( 1 JXXX
Naïve Bayes method
jjlj
jkj
l
k
ll
kk
Xp
Xp
Xp
Xp
lGX
kGX
)(
)(loglog
)(
)(log
)|Pr(
)|Pr(log
Therefore
Note: Surprisingly, even though the assumption that Xj’s are independent is almost never met, the naïve Bayes classified often performs well, even beating more sophisticated methods.
Classification tree
Goal: Predict whether a person owns a house by asking a few questions with yes or no answers.
Predictors: Age, Car Type, etc.
Age
>=30
YES
<30
NO YES
Car Type
sports car minivan
Regression tree: Algorithm
Response function is continuous.
Goal: select a partition of regions (nodes): R1, …, RM, so that the
response can be modeled as a constant cm in each region.
Step 1: For a splitting variable Xj and a splitting point s, define
Seek j and s, so that
is minimized.
Step 2: For each Rm , refine the partition by repeating step 1, stop when
the number of nodes reaches a predefined cutoff.
sXXsjR
sXXsjR
j
j
|),(
|),(
2
1
2)2()2(2)1()1( )()( i iii ii yyyy
Classification tree: Pruning
Define a subtree to be any tree that can be obtained by pruning T. Let
The quality of a tree is given by
Define a cost-complexity criterion
for a pre-selected level
Seek the subtree T that minimized the C(T).
mi Xx
im
m yN
c1
ˆ
mi Rx
mim
m cyN
TQ 2)ˆ(1
)(
||
1
||)()(T
mmm TTQNTC
0TT
Classification tree: Pruning
Find the weak link, that is, a node that leads to minimum increase of .
Repeat the above procedure until a single node tree is achieved.
Theorem (Breiman et al. 1984): The optimal subtree is contained in the above sequence of subtrees.
The level of can be determined through cross-validation.
(We will talk about cross-validation later.)
m
mm TQN )(
Classification tree
• Classification tree differs from regression tree in the quality term.
• For regression tree, minimize
• For classification tree, minimize
– Misclassification error:
– Gini index
– or Cross-entropy or deviance
mRi
mmkim
pmkyIN )(ˆ1))((1
mi Rx
mim
m cyN
TQ 2)ˆ(1
)(
' 1
' )ˆ1(ˆˆˆkk
K
kmkmkmkmk pppp
mRi
mmkim
pmkyIN )(ˆ1))((1
Classification tree
• Advantage– Visually intuitive– Mathematically “simple”
• Drawback– Unstable: tree structures are sensitive to data – Theoretical properties are not well understood
Cross-validation
• The data is divided into a training subset and a testing subset.
• Model building must be independent of testing subset, including variable selection, tree structure, and so on.
Example: n-fold cross-validation
• A dataset is randomly divided into n subsets of equal size.
• Each subset is selected in turn as the testing set, whereas the rest are used as the training set.
Bootstrap methods
Idea: Random draw with replacement from the training data, each sample the same size as the original training set.
Fit the model using the resampled data, then treat the original training data as testing data.
Estimate
Improved version
B
b
N
ii
biBoot xfyL
BNErr
1 1
* ))(ˆ,(1
N
i Cbi
biiBoot
i
xfyLCN
Err1
* ))(ˆ,(||
11
Use cross-validation to select parameters
• A classifier may have several tunable parameters. For example, number of nearest neighbors, for classification tree.
• These parameters can be selected by CV. In these cases, the full dataset is divided into three parts: training set, testing set 1, and testing set 2.
• Testing set 1 is used to tune parameters. So it cannot be used to objectively estimate model performance. Therefore, testing set 2 is needed.
Acknowledgement
• Sources of slides:– Cheng Li– http://www.cs.cornell.edu/johannes/papers/20
01/kdd2001-tutorial-final.pdf– www.cse.msu.edu/~lawhiu/
intro_SVM_new.ppt