02 classification
TRANSCRIPT
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 1/18
IE 527
Intelligent Engineering Systems
Basic concepts
Model/performance evaluation
Overfitting
2
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 2/18
The task of learning a target function f that maps each
attribute set x to one of the predefined class labels y Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes (target) is the class.
Find a model for the class attribute as a function of the
values of other attributes.
Goal: previously unseen records should be assigned a
class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided intotraining and test sets, with training set used to buildthe model and test set used to validate it.
3
Predicting tumor cells as benign or malignant
Classifying credit card transactionsas legitimate or fraudulent
Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random coil
Categorizing news stories as finance,weather, entertainment, sports, etc.
Descriptive modeling
To explain what features define the class label
Predictive modeling
To predict the class label of unknown records4
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 3/18
Systematic approaches to build classification models
from an input data set
Employ a learning algorithm to identify a model that
best fits the relationship between the attribute set and
the class label
Decision Tree based methods
Artificial Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
…
The model should both fit the input data well and
correctly predict the class labels of unknown records
(generalization).5
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?10
Test Set
Learning
algorithm
Training Set
6
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 4/18
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 6 0K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 7 5K No
10 No Single 90K Yes10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Root/Internal nodes
- Spl i tt ing At t r ibu tes
Training Data Induced Model
(Decision Tree)
Leaf nodes - Class labels
7
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
8
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 5/18
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?1 0
Test Data
Assign Cheat to “No”
9
10
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 6/18
Multiple methods are available to classify or predict.
For each method, multiple choices are available forparameter settings.
To choose the best model, we need to assess eachmodel’s performance.
Metrics for Performance Evaluation
How do we evaluate the performance of a model?
Methods for Performance Evaluation
How do we obtain reliable estimates of the metrics?
Methods for Model Comparison
How do we compare the relative performance among competing
models?
12
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 7/18
Error = classifying a record as belonging to one class
when it belongs to another class. Error rate = (no. of misclassified records) / (total no. of records)
Can also use other measures of error (especially for Prediction
where error of each instance, − ) such as
Total SSE (Sum of Squared Errors) = =1
2
RMSE (Root Mean Squared Error) = =1
2 /
Naïve rule: Classify all records as belonging to the most
prevalent class or random classification (50-50) (or theaverage value for Prediction).
Often used as benchmark—we hope to do better than that
(exception: when goal is to identify high-value but rare outcomes,
we may do well by doing worse than the naïve rule)
Performance of a model w.r.t. predictive capability
Confusion Matrix
Performance metrics
Error rate = (no. of wrong predictions) / (total no. of predictions)
= (FP+FN) / (TP+TN+FP+FN) = (25+85)/3000 = 3.67%
Accuracy = (no. of correct predictions) / (total no. of predictions)
= (TP+TN) / (TP+TN+FP+FN) = (201+2689)/3000 = 96.33%
= 1 - (error rate)14
PREDICTED CLASS
ACTUALCLASS
Class = Yes (1) Class = No (0)
Class = Yes (1) 201 (TP) 85 (FN)
Class = No (0) 25 (FP) 2689 (TN)
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 8/18
Consider a 2-class problem:
No. of class 0 = 9990; no. of class 1 = 10 If a model predicts everything to be class 0, accuracy =
9990/10000 = 99.9%
Accuracy is misleading because the model does not detect any
class 1 object!
Accuracy may not be well suited for evaluating models
derived from imbalanced data sets.
Often a correct classification of the rare class (class 1) has a
greater value than a correct classification of the majority class!
In other words, misclassification cost is asymmetric.
FP (or FN) is acceptable, but FN (or FP) must not be allowed.
Example: tax fraud, identity theft, response to promotions,
network intrusion, predicting flight delay, etc.
In such cases, we want to tolerate greater overall error (reduced
accuracy) in return for better classifying the important class.15
+: rare but more important
-: majority but less important
TPR (sensitivity) = TP/(TP+FN) = % of + class correctly classified
TNR (specificity) = TN/(FP+TN) = % of – class correctly classified
FPR = FP/(FP+TN) = 1 – TNR FNR = FN/(TP+FN) = 1 – TPR
Oversample the important class for training (but don’t do for
validation/testing)
16
PREDICTED CLASS
ACTUALCLASS
+ -
+ f++ (TP) f+- (FN)
- f-+ (FP) f-- (TN)
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 9/18
+: rare but important
-: less important
Ctotal(M) = TP*C(+,+) + FP*C(-,+) + FN*C(+,-) + TN*C(-,-)
C(i,j) (or C(j|i)): Cost of (mis)classifying class i object as class j
For a symmetric, 0/1 cost matrix (C(+,+)=C(-,-)=0, C(+,-)=C(-,+)=1)
Ctotal(M) = FP + FN = n * (error rate)
Find a model that yields the lowest cost. If FN are most costly, reduce the FN errors by extending decision
boundary toward the negative class to cover more positives, at the
expense of generating additional false alarms (FP).
17
PREDICTED CLASS
ACTUALCLASS
C(i,j) + -
+ C(+,+) (TP) C(+,-) (FN)
- C(-,+) (FP) C(-,-) (TN)
Select M1 (or A1)18
Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
C(i, j) + -
+ -1 100
- 1 0
Model M1
(or Attr. A1)
PREDICTED CLASS
ACTUAL
CLASS
+ -+ 150 40
- 60 250
Model M2
(or Attr. A2)
PREDICTED CLASS
ACTUAL
CLASS
+ -+ 250 45
- 5 200
Accuracy = 400/500 = 80%
Cost = 3910
Accuracy = 450/500 = 90%
Cost = 4255 (larger due to
more FN)
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 11/18
For m classes, confusion matrix has m rows and m
columns
Theoretically, there are m(m-1) misclassification costs,since any case could be misclassified in m-1 ways
Practically too many to work with
In decision-making context, though, such complexityrarely arises – one class is usually of primary interest
Classifications may reduce to “important” vs. “unimportant”
Metrics for Performance Evaluation
How do we evaluate the performance of a model?
Methods for Performance Evaluation
How do we obtain reliable estimates of the metrics?
Methods for Model Comparison
How do we compare the relative performance among competing
models?
22
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 12/18
Holdout
Reserve 2/3 for training and 1/3 for testing Fewer training records
Highly dependent on the composition of training/test sets
Training & test sets no longer independent of each other
Random subsampling
Repeat k holdouts; acc = i acci/k where acci = accuracy at i-th iteration
Can’t control no. of each record used for testing and training
Cross validation
Partition data into k equal-sized disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one; repeat k times Total error by summing up the errors for all k runs
Leave-one-out: a special case where k = n – good for small samples
Utilizing as much data as possible for training; test sets mutually exclusive
Computationally expensive; high variance (only one record in each test set)
23
Stratified sampling
For imbalanced classes, e.g. consider 100 + and 1000 -.
Undersampling for –:
A random sample of 100 –; Focused undersampling
Underrepresented -
Oversampling for +:
Replicate + until (no. of +) = (no. of -); Generate new + by interpolation
Overfitting possible
Hybrid
Bootstrap
Training set composed by sampling with replacement (possible duplicates)
The rest can become part of the test set.
Good for small samples (like leave-one-out); low variance
24
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 13/18
Metrics for Performance Evaluation
How do we evaluate the performance of a model?
Methods for Performance Evaluation
How do we obtain reliable estimates of the metrics?
Methods for Model Comparison
How do we compare the relative performance among competing
models?
25
Developed in 1950s for signal detection theory to analyze
noisy signals
Characterize the trade-off between positive hits and false alarms
ROC curve plots TPR (on y-axis) against FPR (on x-axis)
Performance of each classifier represented as a point onthe ROC curve
Changing the threshold of algorithm, sample distribution or costmatrix changes the location of the point
26
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 14/18
(TPR,FPR) along cutoff values from 0 to 1
(0,0): Model predicts everything to be – (cutoff = 1) (1,1): Model predicts everything to be + (cutoff = 0)
(1,0): The ideal model (hitting the upper-left corner; area under ROC = 1)
Diagonal line
Random guessing (naïve classifier)
Classify as a + with a fixed prob. p
TPR (= pn+/n+) = FPR (= pn-/n-) = p
Below diagonal line
Prediction is worse than guessing!
M1 vs. M2
M1 is better for small FPR
M2 is better for large FPR
Area under ROC curve (AUC)
Ideal: AUC = 1
Random guessing: AUC = 0.5
The larger the AUC, the better the model
27
Apply the classifier to each
test instance to produce its
posterior probability to be +
Sort the instances in
increasing order of the P(+)
Apply cutoff at each unique
value of P(+)
Assign + to instances ≥ cutoff,
– to instances < cutoff Initially TPR = FPR = 1
Count the number of TP, FP, TN,
FN at each cutoff
Increase cutoff to the next higher;
repeat until the highest
Plot TPR against FPR28
Instance P(+) True Class
1 0.95 +
2 0.93 +
3 0.87 -
4 0.85 +
5 0.85 -
6 0.85 -
7 0.76 -8 0.53 +
9 0.43 -
10 0.25 +
Cutoff Table
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 15/18
29
ROC Curve
Class + - + - - - + - + +
P0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
Given two models:
Model M1: accuracy = 85%, tested on 30 instances
Model M2: accuracy = 75%, tested on 5000 instances
Can we say M1 is better than M2?
Estimate Confidence Intervals for accuracy
Prediction can be regarded as Bernoulli trials (2 possible outcomes),
which follow a binomial distribution with p (true accuracy).
For large test sets, (empirical) acc ~ N(p, p(1-p)n)
Compare performance of two models
Testing statistical significance by Z or t-test
H0: d = e1 – e2 = 0 H1: d ≠ 0
See Section 4.6 in Tan et al. (2006) for more details.30
1)
/)1(( 2/12/ Z
N p p
pacc Z P
)(2
4422
2/
22
2/
2
2/
Z n
accnaccn Z Z accn p
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 16/18
31
Generalization: A good classification model must not only fit training
data well but also accurately classify unseen records (test/new data).
Overfitting: a model that fits training data too well can have a
poorer generalization than a model with a higher training error.
32
Underfitting: When a
model is too simple, both
training and test errors are
large (the model has yet
to learn the data)
Overfitting: Once the tree
becomes too large, its test
error begins to increase
while its training error
continues to decrease
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 17/18
Decision boundary is distorted by a (mislabeled) noise
point that should be ignored by the decision tree.
33
Lack of data points makes it difficult to predict the class labels
correctly
Decision boundary is made by only few records falling in the region
Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
34
8/9/2019 02 Classification
http://slidepdf.com/reader/full/02-classification 18/18
Overfitting results in decision trees that are more
complex than necessary. The chance of overfitting increases as the model becomes more
complex.
Training error no longer provides a good estimate of how well the
tree will perform on previously unseen records.
Need new ways for estimating generalization errors
Occam’s Razor
Given two models of similar generalization errors, one should
prefer the simpler model over the more complex model.
For complex models, there is a greater chance that it was fitted bychance or by noise in data and/or it overfits the data.
Therefore, one should include model complexity when evaluating a
model.
Reduce the number of nodes in a decision tree (pruning).35