02 classification

8/9/2019 02 Classification

http://slidepdf.com/reader/full/02-classification 1/18

IE 527

Intelligent Engineering Systems

Basic concepts

Model/performance evaluation

Overfitting

2



The task of learning a target function f that maps each

attribute set x to one of the predefined class labels y Given a collection of records (training set )

Each record contains a set of attributes, one of the

attributes (target) is the class.

Find a model for the class attribute as a function of the

values of other attributes.

Goal: previously unseen records should be assigned a

class as accurately as possible.

A test set is used to determine the accuracy of the

model. Usually, the given data set is divided intotraining and test sets, with training set used to buildthe model and test set used to validate it.

3

Predicting tumor cells as benign or malignant

Classifying credit card transactionsas legitimate or fraudulent

Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random coil

Categorizing news stories as finance,weather, entertainment, sports, etc.

Descriptive modeling

To explain what features define the class label

Predictive modeling

To predict the class label of unknown records4



Systematic approaches to build classification models

from an input data set

Employ a learning algorithm to identify a model that

best fits the relationship between the attribute set and

the class label

Decision Tree based methods

Artificial Neural Networks

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines

…

The model should both fit the input data well and

correctly predict the class labels of unknown records

(generalization).5

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?10

Test Set

Learning

algorithm

Training Set

6



Tid Refund Marital

Status

Taxable

Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 6 0K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 7 5K No

10 No Single 90K Yes10

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Root/Internal nodes

- Spl i tt ing At t r ibu tes

Training Data Induced Model

(Decision Tree)

Leaf nodes - Class labels

7

Tid Refund Marital

Status

Taxable

Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

MarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle,

Divorced

< 80K > 80K

There could be more than one tree that

fits the same data!

8



Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

MarriedSingle, Divorced

< 80K > 80K

Refund Marital

Status

Taxable

Income Cheat

No Married 80K ?1 0

Test Data

Assign Cheat to “No”

9

10



Multiple methods are available to classify or predict.

For each method, multiple choices are available forparameter settings.

To choose the best model, we need to assess eachmodel’s performance.

Metrics for Performance Evaluation

How do we evaluate the performance of a model?

Methods for Performance Evaluation

How do we obtain reliable estimates of the metrics?

Methods for Model Comparison

How do we compare the relative performance among competing

models?

12



Error = classifying a record as belonging to one class

when it belongs to another class. Error rate = (no. of misclassified records) / (total no. of records)

Can also use other measures of error (especially for Prediction

where error of each instance, − ) such as

Total SSE (Sum of Squared Errors) = =1

2

RMSE (Root Mean Squared Error) = =1

2 /

Naïve rule: Classify all records as belonging to the most

prevalent class or random classification (50-50) (or theaverage value for Prediction).

Often used as benchmark—we hope to do better than that

(exception: when goal is to identify high-value but rare outcomes,

we may do well by doing worse than the naïve rule)

Performance of a model w.r.t. predictive capability

Confusion Matrix

Performance metrics

Error rate = (no. of wrong predictions) / (total no. of predictions)

= (FP+FN) / (TP+TN+FP+FN) = (25+85)/3000 = 3.67%

Accuracy = (no. of correct predictions) / (total no. of predictions)

= (TP+TN) / (TP+TN+FP+FN) = (201+2689)/3000 = 96.33%

= 1 - (error rate)14

PREDICTED CLASS

ACTUALCLASS

Class = Yes (1) Class = No (0)

Class = Yes (1) 201 (TP) 85 (FN)

Class = No (0) 25 (FP) 2689 (TN)



Consider a 2-class problem:

No. of class 0 = 9990; no. of class 1 = 10 If a model predicts everything to be class 0, accuracy =

9990/10000 = 99.9%

Accuracy is misleading because the model does not detect any

class 1 object!

Accuracy may not be well suited for evaluating models

derived from imbalanced data sets.

Often a correct classification of the rare class (class 1) has a

greater value than a correct classification of the majority class!

In other words, misclassification cost is asymmetric.

FP (or FN) is acceptable, but FN (or FP) must not be allowed.

Example: tax fraud, identity theft, response to promotions,

network intrusion, predicting flight delay, etc.

In such cases, we want to tolerate greater overall error (reduced

accuracy) in return for better classifying the important class.15

+: rare but more important

-: majority but less important

TPR (sensitivity) = TP/(TP+FN) = % of + class correctly classified

TNR (specificity) = TN/(FP+TN) = % of – class correctly classified

FPR = FP/(FP+TN) = 1 – TNR FNR = FN/(TP+FN) = 1 – TPR

Oversample the important class for training (but don’t do for

validation/testing)

16

PREDICTED CLASS

ACTUALCLASS

+ -

+ f++ (TP) f+- (FN)

- f-+ (FP) f-- (TN)



+: rare but important

-: less important

Ctotal(M) = TP*C(+,+) + FP*C(-,+) + FN*C(+,-) + TN*C(-,-)

C(i,j) (or C(j|i)): Cost of (mis)classifying class i object as class j

For a symmetric, 0/1 cost matrix (C(+,+)=C(-,-)=0, C(+,-)=C(-,+)=1)

Ctotal(M) = FP + FN = n * (error rate)

Find a model that yields the lowest cost. If FN are most costly, reduce the FN errors by extending decision

boundary toward the negative class to cover more positives, at the

expense of generating additional false alarms (FP).

17

PREDICTED CLASS

ACTUALCLASS

C(i,j) + -

+ C(+,+) (TP) C(+,-) (FN)

- C(-,+) (FP) C(-,-) (TN)

Select M1 (or A1)18

Cost

Matrix

PREDICTED CLASS

ACTUAL

CLASS

C(i, j) + -

+ -1 100

- 1 0

Model M1

(or Attr. A1)

PREDICTED CLASS

ACTUAL

CLASS

+ -+ 150 40

- 60 250

Model M2

(or Attr. A2)

PREDICTED CLASS

ACTUAL

CLASS

+ -+ 250 45

- 5 200

Accuracy = 400/500 = 80%

Cost = 3910

Accuracy = 450/500 = 90%

Cost = 4255 (larger due to

more FN)



For m classes, confusion matrix has m rows and m

columns

Theoretically, there are m(m-1) misclassification costs,since any case could be misclassified in m-1 ways

Practically too many to work with

In decision-making context, though, such complexityrarely arises – one class is usually of primary interest

Classifications may reduce to “important” vs. “unimportant”







models?

22



Holdout

Reserve 2/3 for training and 1/3 for testing Fewer training records

Highly dependent on the composition of training/test sets

Training & test sets no longer independent of each other

Random subsampling

Repeat k holdouts; acc = i acci/k where acci = accuracy at i-th iteration

Can’t control no. of each record used for testing and training

Cross validation

Partition data into k equal-sized disjoint subsets

k-fold: train on k-1 partitions, test on the remaining one; repeat k times Total error by summing up the errors for all k runs

Leave-one-out: a special case where k = n – good for small samples

Utilizing as much data as possible for training; test sets mutually exclusive

Computationally expensive; high variance (only one record in each test set)

23

Stratified sampling

For imbalanced classes, e.g. consider 100 + and 1000 -.

Undersampling for –:

A random sample of 100 –; Focused undersampling

Underrepresented -

Oversampling for +:

Replicate + until (no. of +) = (no. of -); Generate new + by interpolation

Overfitting possible

Hybrid

Bootstrap

Training set composed by sampling with replacement (possible duplicates)

The rest can become part of the test set.

Good for small samples (like leave-one-out); low variance

24









models?

25

Developed in 1950s for signal detection theory to analyze

noisy signals

Characterize the trade-off between positive hits and false alarms

ROC curve plots TPR (on y-axis) against FPR (on x-axis)

Performance of each classifier represented as a point onthe ROC curve

Changing the threshold of algorithm, sample distribution or costmatrix changes the location of the point

26



(TPR,FPR) along cutoff values from 0 to 1

(0,0): Model predicts everything to be – (cutoff = 1) (1,1): Model predicts everything to be + (cutoff = 0)

(1,0): The ideal model (hitting the upper-left corner; area under ROC = 1)

Diagonal line

Random guessing (naïve classifier)

Classify as a + with a fixed prob. p

TPR (= pn+/n+) = FPR (= pn-/n-) = p

Below diagonal line

Prediction is worse than guessing!

M1 vs. M2

M1 is better for small FPR

M2 is better for large FPR

Area under ROC curve (AUC)

Ideal: AUC = 1

Random guessing: AUC = 0.5

The larger the AUC, the better the model

27

Apply the classifier to each

test instance to produce its

posterior probability to be +

Sort the instances in

increasing order of the P(+)

Apply cutoff at each unique

value of P(+)

Assign + to instances ≥ cutoff,

– to instances < cutoff Initially TPR = FPR = 1

Count the number of TP, FP, TN,

FN at each cutoff

Increase cutoff to the next higher;

repeat until the highest

Plot TPR against FPR28

Instance P(+) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 +

5 0.85 -

6 0.85 -

7 0.76 -8 0.53 +

9 0.43 -

10 0.25 +

Cutoff Table



29

ROC Curve

Class + - + - - - + - + +

P0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Given two models:

Model M1: accuracy = 85%, tested on 30 instances

Model M2: accuracy = 75%, tested on 5000 instances

Can we say M1 is better than M2?

Estimate Confidence Intervals for accuracy

Prediction can be regarded as Bernoulli trials (2 possible outcomes),

which follow a binomial distribution with p (true accuracy).

For large test sets, (empirical) acc ~ N(p, p(1-p)n)

Compare performance of two models

Testing statistical significance by Z or t-test

H0: d = e1 – e2 = 0 H1: d ≠ 0

See Section 4.6 in Tan et al. (2006) for more details.30

1)

/)1(( 2/12/ Z

N p p

pacc Z P

)(2

4422

2/

22

2/

2

2/

Z n

accnaccn Z Z accn p



31

Generalization: A good classification model must not only fit training

data well but also accurately classify unseen records (test/new data).

Overfitting: a model that fits training data too well can have a

poorer generalization than a model with a higher training error.

32

Underfitting: When a

model is too simple, both

training and test errors are

large (the model has yet

to learn the data)

Overfitting: Once the tree

becomes too large, its test

error begins to increase

while its training error

continues to decrease



Decision boundary is distorted by a (mislabeled) noise

point that should be ignored by the decision tree.

33

Lack of data points makes it difficult to predict the class labels

correctly

Decision boundary is made by only few records falling in the region

Insufficient number of training records in the region causes the

decision tree to predict the test examples using other training

records that are irrelevant to the classification task

34



Overfitting results in decision trees that are more

complex than necessary. The chance of overfitting increases as the model becomes more

complex.

Training error no longer provides a good estimate of how well the

tree will perform on previously unseen records.

Need new ways for estimating generalization errors

Occam’s Razor

Given two models of similar generalization errors, one should

prefer the simpler model over the more complex model.

For complex models, there is a greater chance that it was fitted bychance or by noise in data and/or it overfits the data.

Therefore, one should include model complexity when evaluating a

model.

Reduce the number of nodes in a decision tree (pruning).35

02 classification

Documents