methods for model comparison - boston universitycs-people.bu.edu/evimaria/cs565-10/lect13.pdf ·...

Post on 26-May-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Model Evaluation

• Metrics for Performance Evaluation

– How to evaluate the performance of a model?

• Methods for Performance Evaluation

– How to obtain reliable estimates?

• Methods for Model Comparison

– How to compare the relative performance of different models?

Metrics for Performance Evaluation

• Focus on the predictive capability of a model

– Rather than how fast it takes to classify or build models, scalability, etc.

• Confusion Matrix:

PREDICTED CLASS

ACTUAL

CLASS

Class=Yes Class=No

Class=Yes a: TP b: FN

Class=No c: FP d: TN

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Metrics for Performance Evaluation…

• Most widely-used metric:

PREDICTED CLASS

ACTUAL

CLASS

Class=Yes Class=No

Class=Yes a

(TP)

b

(FN)

Class=No c

(FP)

d

(TN)

FNFPTNTP

TNTP

dcba

da

Accuracy

Limitation of Accuracy

• Consider a 2-class problem

– Number of Class 0 examples = 9990

– Number of Class 1 examples = 10

• If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %

– Accuracy is misleading because model does not detect any class 1 example

Cost Matrix

PREDICTED CLASS

ACTUAL

CLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)

Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i

Computing Cost of Classification

Cost

Matrix

PREDICTED CLASS

ACTUAL

CLASS

C(i|j) + -

+ -1 100

- 1 0

Model

M1

PREDICTED CLASS

ACTUAL

CLASS

+ -

+ 150 40

- 60 250

Model

M2

PREDICTED CLASS

ACTUAL

CLASS

+ -

+ 250 45

- 5 200

Accuracy = 80%

Cost = 3910

Accuracy = 90%

Cost = 4255

Cost vs Accuracy

Count PREDICTED CLASS

ACTUAL

CLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

Cost PREDICTED CLASS

ACTUAL

CLASS

Class=Yes Class=No

Class=Yes p q

Class=No q p

N = a + b + c + d

Accuracy = (a + d)/N

Cost = p (a + d) + q (b + c)

= p (a + d) + q (N – a – d)

= q N – (q – p)(a + d)

= N [q – (q-p) Accuracy]

Accuracy is proportional to cost if1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p

Cost-Sensitive Measures

FNFPTP

TP

cba

a

pr

rp

FNTP

TP

ba

a

FPTP

TP

ca

a

2

2

2

22(F) measure-F

(r) Recall

(p)Precision

Precision is biased towards C(Yes|Yes) & C(Yes|No)

Recall is biased towards C(Yes|Yes) & C(No|Yes)

F-measure is biased towards all except C(No|No)

dwcwbwaw

dwaw

4321

41Accuracy Weighted

Model Evaluation

• Metrics for Performance Evaluation

– How to evaluate the performance of a model?

• Methods for Performance Evaluation

– How to obtain reliable estimates?

• Methods for Model Comparison

– How to compare the relative performance of different models?

Methods for Performance Evaluation

• How to obtain a reliable estimate of performance?

• Performance of a model may depend on other factors besides the learning algorithm:

– Class distribution

– Cost of misclassification

– Size of training and test sets

Learning Curve

Learning curve shows how accuracy changes with varying sample size

Requires a sampling schedule for creating learning curve

Effect of small sample size:

- Bias in the estimate

- Variance of estimate

Methods of Estimation• Holdout

– Reserve 2/3 for training and 1/3 for testing

• Random subsampling

– Repeated holdout

• Cross validation

– Partition data into k disjoint subsets

– k-fold: train on k-1 partitions, test on the remaining one

– Leave-one-out: k=n

• Bootstrap

– Sampling with replacement

Model Evaluation

• Metrics for Performance Evaluation

– How to evaluate the performance of a model?

• Methods for Performance Evaluation

– How to obtain reliable estimates?

• Methods for Model Comparison

– How to compare the relative performance of different models?

ROC (Receiver Operating Characteristic)

• Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive hits and false

alarms

• ROC curve plots TPR (on the y-axis) against FPR (on the x-axis)

FNTP

TPTPR

TNFP

FPFPR

PREDICTED CLASS

Actual

Yes No

Yes a

(TP)

b

(FN)

No c

(FP)

d

(TN)

ROC (Receiver Operating Characteristic)

• Performance of each classifier represented as a point on the ROC curve– changing the threshold of algorithm, sample

distribution or cost matrix changes the location of the point

ROC Curve

At threshold t:

TP=0.5, FN=0.5, FP=0.12, FN=0.88

- 1-dimensional data set containing 2 classes (positive and negative)

- any points located at x > t is classified as positive

ROC Curve(TP,FP):

• (0,0): declare everythingto be negative class

• (1,1): declare everythingto be positive class

• (1,0): ideal

• Diagonal line:

– Random guessing

– Below diagonal line:

• prediction is opposite of the true class

PREDICTED CLASS

Actual

Yes No

Yes a

(TP)

b

(FN)

No c

(FP)

d

(TN)

Using ROC for Model Comparison

No model consistently outperform the other

M1 is better for small FPR

M2 is better for large FPR

Area Under the ROC curve Ideal: Area = 1

Random guess:

Area = 0.5

How to Construct an ROC curveInstance P(+|A) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 -

5 0.85 -

6 0.85 +

7 0.76 -

8 0.53 +

9 0.43 -

10 0.25 +

• Use classifier that produces posterior probability for each test instance P(+|A)

• Sort the instances according to P(+|A) in decreasing order

• Apply threshold at each unique value of P(+|A)

• Count the number of TP, FP, TN, FN at each threshold

• TP rate, TPR = TP/(TP+FN)

• FP rate, FPR = FP/(FP + TN)

How to construct an ROC curveClass + - + - - - + - + +

P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Threshold >=

ROC Curve:

Ensemble Methods

• Construct a set of classifiers from the training data

• Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

General IdeaOriginal

Training data

....D

1D

2 Dt-1

Dt

D

Step 1:

Create Multiple

Data Sets

C1

C2

Ct -1

Ct

Step 2:

Build Multiple

Classifiers

C*

Step 3:

Combine

Classifiers

Why does it work?

• Suppose there are 25 base classifiers

– Each classifier has error rate, = 0.35

– Assume classifiers are independent

– Probability that the ensemble classifier makes a wrong prediction:

25

13

25 06.0)1(25

i

ii

i

Examples of Ensemble Methods

• How to generate an ensemble of classifiers?

– Bagging

– Boosting

Bagging

• Sampling with replacement

• Build classifier on each bootstrap sample

• Each sample has probability (1 – 1/n)n of being selected

Original Data 1 2 3 4 5 6 7 8 9 10

Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9

Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2

Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Boosting

• An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records

– Initially, all N records are assigned equal weights

– Unlike bagging, weights may change at the end of boosting round

Boosting

• Records that are wrongly classified will have their weights increased

• Records that are classified correctly will have their weights decreased

Original Data 1 2 3 4 5 6 7 8 9 10

Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3

Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2

Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify

• Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

Example: AdaBoost

• Base classifiers: C1, C2, …, CT

• Data pairs: (xi,yi)

• Error rate:

• Importance of a classifier:

N

j

jjiji yxCwN 1

)(1

i

ii

1ln

2

1

Example: AdaBoost

• Classification:

• Weight update for every iteration t and classifier j :

• If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n

factorion normalizat theis where

)( ifexp

)( ifexp)()1(

j

iij

iij

t

t

it

i

Z

yxC

yxC

Z

ww

j

j

T

j

jjy

yxCxC1

)(maxarg)(*

Boosting

Round 1 + + + -- - - - - -0.0094 0.0094 0.4623

B1

= 1.9459

Illustrating AdaBoostData points for training

Initial weights for each data point

Original

Data + + + -- - - - + +

0.1 0.1 0.1

Illustrating AdaBoost

Boosting

Round 1 + + + -- - - - - -

Boosting

Round 2 - - - -- - - - + +

Boosting

Round 3 + + + ++ + + + + +

Overall + + + -- - - - + +

0.0094 0.0094 0.4623

0.3037 0.0009 0.0422

0.0276 0.1819 0.0038

B1

B2

B3

= 1.9459

= 2.9323

= 3.8744

top related