dbm630 lecture08

54
DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University 1 Semester 2/2011 Lecture 8 Classification and Prediction Evaluation by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Upload: tokyo-institute-of-technology

Post on 04-Jul-2015

637 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Dbm630 lecture08

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 8

Classification and Prediction Evaluation

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Page 2: Dbm630 lecture08

Topics

2

Train, Test and Validation sets

Evaluation on Large data Unbalanced data

Evaluation on Small data Cross validation

Bootstrap

Comparing data mining schemes Significance test Lift Chart / ROC curve

Numeric Prediction Evaluation

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 3: Dbm630 lecture08

3

Evaluation in Classification Tasks

How predictive is the model we learned?

Error on the training data is not a good indicator of performance on future data

Q: Why?

A: Because new data will probably not be exactly the same as the training data!

Overfitting – fitting the training data too precisely - usually leads to poor results on new data

Classification and Prediction: Evaluation

Page 4: Dbm630 lecture08

4

Model Selection and Bias-Variance Tradeoff Typical behavior of the test and training error, as model complexity is

varied.

Model Complexity Low High

Pre

dic

tio

n E

rro

r

Test Sample

Training Sample

High Bias Low Variance

Low Bias High Variance

Classification and Prediction: Evaluation

Page 5: Dbm630 lecture08

Classification and Prediction: Evaluation 5

Classifier error rate

Natural performance measure for classification problems: error rate Success: instance’s class is predicted correctly

Error: instance’s class is predicted incorrectly

Error rate: proportion of errors made over the whole set of instances

Resubstitution error: error rate on training data (too optimistic way!)

Generalization error: error rate on test data

85 % 87 % 15 % 13 %

2% improvement

2.35% improvement rate

2% error reduction

13.3% error reduction rate

Accuracy Error rate (1-Accuracy)

Page 6: Dbm630 lecture08

6

Evaluation on LARGE data

If many (thousands) of examples are available, including several hundred examples from each class, then how can we evaluate our classifier model?

A simple evaluation is sufficient

For example, randomly split data into training and test sets (usually 2/3 for train, 1/3 for test)

Build a classifier using the train set and evaluate it using the test set.

Classification and Prediction: Evaluation

Page 7: Dbm630 lecture08

7

Classification Step 1: Split data into train and test sets

Results Known

+ + - - +

THE PAST

Data

Training set

Testing set

Classification and Prediction: Evaluation

Page 8: Dbm630 lecture08

8

Classification Step 2: Build a model on a training set

Training set

Results Known

+ + - - +

THE PAST

Data

Testing set

Model Builder

Classification and Prediction: Evaluation

Page 9: Dbm630 lecture08

9

Classification Step 3: Evaluate on test set (and may be re-train)

Data

Predictions

Y N

Training set

Testing set

+ + - - +

Model Builder feedback

+

-

+

-

THE PAST

Results Known

Classification and Prediction: Evaluation

Page 10: Dbm630 lecture08

10

A note on parameter tuning

It is important that the test data is not used in any way to create the classifier

Some learning schemes operate in two stages:

Stage 1: builds the basic structure

Stage 2: optimizes parameter settings

The test data can’t be used for parameter tuning!

Proper procedure uses three sets: training data, validation data, and test data

Validation data is used to optimize parameters

Classification and Prediction: Evaluation

Page 11: Dbm630 lecture08

11

Classification: Train, Validation, Test split

Data

Predictions

Y N

Results Known

Training set

Validation set

+ + - - +

Model Builder Evaluate

+

-

+

-

Final Model Final Test Set

+

-

+

-

Final Evaluation

Model

Builder

Classification and Prediction: Evaluation

Page 12: Dbm630 lecture08

12

Unbalanced data

Sometimes, classes have very unequal frequency

Accommodation prediction: 97% stay, 3% don’t stay

medical diagnosis: 90% healthy, 10% disease

eCommerce: 99% don’t buy, 1% buy

Security: >99.99% of Americans are not terrorists

Similar situation with multiple classes

Majority class classifier can be 97% correct, but useless

Solution: With two classes, a good approach is to build BALANCED train and test sets, and train model on a balanced set randomly select desired number of minority class instances

add equal number of randomly selected majority class

That is, we ignore the effect of the number of instances for each class.

Classification and Prediction: Evaluation

Page 13: Dbm630 lecture08

13

Evaluation on SMALL data The holdout method reserves a certain amount for testing and

uses the remainder for training Usually: one third for testing, the rest for training

For “unbalanced” datasets, samples might not be representative Few or none instances of some classes

Stratified sample: advanced version of balancing the data Make sure that each class is represented with approximately equal

proportions in both subsets

What if we have a small data set?

The chosen 2/3 for training may not be representative.

The chosen 1/3 for testing may not be representative.

Classification and Prediction: Evaluation

Page 14: Dbm630 lecture08

14

Repeated Holdout Method

Holdout estimate can be made more reliable by repeating the process with different subsamples

In each iteration, a certain proportion is randomly selected for training (possibly with stratification)

The error rates on the different iterations are averaged to yield an overall error rate

This is called the repeated holdout method

Still not optimum: the different test sets overlap.

Can we prevent overlapping?

Classification and Prediction: Evaluation

Page 15: Dbm630 lecture08

15

Cross-validation

Cross-validation avoids overlapping test sets

First step: data is split into k subsets of equal size

Second step: each subset in turn is used for testing and the remainder for training

This is called k-fold cross-validation

Often the subsets are stratified before the cross-validation is performed

The error estimates are averaged to yield an overall error estimate

Classification and Prediction: Evaluation

Page 16: Dbm630 lecture08

16

Test

Cross-validation Example

Break up data into groups of the same size (possibly with stratification)

Hold aside one group for testing and use the rest to build model

Repeat by another test data until

Classification and Prediction: Evaluation

Train

Page 17: Dbm630 lecture08

17

More on cross-validation

Standard method for evaluation: stratified ten-fold cross-validation

Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate

Stratification reduces the estimate’s variance

Even better: repeated stratified cross-validation

E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)

Classification and Prediction: Evaluation

Page 18: Dbm630 lecture08

18

Leave-One-Out cross-validation

Leave-One-Out: Remove one instance for testing and the other for training a particular form of cross-validation:

Set the number of folds equal to the number of training instances

i.e., for n training instances, build classifier n times

Makes best use of the data

Involves no random subsampling

Very computationally expensive

Classification and Prediction: Evaluation

Page 19: Dbm630 lecture08

19

Leave-One-Out-CV and stratification

Disadvantage of Leave-One-Out-CV: stratification is not possible

It guarantees a non-stratified sample because there is only one instance in the test set!

Extreme example: random dataset split equally into two classes

Best inducer predicts majority class

50% accuracy on fresh data

Leave-One-Out-CV estimate is 100% error!

Classification and Prediction: Evaluation

Page 20: Dbm630 lecture08

20

*The bootstrap

CV uses sampling without replacement The same instance, once selected, can not be selected again for a

particular training/test set

The bootstrap uses sampling with replacement to form the training set

Sample a dataset of n instances n times with replacement to form a new dataset of n instances

Use this data as the training set

Use the instances from the original dataset that do not occur in the new training set for testing

Classification and Prediction: Evaluation

Page 21: Dbm630 lecture08

Evaluating the Accuracy of a Classifier or Predictor

Bootstrap method

The training tuples are sampled uniformly with replacement

Each time a tuple is selected, it is equally likely to be selected again and readded to the training set

There are several bootstrap method – the commonly used one is .632 bootstrap which works as follows

Given a data set of d tuples

The data set is sampled d times, with replacement, resulting bootstrap sample of training set of d samples

It is very likely that some of the original data tuples will occur more than once in this sample

The data tuples that did not make it into the training set end up forming the test set

Suppose we try this out several times – on average 63.2% of original data tuple will end up in the bootstrap, and the remaining 36.8% will form the test set

21 Classification and Prediction: Evaluation

Page 22: Dbm630 lecture08

22

*The 0.632 bootstrap

Also called the 0.632 bootstrap

For n instances, a particular instance has a probability of 1–1/n of not being picked

Thus its probability of ending up in the test data is:

This means the training data will contain approximately 63.2% of the instances

368.01

1 1

e

n

n

Classification and Prediction: Evaluation

Note: e is an irrational constant approximately equal to 2.718281828

Page 23: Dbm630 lecture08

23

*Estimating error with the bootstrap

The error estimate on the test data will be very pessimistic

Trained on just ~63% of the instances

Therefore, combine it with the re-substitution error:

The re-substitution error gets less weight than the error on the test data

Repeat process several times with different replacement samples; average the results

nstancestraining_incestest_insta 368.0632.0 errorerrorerr

Classification and Prediction: Evaluation

Page 24: Dbm630 lecture08

24

*More on the bootstrap

Probably the best way of estimating performance for very small datasets

However, it has some problems

Consider the random dataset from above

A perfect memorizer will achieve 0% resubstitution error and ~50% error on test data

Bootstrap estimate for this classifier:

True expected error: 50% %6.31%0368.0%50632.0 err

Classification and Prediction: Evaluation

Page 25: Dbm630 lecture08

27

Evaluating Two-class Classification (Lift Chart vs. ROC Curve)

Information Retrieval or Search Engine

An application to find a set of related documents given a set of keywords.

Hard Decision vs. Soft Decision

Focus on soft decision

Multiclass

Class probability (Ranking)

Class by class evaluation

Example: promotional mailout

Situation 1: classifier predicts that 0.1% of all households will respond

Situation 2: classifier predicts that 0.4% of the 100000 most promising households will respond

Classification and Prediction: Evaluation

Page 26: Dbm630 lecture08

28

Confusion Matrix (Two-class)

Also called contingency table

Classification and Prediction: Evaluation

Actual Class

Yes No

Predicted

Class

Yes True

Positive

False

Positive

No False

Negative

True

Negative

Page 27: Dbm630 lecture08

29

Measures in Information Retrieval

precision: Percentage of

retrieved documents that are

relevant.

recall: Percentage of relevant

documents that are returned.

F-measure: The combination

measure of recall and precision

Precision/recall curves have hyperbolic shape

Summary measures: average

precision at 20%, 50% and 80%

recall (three-point average recall)

precisionrecall

precisionrecallmeasureF

FNTP

TPrecall

FPTP

TPprecision

2

Classification and Prediction: Evaluation

Page 28: Dbm630 lecture08

30

Measures in Two-Class Classification

FNFPTNTP

TNTPaccuracy

FNTP

TPrecall

FPTP

TPprecision

Classification and Prediction: Evaluation

FNFPTNTP

TNTPaccuracy

FPTN

TNrecall

FNTN

TNprecision

For Positive Class For Negative Class

Usually, we focus only positive class (“True” cases or “Yes” cases), therefore, only precision and recall of positive class are used for performance comparison

recallFNTP

TPRateTP

FPTN

FPRateFP

Page 29: Dbm630 lecture08

Confusion matrix: Example

Actual Predict

Buys_computer = yes

Buys_computer = no

Total

Buys_computer=yes 6,954 46 7,000

Buys_computer=no 412 2,588 3,000

Total 7,366 2,634 10,000

31

No. of tuple of class buys_computer=yes

that were labeled by a classifier as class buys_computer=no: FN

No. of tuple of class buys_computer=no

that were labeled by a classifier as class buys_computer=yes: FP

Page 30: Dbm630 lecture08

Cumulative Gains Chart/Lift Chart/ ROC curve

Classification and Prediction: Evaluation 32

They are visual aids for measuring model performance

Cumulative Gains is a measure of the effectiveness of predictive model on TP Rate (%true responses)

Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model

Cumulative gains and lift charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.

ROC is a measure of the effectiveness of a predictive model on %positive response against %negative response TP Rate against FP Rate. The greater the area under ROC curve, the better the model.

Page 31: Dbm630 lecture08

33

Generating Charts Instances are sorted according to their predicted

probability of being a true positive:

Classification and Prediction: Evaluation

Page 32: Dbm630 lecture08

Cumulative Gains Chart

Classification and Prediction: Evaluation 34

The x-axis shows the percentage of samples

The y-axis shows the percentage of positive responses (or true positive rate). This is a percentage of the total possible positive responses

Baseline (overall response rate): If we sample X% of data then we will receive X% of the total positive responses.

Lift Curve: Using the predictions of the response model, calculate the percentage of positive responses for the percent of customers contacted and map these points to create the lift curve.

Page 33: Dbm630 lecture08

35

A Sample Cumulative Gains Chart

10% samples 100% samples

Classification and Prediction: Evaluation

100%

TP Rate 80%

%positive

responses 60%

40%

20%

0

Baseline: %positive_responses = %sample_size

Page 34: Dbm630 lecture08

Lift Chart

Classification and Prediction: Evaluation 36

The x-axis shows the percentage of samples

The y-axis shows the ratio of true positives with model and without model

To plot the chart: Calculate the points on the lift curve by determining the ratio between the result predicted by our model and the result using no model

Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The y-value of the lift curve at 10% is 30 / 10 = 3.

Page 35: Dbm630 lecture08

37

A Sample Lift Chart

10% samples 100% samples

Classification and Prediction: Evaluation

100%

80%

Lift

60%

40%

20%

0

Baseline: Lift = %sample size

Page 36: Dbm630 lecture08

38

ROC Curve

ROC curves are similar to lift charts

“ROC” stands for “receiver operating characteristic”

Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel

Differences:

y axis shows percentage of true positives in sample

x axis shows percentage of false positives in sample (rather than sample size)

Classification and Prediction: Evaluation

Page 37: Dbm630 lecture08

39

A Sample ROC Curve

1000 responds

400

responds

1000000-1000 mailouts Classification and Prediction: Evaluation

Baseline: TP Rate = FP Rate

True Positive

Rate

False Positive Rate

Page 38: Dbm630 lecture08

40

Extending to Multiple-Class Classification

C1

C1

C2

C2

Cn

n11 n12

n21 n22

Cn

Sum

nn1 nn2

P1 P2

n1n

n2n

nnn

Pn

PREDICTED CLASS

AC

TU

AL

CL

AS

S

Sum

R1

R2

Rn

n11/R1

nnn /Rn

n22 /R2

n11

P1

n22

P2

Recall

Preci

sion

nnn

Pn

P

R

n11 +n22+..+nnn

T

T Classification and Prediction: Evaluation

Page 39: Dbm630 lecture08

41

Measures in Multiple-Class Classification

T

nnnaccuracy

R

nCrecall

P

nCprecision

nn

i

iii

i

iii

2211

)(

)(

Classification and Prediction: Evaluation

Page 40: Dbm630 lecture08

42

Numeric Prediction Evaluation Same strategies: independent test set, cross-validation,

significance tests, etc.

Difference: error measures or generalization error

Actual target values: a1, a2,…, an

Predicted target values: p1, p2,…, pn

Most popular measure: mean-squared error

Easy to manipulate mathematically

n

apapapnn

22

22

2

11...

Classification and Prediction: Evaluation

Page 41: Dbm630 lecture08

43

Other Measures The root mean-squared (rms) error:

The mean absolute error is less sensitive to outliers than the mean-squared error:

Sometimes relative error values are more appropriate (e. g. 10% for an error of 50 when predicting 500)

n

apapapnn

22

22

2

11...

n

apapapnn

...2211

Classification and Prediction: Evaluation

Page 42: Dbm630 lecture08

44

Improvement on the Mean

Often we want to know how much the scheme improves

on simply predicting the average

The relative squared error is:

is average

The relative absolute error is:

22

2

2

1

22

22

2

11

...

...

n

nn

aaaaaa

apapap

n

nn

aaaaaa

apapap

...

...

21

2211

a

Classification and Prediction: Evaluation

Page 43: Dbm630 lecture08

45

The Correlation Coefficient Measures the statistical correlation between the

predicted values and the actual values

Scale independent, between -1 and +1

Good performance leads to large values!

Ap

PA

SS

S

1

))((

n

aappS

ii

i

PA

1

)( 2

n

ppS i

i

P

1

)( 2

n

aaS i

i

A

Classification and Prediction: Evaluation

Page 44: Dbm630 lecture08

Practice 1

Classification and Prediction: Evaluation 46

A company wants to do a mail marketing campaign. It costs the company $1 for each item mailed. They have information on 100,000 customers. Create a cumulative gains and a lift chart from the following data.

Page 45: Dbm630 lecture08

Results

Classification and Prediction: Evaluation 47

Page 46: Dbm630 lecture08

Practice 2

Classification and Prediction: Evaluation 48

Using the response model P(x)=Salary(x)/1000+Age(x) for customer x and the data table shown below, construct the cumulative gains, lift charts and ROC curves. Ties in ranking should be arbitrarily broken by assigning a higher rank to who appears first in the table.

Customer Name Salary Age Actual

Response

P(x) Rank

A 10000 39 N

B 50000 21 Y

C 65000 25 Y

D 62000 30 Y

E 67000 19 Y

F 69000 48 N

G 65000 12 Y

H 64000 51 N

I 71000 65 Y

J 73000 42 N

Page 47: Dbm630 lecture08

0.00

0.50

1.00

1.50

2.00

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Lift Baseline

49

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

% Positive responses

Baseline

Customer

Name

Salary Age Actual

Response

P(x) Rank

A 10000 39 N 49 10

B 50000 21 Y 71 9

C 65000 25 Y 90 6

D 62000 30 Y 92 5

E 67000 19 Y 86 7

F 69000 48 N 117 2

G 65000 12 Y 77 8

H 64000 51 N 115 3

I 71000 65 Y 136 1

J 73000 42 N 115 4

Customer

Name

Actual

Response

P(x) Rank

I Y 136 1

F N 117 2

H N 115 3

J N 115 4

D Y 92 5

C Y 90 6

E Y 86 7

G Y 77 8

B Y 71 9

A N 49 10

Ordered by P(x)

Cumulative Gains Chart Lift Chart

Page 48: Dbm630 lecture08

50

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0.00% 50.00% 100.00% 150.00%

FPR

TPR

ROC Curve

Customer

Name

Actual

Response

#Sample TP FP TPR FPR

I Y 1 1 0 16.67% 0.00%

F N 2 1 1 16.67% 25.00%

H N 3 1 2 16.67% 50.00%

J N 4 1 3 16.67% 75.00%

C Y 5 2 3 33.33% 75.00%

E Y 6 3 3 50.00% 75.00%

D Y 7 4 3 66.67% 75.00%

G Y 8 5 3 83.33% 75.00%

B Y 9 6 3 100.00% 75.00%

A N 10 6 4 100.00% 100.00%

Page 49: Dbm630 lecture08

51

*Predicting performance

Assume the estimated error rate is 25%. How close is this to the true error rate? Depends on the amount of test data

Prediction is just like tossing a biased (!) coin “Head” is a “success”, “tail” is an “error”

In statistics, a succession of independent events like this is called a Bernoulli process

Statistical theory provides us with confidence intervals for the true underlying proportion!

Classification and Prediction: Evaluation

Page 50: Dbm630 lecture08

52

*Confidence intervals

We can say: p lies within a certain specified interval with a certain specified confidence

Example: S=750 successes in N=1000 trials

Estimated success rate: 75%

How close is this to true success rate p?

Answer: with 80% confidence p[73.2,76.7]

Another example: S=75 and N=100

Estimated success rate: 75%

With 80% confidence p[69.1,80.1]

Classification and Prediction: Evaluation

Page 51: Dbm630 lecture08

53

*Mean and variance (also Mod 7)

Mean and variance for a Bernoulli trial: p, p (1–p)

Expected success rate f=S/N

Mean and variance for f : p, p (1–p)/N

For large enough N, f follows a Normal distribution

c% confidence interval [–z X z] for random variable with 0 mean is given by:

With a symmetric distribution:

czXz ]Pr[

]Pr[21]Pr[ zXzXz

Classification and Prediction: Evaluation

Page 52: Dbm630 lecture08

54

*Confidence limits

Confidence limits for the normal distribution with 0 mean and a variance of 1:

Thus:

To use this we have to reduce our random variable f to have 0 mean and unit variance

Pr[X z] z

0.1% 3.09

0.5% 2.58

1% 2.33

5% 1.65

10% 1.28

20% 0.84

40% 0.25

%90]65.165.1Pr[ X

–1 0 1 1.65

Classification and Prediction: Evaluation

Page 53: Dbm630 lecture08

55

*Transforming f

Transformed value for f : (i.e. subtract the mean and divide by the standard deviation)

Resulting equation:

Solving for p :

Npp

pf

/)1(

czNpp

pfz

/)1(Pr

N

z

N

z

N

f

N

fz

N

zfp

2

2

222

142

Classification and Prediction: Evaluation

Page 54: Dbm630 lecture08

56

*Examples f = 75%, N = 1000, c = 80% (so that z = 1.28):

f = 75%, N = 100, c = 80% (so that z = 1.28):

Note that normal distribution assumption is only valid for large N (i.e. N > 100)

f = 75%, N = 10, c = 80% (so that z = 1.28):

(should be taken with a grain of salt)

]767.0,732.0[p

]801.0,691.0[p

]881.0,549.0[p

Classification and Prediction: Evaluation