dbm630 lecture08

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 8

Classification and Prediction Evaluation

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Topics

2

Train, Test and Validation sets

Evaluation on Large data Unbalanced data

Evaluation on Small data Cross validation

Bootstrap

Comparing data mining schemes Significance test Lift Chart / ROC curve

Numeric Prediction Evaluation

Data Warehousing and Data Mining by Kritsada Sriphaew

3

Evaluation in Classification Tasks

How predictive is the model we learned?

Error on the training data is not a good indicator of performance on future data

Q: Why?

A: Because new data will probably not be exactly the same as the training data!

Overfitting – fitting the training data too precisely - usually leads to poor results on new data

Classification and Prediction: Evaluation

4

Model Selection and Bias-Variance Tradeoff Typical behavior of the test and training error, as model complexity is

varied.

Model Complexity Low High

Pre

dic

tio

n E

rro

r

Test Sample

Training Sample

High Bias Low Variance

Low Bias High Variance


Classification and Prediction: Evaluation 5

Classifier error rate

Natural performance measure for classification problems: error rate Success: instance’s class is predicted correctly

Error: instance’s class is predicted incorrectly

Error rate: proportion of errors made over the whole set of instances

Resubstitution error: error rate on training data (too optimistic way!)

Generalization error: error rate on test data

85 % 87 % 15 % 13 %

2% improvement

2.35% improvement rate

2% error reduction

13.3% error reduction rate

Accuracy Error rate (1-Accuracy)

6

Evaluation on LARGE data

If many (thousands) of examples are available, including several hundred examples from each class, then how can we evaluate our classifier model?

A simple evaluation is sufficient

For example, randomly split data into training and test sets (usually 2/3 for train, 1/3 for test)

Build a classifier using the train set and evaluate it using the test set.


7

Classification Step 1: Split data into train and test sets

Results Known

+ + - - +

THE PAST

Data

Training set

Testing set


8

Classification Step 2: Build a model on a training set

Training set

Results Known

+ + - - +

THE PAST

Data

Testing set

Model Builder


9

Classification Step 3: Evaluate on test set (and may be re-train)

Data

Predictions

Y N

Training set

Testing set

+ + - - +

Model Builder feedback

+

-

+

-

THE PAST

Results Known


10

A note on parameter tuning

It is important that the test data is not used in any way to create the classifier

Some learning schemes operate in two stages:

Stage 1: builds the basic structure

Stage 2: optimizes parameter settings

The test data can’t be used for parameter tuning!

Proper procedure uses three sets: training data, validation data, and test data

Validation data is used to optimize parameters


11

Classification: Train, Validation, Test split

Data

Predictions

Y N

Results Known

Training set

Validation set

+ + - - +

Model Builder Evaluate

+

-

+

-

Final Model Final Test Set

+

-

+

-

Final Evaluation

Model

Builder


12

Unbalanced data

Sometimes, classes have very unequal frequency

Accommodation prediction: 97% stay, 3% don’t stay

medical diagnosis: 90% healthy, 10% disease

eCommerce: 99% don’t buy, 1% buy

Security: >99.99% of Americans are not terrorists

Similar situation with multiple classes

Majority class classifier can be 97% correct, but useless

Solution: With two classes, a good approach is to build BALANCED train and test sets, and train model on a balanced set randomly select desired number of minority class instances

add equal number of randomly selected majority class

That is, we ignore the effect of the number of instances for each class.


13

Evaluation on SMALL data The holdout method reserves a certain amount for testing and

uses the remainder for training Usually: one third for testing, the rest for training

For “unbalanced” datasets, samples might not be representative Few or none instances of some classes

Stratified sample: advanced version of balancing the data Make sure that each class is represented with approximately equal

proportions in both subsets

What if we have a small data set?

The chosen 2/3 for training may not be representative.

The chosen 1/3 for testing may not be representative.


14

Repeated Holdout Method

Holdout estimate can be made more reliable by repeating the process with different subsamples

In each iteration, a certain proportion is randomly selected for training (possibly with stratification)

The error rates on the different iterations are averaged to yield an overall error rate

This is called the repeated holdout method

Still not optimum: the different test sets overlap.

Can we prevent overlapping?


15

Cross-validation

Cross-validation avoids overlapping test sets

First step: data is split into k subsets of equal size

Second step: each subset in turn is used for testing and the remainder for training

This is called k-fold cross-validation

Often the subsets are stratified before the cross-validation is performed

The error estimates are averaged to yield an overall error estimate


16

Test

Cross-validation Example

Break up data into groups of the same size (possibly with stratification)

Hold aside one group for testing and use the rest to build model

Repeat by another test data until


Train

17

More on cross-validation

Standard method for evaluation: stratified ten-fold cross-validation

Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate

Stratification reduces the estimate’s variance

Even better: repeated stratified cross-validation

E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)


18

Leave-One-Out cross-validation

Leave-One-Out: Remove one instance for testing and the other for training a particular form of cross-validation:

Set the number of folds equal to the number of training instances

i.e., for n training instances, build classifier n times

Makes best use of the data

Involves no random subsampling

Very computationally expensive


19

Leave-One-Out-CV and stratification

Disadvantage of Leave-One-Out-CV: stratification is not possible

It guarantees a non-stratified sample because there is only one instance in the test set!

Extreme example: random dataset split equally into two classes

Best inducer predicts majority class

50% accuracy on fresh data

Leave-One-Out-CV estimate is 100% error!


20

*The bootstrap

CV uses sampling without replacement The same instance, once selected, can not be selected again for a

particular training/test set

The bootstrap uses sampling with replacement to form the training set

Sample a dataset of n instances n times with replacement to form a new dataset of n instances

Use this data as the training set

Use the instances from the original dataset that do not occur in the new training set for testing


Evaluating the Accuracy of a Classifier or Predictor

Bootstrap method

The training tuples are sampled uniformly with replacement

Each time a tuple is selected, it is equally likely to be selected again and readded to the training set

There are several bootstrap method – the commonly used one is .632 bootstrap which works as follows

Given a data set of d tuples

The data set is sampled d times, with replacement, resulting bootstrap sample of training set of d samples

It is very likely that some of the original data tuples will occur more than once in this sample

The data tuples that did not make it into the training set end up forming the test set

Suppose we try this out several times – on average 63.2% of original data tuple will end up in the bootstrap, and the remaining 36.8% will form the test set

21 Classification and Prediction: Evaluation

22

*The 0.632 bootstrap

Also called the 0.632 bootstrap

For n instances, a particular instance has a probability of 1–1/n of not being picked

Thus its probability of ending up in the test data is:

This means the training data will contain approximately 63.2% of the instances

368.01

1 1

e

n

n


Note: e is an irrational constant approximately equal to 2.718281828

http://en.wikipedia.org/wiki/Irrational_number

23

*Estimating error with the bootstrap

The error estimate on the test data will be very pessimistic

Trained on just ~63% of the instances

Therefore, combine it with the re-substitution error:

The re-substitution error gets less weight than the error on the test data

Repeat process several times with different replacement samples; average the results

nstancestraining_incestest_insta 368.0632.0 errorerrorerr


24

*More on the bootstrap

Probably the best way of estimating performance for very small datasets

However, it has some problems

Consider the random dataset from above

A perfect memorizer will achieve 0% resubstitution error and ~50% error on test data

Bootstrap estimate for this classifier:

True expected error: 50% %6.31%0368.0%50632.0 err


27

Evaluating Two-class Classification (Lift Chart vs. ROC Curve)

Information Retrieval or Search Engine

An application to find a set of related documents given a set of keywords.

Hard Decision vs. Soft Decision

Focus on soft decision

Multiclass

Class probability (Ranking)

Class by class evaluation

Example: promotional mailout

Situation 1: classifier predicts that 0.1% of all households will respond

Situation 2: classifier predicts that 0.4% of the 100000 most promising households will respond


28

Confusion Matrix (Two-class)

Also called contingency table


Actual Class

Yes No

Predicted

Class

Yes True

Positive

False

Positive

No False

Negative

True

Negative

29

Measures in Information Retrieval

precision: Percentage of

retrieved documents that are

relevant.

recall: Percentage of relevant

documents that are returned.

F-measure: The combination

measure of recall and precision

Precision/recall curves have hyperbolic shape

Summary measures: average

precision at 20%, 50% and 80%

recall (three-point average recall)

precisionrecall

precisionrecallmeasureF

FNTP

TPrecall

FPTP

TPprecision

2


30

Measures in Two-Class Classification

FNFPTNTP

TNTPaccuracy

FNTP

TPrecall

FPTP

TPprecision


FNFPTNTP

TNTPaccuracy

FPTN

TNrecall

FNTN

TNprecision

For Positive Class For Negative Class

Usually, we focus only positive class (“True” cases or “Yes” cases), therefore, only precision and recall of positive class are used for performance comparison

recallFNTP

TPRateTP

FPTN

FPRateFP

Confusion matrix: Example

Actual Predict

Buys_computer = yes

Buys_computer = no

Total

Buys_computer=yes 6,954 46 7,000

Buys_computer=no 412 2,588 3,000

Total 7,366 2,634 10,000

31

No. of tuple of class buys_computer=yes

that were labeled by a classifier as class buys_computer=no: FN

No. of tuple of class buys_computer=no

that were labeled by a classifier as class buys_computer=yes: FP

Cumulative Gains Chart/Lift Chart/ ROC curve


They are visual aids for measuring model performance

Cumulative Gains is a measure of the effectiveness of predictive model on TP Rate (%true responses)

Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model

Cumulative gains and lift charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.

ROC is a measure of the effectiveness of a predictive model on %positive response against %negative response TP Rate against FP Rate. The greater the area under ROC curve, the better the model.

33

Generating Charts Instances are sorted according to their predicted

probability of being a true positive:


Cumulative Gains Chart


The x-axis shows the percentage of samples

The y-axis shows the percentage of positive responses (or true positive rate). This is a percentage of the total possible positive responses

Baseline (overall response rate): If we sample X% of data then we will receive X% of the total positive responses.

Lift Curve: Using the predictions of the response model, calculate the percentage of positive responses for the percent of customers contacted and map these points to create the lift curve.

35

A Sample Cumulative Gains Chart

10% samples 100% samples


100%

TP Rate 80%

%positive

responses 60%

40%

20%

0

Baseline: %positive_responses = %sample_size

Lift Chart


The x-axis shows the percentage of samples

The y-axis shows the ratio of true positives with model and without model

To plot the chart: Calculate the points on the lift curve by determining the ratio between the result predicted by our model and the result using no model

Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The y-value of the lift curve at 10% is 30 / 10 = 3.

37

A Sample Lift Chart

10% samples 100% samples


100%

80%

Lift

60%

40%

20%

0

Baseline: Lift = %sample size

38

ROC Curve

ROC curves are similar to lift charts

“ROC” stands for “receiver operating characteristic”

Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel

Differences:

y axis shows percentage of true positives in sample

x axis shows percentage of false positives in sample (rather than sample size)


39

A Sample ROC Curve

1000 responds

400

responds

1000000-1000 mailouts Classification and Prediction: Evaluation

Baseline: TP Rate = FP Rate

True Positive

Rate

False Positive Rate

40

Extending to Multiple-Class Classification

C1

C1

C2

C2

Cn

n11 n12

n21 n22

Cn

Sum

nn1 nn2

P1 P2

n1n

n2n

nnn

Pn

PREDICTED CLASS

AC

TU

AL

CL

AS

S

Sum

R1

R2

Rn

n11/R1

nnn /Rn

n22 /R2

n11

P1

n22

P2

Recall

Preci

sion

nnn

Pn

P

R

n11 +n22+..+nnn

T

T Classification and Prediction: Evaluation

41

Measures in Multiple-Class Classification

T

nnnaccuracy

R

nCrecall

P

nCprecision

nn

i

iii

i

iii

2211

)(

)(


42

Numeric Prediction Evaluation Same strategies: independent test set, cross-validation,

significance tests, etc.

Difference: error measures or generalization error

Actual target values: a1, a2,…, an

Predicted target values: p1, p2,…, pn

Most popular measure: mean-squared error

Easy to manipulate mathematically

n

apapapnn

22

22

2

11...


43

Other Measures The root mean-squared (rms) error:

The mean absolute error is less sensitive to outliers than the mean-squared error:

Sometimes relative error values are more appropriate (e. g. 10% for an error of 50 when predicting 500)

n

apapapnn

22

22

2

11...

n

apapapnn

...2211


44

Improvement on the Mean

Often we want to know how much the scheme improves

on simply predicting the average

The relative squared error is:

is average

The relative absolute error is:

22

2

2

1

22

22

2

11

...

...

n

nn

aaaaaa

apapap

n

nn

aaaaaa

apapap

...

...

21

2211

a


45

The Correlation Coefficient Measures the statistical correlation between the

predicted values and the actual values

Scale independent, between -1 and +1

Good performance leads to large values!

Ap

PA

SS

S

1

))((

n

aappS

ii

i

PA

1

)( 2

n

ppS i

i

P

1

)( 2

n

aaS i

i

A


Practice 1


A company wants to do a mail marketing campaign. It costs the company $1 for each item mailed. They have information on 100,000 customers. Create a cumulative gains and a lift chart from the following data.

Results


Practice 2


Using the response model P(x)=Salary(x)/1000+Age(x) for customer x and the data table shown below, construct the cumulative gains, lift charts and ROC curves. Ties in ranking should be arbitrarily broken by assigning a higher rank to who appears first in the table.

Customer Name Salary Age Actual

Response

P(x) Rank

A 10000 39 N

B 50000 21 Y

C 65000 25 Y

D 62000 30 Y

E 67000 19 Y

F 69000 48 N

G 65000 12 Y

H 64000 51 N

I 71000 65 Y

J 73000 42 N

0.00

0.50

1.00

1.50

2.00

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Lift Baseline

49

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

% Positive responses

Baseline

Customer

Name

Salary Age Actual

Response

P(x) Rank

A 10000 39 N 49 10

B 50000 21 Y 71 9

C 65000 25 Y 90 6

D 62000 30 Y 92 5

E 67000 19 Y 86 7

F 69000 48 N 117 2

G 65000 12 Y 77 8

H 64000 51 N 115 3

I 71000 65 Y 136 1

J 73000 42 N 115 4

Customer

Name

Actual

Response

P(x) Rank

I Y 136 1

F N 117 2

H N 115 3

J N 115 4

D Y 92 5

C Y 90 6

E Y 86 7

G Y 77 8

B Y 71 9

A N 49 10

Ordered by P(x)

Cumulative Gains Chart Lift Chart

50

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0.00% 50.00% 100.00% 150.00%

FPR

TPR

ROC Curve

Customer

Name

Actual

Response

#Sample TP FP TPR FPR

I Y 1 1 0 16.67% 0.00%

F N 2 1 1 16.67% 25.00%

H N 3 1 2 16.67% 50.00%

J N 4 1 3 16.67% 75.00%

C Y 5 2 3 33.33% 75.00%

E Y 6 3 3 50.00% 75.00%

D Y 7 4 3 66.67% 75.00%

G Y 8 5 3 83.33% 75.00%

B Y 9 6 3 100.00% 75.00%

A N 10 6 4 100.00% 100.00%

51

*Predicting performance

Assume the estimated error rate is 25%. How close is this to the true error rate? Depends on the amount of test data

Prediction is just like tossing a biased (!) coin “Head” is a “success”, “tail” is an “error”

In statistics, a succession of independent events like this is called a Bernoulli process

Statistical theory provides us with confidence intervals for the true underlying proportion!


52

*Confidence intervals

We can say: p lies within a certain specified interval with a certain specified confidence

Example: S=750 successes in N=1000 trials

Estimated success rate: 75%

How close is this to true success rate p?

Answer: with 80% confidence p[73.2,76.7]

Another example: S=75 and N=100

Estimated success rate: 75%

With 80% confidence p[69.1,80.1]


53

*Mean and variance (also Mod 7)

Mean and variance for a Bernoulli trial: p, p (1–p)

Expected success rate f=S/N

Mean and variance for f : p, p (1–p)/N

For large enough N, f follows a Normal distribution

c% confidence interval [–z X z] for random variable with 0 mean is given by:

With a symmetric distribution:

czXz ]Pr[

]Pr[21]Pr[ zXzXz


54

*Confidence limits

Confidence limits for the normal distribution with 0 mean and a variance of 1:

Thus:

To use this we have to reduce our random variable f to have 0 mean and unit variance

Pr[X z] z

0.1% 3.09

0.5% 2.58

1% 2.33

5% 1.65

10% 1.28

20% 0.84

40% 0.25

%90]65.165.1Pr[ X

–1 0 1 1.65


55

*Transforming f

Transformed value for f : (i.e. subtract the mean and divide by the standard deviation)

Resulting equation:

Solving for p :

Npp

pf

/)1(

czNpp

pfz

/)1(Pr

N

z

N

z

N

f

N

fz

N

zfp

2

2

222

142


56

*Examples f = 75%, N = 1000, c = 80% (so that z = 1.28):

f = 75%, N = 100, c = 80% (so that z = 1.28):

Note that normal distribution assumption is only valid for large N (i.e. N > 100)

f = 75%, N = 10, c = 80% (so that z = 1.28):

(should be taken with a grain of salt)

]767.0,732.0[p

]801.0,691.0[p

]881.0,549.0[p


dbm630 lecture08

Technology