dbm630 lecture08
TRANSCRIPT
DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
1
Semester 2/2011
Lecture 8
Classification and Prediction Evaluation
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
Topics
2
Train, Test and Validation sets
Evaluation on Large data Unbalanced data
Evaluation on Small data Cross validation
Bootstrap
Comparing data mining schemes Significance test Lift Chart / ROC curve
Numeric Prediction Evaluation
Data Warehousing and Data Mining by Kritsada Sriphaew
3
Evaluation in Classification Tasks
How predictive is the model we learned?
Error on the training data is not a good indicator of performance on future data
Q: Why?
A: Because new data will probably not be exactly the same as the training data!
Overfitting – fitting the training data too precisely - usually leads to poor results on new data
Classification and Prediction: Evaluation
4
Model Selection and Bias-Variance Tradeoff Typical behavior of the test and training error, as model complexity is
varied.
Model Complexity Low High
Pre
dic
tio
n E
rro
r
Test Sample
Training Sample
High Bias Low Variance
Low Bias High Variance
Classification and Prediction: Evaluation
Classification and Prediction: Evaluation 5
Classifier error rate
Natural performance measure for classification problems: error rate Success: instance’s class is predicted correctly
Error: instance’s class is predicted incorrectly
Error rate: proportion of errors made over the whole set of instances
Resubstitution error: error rate on training data (too optimistic way!)
Generalization error: error rate on test data
85 % 87 % 15 % 13 %
2% improvement
2.35% improvement rate
2% error reduction
13.3% error reduction rate
Accuracy Error rate (1-Accuracy)
6
Evaluation on LARGE data
If many (thousands) of examples are available, including several hundred examples from each class, then how can we evaluate our classifier model?
A simple evaluation is sufficient
For example, randomly split data into training and test sets (usually 2/3 for train, 1/3 for test)
Build a classifier using the train set and evaluate it using the test set.
Classification and Prediction: Evaluation
7
Classification Step 1: Split data into train and test sets
Results Known
+ + - - +
THE PAST
Data
Training set
Testing set
Classification and Prediction: Evaluation
8
Classification Step 2: Build a model on a training set
Training set
Results Known
+ + - - +
THE PAST
Data
Testing set
Model Builder
Classification and Prediction: Evaluation
9
Classification Step 3: Evaluate on test set (and may be re-train)
Data
Predictions
Y N
Training set
Testing set
+ + - - +
Model Builder feedback
+
-
+
-
THE PAST
Results Known
Classification and Prediction: Evaluation
10
A note on parameter tuning
It is important that the test data is not used in any way to create the classifier
Some learning schemes operate in two stages:
Stage 1: builds the basic structure
Stage 2: optimizes parameter settings
The test data can’t be used for parameter tuning!
Proper procedure uses three sets: training data, validation data, and test data
Validation data is used to optimize parameters
Classification and Prediction: Evaluation
11
Classification: Train, Validation, Test split
Data
Predictions
Y N
Results Known
Training set
Validation set
+ + - - +
Model Builder Evaluate
+
-
+
-
Final Model Final Test Set
+
-
+
-
Final Evaluation
Model
Builder
Classification and Prediction: Evaluation
12
Unbalanced data
Sometimes, classes have very unequal frequency
Accommodation prediction: 97% stay, 3% don’t stay
medical diagnosis: 90% healthy, 10% disease
eCommerce: 99% don’t buy, 1% buy
Security: >99.99% of Americans are not terrorists
Similar situation with multiple classes
Majority class classifier can be 97% correct, but useless
Solution: With two classes, a good approach is to build BALANCED train and test sets, and train model on a balanced set randomly select desired number of minority class instances
add equal number of randomly selected majority class
That is, we ignore the effect of the number of instances for each class.
Classification and Prediction: Evaluation
13
Evaluation on SMALL data The holdout method reserves a certain amount for testing and
uses the remainder for training Usually: one third for testing, the rest for training
For “unbalanced” datasets, samples might not be representative Few or none instances of some classes
Stratified sample: advanced version of balancing the data Make sure that each class is represented with approximately equal
proportions in both subsets
What if we have a small data set?
The chosen 2/3 for training may not be representative.
The chosen 1/3 for testing may not be representative.
Classification and Prediction: Evaluation
14
Repeated Holdout Method
Holdout estimate can be made more reliable by repeating the process with different subsamples
In each iteration, a certain proportion is randomly selected for training (possibly with stratification)
The error rates on the different iterations are averaged to yield an overall error rate
This is called the repeated holdout method
Still not optimum: the different test sets overlap.
Can we prevent overlapping?
Classification and Prediction: Evaluation
15
Cross-validation
Cross-validation avoids overlapping test sets
First step: data is split into k subsets of equal size
Second step: each subset in turn is used for testing and the remainder for training
This is called k-fold cross-validation
Often the subsets are stratified before the cross-validation is performed
The error estimates are averaged to yield an overall error estimate
Classification and Prediction: Evaluation
16
Test
Cross-validation Example
Break up data into groups of the same size (possibly with stratification)
Hold aside one group for testing and use the rest to build model
Repeat by another test data until
Classification and Prediction: Evaluation
Train
17
More on cross-validation
Standard method for evaluation: stratified ten-fold cross-validation
Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate
Stratification reduces the estimate’s variance
Even better: repeated stratified cross-validation
E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)
Classification and Prediction: Evaluation
18
Leave-One-Out cross-validation
Leave-One-Out: Remove one instance for testing and the other for training a particular form of cross-validation:
Set the number of folds equal to the number of training instances
i.e., for n training instances, build classifier n times
Makes best use of the data
Involves no random subsampling
Very computationally expensive
Classification and Prediction: Evaluation
19
Leave-One-Out-CV and stratification
Disadvantage of Leave-One-Out-CV: stratification is not possible
It guarantees a non-stratified sample because there is only one instance in the test set!
Extreme example: random dataset split equally into two classes
Best inducer predicts majority class
50% accuracy on fresh data
Leave-One-Out-CV estimate is 100% error!
Classification and Prediction: Evaluation
20
*The bootstrap
CV uses sampling without replacement The same instance, once selected, can not be selected again for a
particular training/test set
The bootstrap uses sampling with replacement to form the training set
Sample a dataset of n instances n times with replacement to form a new dataset of n instances
Use this data as the training set
Use the instances from the original dataset that do not occur in the new training set for testing
Classification and Prediction: Evaluation
Evaluating the Accuracy of a Classifier or Predictor
Bootstrap method
The training tuples are sampled uniformly with replacement
Each time a tuple is selected, it is equally likely to be selected again and readded to the training set
There are several bootstrap method – the commonly used one is .632 bootstrap which works as follows
Given a data set of d tuples
The data set is sampled d times, with replacement, resulting bootstrap sample of training set of d samples
It is very likely that some of the original data tuples will occur more than once in this sample
The data tuples that did not make it into the training set end up forming the test set
Suppose we try this out several times – on average 63.2% of original data tuple will end up in the bootstrap, and the remaining 36.8% will form the test set
21 Classification and Prediction: Evaluation
22
*The 0.632 bootstrap
Also called the 0.632 bootstrap
For n instances, a particular instance has a probability of 1–1/n of not being picked
Thus its probability of ending up in the test data is:
This means the training data will contain approximately 63.2% of the instances
368.01
1 1
e
n
n
Classification and Prediction: Evaluation
Note: e is an irrational constant approximately equal to 2.718281828
23
*Estimating error with the bootstrap
The error estimate on the test data will be very pessimistic
Trained on just ~63% of the instances
Therefore, combine it with the re-substitution error:
The re-substitution error gets less weight than the error on the test data
Repeat process several times with different replacement samples; average the results
nstancestraining_incestest_insta 368.0632.0 errorerrorerr
Classification and Prediction: Evaluation
24
*More on the bootstrap
Probably the best way of estimating performance for very small datasets
However, it has some problems
Consider the random dataset from above
A perfect memorizer will achieve 0% resubstitution error and ~50% error on test data
Bootstrap estimate for this classifier:
True expected error: 50% %6.31%0368.0%50632.0 err
Classification and Prediction: Evaluation
27
Evaluating Two-class Classification (Lift Chart vs. ROC Curve)
Information Retrieval or Search Engine
An application to find a set of related documents given a set of keywords.
Hard Decision vs. Soft Decision
Focus on soft decision
Multiclass
Class probability (Ranking)
Class by class evaluation
Example: promotional mailout
Situation 1: classifier predicts that 0.1% of all households will respond
Situation 2: classifier predicts that 0.4% of the 100000 most promising households will respond
Classification and Prediction: Evaluation
28
Confusion Matrix (Two-class)
Also called contingency table
Classification and Prediction: Evaluation
Actual Class
Yes No
Predicted
Class
Yes True
Positive
False
Positive
No False
Negative
True
Negative
29
Measures in Information Retrieval
precision: Percentage of
retrieved documents that are
relevant.
recall: Percentage of relevant
documents that are returned.
F-measure: The combination
measure of recall and precision
Precision/recall curves have hyperbolic shape
Summary measures: average
precision at 20%, 50% and 80%
recall (three-point average recall)
precisionrecall
precisionrecallmeasureF
FNTP
TPrecall
FPTP
TPprecision
2
Classification and Prediction: Evaluation
30
Measures in Two-Class Classification
FNFPTNTP
TNTPaccuracy
FNTP
TPrecall
FPTP
TPprecision
Classification and Prediction: Evaluation
FNFPTNTP
TNTPaccuracy
FPTN
TNrecall
FNTN
TNprecision
For Positive Class For Negative Class
Usually, we focus only positive class (“True” cases or “Yes” cases), therefore, only precision and recall of positive class are used for performance comparison
recallFNTP
TPRateTP
FPTN
FPRateFP
Confusion matrix: Example
Actual Predict
Buys_computer = yes
Buys_computer = no
Total
Buys_computer=yes 6,954 46 7,000
Buys_computer=no 412 2,588 3,000
Total 7,366 2,634 10,000
31
No. of tuple of class buys_computer=yes
that were labeled by a classifier as class buys_computer=no: FN
No. of tuple of class buys_computer=no
that were labeled by a classifier as class buys_computer=yes: FP
Cumulative Gains Chart/Lift Chart/ ROC curve
Classification and Prediction: Evaluation 32
They are visual aids for measuring model performance
Cumulative Gains is a measure of the effectiveness of predictive model on TP Rate (%true responses)
Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model
Cumulative gains and lift charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.
ROC is a measure of the effectiveness of a predictive model on %positive response against %negative response TP Rate against FP Rate. The greater the area under ROC curve, the better the model.
33
Generating Charts Instances are sorted according to their predicted
probability of being a true positive:
Classification and Prediction: Evaluation
Cumulative Gains Chart
Classification and Prediction: Evaluation 34
The x-axis shows the percentage of samples
The y-axis shows the percentage of positive responses (or true positive rate). This is a percentage of the total possible positive responses
Baseline (overall response rate): If we sample X% of data then we will receive X% of the total positive responses.
Lift Curve: Using the predictions of the response model, calculate the percentage of positive responses for the percent of customers contacted and map these points to create the lift curve.
35
A Sample Cumulative Gains Chart
10% samples 100% samples
Classification and Prediction: Evaluation
100%
TP Rate 80%
%positive
responses 60%
40%
20%
0
Baseline: %positive_responses = %sample_size
Lift Chart
Classification and Prediction: Evaluation 36
The x-axis shows the percentage of samples
The y-axis shows the ratio of true positives with model and without model
To plot the chart: Calculate the points on the lift curve by determining the ratio between the result predicted by our model and the result using no model
Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The y-value of the lift curve at 10% is 30 / 10 = 3.
37
A Sample Lift Chart
10% samples 100% samples
Classification and Prediction: Evaluation
100%
80%
Lift
60%
40%
20%
0
Baseline: Lift = %sample size
38
ROC Curve
ROC curves are similar to lift charts
“ROC” stands for “receiver operating characteristic”
Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel
Differences:
y axis shows percentage of true positives in sample
x axis shows percentage of false positives in sample (rather than sample size)
Classification and Prediction: Evaluation
39
A Sample ROC Curve
1000 responds
400
responds
1000000-1000 mailouts Classification and Prediction: Evaluation
Baseline: TP Rate = FP Rate
True Positive
Rate
False Positive Rate
40
Extending to Multiple-Class Classification
C1
C1
C2
C2
Cn
n11 n12
n21 n22
Cn
Sum
nn1 nn2
P1 P2
n1n
n2n
nnn
Pn
PREDICTED CLASS
AC
TU
AL
CL
AS
S
Sum
R1
R2
Rn
n11/R1
nnn /Rn
n22 /R2
n11
P1
n22
P2
Recall
Preci
sion
nnn
Pn
P
R
n11 +n22+..+nnn
T
T Classification and Prediction: Evaluation
41
Measures in Multiple-Class Classification
T
nnnaccuracy
R
nCrecall
P
nCprecision
nn
i
iii
i
iii
2211
)(
)(
Classification and Prediction: Evaluation
42
Numeric Prediction Evaluation Same strategies: independent test set, cross-validation,
significance tests, etc.
Difference: error measures or generalization error
Actual target values: a1, a2,…, an
Predicted target values: p1, p2,…, pn
Most popular measure: mean-squared error
Easy to manipulate mathematically
n
apapapnn
22
22
2
11...
Classification and Prediction: Evaluation
43
Other Measures The root mean-squared (rms) error:
The mean absolute error is less sensitive to outliers than the mean-squared error:
Sometimes relative error values are more appropriate (e. g. 10% for an error of 50 when predicting 500)
n
apapapnn
22
22
2
11...
n
apapapnn
...2211
Classification and Prediction: Evaluation
44
Improvement on the Mean
Often we want to know how much the scheme improves
on simply predicting the average
The relative squared error is:
is average
The relative absolute error is:
22
2
2
1
22
22
2
11
...
...
n
nn
aaaaaa
apapap
n
nn
aaaaaa
apapap
...
...
21
2211
a
Classification and Prediction: Evaluation
45
The Correlation Coefficient Measures the statistical correlation between the
predicted values and the actual values
Scale independent, between -1 and +1
Good performance leads to large values!
Ap
PA
SS
S
1
))((
n
aappS
ii
i
PA
1
)( 2
n
ppS i
i
P
1
)( 2
n
aaS i
i
A
Classification and Prediction: Evaluation
Practice 1
Classification and Prediction: Evaluation 46
A company wants to do a mail marketing campaign. It costs the company $1 for each item mailed. They have information on 100,000 customers. Create a cumulative gains and a lift chart from the following data.
Results
Classification and Prediction: Evaluation 47
Practice 2
Classification and Prediction: Evaluation 48
Using the response model P(x)=Salary(x)/1000+Age(x) for customer x and the data table shown below, construct the cumulative gains, lift charts and ROC curves. Ties in ranking should be arbitrarily broken by assigning a higher rank to who appears first in the table.
Customer Name Salary Age Actual
Response
P(x) Rank
A 10000 39 N
B 50000 21 Y
C 65000 25 Y
D 62000 30 Y
E 67000 19 Y
F 69000 48 N
G 65000 12 Y
H 64000 51 N
I 71000 65 Y
J 73000 42 N
0.00
0.50
1.00
1.50
2.00
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Lift Baseline
49
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
% Positive responses
Baseline
Customer
Name
Salary Age Actual
Response
P(x) Rank
A 10000 39 N 49 10
B 50000 21 Y 71 9
C 65000 25 Y 90 6
D 62000 30 Y 92 5
E 67000 19 Y 86 7
F 69000 48 N 117 2
G 65000 12 Y 77 8
H 64000 51 N 115 3
I 71000 65 Y 136 1
J 73000 42 N 115 4
Customer
Name
Actual
Response
P(x) Rank
I Y 136 1
F N 117 2
H N 115 3
J N 115 4
D Y 92 5
C Y 90 6
E Y 86 7
G Y 77 8
B Y 71 9
A N 49 10
Ordered by P(x)
Cumulative Gains Chart Lift Chart
50
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0.00% 50.00% 100.00% 150.00%
FPR
TPR
ROC Curve
Customer
Name
Actual
Response
#Sample TP FP TPR FPR
I Y 1 1 0 16.67% 0.00%
F N 2 1 1 16.67% 25.00%
H N 3 1 2 16.67% 50.00%
J N 4 1 3 16.67% 75.00%
C Y 5 2 3 33.33% 75.00%
E Y 6 3 3 50.00% 75.00%
D Y 7 4 3 66.67% 75.00%
G Y 8 5 3 83.33% 75.00%
B Y 9 6 3 100.00% 75.00%
A N 10 6 4 100.00% 100.00%
51
*Predicting performance
Assume the estimated error rate is 25%. How close is this to the true error rate? Depends on the amount of test data
Prediction is just like tossing a biased (!) coin “Head” is a “success”, “tail” is an “error”
In statistics, a succession of independent events like this is called a Bernoulli process
Statistical theory provides us with confidence intervals for the true underlying proportion!
Classification and Prediction: Evaluation
52
*Confidence intervals
We can say: p lies within a certain specified interval with a certain specified confidence
Example: S=750 successes in N=1000 trials
Estimated success rate: 75%
How close is this to true success rate p?
Answer: with 80% confidence p[73.2,76.7]
Another example: S=75 and N=100
Estimated success rate: 75%
With 80% confidence p[69.1,80.1]
Classification and Prediction: Evaluation
53
*Mean and variance (also Mod 7)
Mean and variance for a Bernoulli trial: p, p (1–p)
Expected success rate f=S/N
Mean and variance for f : p, p (1–p)/N
For large enough N, f follows a Normal distribution
c% confidence interval [–z X z] for random variable with 0 mean is given by:
With a symmetric distribution:
czXz ]Pr[
]Pr[21]Pr[ zXzXz
Classification and Prediction: Evaluation
54
*Confidence limits
Confidence limits for the normal distribution with 0 mean and a variance of 1:
Thus:
To use this we have to reduce our random variable f to have 0 mean and unit variance
Pr[X z] z
0.1% 3.09
0.5% 2.58
1% 2.33
5% 1.65
10% 1.28
20% 0.84
40% 0.25
%90]65.165.1Pr[ X
–1 0 1 1.65
Classification and Prediction: Evaluation
55
*Transforming f
Transformed value for f : (i.e. subtract the mean and divide by the standard deviation)
Resulting equation:
Solving for p :
Npp
pf
/)1(
czNpp
pfz
/)1(Pr
N
z
N
z
N
f
N
fz
N
zfp
2
2
222
142
Classification and Prediction: Evaluation
56
*Examples f = 75%, N = 1000, c = 80% (so that z = 1.28):
f = 75%, N = 100, c = 80% (so that z = 1.28):
Note that normal distribution assumption is only valid for large N (i.e. N > 100)
f = 75%, N = 10, c = 80% (so that z = 1.28):
(should be taken with a grain of salt)
]767.0,732.0[p
]801.0,691.0[p
]881.0,549.0[p
Classification and Prediction: Evaluation