chapter 5 evaluating predictive performance data mining for business analytics shmueli, patel ...
DESCRIPTION
Types of outcome Predicted numerical value Outcome variable is numerical and continuous Predicted class membership Outcome variable is categorical Propensity Probability of class membership when the outcome variable is categoricalTRANSCRIPT
Chapter 5 – Evaluating Predictive Performance
Data Mining for Business AnalyticsShmueli, Patel & Bruce
Why Evaluate?Multiple methods are available to classify
or predictFor each method, multiple choices are
available for settingsTo choose best model, need to assess each
model’s performance
Types of outcomePredicted numerical value
Outcome variable is numerical and continuousPredicted class membership
Outcome variable is categoricalPropensity
Probability of class membership when the outcome variable is categorical
Accuracy Measures (Continuous outcome)
Evaluating Predictive PerformanceNot the same as goodness-of-fit (R2, SY/X)Predicted performance is measured using
the validation datasetBenchmark is average used as prediction
Prediction accuracy measuresei = yi – ŷi,
where yi = Actual y value and ŷi = predicted y value
MAE = Mean Absolute Error = MAE is also known as MAD = Mean Absolute DeviationAverage Error = MAPE = Mean Absolute Percent Error = 100% x RSME = Root-Mean-Squared Error = Total SSE = Total Sum of Squared Error =
Excel MinerExample: Multiple Regression using Boston Housing dataset
Excel MinerResiduals for Training dataset Residuals for Validation dataset
Excel MinerResiduals for Training dataset Residuals for Validation dataset
SAS Enterprise MinerExample: Multiple Regression using Boston Housing dataset
SAS Enterprise MinerResiduals for Validation dataset
SAS Enterprise MinerBoxplot of residuals
Lift ChartOrder predicted y-value from highest to
lowestX-axis = Cases from 1 to nY-axis = Cumulative predicted value of YTwo lines are plotted …
one with as the predicted valueone with ŷ found using a prediction model
Excel Miner: Lift Chart
Decile-wise lift ChartOrder predicted y-value from highest to
lowestX-axis = % of cases from 10% to 100%
i.e. 1st decile to 10th decileY-axis = Cumulative predicted value of YFor each decile the ratio of sum of ŷ and
sum of is potted
XLMINER: Decile-wise lift Chart
Accuracy Measures (Classification)
Misclassification error
Error = classifying a record as belonging to one class when it belongs to another class.
Error rate = percent of misclassified records out of the total records in the validation data
Benchmark: Naïve RuleNaïve rule: classify all records as belonging
to the most prevalent classUsed only as benchmark to evaluate more
complex classifiers
Separation of Records “High separation of records” means that using predictor variables attains low error
“Low separation of records” means that using predictor variables does not improve much on naïve rule
High Separation of Records
Low Separation of Records
Classification Confusion Matrix
201 1’s correctly classified as “1” (True positives)
85 1’s incorrectly classified as “0” (False negatives)
25 0’s incorrectly classified as “1” (False positives)
2689 0’s correctly classified as “0” (True negatives)
Actual Class 1 01 201 850 25 2689
Predicted ClassClassification Confusion Matrix
Error Rate
Overall error rate = (25+85)/3000 = 3.67%
Accuracy = 1 – err = (201+2689) = 96.33%
If multiple classes, error rate is: (sum of misclassified records)/(total
records)
Actual Class 1 01 201 850 25 2689
Predicted ClassClassification Confusion Matrix
Classification Matrix:Meaning of each cell
Actual Class
Predicted ClassC1 C2
C1 n1,1= No. of C1 cases classified correctly as C1
n1,2= No. of C1 cases classified incorrectly as C2
C2 n2,1= No. of C2 cases classified incorrectly as C1
n2,2= No. of C2 cases classified correctly as C2
Misclassification rate = err =
Accuracy = 1 - err =
PropensityPropensities are estimated probabilities
that a case belongs to each of the classesIt is used in two ways …
To generate predicted class membershipRank ordering cases by probability of
belonging to a particular class of interest
Most Data Mining algorithms classify via a 2-step process:
For each case,1. Compute probability of belonging to the
class of interest2. Compare to cutoff value, and classify
accordingly
Default cutoff value is 0.50 If >= 0.50, classify as “1”If < 0.50, classify as “0”
Can use different cutoff values Typically, error rate is lowest for cutoff =
0.50
Propensity and Cutoff for Classification
Cutoff TableActual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.5061 0.988 0 0.4711 0.984 0 0.3371 0.980 1 0.2181 0.948 0 0.1991 0.889 0 0.1491 0.848 0 0.0480 0.762 0 0.0381 0.707 0 0.0251 0.681 0 0.0221 0.656 0 0.0160 0.622 0 0.004
If cutoff is 0.50: thirteen records are classified as “1”
If cutoff is 0.80: seven records are classified as “1”
Confusion Matrix for Different Cutoffs
Confusion Matrix for Different Cutoffs0.25
Actual Class owner non-owner
owner 11 1
non-owner 4 8
0.75
Actual Class owner non-owner
owner 7 5
non-owner 1 11
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
When One Class is More ImportantIn many cases it is more important to identify members of one class
Tax fraudCredit defaultResponse to promotional offerPredicting delayed flights
In such cases, overall accuracy is not a good measure of evaluating the classifier.
SensitivitySpecificity
SensitivitySuppose that “C1” is the important class. Sensitivity is the ability (probability) to
detect membership in “C1” class correctly and is given by
=
Hit rate = True Positive rate
SpecificitySuppose that “C1” is the important class. Specificity is the ability (probability) to
rule out members who belong to “C2” class correctly and is given by
=
1 – Specificity = False Positive rate
ROC CurveReceiver Operating Characteristics Curve
Asymmetric Misclassification Costs
Misclassification Costs May Differ
The cost of making a misclassification error may be higher for one class than the other(s)
Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)
Example – Response to Promotional Offer
“Naïve rule” (classify everyone as “0”) has error rate of 1% (seems good)
Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse)
Predict Class 0 Predict Class 1Actual 0 990 0Actual 1 10 0
The Confusion Matrix
Error rate = (2+20) = 2.2% (higher than naïve rate)
Predict Class 0 Predict Class 1Actual 0 970 20Actual 1 2 8
Suppose that using DM we can correctly classify eight 1’s as 1’s
It comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s.
Introducing Costs & BenefitsSuppose:Profit from a “1” is $10Cost of sending offer is $1Then:Under naïve rule, all are classified as “0”,
so no offers are sent: no cost, no profitUnder DM predictions, 28 offers are sent.
8 respond with profit of $10 each20 fail to respond, cost $1 each972 receive nothing (no cost, no profit)
Net profit = $80 - $20 = $60
Profit Matrix
Predict Class 0 Predict Class 1Actual 0 0 -$20Actual 1 0 $80
Minimize Opportunity costsAs we see, best to convert everything to
costs, as opposed to a mix of costs and benefits
E.g., instead of “benefit from sale” refer to “opportunity cost of lost sale”
Leads to same decisions, but referring only to costs allows greater applicability
Cost Matrix with opportunity costs
Recall original confusion matrix (profit from a “1” = $10, cost of sending offer = $1):
Costs Predict Class 0 Predict Class 1Actual 0 970 x $0 = $0 20 x $1 = $20Actual 1 2 x $10 = $20 8 x $1 = $8
Predict Class 0 Predict Class 1Actual 0 970 20Actual 1 2 8
Total opportunity cost = 0 + 20 + 20 + 8 = 48
Average misclassification cost
q1 = cost of misclassifying an actual C1 as belonging to C2
q2 = cost of misclassifying an actual C2 as belonging to C1
Average misclassification cost =
Look for a classifier that minimizes this average cost.
Generalize to Cost RatioSometimes actual costs and benefits are hard to estimate
Need to express everything in terms of costs (i.e., cost of misclassification per record)
A good practical substitute for individual costs is the ratio of misclassification costs (e.g., “misclassifying fraudulent firms is 5 times worse than misclassifying solvent firms”)
Multiple Classes
Theoretically, there are m(m-1) misclassification costs, since any case from one of the m classes could be misclassified into any one of the m-1 other classes
Practically too many to work withIn decision-making context, though, such
complexity rarely arises – one class is usually of primary interest
For m classes, confusion matrix has m rows and m columns
Judging Ranking Performance
Lift Chart for Binary DataInput: Scored validation dataset
Actual class and propensity (probability) to belong to the class of interest C1)
Sort records in descending order of propensity to belong to the class of interest
Compute cumulative number of C1 members for each row
Lift chart is the plot with row number (no. of records) as the x-axis and cumulative number of C1 members as the y-axis
Lift Chart for Binary Data - ExampleCase No.
Propensity(Predicted
probability of belonging to class
"1")Actual
class
Cumulative actual classes
1 0.9959767 1 12 0.9875331 1 23 0.9844564 1 34 0.9804396 1 45 0.9481164 1 56 0.8892972 1 67 0.8476319 1 78 0.7628063 0 79 0.7069919 1 810 0.6807541 1 911 0.6563437 1 1012 0.6224195 0 1013 0.5055069 1 1114 0.4713405 0 1115 0.3371174 0 1116 0.2179678 1 1217 0.1992404 0 1218 0.1494827 0 1219 0.0479626 0 1220 0.0383414 0 1221 0.0248510 0 1222 0.0218060 0 1223 0.0161299 0 1224 0.0035600 0 12
Lift Chart with Costs and BenefitsSort records in descending order of
probability of success (success = belonging to class of interest)
For each record compute cost/benefit with actual outcome
Compute a column of cumulative cost/benefit
Plot cumulative cost/benefit on y-axis and row number (no. of records) as the x-axis
Lift Chart with cost/benefit - Example
Case No.
Propensity(Predicted
probability of belonging to class
"1")
Actual
class
Cost/Benefi
t
Cumulative Cost/benefi
t1 0.9959767 1 10 102 0.9875331 1 10 203 0.9844564 1 10 304 0.9804396 1 10 405 0.9481164 1 10 506 0.8892972 1 10 607 0.8476319 1 10 708 0.7628063 0 -1 699 0.7069919 1 10 7910 0.6807541 1 10 8911 0.6563437 1 10 9912 0.6224195 0 -1 9813 0.5055069 1 10 10814 0.4713405 0 -1 10715 0.3371174 0 -1 10616 0.2179678 1 10 11617 0.1992404 0 -1 11518 0.1494827 0 -1 11419 0.0479626 0 -1 11320 0.0383414 0 -1 11221 0.0248510 0 -1 11122 0.0218060 0 -1 11023 0.0161299 0 -1 10924 0.0035600 0 -1 108
Lift Curve May Go Negative
If total net benefit from all cases is negative, reference line will have negative slope
Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum
Negative slope to reference curve
Oversampling and Asymmetric Costs
Rare Cases
Responder to mailingSomeone who commits fraudDebt defaulter
Often we oversample rare cases to give model more information to work with
Typically use 50% “1” and 50% “0” for training
Asymmetric costs/benefits typically go hand in hand with presence of rare but important class
ExampleFollowing graphs show optimal classification
under three scenarios:assuming equal costs of misclassificationassuming that misclassifying “o” is five times
the cost of misclassifying “x”Oversampling scheme allowing DM methods
to incorporate asymmetric costs
Classification: equal costs
Classification: Unequal costsSuppose that failing to catch “o” is 5 times as costly as failing to catch “x”.
Oversampling for asymmetric costs
Oversample “o” to appropriately weight misclassification costs – without or with replacement
Equal number of respondersSample equal number of responders and
non-responders
An Oversampling Procedure1. Separate the responders (rare) from non-
responders2. Randomly assign half the responders to
the training sample, plus equal number of non-responders
3. Remaining responders go to validation sample
4. Add non-responders to validation data, to maintain original ratio of responders to non-responders
5. Randomly take test set (if needed) from validation
Assessing model performance1. Score the model with a validation
dataset selected without oversampling2. Score the model with a oversampled
validation dataset and reweight the results to remove the effects of oversampling
Method 1 is straightforward and easier to implement.
Adjusting confusion matrix for oversampling
Whole data SampleResponders 2% 50%Nonresponders 98% 50%
Example:
One responder in whole data = 50/2 = 25 in the sample
One nonresponder in whole data = 50/98 = 0.5102 in the sample
Adjusting confusion matrix for oversampling
Suppose that the confusion matrix with oversampled validationdataset is as follows:
Classification matrix, Oversampled data (Validation)
Predicted 0 Predcited 1 TotalActual 0 390 110 500Actual 1 80 420 500Total 470 530 1000
Misclassification rate = (80 + 110)/1000 = 0.19 or 19% Percentage of records predicted as “1” = 530/1000 =
0.53 or 53%
Adjusting confusion matrix for oversampling
Classification matrix, ReweightedPredicted 0 Predcited 1 Total
Actual 0 390/0.5102 = 764.4 110/0.5102 = 215.6 980Actual 1 80/25 = 3.2 420/25 = 16.8 500Total 767.6 232.4 1000
Weight for responder (Actual 1) = 25 Weight for nonresponder (Actual 0) = 0.5102
Misclassification rate = (3.2 + 215.6)/1000 = 0.219 or 21.9%
Percentage of records predicted as “1” = 232.4/1000 = 0.2324 or 23.24%
Adjusting Lift curve for OversamplingSort records in descending order of
probability of success (success = belonging to class of interest)
For each record compute cost/benefit with actual outcome
Divide the value by the oversampling rate of the actual class
Compute a cumulative column of weighted cost/benefit
Plot the cumulative weighted cost/benefit on y-axis and row number (no. of records) as the x-axis
Adjusting Lift curve for Oversampling -- Example
Cost/BenefitOversamplin
g weightActual 0 -3 0.6Actual 1 50 20
Suppose that cost/benefit and oversampling weights are as follows:
Case No.
Propensity(Predicted
probability of belonging to class
"1")Actual
classweighted
Cost/Benefit
Cumulative weighted
Cost/benefit1 0.9959767 1 2.50 2.502 0.9875331 1 2.50 5.003 0.9844564 1 2.50 7.504 0.9804396 1 2.50 10.005 0.9481164 1 2.50 12.506 0.8892972 1 2.50 15.007 0.8476319 1 2.50 17.508 0.7628063 0 -5.00 12.509 0.7069919 1 2.50 15.0010 0.6807541 1 2.50 17.5011 0.6563437 1 2.50 20.0012 0.6224195 0 -5.00 15.0013 0.5055069 1 2.50 17.5014 0.4713405 0 -5.00 12.5015 0.3371174 0 -5.00 7.5016 0.2179678 1 2.50 10.0017 0.1992404 0 -5.00 5.0018 0.1494827 0 -5.00 0.0019 0.0479626 0 -5.00 -5.0020 0.0383414 0 -5.00 -10.0021 0.0248510 0 -5.00 -15.0022 0.0218060 0 -5.00 -20.0023 0.0161299 0 -5.00 -25.0024 0.0035600 0 -5.00 -30.00
Adjusting Lift curve for Oversampling -- Example