characterizing model errors and differences stephen d. bay and michael j. pazzani information and...
Post on 27-Dec-2015
213 Views
Preview:
TRANSCRIPT
Characterizing Model Errors and Differences
Stephen D. Bay and Michael J. Pazzani
Information and Computer Science
University of California, Irvine{sbay,pazzani}@ics.uci.edu
Evaluation Tools
• loss/accuracy
• confusion matrices
• ROC curves
• Kappa statistic (Cohen, 1960)
Problem: Cannot answer questions like
• “On which types of examples is my classifier most and least accurate?”
• “What are the differences between these two classifiers given that they have the same accuracy?’’
)(
Adult data set
• Census database– 48000 examples– 12 demographic variables– classification task: predict salary >$50K or $50K– C5 accuracy ~85%
• available from UCI Machine Learning Repository (Blake & Merz) http://www.ics.uci.edu/~mlearn/MLRepository.html
Our Goal• Characterize model errors or model differences in
the feature space of the problem
Examples:
Classifier MC4 is 21% less accurate than average on people who are between 45 and 55 years of age, are high school graduates, and are married. This represents 115 misclassified instances.
MC4 and naive Bayes are 9% less likely to agree than average on people who have Masters degrees and are married. This represents 50 instances with different predictions.
MC4 is a C4.5 clone (Kohavi, Sommerfield & Dougherty, 1997)
Framework• Simple meta-learning framework
Age Sex Occupation Salary Agree
34 M Tech-Support >$50K 0
49 F Prof-Specialty >$50K 1
24 F Exec-Managerial $50K 0
57 M Admin-Clerical $50K 1
MErr: does the model agree with the true class labels?MDiff: do two models agree with each other?
Exploratory Research(Dietterich, 1996)
• new task: generating descriptive rule sets for model errors and differences
• existing solutions do not work well– (i.e. although C5 is a very good classifier it is
not appropriate for this task)
• qualitative and quantitative results
• define criteria for measuring quality of results
C5
$50K >$50K
divorced
never married
salary
marital status capital gains
agree=1 agree=1
married$3500 >$3500
capital gains agree=1education
STUCCO
)|()|( 21 cyPcyP XX
Two stages:• search• summarization
such that all Find X
Bay, S.D. & Pazzani, M.J. (1999). Detecting change in categorical data: Mining contrast sets. In Proceedings of theFifth ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining.
Let be a conjunction of attribute-value pairs such as occupation=sales or sex=female ^ age = [45,55]
X
Discriminative vs. Characteristic Learning
• Classifiers can be broadly classified as discriminative or characteristic (Rubinstein & Hastie, 1997)
• normally given select class so that is maximized
)|( XyP )|( yP X
)(
)()|()|(
X
XX
P
yPyPyP
discriminative characteristic
Bayes Rule:
)|( XyPX
C5 vs. STUCCO
• discriminative vs. characteristic
• incomplete vs. complete
• unordered vs. hierarchical rule sets
Leads to very different rule sets
Rule Set Examples, C5
Modelaccuracy
Effectsize
Never married AND capitalgains = 0 AND salary > $50K
-84.8% -145.1
Divorced AND capital gains = 0AND salary > $50K
-84.8% -122.2
Education = Bachelors ANDmarried AND occupation =Sales AND capital gains = 0 andsalary < $50K
-63.9% -42.8
MC4 Errors on Adult
Rule Set Examples, STUCCO
Modelaccuracy
EffectSize
occupation = Adm-clerical +4.6% +84.3
marital status = married -12.5% -924.9
occupation = Adm-clericalAND marital status = married
-17.4% -88.8
MC4 Errors on Adult
Practical Differences
MC4 is 6% more accurate than average on people who have a Bachelors degree, are married, work in a professional specialty, reported a capital gain of $0, and have a salary > $50K. This represents 13 correctly classified instances.
MC4 is 26% less accurate than average on people who have a salary > $50K. This represents 1013 misclassified instances.
MC4 is 13% less accurate than average on people who are married. This represents 925 misclassified instances.
C5 has a fragmentation problem
C5 is incomplete and misses the following rules
Evaluation
Queries:• MC4 Errors• 1NN vs. 5NN• Naïve Bayes vs. SuperParent (Keogh & Pazzani,
1999)
Criteria:• substantial effect• comprehensible• stable
ResultsMC4 Errors 1NN vs. 5NN NB. vs. SPC5 S C5 S C5 S
Number of rules 52 143 685 123 46 192
Median effect size 44 148 2 64 38 115Average rule size 2.9 2.0 3.8 2.1 2.6 2.6
Stability 0.25 0.54 0.05 0.48 0.24 0.40
YX
YXY)X,agreement(
Stability: expected agreement between rule sets generated fromthe same distribution.
Effect Size: if we could make the agreement the same as the average, how many examples would be affected?
Stability, MC4 Errors
0 0.2 0.4 0.6 0.860
80
100
120
140
160
180
Agreement
Siz
e of
Un
ion
C5 STUCCO
Stability, 1NN vs. 5NN
0 0.2 0.4 0.6 0.80
200
400
600
800
Agreement
Siz
e of
Un
ion
C5 STUCCO
Stability, NB vs SP
0 0.2 0.4 0.6 0.80
50
100
150
200
Agreement
Siz
e of
Un
ion
C5 STUCCO
Accuracy Difference vs. Effect Size
-0.8 -0.6 -0.4 -0.2 0 0.2-1500
-1000
-500
0
500
1000
1500
Accuracy Difference
Eff
ect
Siz
e
C5 STUCCO
Summary
• Can treat problem of characterizing model performance as a meta-learning problem
• may require a different bias from discriminative learners
• other factors important beyond validity of rules
Future Work
• generalize to loss
• investigate how to summarize rules for humans
• classifier comparisons– single vs. multiple models– comparing ensemble methods
Set-Enumeration Search{}
{1} {2} {3} {4}
{1,2} {1,3} {1,4} {2,3} {2,4}
{1,2,3} {1,2,4} {1,3,4} {2,3,4}
{1,2,3,4}
{3,4}
(Rymon, 1992)
Rule Summarization
• Rules are summarized hierarchically to present only surprising findings
CA
CB
CBA
given
when do we show
Iterative Process of Building Machine Learning Systems
reprinted with permission from
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996) The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39, 27-34.
Copyright 1996 ACM
top related