data mining lectures lecture 7: classification padhraic smyth, uc irvine ics 278: data mining...
TRANSCRIPT
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
ICS 278: Data Mining
Lectures 7 and 8: Classification Algorithms
Padhraic SmythDepartment of Information and Computer Science
University of California, Irvine
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Notation
• Variables X, Y….. with values x, y (lower case)• Vectors indicated by X
• Components of X indicated by Xj with values xj
• “Matrix” data set D with n rows and p columns– jth column contains values for variable Xj – ith row contains a vector of measurements on object i, indicated
by x(i)– The jth measurement value for the ith object is xj(i)
• Unknown parameter for a model = – Can also use other Greek letters, like ew
– Vector of parameters =
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Classification
• Predictive modeling: predict Y given X– Y is real-valued => regression– Y is categorical => classification
• Classification– Many applications: speech recognition, document
classification, OCR, loan approval, face recognition, etc
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Classification v. Regression
• Similar in many ways…– both learn a mapping from X to Y– Both sensitive to dimensionality of X – Generalization to new data is important in both
• Test error versus model complexity
– Many models can be used for either classification or regression, e.g.,• Trees, neural networks
• Most important differences– Categorical Y versus real-valued Y– Different score functions
• E.g., classification error versus squared error
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Decision Region Terminlogy
2 3 4 5 6 7 8 9 10-1
0
1
2
3
4
5
6
Feature 1
Fea
ture
2
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
DecisionRegion 2
DecisionRegion 1
DecisionBoundary
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Probabilistic view of Classification
• Notation: let there be K classes c1,…..cK
• Class marginals: p(ck) = probability of class k
• Class-conditional probabilities p( x | ck ) = probability of x given ck , k = 1,…K
• Posterior class probabilities (by Bayes rule)
p( ck | x ) = p( x | ck ) p(ck) / p(x) , k = 1,…K
where p(x) = p( x | cj ) p(cj)
In theory this is all we need….in practice this may not be best approach.
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Example of Probabilistic Classification
p( x | c1 )p( x | c2 )
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Example of Probabilistic Classification
p( x | c1 )p( x | c2 )
p( c1 | x )1
0.5
0
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Example of Probabilistic Classification
p( x | c1 )p( x | c2 )
p( c1 | x )1
0.5
0
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Decision Regions and Bayes Error Rate
p( x | c1 )p( x | c2 )
Class c1 Class c2 Class c1Class c2Class c2
Optimal decision regions = regions where 1 class is more likely
Optimal decision regions optimal decision boundaries
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Decision Regions and Bayes Error Rate
p( x | c1 )p( x | c2 )
Class c1 Class c2 Class c1Class c2Class c2
Optimal decision regions = regions where 1 class is more likely
Optimal decision regions optimal decision boundaries
Bayes error rate = fraction of examples misclassified by optimal classifier = shaded area above (see equation 10.3 in text)
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Procedure for optimal Bayes classifier
• For each class learn a model p( x | ck )
– E.g., each class is multivariate Gaussian with its own mean and covariance
• Use Bayes rule to obtain p( ck | x )
=> this yields the optimal decision regions/boundaries => use these decision regions/boundaries for classification
• Correct in theory…. but practical problems include:– How do we model p( x | ck ) ?
– Even if we know the model for p( x | ck ), modeling a distribution or density will be very difficult in high dimensions (e.g., p = 100)
• Alternative approach: model the decision boundaries directly
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Three types of classifiers
• Generative (or class-conditional) classifiers:– Learn models for p( x | ck ), use Bayes rule to find decision boundaries– Examples: naïve Bayes models, Gaussian classifiers
• Regression (or posterior class probabilities):– Learn a model for p( ck | x ) directly– Example: logistic regression (see lecture 5/6), neural networks
• Discriminative classifiers– No probabilities– Learn the decision boundaries directly– Examples:
• Linear boundaries: perceptrons, linear SVMs• Piecewise linear boundaries: decision trees, nearest-neighbor classifiers• Non-linear boundaries: non-linear SVMs
– Note: one can usually “post-fit” class probability estimates p( ck | x ) to a discriminative classifier
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Which type of classifier is appropriate?
• Lets look at the score functions:– c(i) = true class, c(x(i) ; ) = class predicted by the classifier
Class-mismatch loss functions:
S() = 1/n i Cost [c(i), c(x(i) ; ) ]
where cost(i, j) = cost of misclassifying true class i as predicted class j
e.g., cost(i,j) = 0 if i=j, = 1 otherwise (misclassification error or 0-1 loss)
and more generally cost(i,j) is a matrix of K x K losses (e.g., surgery, spam email, etc)
Class-probability loss functions:
S() = 1/n i log p(c(i) | x(i) ; ) (log probability score)
or S() = 1/n i [ c(i) – p(c(i) | x(i) ; ) ]2 (Brier score)
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Example: classifying spam email
• 0-1 loss function– Appropriate if we just want to maximize accuracy
• Asymmetric cost matrix– Appropriate if missing non-spam emails is more “costly” than
failing to detect spam emails
• Probability loss– Appropriate if we wanted to rank all emails by p(spam | email
features), e.g., to allow the user to look at emails via a ranked list.
• In general: don’t solve a harder problem than you need to, or don’t model aspects of the problem you don’t need to (e.g., modeling p(x|c)) - Vapnik, 1996.
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Classes of classifiers
• Class-conditional/probabilistic, based on p( x | ck ),
– Naïve Bayes (simple, but often effective in high dimensions)– Parametric generative models, e.g., Gaussian (can be effective in low-
dimensional problems: leads to quadratic boundaries in general)
• Regression-based, p( ck | x ) directly
– Logistic regression: simple, linear in “odds” space– Neural network: non-linear extension of logistic, can be difficult to work
with
• Discriminative models, focus on locating optimal decision boundaries– Linear discriminants, perceptrons: simple, sometimes effective– Support vector machines: generalization of linear discriminants, can be
quite effective, computational complexity is an issue– Nearest neighbor: simple, can scale poorly in high dimensions– Decision trees: “swiss army knife”, often effective in high dimensionis
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Naïve Bayes Classifiers
• Generative probabilistic model with conditional independence assumption on p( x | ck ), i.e. p( x | ck ) = p( xj | ck )
• Typically used with nominal variables– Real-valued variables discretized to create nominal versions
– (alternative is to model each p( xj | ck ) with a parametric model – less widely used)
• Comments:– Simple to train (just estimate conditional probabilities for each feature-class pair)– Often works surprisingly well in practice
• e.g., state of the art for text-classification, basis of many widely used spam filters
– Feature selection can be helpful, e.g., information gain– Note that even if CI assumptions are not met, it may still be able to approximate
the optimal decision boundaries (seems to happen in practice)– However…. on most problems can usually be beaten with a more complex model
(plus more work)
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Announcements
• Homework 2 now online on the Web page– Due next Thursday in class– Homework 1 still being graded
• Projects– Interim report due 2 weeks from now (more details later)– More traffic data now online– Locations of VDS stations now known (contact Ram Hariharan)
• Schedule:– Today: more on classification– Next: clustering, pattern-finding, dimension reduction– After that: specific topics such as text, Web, credit scoring, etc
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Link between Logistic Regression and Naïve Bayes
( | ) ( ) ( | )log log log
( | ) ( ) ( | )w d
P C d P C P w C
P C d P C P w C
Naïve Bayes
Logistic Regression
( | )log
( | ) ww d
P C dw
P C d
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Linear Discriminant Classifiers
• Linear Discriminant Analysis (LDA)– Earliest known classifier (1936, R.A. Fisher)
– See section 10.4 for math details
– Find a projection onto a vector such that means for each class (2 classes) are separated as much as possible (with variances taken into account appropriately)
– Reduces to a special case of parametric Gaussian classifier in certain situations
– Many subsequent variations on this basic theme (e.g., regularized LDA)
• Other linear discriminants– Decision boundary = (p-1) dimensional hyperplane in p dimensions– Perceptron learning algorithms (pre-dated neural networks)
• Simple “error correction” based learning algorithms
– SVMs: use a sophisticated “margin” idea for selecting the hyperplane
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Nearest Neighbor Classifiers
• kNN: select the k nearest neighbors to x from the training data and select the majority class from these neighbors
• k is a parameter: – Small k: “noisier” estimates, Large k: “smoother” estimates– Best value of k often chosen by cross-validation
• Comments– Virtually assumption free– Interesting theoretical properties:
Bayes error < error(kNN) < 2 x Bayes error (asymptotically)
• Disadvantages– Can scale poorly with dimensionality: sensitive to distance metric– Requires fast lookup at run-time to do classification with large n– Does not provide any interpretable “model”
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Local Decision Boundaries
1
1
1
2
2
2
Feature 1
Feature 2
?
Boundary? Points that are equidistantbetween points of class 1 and 2Note: locally the boundary is(1) linear (because of Euclidean distance)(2) halfway between the 2 class points(3) at right angles to connector
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Finding the Decision Boundaries
1
1
1
2
2
2
Feature 1
Feature 2
?
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Finding the Decision Boundaries
1
1
1
2
2
2
Feature 1
Feature 2
?
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Finding the Decision Boundaries
1
1
1
2
2
2
Feature 1
Feature 2
?
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Overall Boundary = Piecewise Linear
1
1
1
2
2
2
Feature 1
Feature 2
?
Decision Region for Class 1
Decision Region for Class 2
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Decision Tree Classifiers
– Widely used in practice• Can handle both real-valued and nominal inputs (unusual)• Good with high-dimensional data
– similar algorithms as used in constructing regression trees
– historically, developed both in statistics and computer science• Statistics:
– Breiman, Friedman, Olshen and Stone, CART, 1984• Computer science:
– Quinlan, ID3, C4.5 (1980’s-1990’s)
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Decision Tree Example
Income
Debt
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Decision Tree Example
t1 Income
DebtIncome > t1
??
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Decision Tree Example
t1
t2
Income
DebtIncome > t1
Debt > t2
??
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Decision Tree Example
t1t3
t2
Income
DebtIncome > t1
Debt > t2
Income > t3
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Decision Tree Example
t1t3
t2
Income
DebtIncome > t1
Debt > t2
Income > t3
Note: tree boundaries are piecewiselinear and axis-parallel
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Figure from Duda, Hart & Stork, Chap. 8
Moving just one example slightly may lead to quite different trees and space partition!
Lack of stability against small perturbation of data.
Decision Trees are not stable
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Decision Tree Pseudocode
node = tree-design (Data = {X,C})
For i = 1 to d
quality_variable(i) = quality_score(Xi, C)
end
node = {X_split, Threshold } for max{quality_variable}
{Data_right, Data_left} = split(Data, X_split, Threshold)
if node == leaf?
return(node)
else
node_right = tree-design(Data_right)
node_left = tree-design(Data_left)
end
end
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Binary split selection criteria
• Q(t) = N1Q1(t) + N2Q2(t), where t is the threshold
• Let p1k be the proportion of class k points in region 1
• Error criterion for a branch Q1(t) = 1 - p1k*
• Gini index: Q1(t) = k p1k (1 - p1k)
• Cross-entropy: Q1(t) = k p1k log p1k
• Cross-entropy and Gini work better in general– Tend to give higher rank to splits with more extreme class
distributions – Consider [(300,100) (100,300)] split versus [(400,0) (200 200)]
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Computational Complexity for a Binary Tree
• At the root node, for each of p variables– Sort all values, compute quality for each split– O(pN log N) time for real-valued or ordinal variables
• Subsequent internal node operations each take O(N’ log N’)- e.g., balanced tree of depth K requires pN log N + 2(pN/2 log N/2) + 4(pN/4 log N/4) + …. 2K(pN/2K log
N/2K) = pN(logN + log(N/2) + log(N/4) + …… log N/2K)
• This assumes data are in main memory– If data are on disk then repeated access of subsets at different
nodes may be very slow (impossible to pre-index)
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Splitting on a nominal attribute
• Nominal attribute with m values– e.g., the name of a state or a city in marketing data
• 2m-1 possible subsets => exhaustive search is O(2m-1)– For small m, a simple approach is to branch on specific values– But for large m this may not work well
• Neat trick for the 2-class problem:– For each predictor value calculate the proportion of class 1’s – Order the m values according to these proportions– Now treat as an ordinal variable and select the best split
(linear in m) – This gives the optimal split for the Gini index, among all
possible 2m-1 splits (Breiman et al, 1984).
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
How to Choose the Right-Sized Tree?
PredictiveError
Size of Decision Tree
Error on Training Data
Error on Test Data
Ideal Rangefor Tree Size
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Choosing a Good Tree for Prediction
• General idea– grow a large tree– prune it back to create a family of subtrees
• “weakest link” pruning
– score the subtrees and pick the best one
• Massive data sizes (e.g., n ~ 100k data points)– use training data set to fit a set of trees– use a validation data set to score the subtrees
• Smaller data sizes (e.g., n ~1k or less)– use cross-validation– use explicit penalty terms (e.g., Bayesian methods)
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Example: Spam Email Classification
• Data Set: (from the UCI Machine Learning Archive)– 4601 email messages from 1999– Manually labelled as spam (60%), non-spam (40%)– 54 features: percentage of words matching a specific
word/character• Business, address, internet, free, george, !, $, etc
– Average/longest/sum lengths of uninterrupted sequences of CAPS
• Error Rates (Hastie, Tibshirani, Friedman, 2001)– Training: 3056 emails, Testing: 1536 emails– Decision tree = 8.7%– Logistic regression: error = 7.6%– Naïve Bayes = 10% (typically)
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Treating Missing Data in Trees
• Missing values are common in practice
• Approaches to handing missing values – During training
• Ignore rows with missing values (inefficient)– During testing
• Send the example being classified down both branches and average predictions
– Replace missing values with an “imputed value” (can be suboptimal)
• Other approaches – Treat “missing” as a unique value (useful if missing values are
correlated with the class)– Surrogate splits method
• Search for and store “surrogate” variables/splits during training
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Other Issues with Classification Trees• Why use binary splits?
– Multiway splits can be used, but cause fragmentation
• Linear combination splits?– can produces small improvements – optimization is much more difficult (need weights and split point)– Trees are much less interpretable
• Model instability– A small change in the data can lead to a completely different tree– Model averaging techniques (like bagging) can be useful
• Tree “bias”– Poor at approximating non-axis-parallel boundaries
• Producing rule sets from tree models (e.g., c5.0)
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Why Trees are widely used in Practice
• Can handle high dimensional data– builds a model using 1 dimension at time
• Can handle any type of input variables– categorical, real-valued, etc– most other methods require data of a single type (e.g., only
real-valued)
• Trees are (somewhat) interpretable– domain expert can “read off” the tree’s logic
• Tree algorithms are relatively easy to code and test
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Limitations of Trees
• Representational Bias– classification: piecewise linear boundaries, parallel to axes– regression: piecewise constant surfaces
• High Variance– trees can be “unstable” as a function of the sample
• e.g., small change in the data -> completely different tree– causes two problems
• 1. High variance contributes to prediction error• 2. High variance reduces interpretability
– Trees are good candidates for model combining• Often used with boosting and bagging
• Trees do not scale well to massive data sets (e.g., N in millions)– repeated random access of subsets of the data
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Evaluating Classification Results (in general)
• Summary statistics:– empirical estimate of score function on test data, eg.,
error rate
• More detailed breakdown– E.g., “confusion matrices”– Can be quite useful in detecting systematic errors
• Detection v. false-alarm plots (2 classes)– Binary classifier with real-valued output for each
example, where higher means more likely to be class 1
– For each possible threshold, calculate• Detection rate = fraction of class 1 detected• False alarm rate = fraction of class 2 detected
– Plot y (detection rate) versus x (false alarm rate)– Also known as ROC, precision-recall,
specificity/sensitivity
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Bagging for Combining Classifiers
• Training data sets of size N
• Generate B “bootstrap” sampled data sets of size N – Bootstrap sample = sample with replacement – e.g. B = 100
• Build B models (e.g., trees), one for each bootstrap sample– Intuition is that the bootstrapping “perturbs” the data enough to make
the models more resistant to true variability
• For prediction, combine the predictions from the B models– E.g., for classification p(c | x) = fraction of B models that predict c
– Plus: generally improves accuracy on models such as trees– Negative: lose interpretability
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
green = majority votepurple = averaging the probabilities
From Hastie, Tibshirani,And Friedman, 2001
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Illustration of Boosting:Color of points = class labelDiameter of points = weight at each iterationDashed line: single stage classifier. Green line: combined, boosted classifierDotted blue in last two: bagging (from G. Rätsch, Phd thesis, 2001)
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Support Vector Machines(will be discussed again later)
• Support vector machines– Use a different loss function, the “margin”
• Results in convex optimization problem, solvable by quadratic programming
– Decision boundary represented by examples in training data– Linear version: clever placement of the hyperplane– Non-linear version: “kernel trick” for high-dimensional
problems– Computational complexity can be O(N3) without speedups
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Summary on Classifiers
• Simple models (but can be effective)– Logistic regression– Naïve Bayes– K nearest-neighbors
• Decision trees– Good for high-dimensional problems with different data types
• State of the art:– Support vector machines– Boosted trees
• Many tradeoffs in interpretability, score functions, etc
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Task
Decision Tree Classifiers
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
Classification
Decision boundaries = hierarchy of axis-parallel
Cross-validated error
Greedy search in tree space
None specified
Tree
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Task
Naïve Bayes Classifier
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
Classification
Conditional independence probability model
Likelihood
Closed form probability estimates
None specified
Conditional probability tables
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Task
Logistic Regression
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
Classification
Log-odds(C) = linear function of X’s
Log-likelihood
Iterative (Newton) method
None specified
Logistic weights
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Task
Nearest Neighbor Classifier
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
Classification
Memory-based
Cross-validated error (for selecting k)
None
None specified
None
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Task
Support Vector Machines
Representation
Score Function
Search/Optimization
Data Management
Models, Parameters
Classification
Hyperplanes
“Margin”
Convex optimization
(quadratic programming)
None specified
None
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Software (same as for Regression)
• MATLAB– Many free “toolboxes” on the Web for regression and prediction– e.g., see http://lib.stat.cmu.edu/matlab/
and in particular the CompStats toolbox
• R– General purpose statistical computing environment (successor to S)– Free (!)– Widely used by statisticians, has a huge library of functions and
visualization tools
• Commercial tools– SAS, other statistical packages– Data mining packages– Often are not progammable: offer a fixed menu of items
Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine
Reading
• For this class: Chapter 10:– Covers both general concepts in classification and a broad range of
classifiers
• Suggested background reading for further information:
– Elements of Statistical Learning, • T. Hastie, R. Tibshirani, and J. Friedman, Springer Verlag, 2001
– Learning from Kernels, • B Schoelkopf and A. Smola, MIT Press, 2003.
– Classification Trees, • Breiman, Friedman, Olshen, and Stone, Wadsworth Press, 1984.