cs 277: data mining notes on classification padhraic smyth department of computer science university...

122
CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Upload: britton-cox

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

CS 277: Data Mining

Notes on Classification

Padhraic SmythDepartment of Computer Science

University of California, Irvine

Page 2: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Review

• Models that are linear in parameters , e.g.,

y = 0 + 1 x1 + 2 x2 + 12 x1 x2

With least squares objective function -> solving a set of linear equations

• Models that are non-linear in parameters, e.g., logistic

y = 1/ 1 + exp[ - (0 + 1 x1 + 2 x2 + 12 x1 x2 ) ]

Solution requires non-linear optimization methods, e.g., iterative

search

Page 3: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Optimization: Gradient Ascent (or Descent)

– Select an initial heuristically or randomly

– Update:• Compute the local gradient of the objective function S

SSd, ………….., Spdp – Gives direction of maximum increase of S function (at this point in

space)

• Move a small distance in this direction, i.e., uphill

+ S = learning rate (e.g., 0.1)

• Repeat until convergence– e.g., is no longer changing, S ~ 0, i.e., we are a (local) maximum

– Repeat from different initial conditions if local maxima exist in S()

– Many different versions (e.g., batch, sequential/stochastic, etc)

Page 4: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Optimization: Newton’s Method

• Use 2nd order derivative information in update rule:

+ Swhere is the Hessian matrix, a p x p matrix of 2nd derivatives of S evaluated at

• Requires O(Np2) computations to compute Hessian, O(p3) to invert– Can approximate with diagonal, yields O(Np)

• Gives optimal convergence rate if S() is quadratic – May be particularly helpful near minimum (or maximum) (think Taylor’s series

expansion)

• For more discussion see Section 8.3 in the text

Page 5: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Model Evaluation

• Let MSEtest be the mean-square error of our learned predictor function, evaluated on test data

• Useful to report MSEtest / MSEbaseline

– e.g., where MSEbaseline = i [y(i) – y]2 (on test data points)

where y = mean of y values on the training data

- ideally we would like MSEtest / MSEbaseline to be much less than 1.

• Can also plot histograms of individual errors: MSE might be dominated by outliers

Page 6: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Classification

• Predictive modeling: predict Y given X– Y is real-valued => regression– Y is categorical => classification

• Often use C rather than Y to indicate the “class variable”

• Classification– Many applications: speech recognition, document

classification, OCR, loan approval, face recognition, etc

Page 7: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Classification v. Regression

• Similar in many ways…– both learn a mapping from X to C or Y– Both sensitive to dimensionality of X – Generalization to new data is important in both

• Test error versus model complexity

– Many models can be used for either classification or regression, e.g.,• trees, neural networks

• Most important differences– Categorical Y versus real-valued Y– Different score functions

• E.g., classification error versus squared error

Page 8: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Decision Region Terminology

2 3 4 5 6 7 8 9 10-1

0

1

2

3

4

5

6

Feature 1

Fea

ture

2

TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE

DecisionRegion 2

DecisionRegion 1

DecisionBoundary

Page 9: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Probabilistic View of Classification

• Notation: K classes c1,…..cK

• Class probabilities: p(ck) = probability of class k

• Class-conditional probabilities p( x | ck ) = probability of x given ck , k = 1,…K

• Posterior class probabilities (by Bayes rule)

p( ck | x ) = p( x | ck ) p(ck) / p(x) , k = 1,…K

where p(x) = p( x | cj ) p(cj)

In theory this is all we need….in practice this may not be best approach.

Page 10: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Bayes Rules for Classification

Consider 2 class case c1, c2

Goal of classification: given x, predict c1 or c2

Optimal decision rule: choose c1 if p(c1 | x) > 0.5, otherwise choose c2

=> we would like to know p(c1 | x),

By Bayes rule, p(c1 | x ) = p(x | c1) p(c1) / p(x)

= p(x | c1) p(c1) / ( p(x | c1) p(c1) + p(x | c2) p(c2) )

= p(x , c1) / ( p(x , c1) + p(x , c2) )

Page 11: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Probabilistic Classification for 1-dimensional x

p( x , c1 )p( x , c2 )

Note that p( x , c ) = p(x | c) p(c)

Page 12: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Probabilistic Classification for 1-dimensional x

p( x , c1 )p( x , c2 )

p( c1 | x )1

0.5

0

Page 13: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Probabilistic Classification for 1-dimensional x

p( x , c1 )p( x , c2 )

p( c1 | x )1

0.5

0

Page 14: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Decision Regions and Bayes Error Rate

p( x , c1 )p( x , c2 )

Class c1 Class c2 Class c1Class c2Class c2

Optimal decision regions = regions where 1 class is more likely

Optimal decision regions optimal decision boundaries

Page 15: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Decision Regions and Bayes Error Rate

p( x , c1 )p( x , c2 )

Class c1 Class c2 Class c1Class c2Class c2

Optimal decision regions = regions where 1 class is more likely

Optimal decision regions optimal decision boundaries

Bayes error rate = fraction of examples misclassified by optimal classifier = shaded area above (see equation 10.3 in text)

Page 16: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Procedure for optimal Bayes classifier

• For each class learn a model p( x | ck )

– E.g., each class is multivariate Gaussian with its own mean and covariance

• Use Bayes rule to obtain p( ck | x )

=> this yields the optimal decision regions/boundaries => use these decision regions/boundaries for classification

• Correct in theory…. but practical problems include:– How do we model p( x | ck ) ?

– Even if we know the model for p( x | ck ), modeling a distribution or density will be very difficult in high dimensions (e.g., p = 100)

• Alternative approach: model the decision boundaries directly

Page 17: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

3 Types of Classifiers

• Generative (or class-conditional) classifiers:– Learn models for p( x | ck ), use Bayes rule to find decision boundaries– Examples: naïve Bayes models, Gaussian classifiers

• Regression-based classifiers:– Learn a model for p( ck | x ) directly– Example: logistic regression, neural networks

• Discriminative classifiers– No probabilities– Learn the decision boundaries directly– Examples:

• Linear boundaries: perceptrons, linear SVMs• Piecewise linear boundaries: decision trees, nearest-neighbor classifiers• Non-linear boundaries: non-linear SVMs

– Note: one can usually “post-fit” class probability estimates p( ck | x ) to a discriminative classifier, e.g., often done with SVMs

Page 18: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Generative Classifier

p( x , c1 )p( x , c2 )

Page 19: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Regression-based Classifier

p( x , c1 )p( x , c2 )

p( c1 | x )1

0.5

0

Page 20: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Discriminative Classifier

p( x , c1 )p( x , c2 )

p( c1 | x )1

0.5

0

Page 21: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

What type of cost function is appropriate?

• Lets look at the score functions:– c(i) = true class, c(x(i) ; ) = class predicted by the classifier

Class-mismatch loss functions:

S() = 1/n i Cost [c(i), c(x(i) ; ) ]

where cost(i, j) = cost of misclassifying true class i as predicted class j

e.g., cost(i,j) = 0 if i=j, = 1 otherwise (misclassification error or 0-1 loss)

and more generally cost(i,j) is a matrix of K x K losses (e.g., surgery, spam email, etc)

Class-probability loss functions, c = 0 or 1

S() = 1/n i log p(c(i) | x(i) ; ) (log probability score)

or S() = 1/n i [ c(i) – p(c(i) | x(i) ; ) ]2 (Brier score)

Page 22: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Example: cost functions for classifying spam email

• 0-1 loss function– Appropriate if we just want to maximize accuracy

• Asymmetric cost matrix– Appropriate if missing non-spam emails is more “costly” than

failing to detect spam emails

• Probability loss– Appropriate if we wanted to rank all emails by p(spam | email

features), e.g., to allow the user to look at emails via a ranked list.

• In general: don’t solve a harder problem than you need to, or don’t model aspects of the problem you don’t need to

Page 23: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Examples of Classifiers

• Generative/class-conditional/probabilistic, based on p( x | ck ),

– Naïve Bayes (simple, but often effective in high dimensions)– Parametric generative models, e.g., Gaussian (can be effective in low-

dimensional problems: leads to quadratic boundaries in general)

• Regression-based, model p( ck | x ) directly– Logistic regression: simple, linear in “odds” space, widely used in industry– Neural network: non-linear extension of logistic, can be difficult to work

with

• Discriminative models, focus on locating optimal decision boundaries– Linear discriminants, perceptrons: simple, sometimes effective– Support vector machines: generalization of linear discriminants, can be

quite effective, computational complexity can be an issue– Nearest neighbor: simple, can scale poorly in high dimensions– Decision trees: often effective in high dimensions, but biased

Page 24: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Generative Classifiers

(classifiers that estimate p(x | c) and then use Bayes rule to compute p(c | x)

Page 25: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

A Generative Classifier: Naïve Bayes

• Generative probabilistic model with conditional independence assumption on p( x | ck ), i.e.

p( x | ck ) = p( xj | ck )

or, log p( x | ck ) = log [ p( xj | ck ) ]

• Useful in high-dimensional problems

• Typically used with nominal or ordinal variables– Real-valued variables discretized to create ordinal versions

• e.g., Supervised and unsupervised discretization of continuous features, Dougherty, Kohavi, and Sahami, ICML 1995

– alternative for real-valued x is to model each p( xj | ck ) with a parametric density model, e.g., Gaussian. Less widely used.

Page 26: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

A Generative Classifier: Naïve Bayes

• Comments:– Simple to train (just estimate conditional probabilities for each feature-class

pair)

– Often works surprisingly well in practice• e.g., good baseline for text classification, basis of many widely used spam filters

– Feature selection can be helpful, • e.g., select the K best individual features

– Note that even if independence assumptions are not met, it may still be able to approximate the optimal decision boundaries (seems to happen in practice)

• See On the optimality of the simple Bayesian classifier under zero-one loss, Domingos and Pazzani, Machine Learning, 2004

– However…. on most problems can usually be beaten with a more complex model (plus more work)

Page 27: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Regression-Based Classifiers

(classifiers that estimate p(c | x) directly)

Page 28: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Regression-based Classification

• Consider regression once again, but where y now takes values 0 or 1

• Regression will try to learn an f function to approximate E[y | x] at each x

• For binary y we have E[y | x] = y p(y |x) y

= p(y=1|x) . 1 + p(y=0|x) . 0

= p(y=1|x)

=> For binary classification problems a regression model will try to approximate p(y=1|x) (posterior class probabilities)– e.g., this is what logistic regression and neural networks do

Page 29: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Predicting an output between 0 and 1

• We often have a problem where y lies between 0 and 1– probability that a patient with attributes X will survive 10 years– proportion of people in Zip code X who will buy a product

• We could use linear regression, but…..

• Instead we can use the logistic function

log p(y=1|x)/log p(y=0|x) = 0 + j xj

Equivalently, p(y=1|x) = 1/[1 + exp(- 0 - j xj ) ]

We model the log-odds as a linear function of the input variables. This is known as logistic regression.

(Note: neural networks can be thought of as multi-layer logistic models)

Page 30: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

1-dimensional case

p(y=1| x ) = 1/[1 + exp(- 0 - x ) ]

For simplicity assume ’s are both >0

As x -> + infinity, p(y=1 | x) -> 1As x -> - infinity, p(y=1 | x) -> 0

P(y=1|x) = 0.5 when? - 0 - x = 0 -> x = - 0 /

- location of logistic curve controlled by - 0 /

- steepness of curve controlled by

Page 31: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Likelihood-based Objective Function

• Conditional Log-Likelihood

– likelihood = probability of observed data

– Select parameters to maximize the (log) likelihood of the y’s given the x’s (“conditional maximum likelihood”)

S() = i log p( y(i) | x(i) ; )

= i y(i) log p( y(i)=1| x(i) ; ) + [1-y(i)] log(1- p( y(i)=1| x(i) ; ))

Page 32: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Fitting a Logistic Regression Model

• Iterative Reweighted Least Squares (IRLS)– Can compute the 2nd derivative directly as weighted matrix

• Forms the basis for an iterative 2nd order Newton scheme• Each iteration is equivalent to a weighted regression problem, O(p3)

– see (e.g.) Komarek and Moore (2005) for speedups for sparse data• Known as iteratively reweighted least-squares• Log-likelihood here is convex: so it is quite stable (only one global

maximum!).

• Stochastic gradient descent– Often faster for large data sets (large N, large p)– See notes by Charles Elkan for reference

• http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf

Page 33: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Link between Logistic Regression and Naïve Bayes

( | ) ( ) ( | )log log log

( | ) ( ) ( | )w d

P C d P C P w C

P C d P C P w C

Naïve Bayes

Logistic Regression

( | )log

( | ) ww d

P C dw

P C d

Page 34: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Evaluating Classifiers

• Evaluate on independent test data (as with regression)

• Measures of performance on test data:– Classification accuracy (or error)

• or cost function if “costs” of errors are not symmetric• Confusion matrices:

– K x K matrix where entry(i,j) contains number of test examples that were predicted to be class i, and truly belonged to class j

– Diagonal elements = examples classified correctly– Off-diagonal elements = misclassified examples– Useful with more than 2 classes for figuring out which classes are most

“confused”

– Log-probability score on test data• Useful if we want to measure how good (well-callibrated) p(c|x) estimates are

– Ranking performance• How well does a classifier rank new examples?

– Receiver-operating characteristics– Lift curves

Page 35: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Imbalanced Class Distributions

• Common in data mining to have one class be much less likely than the others– e.g., 0.1% of examples are fraudulent or have a disease

• If we train a standard classifier on a random sample of data it is very difficult to beat the “majority classifier” in terms of accuracy

• Approaches:– Stratified sampling: artificially create training data with 50% of each

class being present, and then “correct” for this in prediction• E.g., learn p(x|c) on stratified data and use true p( c ) when predicting with a

probabilistic model

– Use a different score function:• We are often interested in scoring/screening/ranking cases when using the

model• Thus, scores such as “how many of the class of interest are ranked in the top

1% of predictions” may be more relevant than overall accuracy (e.g., in document retrieval)

Page 36: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Ranking and Lift Curves

• Many problems where we are interested in ranking examples in terms of how likely they are to the “positive” class – E.g., credit scoring, fraud detection, medical screening,

document retrieval– E.g., use classifier to rank N test examples according to p(c|x)

and then pick the top K, where K is much smaller than N

• Lift curve– n = number of true positives that appear in top K% of ranked list– r = number of true positives that would appear if we ranked

randomly– n/r is the “lift” provided by the classifier for top K%

• e.g., K = 10%, r = 200, n = 300, lift = 1.5, or 50% increase in lift• Random ranking gives lift = 1, or 0% increase in lift

Page 37: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

• Target variable = response/no-response from mailing campaign

• Training and test sets each of size 250k

• Standard model had 80 variables: variable selection reduced this to 7

• Note non-monotonicity in lower curve (undesirable)

Page 38: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Receiver Operating Characteristic (ROC) plots

• Rank the N test examples by p(c|x)– or whatever real-number our classifier produces that indicates

likelihood of belonging to class 1

• Let k = number of true class 1 examples, and m = number of true class 0 examples, and k+m = N

• For all possible thresholds t for this ranked list– count number of true positives kt

• true positive rate = kt /k– count number of “false alarms”, mt

• false positive rate = mt /m

– ROC plot = plot of true positive rate kt v false positive rate mt

Page 39: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

ROC Example

N = 10 examples, k = 6 true class 1’s, m = 4 class 0’s

The first column is a possible ranking from a classifier

Rank True Class

TruePositives

FalsePositives

1 1 1 0

2 1 2 0

3 1 3 0

4 1 4 0

5 0 4 1

6 1 5 1

7 0 5 2

8 1 6 2

9 0 6 3

10 0 6 4

Page 40: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

ROC Plot

• Area under curve (AUC) often used as a metric to summarize ROC• Online example at http://www.anaesthetist.com/mnm/stats/roc/

Diagonal linecorrespondsto randomranking

Page 41: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Calibration

• In addition to ranking we may be interested in how accurate our estimates of p(c|x) are,– i.e., if the model says p(c|x) = 0.9, how accurate is this

number?

• Calibration:– a model is well-calibrated if its probabilistic predictions match

real-world empirical frequencies– i.e., if a classifier predicts p(c|x) = 0.9 for 100 examples, then

on average we would expect about 90 of these examples to belong to class c, and 10 not to.

– We can estimate calibration curves by binning a classifier’s probabilistic predictions, and measuring how many

Page 42: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Calibration in Probabilistic Prediction

Page 43: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Examples of Classifiers

• Generative/class-conditional/probabilistic, based on p( x | ck ),

– Naïve Bayes (simple, but often effective in high dimensions)– Parametric generative models, e.g., Gaussian (can be effective in low-

dimensional problems: leads to quadratic boundaries in general)

• Regression-based, model p( ck | x ) directly– Logistic regression: simple, linear in “odds” space, widely used in industry– Neural network: non-linear extension of logistic, can be difficult to work

with

• Discriminative models, focus on locating optimal decision boundaries– Linear discriminants, perceptrons: simple, sometimes effective– Support vector machines: generalization of linear discriminants, can be

quite effective, computational complexity can be an issue– Nearest neighbor: simple, can scale poorly in high dimensions– Decision trees: often effective in high dimensions, but biased

Page 44: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Discriminative Classifiers

Page 45: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Nearest Neighbor Classifiers

• kNN: select the k nearest neighbors to x from the training data and select the majority class from these neighbors

• k is a parameter: – Small k: “noisier” estimates, Large k: “smoother” estimates– Best value of k often chosen by cross-validation

• Comments– Virtually assumption free– Gives piecewise linear boundaries (i.e., non-linear overall)– Interesting theoretical properties:

Bayes error < error(kNN) < 2 x Bayes error (asymptotically)

• Disadvantages– Can scale poorly with dimensionality: sensitive to distance metric– Requires fast lookup at run-time to do classification with large n– Does not provide any interpretable “model”

Page 46: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Local Decision Boundaries

1

1

1

2

2

2

Feature 1

Feature 2

?

Boundary? Points that are equidistantbetween points of class 1 and 2Note: locally the boundary is(1) linear (because of Euclidean distance)(2) halfway between the 2 class points(3) at right angles to connector

Page 47: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Finding the Decision Boundaries

1

1

1

2

2

2

Feature 1

Feature 2

?

Page 48: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Finding the Decision Boundaries

1

1

1

2

2

2

Feature 1

Feature 2

?

Page 49: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Finding the Decision Boundaries

1

1

1

2

2

2

Feature 1

Feature 2

?

Page 50: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Overall Boundary = Piecewise Linear

1

1

1

2

2

2

Feature 1

Feature 2

?

Decision Region for Class 1

Decision Region for Class 2

Page 51: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Example: Choosing k in kNN (example from G. Ridgeway, 2003)

Page 52: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Decision Tree Classifiers

– Widely used in practice• Can handle both real-valued and nominal inputs (unusual)• Good with high-dimensional data

– similar algorithms as used in constructing regression trees

– historically, developed both in statistics and computer science• Statistics:

– Breiman, Friedman, Olshen and Stone, CART, 1984• Computer science:

– Quinlan, ID3, C4.5 (1980’s-1990’s)

Page 53: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Decision Tree Example

Income

Debt

Page 54: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Decision Tree Example

t1 Income

DebtIncome > t1

??

Page 55: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Decision Tree Example

t1

t2

Income

DebtIncome > t1

Debt > t2

??

Page 56: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Decision Tree Example

t1t3

t2

Income

DebtIncome > t1

Debt > t2

Income > t3

Page 57: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Decision Tree Example

t1t3

t2

Income

DebtIncome > t1

Debt > t2

Income > t3

Note: tree boundaries are piecewiselinear and axis-parallel

Page 58: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Decision Tree Pseudocode

node = tree-design(Data = {X,C})

For i = 1 to d

quality_variable(i) = quality_score(Xi, C)

end

node = {X_split, Threshold } for max{quality_variable}

{Data_right, Data_left} = split(Data, X_split, Threshold)

if node == leaf?

return(node)

else

node_right = tree-design(Data_right)

node_left = tree-design(Data_left)

end

end

Page 59: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Binary split selection criteria

• Q(t) = N1Q1(t) + N2Q2(t), where t is the threshold

• Let p1k be the proportion of class k points in region 1

• Error criterion for a branch Q1(t) = 1 - p1k*

• Gini index: Q1(t) = k p1k (1 - p1k)

• Cross-entropy: Q1(t) = k p1k log p1k

• Cross-entropy and Gini work better in practice than direct minimization of classification error at each node

Page 60: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Computational Complexity for a Binary Tree

• At the root node, for each of p variables– Sort all values, compute quality for each split– O(pN log N) time for real-valued or ordinal variables

• Subsequent internal node operations each take O(N’ log N’)

• This assumes data are in main memory– If data are on disk then repeated access of subsets at different

nodes may be very slow (impossible to pre-index)– Note: time difference between retrieving data in RAM and data

on disk may be O(103) or more.

Page 61: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Splitting on a nominal attribute

• Nominal attribute with m values– e.g., the name of a state or a city in marketing data

• 2m-1 possible subsets => exhaustive search is O(2m-1)– For small m, a simple approach is to branch on specific values– But for large m this may not work well

• Neat trick for the 2-class problem:– For each predictor value calculate the proportion of class 1’s – Order the m values according to these proportions– Now treat as an ordinal variable and select the best split (linear

in m) – This gives the optimal split for the Gini index, among all

possible 2m-1 splits (Breiman et al, 1984).

Page 62: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

How to Choose the Right-Sized Tree?

PredictiveError

Size of Decision Tree

Error on Training Data

Error on Test Data

Ideal Rangefor Tree Size

Page 63: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Choosing a Good Tree for Prediction

• General idea– grow a large tree– prune it back to create a family of subtrees

• “weakest link” pruning

– score the subtrees and pick the best one

• Massive data sizes (e.g., n ~ 100k data points)– use training data set to fit a set of trees– use a validation data set to score the subtrees

• Smaller data sizes (e.g., n ~1k or less)– use cross-validation– use explicit penalty terms (e.g., Bayesian methods)

Page 64: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Example: Spam Email Classification

• Data Set: (from the UCI Machine Learning Archive)– 4601 email messages from 1999– Manually labelled as spam (60%), non-spam (40%)– 54 features: percentage of words matching a specific

word/character• Business, address, internet, free, george, !, $, etc

– Average/longest/sum lengths of uninterrupted sequences of CAPS

• Error Rates (Hastie, Tibshirani, Friedman, 2001)– Training: 3056 emails, Testing: 1536 emails– Decision tree = 8.7%– Logistic regression: error = 7.6%– Naïve Bayes = 10% (typically)

Page 65: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Data Mining Lectures Lectures 9/10: Classification Padhraic Smyth, UC Irvine

Page 66: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Page 67: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Treating Missing Data in Trees

• Missing values are common in practice

• Approaches to handing missing values – During training

• Ignore rows with missing values (inefficient)– During testing

• Send the example being classified down both branches and average predictions

– Replace missing values with an “imputed value” (can be suboptimal)

• Other approaches – Treat “missing” as a unique value (useful if missing values are

correlated with the class)– Surrogate splits method

• Search for and store “surrogate” variables/splits during training

Page 68: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Other Issues with Classification Trees• Why use binary splits?

– Multiway splits can be used, but cause fragmentation

• Linear combination splits?– can produces small improvements – optimization is much more difficult (need weights and split point)– Trees are much less interpretable

• Model instability– A small change in the data can lead to a completely different tree– Model averaging techniques (like bagging) can be useful

• Tree “bias”– Poor at approximating non-axis-parallel boundaries

• Producing rule sets from tree models (e.g., c5.0)

Page 69: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Why Trees are useful in Practice

• Can handle high dimensional data– builds a model using 1 dimension at time

• Can handle any type of input variables– categorical, real-valued, etc– most other methods require data of a single type (e.g., only

real-valued)

• Trees are (somewhat) interpretable– domain expert can “read off” the tree’s logic

Page 70: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Limitations of Trees

• Representational Bias– classification: piecewise linear boundaries, parallel to axes– regression: piecewise constant surfaces

• Trees do not scale well to massive data sets (e.g., N in millions)– repeated (unpredictable) access of subsets of the data

• High Variance– trees can be “unstable” as a function of the sample

• e.g., small change in the data -> completely different tree– causes two problems

• 1. High variance contributes to prediction error• 2. High variance reduces interpretability

– Trees are good candidates for model combining• Often used with boosting and bagging

Page 71: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Figure from Duda, Hart & Stork, Chap. 8

Moving just one example slightly may lead to quite different trees and space partition

Lack of stability against small perturbation of data.

Decision Trees are not stable

Page 72: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Example of Tree Instability2 trees fit to 2 splits of data, from G. Ridgeway,

2003

Page 73: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Linear Classifiers

Some slides adapted from Introduction to Information Retrieval, by Manning, Raghavan, and Schutze, Cambridge University Press, 2009, with additional slides from Ray Mooney.

Page 74: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Linear Classifiers

• Linear classifier (two-class case) wT x + b > 0

– w is a p-dimensional vector of weights (learned from the data)

– w0 is a threshold (also learned from the data)

– Equation of linear hyperplane (decision boundary) wT x + b = 0

Note that in p-dimensions this defines a p-1 dimensional hyperplane

Page 75: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Linear Discriminant Analysis (LDA)

Earliest known classification algorithm (1936, R.A. Fisher)

Find a linear projection onto a vector such that means for each class (2 classes) are separated as much as possible (with variances taken into account appropriately)

See section 10.4 in PDM for more mathematical details

Reduces to a special case of generative Gaussian classifier in certain situations

Many subsequent variations on this basic theme (e.g., regularized LDA)

Page 76: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Examples of Linear Classifiers

• Perceptrons:– f(x) = wT x + b > 0 , classify x as 1 if f(x) > 0

• Logistic Regression– See definition in earlier slides

– Boundary occurs at log(p/1-p) = 0, i.e., T x + 0

• Naïve Bayes– Linear (additive) form when we look at log P(c=1 | x)

• Linear support vector machines– Same functional form as perceptrons, but trained differently

Page 77: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

2-dimensional Example

• wT = [1 -1], b = 0

• Equation for decision boundary is wT x + b = 0

or x1 – x2 = 0, x1 = x2

length of weight vector |w| = sqrt(wT w) = sqrt(2)

Page 78: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Geometry of Linear Classifiers

Decision boundary:wT x + b = 0

Direction of w vector

Distance of x fromthe boundary is (wT x + b )/ |w|

wT = [1 -1], b = 0

x1

x2

Page 79: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Linear Separability in High Dimensions

Consider n sample points in p dimensions– Binary labels => 2n possible labelings (or dichotomies) – A labeling is linearly separable if we can separate the labels with

a hyperplane

– Let f(n,p) = fraction of the 2n possible labelings that are linearly separable

f(n, p) = 1 n <= p + 1

2/ 2n (n-1 choose i) n > p+1

Page 80: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

If n < p+1, then pointswill be linearlyseparable (for large p)

Page 81: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

81

Linear classifiers: Which Hyperplane?

• Lots of possible solutions for weights.

• Some methods find a separating hyperplane, but not an optimal one [according to some criterion of expected goodness]

– E.g., perceptron

• Support Vector Machine (SVM) finds a “maximum margin” solution– Maximizes the distance between the

hyperplane and the “difficult points” close to decision boundary

– One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions

Ch. 15

Page 82: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

82

Support Vector Machine (SVM)

Support vectors

Maximizesmargin

• SVMs maximize the margin around the separating hyperplane.

• A.k.a. large margin classifiers

• The decision function is fully specified by a subset of training samples, the support vectors.

• Solving SVMs is a quadratic programming problem

• Seen by many as the most successful current text classification method*

*but other discriminative methods often perform very similarly

Sec. 15.1

Narrowermargin

Page 83: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

83

Distance of a Point to the Hyperplane

• Distance from example to the separator is

(where y is +1 or -1) w

xw byr

T

r

ρx

x′

w

Derivation of finding r:

Dotted line x’−x is perpendicular todecision boundary so parallel to w.

Unit vector is w/|w|, so line is rw/|w|.

x’ = x – y r w/|w|. x’ satisfies wTx’+b = 0.

So wT(x –y r w/|w|) + b = 0Recall that |w| = sqrt(wTw).

So, solving for r gives:r = y(wTx + b)/|w|

Sec. 15.1

Page 84: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Optimal Separating Hyperplane

• Solution to constrained optimization problem:

(Here yi {-1, 1} is the binary class label for example i)

• Without loss of generality, let ||w|| = 1/M

• Unique solution for a linearly separable data set • Margin M of the classifier

– the distance between the separating hyperplane and the closest training samples

– optimal separating hyperplane maximum margin

n1,...,i ,M)(||w||

1 subject to M max 0

ww, 0

wxwy iT

i

n1,...,i ,1)( subject to ||w|| min 0 ww, 0

wxwy iT

i

Page 85: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Optimal Hyperplane, Support Vectors, and Margin

Circles = support vectors = points on convex hull that are closest to hyperplane

M = margin = distance of support vectors from hyperplane

Goal is to find weight vectorthat maximizes M

Theory tells us that max-marginhyperplane leads to good generalization (see workby Vapnik in 1990’s)

Page 86: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Sketch of Optimization Problem

• Define Lagrangian as a function of w vector, and ’s

• The solution must satisfy

• Points with i > 0 are called support vectors and distance from hyperplane = M

1]- )([ ||w|| 2

1 )L(w,

n

1i0

2

wxwy iT

ii

0]1)([ 0 wxwy iT

ii

Page 87: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Sketch of Optimization Problem

• This results in a quadratic programming optimization problem

• Good news: – convex function of unknowns, unique optimum– Variety of well-known algorithms for finding this optimum

• Bad news:– Quadratic programming in general scales as O(n3)

– In practice takes O(na), where a ~ 1.6 to 2 - e.g., Mining the Web: Discovering Knowledge from

Hypertext data, S. Chakrabarti, Chapter 5, p166)

– Faster methods also available, specialized for SVMs• E.g., cutting plane method of Joachims, 2006

Page 88: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

From Chakrabarti, Chapter 5, 2002Timing results on text classification

Page 89: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

89

Soft Margin Classification

• If the training data is not linearly separable, slack variables ξi can be added to allow misclassification of difficult or noisy examples.

• Misclassifier points incur a cost in training, proportional to distance from hyperplane

• Try to minimize training set errors, and to place hyperplane “far” from each class (large margin)

ξj

ξi

Sec. 15.2.1

Page 90: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

90

Non-linear SVMs

• Datasets that are linearly separable (with some noise) work out great:

• But what are we going to do if the dataset is just too hard?

• How about … mapping data to a higher-dimensional space:

0

x2

x

0 x

0 x

Sec. 15.2.3

Page 91: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

91

Non-linear SVMs: Feature spaces

• General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

Sec. 15.2.3

Page 92: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine

Support Vector Machines• If i > 0 then the distance of xi from the separating hyperplane is M

– Support vectors - points with associated i > 0

• The decision function f(x) is computed from support vectors as

=> prediction can be fast if i are sparse (i.e., most are zero)

• Non-linearly-separable case: can generalize to allow “slack” constraints

• Non-linear SVMs: replace original x vector with non-linear functions of x– “kernel trick” : can solve high-d problem without working directly in high d

• Computational speedups: can reduce training time to near- linear– e.g Platt’s SMO algorithm, Joachim’s SVMLight

n

ii

Tii xxyxf

1

)(

Page 93: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Experiments by Komarek and Moore, 2005

Page 94: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Accuracies and Training TimeKomarek and Moore, 2005

Page 95: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Accuracies and Training TimeKomarek and Moore, 2005

Page 96: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Summary on Classifiers

• Simple models (but can be effective)– Logistic regression– Naïve Bayes– K nearest-neighbors

• Decision trees– Good for high-dimensional problems with different data types

• More sophisticated:– Support vector machines– Boosting (e.g., boosting with naïve Bayes or with decision

stumps)

• Many tradeoffs in interpretability, score functions, etc

Page 97: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Software Tools

• MATLAB– Many free “toolboxes” on the Web for regression and prediction– e.g., see http://lib.stat.cmu.edu/matlab/

and in particular the CompStats toolbox

• R– General purpose statistical computing environment (successor to S)– Free (!)– Widely used by statisticians, has a huge library of functions and

visualization tools

• Weka– Widely used JAVA library for machine learning– Many Notes on Classification, support for cross-validation, evaluation, etc

• Commercial tools– SAS, other statistical packages– Data mining packages– Often are not progammable: offer a fixed menu of items

Page 98: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Reading

• From text: Chapter 10:– Covers both general concepts in classification and a broad range of

classifiers

• See also pointers on class Web site under Background Reading

• Additional reading for further information:

– Elements of Statistical Learning, • T. Hastie, R. Tibshirani, and J. Friedman, Springer Verlag, 2001

– Classification Trees, • Breiman, Friedman, Olshen, and Stone, Wadsworth Press, 1984.

– SVMs• T. Joachims, Learning to Classify Text using Support Vector Machines. Kluwer, 2002.

Page 99: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Additional Slides

Page 100: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Model Averaging Techniques

Page 101: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Model Averaging

• Can average over parameters and models– E.g., weighted linear combination of predictions from multiple

models

y = wk yk

– Why? Any predictions from a point estimate of parameters or a single model has only a small chance of the being the best

– Averaging makes our predictions more stable and less sensitive to random variations in a particular data set (good for less stable models like trees)

Page 102: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Model Averaging

• Model averaging flavors– Fully Bayesian: average over uncertainty in parameters and

models

– “empirical Bayesian”: learn weights over multiple models• E.g., stacking and bagging

– Build multiple simple models in a systematic way and combine them, e.g.,

• Boosting: will say more about this in later lectures

• Random forests: stochastically perturb the data, learn multiple trees, and then combine for prediction

Page 103: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Bagging for Combining Classifiers

• Training data sets of size N

• Generate B “bootstrap” sampled data sets of size N – Bootstrap sample = sample with replacement – e.g. B = 100

• Build B models (e.g., trees), one for each bootstrap sample– Intuition is that the bootstrapping “perturbs” the data enough to make

the models more resistant to true variability

• For prediction, combine the predictions from the B models– E.g., for classification p(c | x) = fraction of B models that predict c

– Plus: generally improves accuracy on models such as trees– Negative: lose interpretability

– Related techniques: random forests, boosting.

Page 104: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

green = majority votepurple = averaging the probabilities

From Hastie, Tibshirani,and Friedman, 2001

Page 105: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Illustration of Boosting:Color of points = class labelDiameter of points = weight at each iterationDashed line: single stage classifier. Green line: combined, boosted classifierDotted blue in last two: bagging (from G. Rätsch, Phd thesis, 2001)

Page 106: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Classification Case Study

Page 107: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Link Prediction in Coauthor GraphsO Madadhain, Hutchins, Smyth, SIGKDD, 2005

• Binary classification problem– Training data:

• graph of coauthor links, 100k authors, 300k links• data over several years

– Test data: coauthor graph for same authors in a future year– Classification problem:

• predict if pair(A,B) will coauthor• Training and test pairs selected in various ways

• Compared a variety of different classifiers and evaluation metrics– Skewed class distribution

• No link present (class 0) in 93.8 % of test examples• Link present (class 1) in 6.2 % of test examples

Page 108: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Evaluation Metrics

• Classification error– If p(link[A,B]) > 0.5, predict a link

• Brier Score

– [ p(link[A,B] – (A,B) ]2

• ROC Area– area under ROC plot (between 0 and 1)

Page 109: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Link Prediction Evaluation

ClassificationError

Baseline 6.2

Single Feature 6.2

Naïve Bayes 15.5

Logistic 6.1

Boosting 6.4

Averaged 6.2

Page 110: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Link Prediction Evaluation

ClassificationError

ROCArea

Baseline 6.2 0.50

Single feature 6.2 0.54

Naïve Bayes 15.5 0.78

Logistic 6.1 0.80

Boosting 6.4 0.79

Averaged 6.2 0.80

Page 111: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Link Prediction Evaluation

ClassificationError

ROCArea

BrierScore

Baseline 6.2 0.50 100.0

Single feature 6.2 0.54 98.6

Naïve Bayes 15.5 0.78 211.7

Logistic 6.1 0.80 83.1

Boosting 6.4 0.79 83.4

Averaged 6.2 0.80 82.2

Page 112: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Lift Curves for Different Models

Base Rate of links = 6.2%

Page 113: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Interpretation of Ranking at Top of Ranked List

• Top 50 ranked candidates– Averaged: contains 44 true links – Logistic: contains 40 true links– Baseline: contains 3 true links

• Top 500 ranked candidates– Averaged: contains 300 true links– Logistic: contains 298 true links– Baseline: contains 31 true links

Page 114: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Lift Curves for Different Models

Base Rate of links = 0.2%

Page 115: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Task

Decision Tree Classifiers

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Classification

Decision boundaries = hierarchy of axis-parallel

Cross-validated error

Greedy search in tree space

None specified

Tree

Page 116: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Task

Naïve Bayes Classifier

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Classification

Conditional independence probability model

Likelihood

Closed form probability estimates

None specified

Conditional probability tables

Page 117: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Task

Logistic Regression

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Classification

Log-odds(C) = linear function of X’s

Log-likelihood

Iterative (Newton) method

None specified

Logistic weights

Page 118: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Task

Nearest Neighbor Classifier

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Classification

Memory-based

Cross-validated error (for selecting k)

None

None specified

None

Page 119: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Task

Support Vector Machines

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Classification

Hyperplanes

“Margin”

Convex optimization

(quadratic programming)

None specified

None

Page 120: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Task

Neural Networks

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Regression

Y = nonlin function of X’s

Least-squares

Gradient descent

None specified

Network weights

Page 121: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Task

Multivariate Linear Regression

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Regression

Y = Weighted linear sum of X’s

Least-squares

Linear algebra

None specified

Regression coefficients

Page 122: CS 277: Data Mining Notes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine

Data Mining Lectures Notes on Classification © Padhraic Smyth, UC Irvine

Task

Autoregressive Time Series Models

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Time Series Regression

X = Weighted linear sum of earlier X’s

Least-squares

Linear algebra

None specified

Regression coefficients