more methodology; nearest-neighbor classifiers sec 4.7

More Methodology;Nearest-Neighbor

ClassifiersSec 4.7

Review: Properties of DTs

•Axis orthagonal, hyperrectangular, piecewise-constant models

•Categorical labels

•Non-metric

Separation of train & test•Fundamental principle (1st

amendment of ML):

•Don’t evaluate accuracy (performance) of your classifier (learning system) on the same data used to train it!

Holdout data•Usual to “hold out” a separate set of data for

testing; not used to train classifier

•A.k.a., test set, holdout set, evaluation set, etc.

•E.g.,

• is training set accuracy

• is test set (or generalization) accuracy

Gotchas...•What if you’re unlucky when you split data

into train/test?

•E.g., all train data are class A and all test are class B?

•No “red” things show up in training data

•Best answer: stratification

•Try to make sure class (+feature) ratios are same in train/test sets (and same as original data)

•Why does this work?

•Almost as good: randomization

•Shuffle data randomly before split

•Why does this work?

More gotchas...•What if your data set is small?

•Might not be able to get perfect stratification

•Can’t get really representative accuracy from any single train/test split

•A: cross-validationfor (i=0;i<k;++i) {

[Xtrain,Ytrain,Xtest,Ytest]=splitData(X,Y,N/k,i);

model[i]=train(Xtrain,Ytrain);cvAccs[i]=measureAcc(model[i],Xtest,Ytest);

}avgAcc=mean(cvAccs);stdAcc=stddev(cvAccs);

CV in pix[X;Y]Original

data

[X’;Y’]Randomshuffle

k-waypartition

[X1’Y1’]

[X2’Y2’]

[Xk’Yk’]

...k train/test sets

k accuracies53.7% 85.1% 73.2%

But is it really learning?•Now we know how well our models are

performing

•But are they really learning?

•Maybe any classifier would do as well

•E.g., a default classifier (pick the most likely class) or a random classifier

•How can we tell if the model is learning anything?

•Go back to first definitions

•What does it mean to learn something?

The learning curve•Train on successively larger fractions of

data

•Watch how accuracy (performance) changes

Measuring variance•Cross validation helps you get better

estimate of accuracy for small data

•Randomization (shuffling the data) helps guard against poor splits/ordering of the data

•Learning curves help assess learning rate/asymptotic accuracy

•Still one big missing component: variance

•Definition: Variance of a classifier is the fraction of error due to the specific data set it’s trained on

Measuring variance•Variance tells you how much you expect

your classifier/performance to change when you train it on a new (but similar) data set

•E.g., take 5 samplings of a data source; train/test 5 classifiers

•Accuracies: 74.2, 90.3, 58.1, 80.6, 90.3

•Mean accuracy: 78.7%

•Std dev of acc: 13.4%

•Variance is usually a function of both classifier and data source

•High variance classifiers are very susceptible to small changes in data

Putting it all together•Suppose you want to measure the expected

accuracy of your classifier, assess learning rate, and measure variance all at the same time?for (i=0;i<10;++i) { // variance reps

shuffle datado 10-way CV partition of datafor each train/test partition { // xval

for (pct=0.1;pct+=0.1;pct<=0.9) { // LCSubsample pct fraction of training settrain on subsample, test on test set

}}avg across all folds of CV partitiongenerate learning curve for this partition

}get mean and std across all curves

Putting it all together“hepatitis” data

5 minutes of math...

•Decision trees are non-metric

•Don’t know anything about relations between instances, except sets induced by feature splits

•Often, we have well-defined distances between points

•Idea of distance encapsulated by a metric

5 minutes of math...•Definition: a metric function

•is a function that obeys the following properties:

•Identity:

•Symmetry:

•Triangle inequality:

5 minutes of math...•Examples:

•Euclidean distance

* Note: omitting the square root still yields a metricand usually won’t change our results


•Manhattan (taxicab) distance

•Distance travelled along a grid between two points

•No diagonals allowed


•What if some attribute is categorical?


•What if some attribute is categorical?

•Typical answer is 0/1 distance:

•For each attribute, add 1 if the instances differ in that attribute, else 0

Distances in classification•Nearest neighbor: find the nearest

instance to the query point in feature space, return the class of that instance

•Simplest possible distance-based classifier

•With more notation:

Properties of NN•Training time of NN?

•Classification time?

•Geometry of model?

NN miscellaney•Slight generalization: k-Nearest neighbors (k-

NN)

•Find k training instances closest to query point

•Vote among them for label

•Q: How does this affect system?

•Gotcha: unscaled dimensions

•What happens if one axis is measured in microns and one in lightyears?

•Usual trick is to scale each axis to [-1,1] range

more methodology; nearest-neighbor classifiers sec 4.7

Documents

data set

train data

separate set of data

set accuracy

original data

holdout set

evaluation set

holdout data usual