nearest-neighbor methods

3
Focus Article Nearest-neighbor methods Clifton Sutton Nearest-neighbor classification is a simple and popular method for supervised classification. The basic method is to classify a query point as being of a certain class if of the k-nearest neighbors of the query point, more of them belong to this class than to any other class. Variations of the basic method include weighted nearest- neighbor methods and adaptive nearest-neighbor methods (which allow some variables to have greater influence than other variables). © 2012 Wiley Periodicals, Inc. How to cite this article: WIREs Comput Stat 2012, 4:307–309. doi: 10.1002/wics.1195 Keywords: classification; distance; local; majority INTRODUCTION N earest-neighbor classification is an extremely simple and popular method of classification which can be competitive in many situations. It is a local method, as opposed to being model based, and it uses few assumptions about the data. Nearest- neighbor methods tend to work well when the class boundaries are somewhat complex, perhaps due to multimodal densities for the class distributions. If class boundaries are close to linear, then other classification methods may perform better. THE BASIC METHOD If one has a training data set consisting on n observa- tions of the form (x i , g i ), where x i is a p-dimensional vector composed of the values of p predictor variables for the ith observation, and g i is the class of the ith observation, to classify a query for which a vector of the predictor variable values, x q , is available, but the class is not, the method of k-nearest neighbors (KNN) predicts the unknown class based on the classes asso- ciated with the k predictor vectors of the training set that are closest to x q , using a simple majority deci- sion rule. For example, if k = 15 and 5 of a query’s 15 nearest neighbors are from class 1, 7 of them are from class 2, and 3 of them are from class 3, then the predicted class for the query is class 2. The predictor variable values of the training set are usually scaled so that each sample mean is 0 and Correspondence to: [email protected] George Mason University, Statistics, 1700 Nguyen Engineering Building, Fairfax, Virginia each sample standard deviation is 1, and then the same linear transformations are applied to the query vector values. It is assumed that the numbers of cases of the various classes in the training set are proportional to the proportions of members of the classes in the population from which the query is drawn, i.e., the representation of each class in the training set is proportional to the appropriate prior probability for the class of the query. In a situation for which one wants to give equal prior probability for each class represented in the training set (e.g., if there is no reason to favor one class over any other class), the number of cases for each class should be the same in the training set, and if this is not the case, then the majority voting method used to predict the class of the query should be adjusted. (One way to do the adjustment would be to let each of the KNN contribute a vote having a weight proportional to the inverse of the number of training set cases having the class of the nearest neighbor.) To determine the KNN of a query, Euclidean distance is usually used, but one could try other metrics, with L 1 distance being a popular alternative. Often the value of k used is more crucial than the choice of metric. To determine a good value to use for k, a leave-one-out cross-validation approach can be used (i.e., classify each observation of the training sample using nearest neighbors based on the other n 1 observations, using various values for k, and then determine which value of k produced the fewest misclassifications when the results are pooled). The same approach can also be used to select a metric from two or more possibilities. (R’s train.kknn function uses cross-validation to select a value of k and a metric Volume 4, May/June 2012 © 2012 Wiley Periodicals, Inc. 307

Upload: clifton-sutton

Post on 11-Oct-2016

216 views

Category:

Documents


3 download

TRANSCRIPT

Focus Article

Nearest-neighbor methodsClifton Sutton∗

Nearest-neighbor classification is a simple and popular method for supervisedclassification. The basic method is to classify a query point as being of a certain classif of the k-nearest neighbors of the query point, more of them belong to this classthan to any other class. Variations of the basic method include weighted nearest-neighbor methods and adaptive nearest-neighbor methods (which allow somevariables to have greater influence than other variables). © 2012 Wiley Periodicals, Inc.

How to cite this article:WIREs Comput Stat 2012, 4:307–309. doi: 10.1002/wics.1195

Keywords: classification; distance; local; majority

INTRODUCTION

Nearest-neighbor classification is an extremelysimple and popular method of classification

which can be competitive in many situations. It isa local method, as opposed to being model based,and it uses few assumptions about the data. Nearest-neighbor methods tend to work well when the classboundaries are somewhat complex, perhaps due tomultimodal densities for the class distributions. If classboundaries are close to linear, then other classificationmethods may perform better.

THE BASIC METHODIf one has a training data set consisting on n observa-tions of the form (xi, gi), where xi is a p-dimensionalvector composed of the values of p predictor variablesfor the ith observation, and gi is the class of the ithobservation, to classify a query for which a vector ofthe predictor variable values, xq, is available, but theclass is not, the method of k-nearest neighbors (KNN)predicts the unknown class based on the classes asso-ciated with the k predictor vectors of the training setthat are closest to xq, using a simple majority deci-sion rule. For example, if k = 15 and 5 of a query’s15 nearest neighbors are from class 1, 7 of them arefrom class 2, and 3 of them are from class 3, then thepredicted class for the query is class 2.

The predictor variable values of the training setare usually scaled so that each sample mean is 0 and

∗Correspondence to: [email protected] Mason University, Statistics, 1700 Nguyen EngineeringBuilding, Fairfax, Virginia

each sample standard deviation is 1, and then thesame linear transformations are applied to the queryvector values. It is assumed that the numbers ofcases of the various classes in the training set areproportional to the proportions of members of theclasses in the population from which the query isdrawn, i.e., the representation of each class in thetraining set is proportional to the appropriate priorprobability for the class of the query. In a situationfor which one wants to give equal prior probabilityfor each class represented in the training set (e.g., ifthere is no reason to favor one class over any otherclass), the number of cases for each class should bethe same in the training set, and if this is not the case,then the majority voting method used to predict theclass of the query should be adjusted. (One way todo the adjustment would be to let each of the KNNcontribute a vote having a weight proportional to theinverse of the number of training set cases having theclass of the nearest neighbor.)

To determine the KNN of a query, Euclideandistance is usually used, but one could try othermetrics, with L1 distance being a popular alternative.Often the value of k used is more crucial than thechoice of metric. To determine a good value to usefor k, a leave-one-out cross-validation approach canbe used (i.e., classify each observation of the trainingsample using nearest neighbors based on the othern − 1 observations, using various values for k, andthen determine which value of k produced the fewestmisclassifications when the results are pooled). Thesame approach can also be used to select a metric fromtwo or more possibilities. (R’s train.kknn functionuses cross-validation to select a value of k and a metric

Volume 4, May/June 2012 © 2012 Wiley Per iodica ls, Inc. 307

Focus Article wires.wiley.com/compstats

(and a kernel, or weight function, if ones wants to tryweighted nearest-neighbor classification).)

WEIGHTED NEAREST NEIGHBORS

In additional to the basic KNN classification ruledescribed above, one can also consider using weightednearest neighbors. With weighted nearest neighbors,instead of having each of the KNN count equally,without regard to their distances from xq, each ofthe nearest neighbors is given a weighted vote, witha weight that decreases as the distance between theneighbor and the query increases, with the distanceto the (k + 1)th nearest neighbor determining the dis-tance at which the weight is 0. For example, one canlet the weights given to the KNN decrease linearly, giv-ing weight 1 if a neighbor’s predictor vector is exactlyequal to xq, and giving weight 0.5 if a neighbor’s dis-tance to xq is half of the distance between the querypoint and its (k + 1)th nearest neighbor. (Note thatsome describe the weighting using kernels, but in highdimensions, it may be easiest to think of the weight-ing simply in terms of a function that gives decreasingweight with increasing distance, and assigning positiveweights to at most KNN. Using the method describedabove, it can be noted that only k − 1 neighbors will begiven positive weights if the kth and (k + 1)th nearestneighbors are at the same distance from the query.)

OTHER VARIATIONS

As described above, nearest-neighbor methods haveno built-in variable selection capability which couldbe used to give increased weight (in the distance deter-minations) to the most important variables and little orno weight to variables which are ‘noise’ variables, sup-plying little or no information which is useful for theclassification task. Hastie, Tibshirani, and Friedmanalso note that ‘When nearest-neighbor classification iscarried out in a high-dimensional feature space, thenearest neighbors of a point can be very far away,causing bias and degrading the performance of therule’ (Ref 1, p. 427). Two general techniques whichcan be used in an attempt to combat the presence ofnoise variables and the so-called curse of dimension-ality are (1) the use of locally adaptive neighborhoodsand (2) global dimension reduction.

With adaptive nearest neighbors methods, ina sense the metric is locally adapted to give lessuseful predictor variables little or no weight. Hastieand Tibshirani2 describe the discriminant adaptivenearest neighbor (DANN) rule, which deform theneighborhood used at a query point away from a

spherical neighborhood (which effectively gives equalweight to each predictor variable) in such a wayso that the neighborhood is stretched out in certaindirections. These directions may be roughly parallel toclass boundaries so that a majority vote using trainingobservations in the deformed neighborhood maybetter determine on which side of a class boundary aquery point is located. Deforming the neighborhood isequivalent to adapting the metric, giving the predictorvariables different weights with regard to determiningthe distances to neighboring points. DANN adaptsthe metric differently at each query point, using abetween class covariance matrix and a pooled within-class covariance matrix based on a specified number ofnearest neighbors (found using a conventional metric).Friedman3 proposed a simpler locally adaptivemethod with which rectangular neighborhoods aredetermined at each query point by successivelynarrowing a hypercube in selected dimensions.

In contrast to locally adaptive nearest-neighbormethods, which in a way performs local dimensionreduction, one can also try global dimension reduc-tion, and apply traditional nearest-neighbor methodsto some selected subspace of the original featurespace. Hastie and Tibshirani2 describe a variationof their DANN method which does this. The lower-dimensional subspace used is based on the eigenvec-tors of an averaged between-centroids sum of squaresmatrix.

Alternatively, some simple method could be usedto reduce the dimension of the data vectors. One couldcompute a one-way analysis of variance F statistic foreach predictor variable, as if one was going to testwhether or not the distribution means for all of theclasses are the same, and then use only those variableshaving an F statistic value above a chosen thresh-old, and thus reduce the dimension by simply notusing some of the variables. Or one could standard-ize the variables and determine the first r principalcomponents directions (r < p), and then apply thenearest-neighbor methods described above on predic-tor variables corresponding to these r directions.Foreither approach, the dimension selected can be deter-mined in a data-driven manner. These two approachesto dimensionality reduction can be combined, firstusing the F statistics to hopefully eliminate noise vari-ables and weak predictors, and then using principalcomponents to further reduce the dimension.

MULTIPLE QUERIES FROM ONE CLASS

When there is a group of queries which are knownto be of the same class, the information gained fromapplying a nearest-neighbor method to each query

308 © 2012 Wiley Per iodica ls, Inc. Volume 4, May/June 2012

WIREs Computational Statistics Nearest-neighbor methods

point separately can be pooled to arrive at a class pre-diction for the group. One way to do this would be touse a simple majority rule which weights each querypoint equally by giving one vote to the predictedclass of each query point. However, this simplisticmethod ignores the fact that sometimes nearest neigh-bors arrives at a rather decisive prediction (e.g., ifall of a query point’s KNN are of the same class)and in other cases the predicted class may be chosenbecause one class just barely beats out a number ofother classes. To better incorporate the strength ofeach prediction, the weights of the votes from the sep-arate predictions can be pooled together to arrive ata predicted class for the entire group of query points.For example, if there is a group of 20 query pointsknown to belong to the same class, and 15 nearestneighbors were used to obtain a predicted class foreach query point, a majority rule can be applied to thecombined collection of 300 nearest neighbors to arriveat a predicted class for the group. If weighted near-est neighbors were used for each separate prediction,then the overall vote for each class can be obtained byconsidering the 300 weights corresponding to the 15nearest neighbors of each of the 20 query points, andsumming all of the weights assigned to each particularclass.

A variation which may be appealing when thereare outliers in the query points, or when there is a pos-sibility that none of the training sample’s classes is theclass of the group of query points, is to reduce the vot-ing weights when the nearest neighbors are relativelyfar away. This can be done by dividing the weights byeither the radius or the volume of the p-dimensionalball centered at the query point and having a radiusequal to the distance between the query point and its(k + 1)th nearest neighbor. Then query points that are

relatively far from the training set points will result inlarge radii for the balls, and thus a low weighted voteeven for the winning class. Experience may suggest athreshold to use so that if the weighted vote of thewinning class does not exceed it, the conclusion willbe that the group of query points do not belong toany of the classes represented in the training data.

ORIGINS

The basic idea of using nearest neighbors for classifi-cation seems to have been first proposed in an unpub-lished 1951 technical report by Fix and Hodges4 (laterreprinted in Agrawala5). This report proposed non-parametric estimation of the likelihood ratio for thepurpose of performing classification. Silverman andJones6 provide an excellent review of and commentaryabout the original report by Fix and Hodges,4 and alsothe 1952 report by Fix and Hodges7 (later reprintedin Agrawala5), which further considers KNN classifi-cation (mostly for the case of k = 1).

Silverman and Jones6 indicate that the firstappearance of a nearest neighbors multivariate densityestimator in the published literature was in the 1965paper by Loftsgaarden and Quesenberry.8 For theone-dimensional case, they suggested that the densityat a point x be estimated by

k2 n dk(x)

,

where n is the sample size, k is a positive integer, anddk(x) is the distance from the point xth to the kthnearest neighbor in the sample. However, it should benoted that this idea was proposed earlier, in 1951, byFix and Hodges.4

REFERENCES1. Hastie T, Tibshirani R, Friedman J. The Elements of Sta-

tistical Learning: Data Mining, Inference, and Prediction.New York: Springer-Verlag; 2001.

2. Hastie T, Tibshirani R. Discriminant adaptive nearest-neighbor classification. IEEE Pattern Recognit MachIntell 1996, 18:607–616.

3. Friedman J. Flexible metric nearest-neighbor classifica-tion, Technical Report, Stanford University, Stanford,CA, 1994.

4. Fix E, Hodges JL. Discriminatory analysis—nonpara-metric discrimination: consistency properties, TechnicalReport, USAF School of Aviation Medicine, RandolphField, TX, 1951. (Reprinted as pp. 261–279 of Agrawala,1977.)

5. Agrawala AK. Machine Recognition of Patterns. NewYork: IEEE Press; 1977.

6. Silverman BW, Jones MC. E. Fix and J. L. Hodges(1951): an important contribution to nonparametric dis-criminant analysis and density estimation. Int Stat Rev1989, 57:233–247.

7. Fix E, Hodges JL. Discriminatory analysis—nonparamet-ric discrimination: small sample performance, TechnicalReport, USAF School of Aviation Medicine, RandolphField, TX, 1952. (Reprinted as pp. 280–322 of Agrawala,1977.)

8. Loftsgaarden DO, Quesenberry CP. A nonparametricestimate of a multivariate density function. Ann MathStat 1965, 36:1049–1051.

Volume 4, May/June 2012 © 2012 Wiley Per iodica ls, Inc. 309