Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Aaron Bobick School of Interactive Computing
CS 7616 Pattern Recognition Non-parametric methods: Kernels and other Naïve Things
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Administrivia
• Next PS will be same data, two more methods. Probably K-NN and maybe Naïve Bayes
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Today brought to you by… • This lecture (and I’m guessing some more to come!) gracously
provided (with real email!) by Aarti Singh @ CMU
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Parametric methods • Assume some functional form (Gaussian, logistic, Linear,
Quadratic) for • 𝑃𝑃(𝑋𝑋|𝑌𝑌) and 𝑃𝑃(𝑌𝑌) as in Bayes (remember Y is now label) • 𝑃𝑃(𝑌𝑌|𝑋𝑋) as in Logistic, Linear and Nonlinear regression
• Estimate parameters (𝝁𝝁,Σ,𝒘𝒘, 𝑏𝑏) using MLE/MAP or through gradient ascent
• Pro – need few data points to learn parameters • Con – Strong distributional assumptions, not satisfied in
practice • Doesn’t really work very well (sometimes? Often?)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Example
3 5
8
7
9
4
2
1
Hand-written digit images projected as points on a two-dimensional (nonlinear) feature spaces
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Non-Parametric methods • Typically don’t make any distributional assumptions • As we have more data, we should be able to learn more
complex models • Let number of parameters scale with number of training data
• Today, we will see some nonparametric methods for
• Density estimation • Classification • Maybe Regression and Naïve Bayes
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Histogram density estimate • Partition the feature space into distinct bins with widths Δ𝑖𝑖
and count the number of observations, 𝑛𝑛𝑖𝑖, in each bin.
• Often, the same width is used for all bins, Δ𝑖𝑖 = Δ
• Δ acts as a smoothing parameter.
Image src: Bishop book
�̂�𝑝 𝑥𝑥 =𝑛𝑛𝑖𝑖𝑛𝑛Δ𝑖𝑖
1𝑥𝑥∈𝐵𝐵𝑖𝑖𝑛𝑛𝑖𝑖
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Bias – Variance tradeoff
Bias – how close is the mean of estimate to the truth Variance – how much does the estimate vary around mean
• Choice of bin-width ∆ or #bins
Small ∆, large #bins “Small bias, Large variance” Large ∆, small #bins “Large bias, Small variance”
# bins = 1/∆
(p(x) approx constant per bin)
(more data per bin, stable estimate)
Bias Variance
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Choice of #bins
Image src: Bishop book
# bins = 1/∆
fixed n ∆ decreases ni decreases
MSE
= B
ias +
Var
ianc
e
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Histogram as MLE • Underlying model – density is constant on each bin Parameters pj : density in bin j
Note since ∑ (Δ ⋅ 𝑝𝑝𝑗𝑗)𝑗𝑗 = 1
• Maximize likelihood of data under probability model with
parameters pj
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
• Histogram – blocky estimate
• Kernel density estimate aka “Parzen/moving window method”
Kernel density estimate
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
• more generally
Kernel density estimate
-1 1
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernel density estimation
Gaussian bumps (red) around six data points and their sum (blue)
• Place small "bumps" at each data point, determined by the kernel function. • The estimator consists of a (normalized) "sum of bumps”.
• Note that where the points are denser the density estimate will have higher values.
Img src: Wikipedia
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernels
Any kernel function that satisfies
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernels
Finite support – only need local points to compute estimate
Infinite support - need all points to compute estimate -But quite popular since smoother
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Choice of kernel bandwidth
“Bart-Simpson” Density
Too small
Too large Just right
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Histograms vs. Kernel density estimation
∆ = ℎ acts as a smoother.
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
How to pick Δ (or h bin width) • Fundamental question of how to pick the bin width: • The fewer samples you have, the bigger you need the bin to
be to avoid accidental variations in density estimate. • Actually, the fewer samples you have near x, the bigger the
bin has to be around x. • Obvious idea: make the bin size a function of the number of
samples in the neighborhood of x.
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN (Nearest Neighbor) density estimation
• Histogram
• Kernel density est
Before: Fix ∆, estimate number of points within ∆ of x (𝑛𝑛𝑖𝑖 or 𝑛𝑛𝑥𝑥) from data
Now: Fix 𝑛𝑛𝑥𝑥 = 𝑘𝑘, estimate ∆ from data (volume of ball around x that contains k training pts)
• k-NN density est
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN density estimation
𝑘𝑘 acts as a smoother.
Not very popular for density estimation - expensive to compute, bad estimates But a related version for classification quite popular …
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From
Density estimation to
Classification
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN classifier
Sports
Science
Arts
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN classifier
Sports
Science
Arts
Test document
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN classifier (k=5)
Sports
Science
Arts
Test document
What should we predict? … Average? Majority? Why?
∆k,x
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN classifier • Optimal Classifier:
• k-NN Classifier:
# total training pts of class y
# training pts of class y that lie within ∆k ball
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
1-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
2-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
K even not used in practice
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
3-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
5-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
What is the best K?
Bias-variance tradeoff Larger K => predicted label is more stable Smaller K => predicted label is more accurate Similar to density estimation
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
1-NN classifier – decision boundary
K
Voronoi Diagram
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN classifier – decision boundary
• K acts as a smoother (Bias-variance tradeoff)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Case Study: kNN for Web Classification • Dataset
• 20 News Groups (20 classes) • Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/) • 61,118 words, 18,774 documents • Class labels descriptions
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Experimental Setup • Training/Test Sets:
• 50%-50% randomly split. • 10 runs • report average results
• Evaluation Criteria:
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Results: Binary Classes alt.atheism
vs. comp.graphics
rec.autos vs.
rec.sport.baseball
comp.windows.x
vs. rec.motorcycles
k
Accu
racy
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From Classification to Regression
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Temperature sensing • What is the temperature • in the room?
Average “Local” Average
at location x?
x
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernel Regression
• Aka Local Regression • Nadaraya-Watson Kernel Estimator
Where
• Weight each training point based on distance to test
point • Boxcar kernel yields local average
h
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernels
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Choice of kernel bandwidth h h=1 h=10
h=50 h=200
Choice of kernel is not that important
Too small
Too large Just right
Too small
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernel Regression as Weighted Least Squares
Weighted Least Squares
Kernel regression corresponds to locally constant estimator obtained from (locally) weighted least squares i.e. set f(Xi) = β (a constant)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernel Regression as Weighted Least Squares
constant
Notice that
set f(Xi) = β (a constant)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Local Linear/Polynomial Regression
Weighted Least Squares
Local Polynomial regression corresponds to locally polynomial estimator obtained from (locally) weighted least squares i.e. set
(local polynomial of degree p around X)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
When doesn’t this work? • Not enough points in a bin of reasonable size…
• When does that happen?
• When dimensions pf the space get too large.
• For example:
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Curse of dimensionality…
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Bayesian Revisited and “Naïve Bayes” • Given data 𝑿𝑿, posteriori probability of a hypothesis ℎ, 𝑃𝑃(ℎ|𝑋𝑋)
follows the Bayes theorem
• MAP (maximum posteriori) hypothesis
• Practical difficulty: require estimation of multi-dimensional densities 𝑃𝑃 𝑿𝑿 ℎ with limited data. (ℎ is, for example, class 𝐶𝐶𝑗𝑗)
( | ) ( )( | )( )
P h P hP hP
= XXX
.argmax ( | ) argmax ( | ) ( )h P h P h P hMAP h H h H≡ =
∈ ∈X X
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Naïve Bayes Classifier and kernel densities • A simplified assumption: attributes are conditionally independent:
• Assumes that the density of each element of a vector are independent given the class.
• Many fewer parameters. E.g. For 5 dimensional gaussians, 10 parameters instead of 20 (𝝁𝝁 and 𝚺𝚺)
• And you can use Kernel density estimation to learn each! • Conventional wisdom: Naïve Bayes gives poor densities but good
classification!
1( | ) ( ) (x | )
n
j j i ji
P C P C P C=
∝ ∏X
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Summary • Instance based/non-parametric approaches
Four things make a memory based learner: 1. A distance metric, dist(x,Xi)
Euclidean (and many more) 2. How many nearby neighbors/radius to look at?
k, ∆/h 3. A weighting function (optional)
W based on kernel K 4. How to fit with the local points?
Average, Majority vote, Weighted average, Poly fit
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Summary • Parametric vs Nonparametric approaches
Nonparametric models place very mild assumptions on the data distribution and provide good models for complex data
Parametric models rely on very strong (simplistic) distributional assumptions
Nonparametric models (not histograms) requires
storing and computing with the entire data set. Parametric models, once fitted, are much more
efficient in terms of storage and computation.
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Remaining slides not used yet • Taken from 600.325/425 Declarative Methods - J. Eisner
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Intuition behind memory-based learning • Similar inputs map to similar outputs
• If not true learning is impossible • If true learning reduces to defining “similar”
• Not all similarities created equal
• guess J. D. Salinger’s weight • who are the similar people? • similar occupation, age, diet, genes, climate, …
• guess J. D. Salinger’s IQ • similar occupation, writing style, fame, SAT score, …
• Superficial vs. deep similarities?
• B. F. Skinner and the behaviorism movement
parts of slide thanks to Rich Caruana
what do brains actually do?
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
1-Nearest Neighbor • Define a distance d(x1,x2) between any 2 examples
• examples are feature vectors • so could just use Euclidean distance …
• Training: Index the training examples for fast lookup. • Test: Given a new x, find the closest x1 from training.
Classify x the same as x1 (positive or negative)
• Can learn complex decision boundaries • As training size ∞, error rate is at most 2x the Bayes-
optimal rate (i.e., the error rate you’d get from knowing the true model that generated the data – whatever it is!)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From Hastie, Tibshirani, Friedman 2001 p418
1-Nearest Neighbor – decision boundary
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-Nearest Neighbor
• Average of k points more reliable when: • noise in training vectors x • noise in training labels y • classes partially overlap
attribute_1
attr
ibut
e_2
+ +
+
+
+ + + + +
o o
o o o
o o o o o o
+ +
+ o o o
slide thanks to Rich Caruana (modified)
Instead of picking just the single nearest neighbor, pick the k nearest neighbors and have them vote
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From Hastie, Tibshirani, Friedman 2001 p418
slide thanks to Rich Caruana (modified)
1 Nearest Neighbor – decision boundary
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From Hastie, Tibshirani, Friedman 2001 p418
slide thanks to Rich Caruana (modified)
15 Nearest Neighbors – it’s smoother!
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
How to choose “k” • Odd k (often 1, 3, or 5):
• Avoids problem of breaking ties (in a binary classifier) • Large k:
• less sensitive to noise (particularly class noise) • better probability estimates for discrete classes • larger training sets allow larger values of k
• Small k: • captures fine structure of problem space better • may be necessary with small training sets
• Balance between large and small k • What does this remind you of?
• As training set approaches infinity, and k grows large, kNN becomes Bayes optimal
slide thanks to Rich Caruana (modified)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From Hastie, Tibshirani, Friedman 2001 p419
slide thanks to Rich Caruana (modified)
why?
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Cross-Validation • Models usually perform better on training data than on future
test cases • 1-NN is 100% accurate on training data! • “Leave-one-out” cross validation:
• “remove” each case one-at-a-time • use as test case with remaining cases as train set • average performance over all test cases
• LOOCV is impractical with most learning methods, but extremely efficient with MBL!
slide thanks to Rich Caruana
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Distance-Weighted kNN • hard to pick large vs. small k
• may not even want k to be constant • use large k, but more emphasis on nearer neighbors?
),(exp
1often or ),(
1 maybeor ),(
1:NN-k for the weightsrelative define We
labels their , and NN-k theare , e wher
)(
11
1
1
xxDistxxDistxxDistw
yyxx
w
ywxprediction
iiii
kk
k
ii
k
iii
⋅=
⋅=
∑
∑
=
=
ββ
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Combining k-NN with other methods, #1 • Instead of having the k-NN simply vote, put them into a little
machine learner! • To classify x, train a “local” classifier on its k nearest neighbors
(maybe weighted). • polynomial, neural network, …
parts of slide thanks to Rich Caruana
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important
attribute_1
attr
ibut
e_2
+ +
+
+
+ + + + +
o o
o o o
o o o o o o
parts of slide thanks to Rich Caruana
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick These o’s are now “closer” to + than to each other
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important
• Problem #1: • What if the input represents physical weight not in pounds but in
milligrams? • Then small differences in physical weight dimension have a huge effect on
distances, overwhelming other features. • Should really correct for these arbitrary “scaling” issues.
• One simple idea: rescale weights so that standard deviation = 1.
+ o o
+
weight (lb)
attr
ibut
e_2
+ +
+ +
+ +
+ o
o
o o o o
o o o
weight (mg)
attr
ibut
e_2
+
+ +
+ +
+ +
+ +
o o
o o o
o o
o o o
o bad
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important
• Problem #2: • What if some dimensions more correlated with true label?
• (more relevant, or less noisy)
• Stretch those dimensions out so that they are more important in determining distance. • One common technique is called “information gain.”
parts of slide thanks to Rich Caruana
most relevant attribute
attr
ibut
e_2 +
+
+
+
+
+ + +
+ + o o
o
o o
o
o
o
o o
o
attr
ibut
e_2
most relevant attribute
+ +
+
+
+
+ + +
+ + o o
o
o o
o
o
o
o o
o good
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Weighted Euclidean Distance
• large weight si attribute i is more important • small weight si attribute i is less important • zero weight si attribute i doesn’t matter
( )∑=
−⋅=N
iiii xxsxxd
1
2')',(
slide thanks to Rich Caruana (modified)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important • Problem #3:
• Do we really want to decide separately and theoretically how to scale each dimension?
• Could simply pick dimension scaling factors to maximize performance on development data. (maybe do leave-one-out) • Similarly, pick number of neighbors k and how to weight them.
• Especially useful if performance measurement is complicated (e.g., 3 classes and differing misclassification costs).
attribute_1
attr
ibut
e_2
+ +
+
+
+ + + + +
o
o
o o o
o o o o
o o
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
should replot on log scale before measuring dist
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important
• Problem #4: • Is it the original input dimensions that we want to scale? • What if the true clusters run diagonally? Or curve? • We can transform the data first by extracting a different, useful
set of features from it: • Linear discriminant analysis • Hidden layer of a neural network
attribute_1
attr
ibut
e_2
+ exp(attribute_1)
attr
ibut
e_2 +
+
+
+
+
+ + +
+ + o o
o o
o
o
o
o
o o
i.e., redescribe the data by how a different type of learned classifier internally sees it
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important
• Problem #5: • Do we really want to transform the data globally? • What if different regions of the data space behave differently? • Could find 300 “nearest” neighbors (using global transform), then locally
transform that subset of the data to redefine “near” • Maybe could use decision trees to split up the data space first
attribute_1 at
trib
ute_
2 +
+ +
+
+ +
+ + +
+ o o o
o
o o
o
o
o
o o
o
o + + + + + +
+ +
+
o o o
o o o o o o o o
?
?
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Why are we doing all this preprocessing? • Shouldn’t the user figure out a smart way to transform the data
before giving it to k-NN?
• Sure, that’s always good, but what will the user try? • Probably a lot of the same things we’re discussing. • She’ll stare at the training data and try to figure out how to transform it
so that close neighbors tend to have the same label. • To be nice to her, we’re trying to automate the most common parts of
that process – like scaling the dimensions appropriately. • We may still miss patterns that her visual system or expertise can find.
So she may still want to transform the data. • On the other hand, we may find patterns that would be hard for her to
see.
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
split on feature that reduces our uncertainty most
1607/1704 = 0.943 694/5977 = 0.116
Tangent: Decision Trees (a different simple method)
example thanks to Manning & Schütze
Is this Reuters article an Earnings Announcement?
2301/7681 = 0.3 of all docs contains “cents” < 2 times contains “cents” ≥ 2 times
contains “versus” < 2 times
contains “versus” ≥ 2 times
contains “net” < 1 time
contains “net” ≥ 1 time
1398/1403 = 0.996
209/301 = 0.694
“yes”
422/541 = 0.780
272/5436 = 0.050
“no”
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Booleans, Nominals, Ordinals, and Reals • Consider attribute value differences:
(xi – x’i): what does subtraction do?
• Reals: easy! full continuum of differences • Integers: not bad: discrete set of differences • Ordinals: not bad: discrete set of differences • Booleans: awkward: hamming distances 0 or 1 • Nominals? not good! recode as Booleans?
slide thanks to Rich Caruana (modified)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
“Curse of Dimensionality”” • Pictures on previous slides showed 2-dimensional data • What happens with lots of dimensions? • 10 training samples cover the space less & less well …
images thanks to Charles Annis
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
“Curse of Dimensionality””
• A deeper perspective on this: • Random points chosen in a high-dimensional space tend to all be pretty much
equidistant from one another! • (Proof: in 1000 dimensions, the squared distance between two random points is
a sample variance of 1000 coordinate distances. Since 1000 is large, this sample variance is usually close to the true variance.)
• So each test example is about equally close to most training examples. • We need a lot of training examples to expect one that is unusually close to the
test example.
images thanks to Charles Annis
Pictures on previous slides showed 2-dimensional data What happens with lots of dimensions? 10 training samples cover the space less & less well …
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
“Curse of Dimensionality”” • also, with lots of dimensions/attributes/features, the irrelevant
ones may overwhelm the relevant ones:
• So the ideas from previous slides grow in importance: • feature weights (scaling)
• feature selection (try to identify & discard irrelevant features) • but with lots of features, some irrelevant ones will probably
accidentally look relevant on the training data • smooth by allowing more neighbors to vote (e.g., larger k)
( ) ( )∑ ∑= =
−+−=relevant
i
irrelevant
jjjii xxxxxxd
1 1
22 '')',(
slide thanks to Rich Caruana (modified)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Advantages of Memory-Based Methods • Lazy learning: don’t do any work until you know what you
want to predict (and from what variables!) • never need to learn a global model • many simple local models taken together can represent a more
complex global model • Learns arbitrarily complicated decision boundaries • Very efficient cross-validation • Easy to explain to users how it works
• … and why it made a particular decision! • Can use any distance metric: string-edit distance, …
• handles missing values, time-varying distributions, ...
slide thanks to Rich Caruana (modified)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Weaknesses of Memory-Based Methods • Curse of Dimensionality
• often works best with 25 or fewer dimensions • Classification runtime scales with training set size
• clever indexing may help (K-D trees? locality-sensitive hash?) • large training sets will not fit in memory
• Sometimes you wish NN stood for “neural net” instead of “nearest neighbor” • Simply averaging nearby training points isn’t very subtle • Naive distance functions are overly respectful of the input encoding
• For regression (predict a number rather than a class), the extrapolated surface has discontinuities
slide thanks to Rich Caruana (modified)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Current Research in MBL • Condensed representations to reduce memory
requirements and speed-up neighbor finding to scale to 106–1012 cases
• Learn better distance metrics • Feature selection • Overfitting, VC-dimension, ... • MBL in higher dimensions • MBL in non-numeric domains:
• Case-Based or Example-Based Reasoning • Reasoning by Analogy
slide thanks to Rich Caruana