![Page 1: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/1.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Aaron Bobick School of Interactive Computing
CS 7616 Pattern Recognition Non-parametric methods: Kernels and other Naïve Things
![Page 2: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/2.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Administrivia
• Next PS will be same data, two more methods. Probably K-NN and maybe Naïve Bayes
![Page 3: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/3.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Today brought to you by… • This lecture (and I’m guessing some more to come!) gracously
provided (with real email!) by Aarti Singh @ CMU
![Page 4: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/4.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Parametric methods • Assume some functional form (Gaussian, logistic, Linear,
Quadratic) for • 𝑃𝑃(𝑋𝑋|𝑌𝑌) and 𝑃𝑃(𝑌𝑌) as in Bayes (remember Y is now label) • 𝑃𝑃(𝑌𝑌|𝑋𝑋) as in Logistic, Linear and Nonlinear regression
• Estimate parameters (𝝁𝝁,Σ,𝒘𝒘, 𝑏𝑏) using MLE/MAP or through gradient ascent
• Pro – need few data points to learn parameters • Con – Strong distributional assumptions, not satisfied in
practice • Doesn’t really work very well (sometimes? Often?)
![Page 5: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/5.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Example
3 5
8
7
9
4
2
1
Hand-written digit images projected as points on a two-dimensional (nonlinear) feature spaces
![Page 6: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/6.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Non-Parametric methods • Typically don’t make any distributional assumptions • As we have more data, we should be able to learn more
complex models • Let number of parameters scale with number of training data
• Today, we will see some nonparametric methods for
• Density estimation • Classification • Maybe Regression and Naïve Bayes
![Page 7: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/7.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Histogram density estimate • Partition the feature space into distinct bins with widths Δ𝑖𝑖
and count the number of observations, 𝑛𝑛𝑖𝑖, in each bin.
• Often, the same width is used for all bins, Δ𝑖𝑖 = Δ
• Δ acts as a smoothing parameter.
Image src: Bishop book
�̂�𝑝 𝑥𝑥 =𝑛𝑛𝑖𝑖𝑛𝑛Δ𝑖𝑖
1𝑥𝑥∈𝐵𝐵𝑖𝑖𝑛𝑛𝑖𝑖
![Page 8: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/8.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Bias – Variance tradeoff
Bias – how close is the mean of estimate to the truth Variance – how much does the estimate vary around mean
• Choice of bin-width ∆ or #bins
Small ∆, large #bins “Small bias, Large variance” Large ∆, small #bins “Large bias, Small variance”
# bins = 1/∆
(p(x) approx constant per bin)
(more data per bin, stable estimate)
Bias Variance
![Page 9: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/9.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Choice of #bins
Image src: Bishop book
# bins = 1/∆
fixed n ∆ decreases ni decreases
MSE
= B
ias +
Var
ianc
e
![Page 10: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/10.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Histogram as MLE • Underlying model – density is constant on each bin Parameters pj : density in bin j
Note since ∑ (Δ ⋅ 𝑝𝑝𝑗𝑗)𝑗𝑗 = 1
• Maximize likelihood of data under probability model with
parameters pj
![Page 11: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/11.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
• Histogram – blocky estimate
• Kernel density estimate aka “Parzen/moving window method”
Kernel density estimate
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
![Page 12: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/12.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
• more generally
Kernel density estimate
-1 1
![Page 13: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/13.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernel density estimation
Gaussian bumps (red) around six data points and their sum (blue)
• Place small "bumps" at each data point, determined by the kernel function. • The estimator consists of a (normalized) "sum of bumps”.
• Note that where the points are denser the density estimate will have higher values.
Img src: Wikipedia
![Page 14: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/14.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernels
Any kernel function that satisfies
![Page 15: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/15.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernels
Finite support – only need local points to compute estimate
Infinite support - need all points to compute estimate -But quite popular since smoother
![Page 16: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/16.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Choice of kernel bandwidth
“Bart-Simpson” Density
Too small
Too large Just right
![Page 17: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/17.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Histograms vs. Kernel density estimation
∆ = ℎ acts as a smoother.
![Page 18: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/18.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
How to pick Δ (or h bin width) • Fundamental question of how to pick the bin width: • The fewer samples you have, the bigger you need the bin to
be to avoid accidental variations in density estimate. • Actually, the fewer samples you have near x, the bigger the
bin has to be around x. • Obvious idea: make the bin size a function of the number of
samples in the neighborhood of x.
![Page 19: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/19.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN (Nearest Neighbor) density estimation
• Histogram
• Kernel density est
Before: Fix ∆, estimate number of points within ∆ of x (𝑛𝑛𝑖𝑖 or 𝑛𝑛𝑥𝑥) from data
Now: Fix 𝑛𝑛𝑥𝑥 = 𝑘𝑘, estimate ∆ from data (volume of ball around x that contains k training pts)
• k-NN density est
![Page 20: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/20.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN density estimation
𝑘𝑘 acts as a smoother.
Not very popular for density estimation - expensive to compute, bad estimates But a related version for classification quite popular …
![Page 21: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/21.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From
Density estimation to
Classification
![Page 22: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/22.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN classifier
Sports
Science
Arts
![Page 23: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/23.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN classifier
Sports
Science
Arts
Test document
![Page 24: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/24.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN classifier (k=5)
Sports
Science
Arts
Test document
What should we predict? … Average? Majority? Why?
∆k,x
![Page 25: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/25.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN classifier • Optimal Classifier:
• k-NN Classifier:
# total training pts of class y
# training pts of class y that lie within ∆k ball
![Page 26: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/26.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
1-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
![Page 27: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/27.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
2-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
K even not used in practice
![Page 28: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/28.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
3-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
![Page 29: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/29.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
5-Nearest Neighbor (kNN) classifier
Sports
Science
Arts
![Page 30: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/30.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
What is the best K?
Bias-variance tradeoff Larger K => predicted label is more stable Smaller K => predicted label is more accurate Similar to density estimation
![Page 31: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/31.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
1-NN classifier – decision boundary
K
Voronoi Diagram
![Page 32: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/32.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-NN classifier – decision boundary
• K acts as a smoother (Bias-variance tradeoff)
![Page 33: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/33.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Case Study: kNN for Web Classification • Dataset
• 20 News Groups (20 classes) • Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/) • 61,118 words, 18,774 documents • Class labels descriptions
![Page 34: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/34.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Experimental Setup • Training/Test Sets:
• 50%-50% randomly split. • 10 runs • report average results
• Evaluation Criteria:
![Page 35: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/35.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Results: Binary Classes alt.atheism
vs. comp.graphics
rec.autos vs.
rec.sport.baseball
comp.windows.x
vs. rec.motorcycles
k
Accu
racy
![Page 36: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/36.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From Classification to Regression
![Page 37: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/37.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Temperature sensing • What is the temperature • in the room?
Average “Local” Average
at location x?
x
![Page 38: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/38.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernel Regression
• Aka Local Regression • Nadaraya-Watson Kernel Estimator
Where
• Weight each training point based on distance to test
point • Boxcar kernel yields local average
h
![Page 39: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/39.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernels
![Page 40: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/40.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Choice of kernel bandwidth h h=1 h=10
h=50 h=200
Choice of kernel is not that important
Too small
Too large Just right
Too small
![Page 41: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/41.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernel Regression as Weighted Least Squares
Weighted Least Squares
Kernel regression corresponds to locally constant estimator obtained from (locally) weighted least squares i.e. set f(Xi) = β (a constant)
![Page 42: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/42.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Kernel Regression as Weighted Least Squares
constant
Notice that
set f(Xi) = β (a constant)
![Page 43: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/43.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Local Linear/Polynomial Regression
Weighted Least Squares
Local Polynomial regression corresponds to locally polynomial estimator obtained from (locally) weighted least squares i.e. set
(local polynomial of degree p around X)
![Page 44: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/44.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
When doesn’t this work? • Not enough points in a bin of reasonable size…
• When does that happen?
• When dimensions pf the space get too large.
• For example:
![Page 45: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/45.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Curse of dimensionality…
![Page 46: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/46.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Bayesian Revisited and “Naïve Bayes” • Given data 𝑿𝑿, posteriori probability of a hypothesis ℎ, 𝑃𝑃(ℎ|𝑋𝑋)
follows the Bayes theorem
• MAP (maximum posteriori) hypothesis
• Practical difficulty: require estimation of multi-dimensional densities 𝑃𝑃 𝑿𝑿 ℎ with limited data. (ℎ is, for example, class 𝐶𝐶𝑗𝑗)
( | ) ( )( | )( )
P h P hP hP
= XXX
.argmax ( | ) argmax ( | ) ( )h P h P h P hMAP h H h H≡ =
∈ ∈X X
![Page 47: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/47.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Naïve Bayes Classifier and kernel densities • A simplified assumption: attributes are conditionally independent:
• Assumes that the density of each element of a vector are independent given the class.
• Many fewer parameters. E.g. For 5 dimensional gaussians, 10 parameters instead of 20 (𝝁𝝁 and 𝚺𝚺)
• And you can use Kernel density estimation to learn each! • Conventional wisdom: Naïve Bayes gives poor densities but good
classification!
1( | ) ( ) (x | )
n
j j i ji
P C P C P C=
∝ ∏X
![Page 48: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/48.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Summary • Instance based/non-parametric approaches
Four things make a memory based learner: 1. A distance metric, dist(x,Xi)
Euclidean (and many more) 2. How many nearby neighbors/radius to look at?
k, ∆/h 3. A weighting function (optional)
W based on kernel K 4. How to fit with the local points?
Average, Majority vote, Weighted average, Poly fit
![Page 49: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/49.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Summary • Parametric vs Nonparametric approaches
Nonparametric models place very mild assumptions on the data distribution and provide good models for complex data
Parametric models rely on very strong (simplistic) distributional assumptions
Nonparametric models (not histograms) requires
storing and computing with the entire data set. Parametric models, once fitted, are much more
efficient in terms of storage and computation.
![Page 50: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/50.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Remaining slides not used yet • Taken from 600.325/425 Declarative Methods - J. Eisner
![Page 51: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/51.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Intuition behind memory-based learning • Similar inputs map to similar outputs
• If not true learning is impossible • If true learning reduces to defining “similar”
• Not all similarities created equal
• guess J. D. Salinger’s weight • who are the similar people? • similar occupation, age, diet, genes, climate, …
• guess J. D. Salinger’s IQ • similar occupation, writing style, fame, SAT score, …
• Superficial vs. deep similarities?
• B. F. Skinner and the behaviorism movement
parts of slide thanks to Rich Caruana
what do brains actually do?
![Page 52: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/52.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
1-Nearest Neighbor • Define a distance d(x1,x2) between any 2 examples
• examples are feature vectors • so could just use Euclidean distance …
• Training: Index the training examples for fast lookup. • Test: Given a new x, find the closest x1 from training.
Classify x the same as x1 (positive or negative)
• Can learn complex decision boundaries • As training size ∞, error rate is at most 2x the Bayes-
optimal rate (i.e., the error rate you’d get from knowing the true model that generated the data – whatever it is!)
![Page 53: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/53.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From Hastie, Tibshirani, Friedman 2001 p418
1-Nearest Neighbor – decision boundary
![Page 54: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/54.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
k-Nearest Neighbor
• Average of k points more reliable when: • noise in training vectors x • noise in training labels y • classes partially overlap
attribute_1
attr
ibut
e_2
+ +
+
+
+ + + + +
o o
o o o
o o o o o o
+ +
+ o o o
slide thanks to Rich Caruana (modified)
Instead of picking just the single nearest neighbor, pick the k nearest neighbors and have them vote
![Page 55: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/55.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From Hastie, Tibshirani, Friedman 2001 p418
slide thanks to Rich Caruana (modified)
1 Nearest Neighbor – decision boundary
![Page 56: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/56.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From Hastie, Tibshirani, Friedman 2001 p418
slide thanks to Rich Caruana (modified)
15 Nearest Neighbors – it’s smoother!
![Page 57: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/57.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
How to choose “k” • Odd k (often 1, 3, or 5):
• Avoids problem of breaking ties (in a binary classifier) • Large k:
• less sensitive to noise (particularly class noise) • better probability estimates for discrete classes • larger training sets allow larger values of k
• Small k: • captures fine structure of problem space better • may be necessary with small training sets
• Balance between large and small k • What does this remind you of?
• As training set approaches infinity, and k grows large, kNN becomes Bayes optimal
slide thanks to Rich Caruana (modified)
![Page 58: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/58.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
From Hastie, Tibshirani, Friedman 2001 p419
slide thanks to Rich Caruana (modified)
why?
![Page 59: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/59.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Cross-Validation • Models usually perform better on training data than on future
test cases • 1-NN is 100% accurate on training data! • “Leave-one-out” cross validation:
• “remove” each case one-at-a-time • use as test case with remaining cases as train set • average performance over all test cases
• LOOCV is impractical with most learning methods, but extremely efficient with MBL!
slide thanks to Rich Caruana
![Page 60: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/60.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Distance-Weighted kNN • hard to pick large vs. small k
• may not even want k to be constant • use large k, but more emphasis on nearer neighbors?
),(exp
1often or ),(
1 maybeor ),(
1:NN-k for the weightsrelative define We
labels their , and NN-k theare , e wher
)(
11
1
1
xxDistxxDistxxDistw
yyxx
w
ywxprediction
iiii
kk
k
ii
k
iii
⋅=
⋅=
∑
∑
=
=
ββ
![Page 61: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/61.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Combining k-NN with other methods, #1 • Instead of having the k-NN simply vote, put them into a little
machine learner! • To classify x, train a “local” classifier on its k nearest neighbors
(maybe weighted). • polynomial, neural network, …
parts of slide thanks to Rich Caruana
![Page 62: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/62.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important
attribute_1
attr
ibut
e_2
+ +
+
+
+ + + + +
o o
o o o
o o o o o o
parts of slide thanks to Rich Caruana
![Page 63: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/63.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick These o’s are now “closer” to + than to each other
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important
• Problem #1: • What if the input represents physical weight not in pounds but in
milligrams? • Then small differences in physical weight dimension have a huge effect on
distances, overwhelming other features. • Should really correct for these arbitrary “scaling” issues.
• One simple idea: rescale weights so that standard deviation = 1.
+ o o
+
weight (lb)
attr
ibut
e_2
+ +
+ +
+ +
+ o
o
o o o o
o o o
weight (mg)
attr
ibut
e_2
+
+ +
+ +
+ +
+ +
o o
o o o
o o
o o o
o bad
![Page 64: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/64.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important
• Problem #2: • What if some dimensions more correlated with true label?
• (more relevant, or less noisy)
• Stretch those dimensions out so that they are more important in determining distance. • One common technique is called “information gain.”
parts of slide thanks to Rich Caruana
most relevant attribute
attr
ibut
e_2 +
+
+
+
+
+ + +
+ + o o
o
o o
o
o
o
o o
o
attr
ibut
e_2
most relevant attribute
+ +
+
+
+
+ + +
+ + o o
o
o o
o
o
o
o o
o good
![Page 65: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/65.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Weighted Euclidean Distance
• large weight si attribute i is more important • small weight si attribute i is less important • zero weight si attribute i doesn’t matter
( )∑=
−⋅=N
iiii xxsxxd
1
2')',(
slide thanks to Rich Caruana (modified)
![Page 66: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/66.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important • Problem #3:
• Do we really want to decide separately and theoretically how to scale each dimension?
• Could simply pick dimension scaling factors to maximize performance on development data. (maybe do leave-one-out) • Similarly, pick number of neighbors k and how to weight them.
• Especially useful if performance measurement is complicated (e.g., 3 classes and differing misclassification costs).
attribute_1
attr
ibut
e_2
+ +
+
+
+ + + + +
o
o
o o o
o o o o
o o
![Page 67: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/67.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
should replot on log scale before measuring dist
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important
• Problem #4: • Is it the original input dimensions that we want to scale? • What if the true clusters run diagonally? Or curve? • We can transform the data first by extracting a different, useful
set of features from it: • Linear discriminant analysis • Hidden layer of a neural network
attribute_1
attr
ibut
e_2
+ exp(attribute_1)
attr
ibut
e_2 +
+
+
+
+
+ + +
+ + o o
o o
o
o
o
o
o o
i.e., redescribe the data by how a different type of learned classifier internally sees it
![Page 68: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/68.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Now back to that distance function
• Euclidean distance treats all of the input dimensions as equally important
• Problem #5: • Do we really want to transform the data globally? • What if different regions of the data space behave differently? • Could find 300 “nearest” neighbors (using global transform), then locally
transform that subset of the data to redefine “near” • Maybe could use decision trees to split up the data space first
attribute_1 at
trib
ute_
2 +
+ +
+
+ +
+ + +
+ o o o
o
o o
o
o
o
o o
o
o + + + + + +
+ +
+
o o o
o o o o o o o o
?
?
![Page 69: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/69.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Why are we doing all this preprocessing? • Shouldn’t the user figure out a smart way to transform the data
before giving it to k-NN?
• Sure, that’s always good, but what will the user try? • Probably a lot of the same things we’re discussing. • She’ll stare at the training data and try to figure out how to transform it
so that close neighbors tend to have the same label. • To be nice to her, we’re trying to automate the most common parts of
that process – like scaling the dimensions appropriately. • We may still miss patterns that her visual system or expertise can find.
So she may still want to transform the data. • On the other hand, we may find patterns that would be hard for her to
see.
![Page 70: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/70.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
split on feature that reduces our uncertainty most
1607/1704 = 0.943 694/5977 = 0.116
Tangent: Decision Trees (a different simple method)
example thanks to Manning & Schütze
Is this Reuters article an Earnings Announcement?
2301/7681 = 0.3 of all docs contains “cents” < 2 times contains “cents” ≥ 2 times
contains “versus” < 2 times
contains “versus” ≥ 2 times
contains “net” < 1 time
contains “net” ≥ 1 time
1398/1403 = 0.996
209/301 = 0.694
“yes”
422/541 = 0.780
272/5436 = 0.050
“no”
![Page 71: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/71.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Booleans, Nominals, Ordinals, and Reals • Consider attribute value differences:
(xi – x’i): what does subtraction do?
• Reals: easy! full continuum of differences • Integers: not bad: discrete set of differences • Ordinals: not bad: discrete set of differences • Booleans: awkward: hamming distances 0 or 1 • Nominals? not good! recode as Booleans?
slide thanks to Rich Caruana (modified)
![Page 72: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/72.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
“Curse of Dimensionality”” • Pictures on previous slides showed 2-dimensional data • What happens with lots of dimensions? • 10 training samples cover the space less & less well …
images thanks to Charles Annis
![Page 73: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/73.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
“Curse of Dimensionality””
• A deeper perspective on this: • Random points chosen in a high-dimensional space tend to all be pretty much
equidistant from one another! • (Proof: in 1000 dimensions, the squared distance between two random points is
a sample variance of 1000 coordinate distances. Since 1000 is large, this sample variance is usually close to the true variance.)
• So each test example is about equally close to most training examples. • We need a lot of training examples to expect one that is unusually close to the
test example.
images thanks to Charles Annis
Pictures on previous slides showed 2-dimensional data What happens with lots of dimensions? 10 training samples cover the space less & less well …
![Page 74: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/74.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
“Curse of Dimensionality”” • also, with lots of dimensions/attributes/features, the irrelevant
ones may overwhelm the relevant ones:
• So the ideas from previous slides grow in importance: • feature weights (scaling)
• feature selection (try to identify & discard irrelevant features) • but with lots of features, some irrelevant ones will probably
accidentally look relevant on the training data • smooth by allowing more neighbors to vote (e.g., larger k)
( ) ( )∑ ∑= =
−+−=relevant
i
irrelevant
jjjii xxxxxxd
1 1
22 '')',(
slide thanks to Rich Caruana (modified)
![Page 75: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/75.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Advantages of Memory-Based Methods • Lazy learning: don’t do any work until you know what you
want to predict (and from what variables!) • never need to learn a global model • many simple local models taken together can represent a more
complex global model • Learns arbitrarily complicated decision boundaries • Very efficient cross-validation • Easy to explain to users how it works
• … and why it made a particular decision! • Can use any distance metric: string-edit distance, …
• handles missing values, time-varying distributions, ...
slide thanks to Rich Caruana (modified)
![Page 76: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/76.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Weaknesses of Memory-Based Methods • Curse of Dimensionality
• often works best with 25 or fewer dimensions • Classification runtime scales with training set size
• clever indexing may help (K-D trees? locality-sensitive hash?) • large training sets will not fit in memory
• Sometimes you wish NN stood for “neural net” instead of “nearest neighbor” • Simply averaging nearby training points isn’t very subtle • Naive distance functions are overly respectful of the input encoding
• For regression (predict a number rather than a class), the extrapolated surface has discontinuities
slide thanks to Rich Caruana (modified)
![Page 77: CS 7616 Pattern Recognition - College of Computing › ~afb › classes › CS7616-Spring2014 › ... · 2014-02-25 · CS7616 Pattern Recognition – A. Bobick . Non-parametric:](https://reader033.vdocument.in/reader033/viewer/2022042407/5f21bd38af07a248f15790c7/html5/thumbnails/77.jpg)
Non-parametric: Kernel methods CS7616 Pattern Recognition – A. Bobick
Current Research in MBL • Condensed representations to reduce memory
requirements and speed-up neighbor finding to scale to 106–1012 cases
• Learn better distance metrics • Feature selection • Overfitting, VC-dimension, ... • MBL in higher dimensions • MBL in non-numeric domains:
• Case-Based or Example-Based Reasoning • Reasoning by Analogy
slide thanks to Rich Caruana