geoff gordon - carnegie mellon school of computer …ggordon/svms/new-svms-and...geoff gordon...
TRANSCRIPT
![Page 2: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/2.jpg)
Support vector machines
The SVM is a machine learning algorithm which
• solves classification problems
• uses a flexible representation of the class boundaries
• implements automatic complexity control to reduce overfitting
• has a single global minimum which can be found in polynomialtime
It is popular because
• it can be easy to use
• it often has good generalization performance
• the same algorithm solves a variety of problems with little tuning
![Page 3: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/3.jpg)
SVM concepts
Perceptrons
Convex programming and duality
Using maximum margin to control complexity
Representing nonlinear boundaries with feature expansion
The “kernel trick” for efficient optimization
![Page 4: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/4.jpg)
Outline
• Classification problems
• Perceptrons and convex programs
• From perceptrons to SVMs
• Advanced topics
![Page 5: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/5.jpg)
Classification example—Fisher’s irises
0
0.5
1
1.5
2
2.5
1 2 3 4 5 6 7
setosaversicolor
virginica
![Page 6: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/6.jpg)
Iris data
Three species of iris
Measurements of petal length, width
Iris setosa is linearly separable from I. versicolor and I. virginica
![Page 7: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/7.jpg)
Example—Boston housing data
40 50 60 70 80 90 100 110 120 13060
65
70
75
80
85
90
95
100
Industry & Pollution
% M
iddl
e &
Upp
er C
lass
![Page 8: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/8.jpg)
A good linear classifier
40 50 60 70 80 90 100 110 120 13060
65
70
75
80
85
90
95
100
Industry & Pollution
% M
iddl
e &
Upp
er C
lass
![Page 9: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/9.jpg)
Example—NIST digits
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
28 × 28 = 784 features in [0,1]
![Page 10: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/10.jpg)
Class means
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
![Page 11: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/11.jpg)
A linear separator
5 10 15 20 25
5
10
15
20
25
![Page 12: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/12.jpg)
Sometimes a nonlinear classifier is better
5 10 15 20 25
5
10
15
20
25
5 10 15 20 25
5
10
15
20
25
![Page 13: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/13.jpg)
Sometimes a nonlinear classifier is better
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
![Page 14: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/14.jpg)
Classification problem
X y
...
40 50 60 70 80 90 100 110 120 13060
65
70
75
80
85
90
95
100
Industry & Pollution
% M
iddl
e &
Upp
er C
lass
−10 −8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
Data points X = [x1;x2;x3; . . .] with xi ∈ Rn
Labels y = [y1; y2; y3; . . .] with yi ∈ {−1,1}
Solution is a subset of Rn, the “classifier”
Often represented as a test f(x, learnable parameters) ≥ 0
Define: decision surface, linear separator, linearly separable
![Page 15: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/15.jpg)
What is goal?
Classify new data with fewest possible mistakes
Proxy: minimize some function on training data
minw
∑
i
l(yif(xi;w)) + l0(w)
That’s l(f(x)) for +ve examples, l(−f(x)) for -ve
−3 −2 −1 0 1 2 3−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
−3 −2 −1 0 1 2 3−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
piecewise linear loss logistic loss
![Page 16: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/16.jpg)
Getting fancy
Text? Hyperlinks? Relational database records?
• difficult to featurize w/ reasonable number of features
• but what if we could handle large or infinite feature sets?
![Page 17: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/17.jpg)
Outline
• Classification problems
• Perceptrons and convex programs
• From perceptrons to SVMs
• Advanced topics
![Page 18: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/18.jpg)
Perceptrons
−3 −2 −1 0 1 2 3−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Weight vector w, bias c
Classification rule: sign(f(x)) where f(x) = x · w + c
Penalty for mispredicting: l(yf(x)) = [yf(x)]−
This penalty is convex in w, so all minima are global
Note: unit-length x vectors
![Page 19: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/19.jpg)
Training perceptrons
Perceptron learning rule: on mistake,
w += yx
c += y
That is, gradient descent on l(yf(x)), since
∇w [y(x · w + c)]− =
{
−yx if y(x · w + c) ≤ 00 otherwise
![Page 20: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/20.jpg)
Perceptron demo
−1 −0.5 0 0.5 1
−1
−0.5
0
0.5
1
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
![Page 21: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/21.jpg)
Perceptrons as linear inequalities
Linear inequalities (for separable case):
y(x · w + c) > 0
That’s
x · w + c > 0 for positive examplesx · w + c < 0 for negative examples
i.e., ensure correct classification of training examples
Note > not ≥
![Page 22: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/22.jpg)
Version space
x · w + c = 0
As a fn of x: hyperplane w/ normal w at distance c/‖w‖ from origin
As a fn of w: hyperplane w/ normal x at distance c/‖x‖ from origin
![Page 23: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/23.jpg)
Convex programs
Convex program:
min f(x) subject to
gi(x) ≤ 0 i = 1 . . . m
where f and gi are convex functions
Perceptron is almost a convex program (> vs. ≥)
Trick: write
y(x · w + c) ≥ 1
![Page 24: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/24.jpg)
Slack variables
−3 −2 −1 0 1 2 3−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
If not linearly separable, add slack variable s ≥ 0
y(x · w + c) + s ≥ 1
Then∑
i si is total amount by which constraints are violated
So try to make∑
i si as small as possible
![Page 25: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/25.jpg)
Perceptron as convex program
The final convex program for the perceptron is:
min∑
i si subject to
(yixi) · w + yic + si ≥ 1
si ≥ 0
We will try to understand this program using convex duality
![Page 26: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/26.jpg)
Duality
To every convex program corresponds a dual
Solving original (primal) is equivalent to solving dual
Dual often provides insight
Can derive dual by using Lagrange multipliers to eliminate con-straints
![Page 27: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/27.jpg)
Lagrange Multipliers
Way to phrase constrained optimization problem as a game
maxx f(x) subject to g(x) ≥ 0
(assume f, g are convex downward)
maxx mina≥0 f(x) + ag(x)
If x plays g(x) < 0, then a wins: playing big numbers makes payoffapproach −∞
If x plays g(x) ≥ 0, then a must play 0
![Page 28: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/28.jpg)
Lagrange Multipliers: the picture
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
![Page 29: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/29.jpg)
Lagrange Multipliers: the caption
Problem: maximize
f(x, y) = 6x + 8y
subject to
g(x, y) = x2 + y2 − 1 ≥ 0
Using a Lagrange multiplier a,
maxxy
mina≥0
f(x, y) + ag(x, y)
At optimum,
0 = ∇f(x, y) + a∇g(x, y) =
(
68
)
+ 2a
(
xy
)
![Page 30: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/30.jpg)
Duality for the perceptron
Notation: zi = yixi and Z = [z1; z2; . . .], so that:
mins,w,c∑
i si subject to
Zw + cy + s ≥ 1
s ≥ 0
Using a Lagrange multiplier vector a ≥ 0,
mins,w,c maxa∑
i si − aT(Zw + cy + s − 1)
subject to s ≥ 0 a ≥ 0
![Page 31: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/31.jpg)
Duality cont’d
From last slide:
mins,w,c maxa∑
i si − aT(Zw + cy + s − 1)
subject to s ≥ 0 a ≥ 0
Minimize wrt w, c explicitly by setting gradient to 0:
0 = aTZ
0 = aTy
Minimizing wrt s yields inequality:
0 ≤ 1 − a
![Page 32: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/32.jpg)
Duality cont’d
Final form of dual program for perceptron:
maxa 1Ta subject to
0 = aTZ
0 = aTy
0 ≤ a ≤ 1
![Page 33: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/33.jpg)
Problems with perceptrons
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Vulnerable to overfitting when many input features
Not very expressive (XOR)
![Page 34: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/34.jpg)
Outline
• Classification problems
• Perceptrons and convex programs
• From perceptrons to SVMs
• Advanced topics
![Page 35: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/35.jpg)
Modernizing the perceptron
Three extensions:
• Margins
• Feature expansion
• Kernel trick
Result is called a Support Vector Machine (reason given below)
![Page 36: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/36.jpg)
Margins
Margin is the signed ⊥ distance from an example to the decisionboundary
+ve margin points are correctly classified, -ve margin means error
![Page 37: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/37.jpg)
SVMs are maximum margin
Maximize minimum distance from data to separator
“Ball center” of version space (caveats)
Other centers: analytic center, center of mass, Bayes point
Note: if not linearly separable, must trade margin vs. errors
![Page 38: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/38.jpg)
Why do margins help?
If our hypothesis is near the boundary of decision space, we don’tnecessarily learn much from our mistakes
If we’re far away from any boundary, a mistake has to eliminate alarge volume from version space
![Page 39: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/39.jpg)
Why margins help, explanation 2
Occam’s razor: “simple” classifiers are likely to do better in practice
Why? There are fewer simple classifiers than complicated ones, sowe are less likely to be able to fool ourselves by finding a really goodfit by accident.
What does “simple” mean? Anything, as long as you tell me beforeyou see the data.
![Page 40: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/40.jpg)
Explanation 2 cont’d
“Simple” can mean:
• Low-dimensional
• Large margin
• Short description length
For this lecture we are interested in large margins and compact de-scriptions
By contrast, many classical complexity control methods (AIC, BIC)rely on low dimensionality alone
![Page 41: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/41.jpg)
Why margins help, explanation 3
−3 −2 −1 0 1 2 3−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
−3 −2 −1 0 1 2 3−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Margin loss is an upper bound on number of mistakes
![Page 42: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/42.jpg)
Why margins help, explanation 4
![Page 43: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/43.jpg)
Optimizing the margin
Most common method: convex quadratic program
Efficient algorithms exist (essentially the same as some interior pointLP algorithms)
Because QP is strictly∗ convex, unique global optimum
Next few slides derive the QP. Notation:
• Assume w.l.o.g. ‖xi‖2 = 1
• Ignore slack variables for now (i.e., assume linearly separable)
∗if you ignore the intercept term
![Page 44: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/44.jpg)
Optimizing the margin, cont’d
Margin M is ⊥ distance to decision surface: for pos example,
(x − Mw/‖w‖) · w + c = 0
x · w + c = Mw · w/‖w‖ = M‖w‖
SVM maximizes M > 0 such that all margins are ≥ M :
maxM>0,w,c M subject to
(yixi) · w + yic ≥ M‖w‖
Notation: zi = yixi and Z = [z1; z2; . . .], so that:
Zw + yc ≥ M‖w‖
Note λw, λc is a solution if w, c is
![Page 45: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/45.jpg)
Optimizing the margin, cont’d
Divide by M‖w‖ to get (Zw + yc)/M‖w‖ ≥ 1
Define v = wM‖w‖ and d = c
M‖w‖, so that ‖v‖ =‖w‖
M‖w‖ = 1M
maxv,d 1/‖v‖ subject to
Zv + yd ≥ 1
Maximizing 1/‖v‖ is minimizing ‖v‖ is minimizing ‖v‖2
minv,d ‖v‖2 subject to
Zv + yd ≥ 1
Add slack variables to handle non-separable case:
mins≥0,v,d ‖v‖2 + C∑
i si subject to
Zv + yd + s ≥ 1
![Page 46: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/46.jpg)
Modernizing the perceptron
Three extensions:
• Margins
• Feature expansion
• Kernel trick
![Page 47: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/47.jpg)
Feature expansion
Given an example x = [a b ...]
Could add new features like a2, ab, a7b3, sin(b), ...
Same optimization as before, but with longer x vectors and so longerw vector
Classifier: “is 3a + 2b + a2 + 3ab − a7b3 + 4sin(b) ≥ 2.6?”
This classifier is nonlinear in original features, but linear in expandedfeature space
We have replaced x by φ(x) for some nonlinear φ, so decisionboundary is nonlinear surface w · φ(x) + c = 0
![Page 48: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/48.jpg)
Feature expansion example
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1
−0.5
0
0.5
1
−1
−0.5
0
0.5
1−1
−0.5
0
0.5
1
x, y x, y, xy
![Page 49: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/49.jpg)
Some popular feature sets
Polynomials of degree k
1, a, a2, b, b2, ab
Neural nets (sigmoids)
tanh(3a + 2b − 1), tanh(5a − 4), . . .
RBFs of radius σ
exp
(
− 1
σ2((a − a0)
2 + (b − b0)2)
)
![Page 50: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/50.jpg)
Feature expansion problems
Feature expansion techniques yield lots of features
E.g. polynomial of degree k on n original features yields O(nk)
expanded features
E.g. RBFs yield infinitely many expanded features!
Inefficient (for i = 1 to infinity do ...)
Overfitting (VC-dimension argument)
![Page 51: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/51.jpg)
How to fix feature expansion
We have already shown we can handle the overfitting problem: evenif we have lots of parameters, large margins make simple classifiers
“All” that’s left is efficiency
Solution: kernel trick
![Page 52: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/52.jpg)
Modernizing the perceptron
Three extensions:
• Margins
• Feature expansion
• Kernel trick
![Page 53: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/53.jpg)
Kernel trick
Way to make optimization efficient when there are lots of features
Compute one Lagrange multiplier per training example instead ofone weight per feature (part I)
Use kernel function to avoid representing w ever (part II)
Will mean we can handle infinitely many features!
![Page 54: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/54.jpg)
Kernel trick, part I
minw,c |w|2/2 subject to Zw + yc ≥ 1
minw,c maxa≥0 wTw/2 + a · (1 − Zw − yc)
Minimize wrt w, c by setting derivatives to 0
0 = w − ZTa 0 = a · y
Substitute back in for w, c
maxa≥0 a · 1 − aTZZTa/2 subject to a · y = 0
Note: to allow slacks, add an upper bound a ≤ C
![Page 55: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/55.jpg)
What did we just do?
max0≤a≤C a · 1 − aTZZTa/2 subject to a · y = 0
Now we have a QP in a instead of w, c
Once we solve for a, we can find w = ZTa to use for classification
We also need c which we can get from complementarity:
yixi · w + yic = 1 ⇔ ai > 0
or as dual variable for a · y = 0
![Page 56: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/56.jpg)
Representation of w
Optimal w = ZTa is a linear combination of rows of Z
I.e., w is a linear combination of (signed) training examples
I.e., w has a finite representation even if there are infinitely manyfeatures
![Page 57: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/57.jpg)
Support vectors
Examples with ai > 0 are called support vectors
“Support vector machine” = learning algorithm (“machine”) basedon support vectors
Often many fewer than number of training examples (a is sparse)
This is the “short description” of an SVM mentioned above
![Page 58: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/58.jpg)
Intuition for support vectors
40 50 60 70 80 90 100 110 12060
70
80
90
100
110
120
![Page 59: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/59.jpg)
Partway through optimization
Suppose we have 5 positive support vectors and 5 negative, all withequal weights
Best w so far is ZTa: diff between mean of +ve SVs, mean of -ve
Averaging xi · w + c = yi for all SVs yields
c = −x̄ · w
That is,
• Compute mean of +ve SVs, mean of -ve SVs
• Draw the line between means, and its perpendicular bisector
• This bisector is current classifier
![Page 60: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/60.jpg)
At end of optimization
Gradient wrt ai is 1 − yi(xi · w + c)
Increase ai if (scaled) margin < 1, decrease if margin > 1
Stable iff (ai = 0 AND margin ≥ 1) OR margin = 1
![Page 61: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/61.jpg)
How to avoid writing down weights
Suppose number of features is really big or even infinite?
Can’t write down X, so how do we solve the QP?
Can’t even write down w, so how do we classify new examples?
![Page 62: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/62.jpg)
Solving the QP
max0≤a≤C a · 1 − aTZZTa/2 subject to a · y = 0
Write G = ZZT (called Gram matrix)
That is, Gij = yiyjxi · xj
max0≤a≤C a · 1 − aTGa/2 subject to a · y = 0
With m training examples, G is m × m — (often) small enough forus to solve the QP even if we can’t write down X
Can we compute G directly, without computing X?
![Page 63: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/63.jpg)
Kernel trick, part II
Yes, we can compute G directly—sometimes!
Recall that xi was the result of applying a nonlinear feature expan-sion function φ to some shorter vector (say qi)
Define K(qi,qj) = φ(qi) · φ(qj)
![Page 64: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/64.jpg)
Mercer kernels
K is called a (Mercer) kernel function
Satisfies Mercer condition: K(q,q′) ≥ 0
Mercer condition for a function K is analogous to nonnegative defi-niteness for a matrix
In many cases there is a simple expression for K even if there isn’tone for φ
In fact, it sometimes happens that we know K without knowing φ
![Page 65: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/65.jpg)
Example kernels
Polynomial (typical component of φ might be 17q21q32q4)
K(q,q′) = (1 + q · q′)k
Sigmoid (typical component tanh(q1 + 3q2))
K(q,q′) = tanh(aq · q′ + b)
Gaussian RBF (typical component exp(−12(q1 − 5)2))
K(q,q′) = exp(−‖q − q′‖2/σ2)
![Page 66: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/66.jpg)
Detail: polynomial kernel
Suppose x =
1√2q
q2
Then x′ · x = 1 + 2qq′ + q2(q′)2
From previous slide,
K(q, q′) = (1 + qq′)2 = 1 + 2qq′ + q2(q′)2
Dot product + addition + exponentiation vs. O(nk) terms
![Page 67: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/67.jpg)
The new decision rule
Recall original decision rule: sign(x · w + c)
Use representation in terms of support vectors:
sign(x·ZTa+c) = sign
∑
i
x · xiyiai + c
= sign
∑
i
K(q,qi)yiai + c
Since there are usually not too many support vectors, this is a rea-sonably fast calculation
![Page 68: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/68.jpg)
Summary of SVM algorithm
Training:
• Compute Gram matrix Gij = yiyjK(qi,qj)
• Solve QP to get a
• Compute intercept c by using complementarity or duality
Classification:
• Compute ki = K(q,qi) for support vectors qi
• Compute f = c +∑
i aikiyi
• Test sign(f)
![Page 69: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/69.jpg)
Outline
• Classification problems
• Perceptrons and convex programs
• From perceptrons to SVMs
• Advanced topics
![Page 70: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/70.jpg)
Advanced kernels
All problems so far: each example is a list of numbers
What about text, relational DBs, . . . ?
Insight: K(x, y) can be defined when x and y are not fixed length
Examples:
• String kernels
• Path kernels
• Tree kernels
• Graph kernels
![Page 71: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/71.jpg)
String kernels
Pick λ ∈ (0,1)
cat 7→ c, a, t, λ ca, λ at, λ2 ct, λ2 cat
Strings are similar if they share lots of nearly-contiguous substrings
Works for words in phrases too: “man bites dog” similar to “manbites hot dog,” less similar to “dog bites man”
There is an efficient dynamic-programming algorithm to evaluatethis kernel (Lodhi et al, 2002)
![Page 72: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/72.jpg)
Combining kernels
Suppose K(x, y) and K ′(x, y) are kernels
Then so are
• K + K ′
• αK for α > 0
Given a set of kernels K1, K2, . . ., can search for best
K = α1K1 + α2K2 + . . .
using cross-validation, etc.
![Page 73: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/73.jpg)
“Kernel X”
Kernel trick isn’t limited to SVDs
Works whenever we can express an algorithm using only sums, dotproducts of training examples
Examples:
• kernel Fisher discriminant
• kernel logistic regression
• kernel linear and ridge regression
• kernel SVD or PCA
• 1-class learning / density estimation
![Page 74: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/74.jpg)
Summary
Perceptrons are a simple, popular way to learn a classifier
They suffer from inefficient use of data, overfitting, and lack of ex-pressiveness
SVMs fix these problems using margins and feature expansion
In order to make feature expansion computationally feasible, weneed the kernel trick
Kernel trick avoids writing out high-dimensional feature vectors byuse of Lagrange multipliers and representer theorem
SVMs are popular classifiers because they usually achieve gooderror rates and can handle unusual types of data
![Page 75: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/75.jpg)
References
http://www.cs.cmu.edu/~ggordon/SVMs
these slides, together with code
http://svm.research.bell-labs.com/SVMdoc.html
Burges’s SVM tutorial
http://citeseer.nj.nec.com/burges98tutorial.html
Burges’s paper “A Tutorial on Support Vector Machines for PatternRecognition” (1998)
![Page 76: Geoff Gordon - Carnegie Mellon School of Computer …ggordon/SVMs/new-svms-and...Geoff Gordon ggordon@cs.cmu.edu June 15, 2004 Support vector machines The SVM is a machine learning](https://reader031.vdocument.in/reader031/viewer/2022030503/5aaf8ece7f8b9a190d8d6d46/html5/thumbnails/76.jpg)
References
Huma Lodhi, Craig Saunders, Nello Cristianini, John Shawe-Taylor,Chris Watkins. “Text Classification using String Kernels.” 2002.
http://www.cs.rhbnc.ac.uk/research/compint/areas/
comp_learn/sv/pub/slide1.ps
Slides by Stitson & Weston
http://oregonstate.edu/dept/math/CalculusQuestStudyGuides/
vcalc/lagrang/lagrang.html
Lagrange Multipliers
http://svm.research.bell-labs.com/SVT/SVMsvt.html
on-line SVM applet