support vector machines tao department of computer science university of illinois
TRANSCRIPT
![Page 1: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/1.jpg)
Support Vector Machines
Tao Tao
Department of computer science
University of Illinois
![Page 2: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/2.jpg)
Adapting many contents and even slides from:
• Gentle Guide to Support Vector Machines, Ming-Hsuan Yang
• Support Vector and Kernel Methods, Thorsten Joachims
![Page 3: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/3.jpg)
Problem
Optimal hyper-plane to classify data points
How to choose this hyper-plane?
![Page 4: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/4.jpg)
What is “optimal”?
Intuition: to maximize the margin
![Page 5: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/5.jpg)
What is “optimal”?
Statistically: risk Minimization• Risk function Riskp(h) = P(h(x)!=y) = ∫Δ(h(x)!=y)dP(x,y) (h in H)
h : hyper-plane function; x: vector; y:1,-1; Δ: indicator function
• Minimization hopt = argminh{Riskp(h)}
![Page 6: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/6.jpg)
In practice…
Given N observations: (X,Y) (Y are labels, 1,-1) Looking for a mapping: x->f(x,α) (1,-1) Expected risk: Empirical risk
Question: Are they consistent in terms of minimization?
![Page 7: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/7.jpg)
Vapnik/Chervonenkis (VC) dimensionDefinition: VC dimension of H is equal to the maximum
number h of examples that can be split into two sets in all 2h ways using function from H.
Example: In R2 space, VC dimension is 3 (Rn, vc: n+1)
But, 4 points:
![Page 8: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/8.jpg)
Upper bound for expected risk
This bound for the expected risk holds with probability 1-η.
h: VC dimension
the second term: VC confidence
Training error Avoid overfitting
![Page 9: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/9.jpg)
Error vs. vc dimension
![Page 10: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/10.jpg)
Want to minimize expected risk? It is not enough just to minimize the empirical
risk Need to choose an appropriate VC Make both parts small
Solution: Structural Risk Minimization (SRM)
![Page 11: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/11.jpg)
Structural risk minimization
Nested structure of hypothesis space:
h(n) ≤ h(n+1), h(n) is the VC dimension of Hn
Tradeoff between VC dimension and empirical risk
Problem: VC dimension minimum empirical risk
![Page 12: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/12.jpg)
Linear SVM
Given xi in Rn
Linearly separable: exist w in Rn and b in R , s.t
yi(w●xi+b) ≥ 1 Scale (w,b) in order to make the distance of the
closest points, say xj, equals to 1/||w|| Optimal separating hyper-plane (OSH): to maximize
the 1/||w||
![Page 13: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/13.jpg)
Linear SVM example
Given (x,y), find (w,b), s.t. <x●w>+b = 0
additional requirement: mini|<w●xi>+b| = 1
f(x,w,b) = sgn(x●w+b)
ID x1 x2 x3 x4 x5 x6 x7 yD1 1 2 0 0 2 0 2 1D2 0 0 0 3 0 1 1 -1D3 0 2 1 0 0 0 3 1D4 0 0 1 1 1 1 1 -1
w,b 2 3 -1 -3 -1 -1 0 b=1
![Page 14: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/14.jpg)
VC dimension upper bound
Lemma [Vapnik 1995]• Let R be the radius of smallest
ball to cover all x: {||x-a||<R}, • let fw,b = sgn((w●x)+b) be the
decision functions • ||w|| ≤ A
Then, VC dimension h < R2A2+1
||w|| = 1/δ, δ is margin length
δ
R
w
![Page 15: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/15.jpg)
So …
Maximizing the margin δ
═> Minimizing ||w||
═> Smallest acceptable VC dimension
═> Constructing an optimal hyper-plane
Is everything clear??
How to do it? Quadratic Programming!
![Page 16: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/16.jpg)
Constrained quadratic programmingMinimize ½ <w●w>
Subject to yi(<w●xi>+b) ≥ 1
Solve it: Lagrange multipliers to find the saddle point
For more details, go to the book:
An introduction to Support Vector Machines
![Page 17: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/17.jpg)
What is “support vectors”?
yi(w●xi+b) ≥ 1
Most of xi achieves inequality signs;
The xi, achieving equal signs,
are called support vectors.
Support vector
![Page 18: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/18.jpg)
Inseparable data
![Page 19: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/19.jpg)
Soft margin classifier
Loose the margin by introducing N nonnegative variable ξ = (ξ1,ξ2,…, ξn)
So that yi(<w●xi>+b) ≥ 1- ξi
Problem:
Minimize ½ <w●w> + C ∑ ξi
Subject to yi(<w●xi>+b) ≥ 1 – ξi
ξ ≥ 0
![Page 20: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/20.jpg)
C and ξ
C:• C is small, maximize the minimum distance• C is large, minimize the number of
misclassified points ξ:• >1: misclassified points• 0< ξ<1: correctly classified but closer than 1/||w||• =0: margin vectors
![Page 21: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/21.jpg)
Nonlinear SVM
R2 R3
![Page 22: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/22.jpg)
Feature space
Input Space
Feature Space
Φ
a | b | c
a | b | c | aa | ab | ac | bb | bc | cc
Φ
![Page 23: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/23.jpg)
Problem: Very many parameters! O(Np) attributes in feature space, for N attributes, p degree.
Solution: Kernel methods!
![Page 24: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/24.jpg)
Dual representations
Lagrange multipliers:
Require:
substitute
![Page 25: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/25.jpg)
Constrained QP using dual
D is an N×N matrix such that Di,j = yiyj<xi●xj>
Observations: the only way the data points appear in the training problem is in the form of dot products--- <xi●xj>
![Page 26: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/26.jpg)
Go back to nonlinear SVM…
Original:
Expanding to high dimensional space:
Problem: Φ is computationally expensive. Fortunately: We only need Φ(xi)●Φ(xj)
![Page 27: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/27.jpg)
Kernel function
K(xi,xj) = Φ(xi)●Φ(xj)
Without knowing exact Φ Replace <xi●xj > by K(xi,xj) All previous derivations in linear SVM hold
![Page 28: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/28.jpg)
How to decide to kernel function? Mercer condition (necessary and sufficient):
K(u,v) is symmetric
![Page 29: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/29.jpg)
Some examples for kernel functions
![Page 30: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/30.jpg)
Multiple classes (k)
One-against-the rest: k SVM’s One-against-one: k(k-1)/2 SVM’s K-class SVM John Platt’s DAG method
![Page 31: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/31.jpg)
Application in text classification Counting each term in an article An article, therefore, becomes a vector (x)
Further reading and advanced topics. the theory of linear
…
… is much ………
The problem of linear regressionIs much older than the
Classification………
Read Problem
…Class
……
24…5……
count
Attributes: terms
Value: occurrence or frequency
![Page 32: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/32.jpg)
Conclusions
Linear SVM VC dimension Soft margin classifier Dual representation Nonlinear SVM Kernel methods Multi-classifier
![Page 33: Support Vector Machines Tao Department of computer science University of Illinois](https://reader035.vdocument.in/reader035/viewer/2022070413/5697bfe41a28abf838cb56bb/html5/thumbnails/33.jpg)
Thank you!