Support Vector Machines (SVMs)
Chapter 5 (Duda et al.)
CS479/679 Pattern RecognitionDr. George Bebis
Learning through “empirical risk” minimization
• Estimate g(x) from a finite set of observations by minimizing an error function, for example, the training error (also called empirical risk):
1
2
1
1k
kk
ifz
if
x
xclass labels:
2
1
1[ ( )]
n
emp k kk
R z g xn
Learning through “empirical risk” minimization (cont’d)
• Conventional empirical risk minimization does not imply good generalization performance.
– There could be several different functions g(x) which all approximate the training data set well.
– Difficult to determine which function would have the best generalization performance.
Learning through “empirical risk” minimization (cont’d)
B1
B2
Which solution is better?
Solution 1 Solution 2
Statistical Learning:Capacity and VC dimension
• To guarantee good generalization performance, the capacity (i.e., complexity) of the learned functions must be controlled.
• Functions with high capacity are more complicated (i.e., have many degrees of freedom).
low capacity high capacity
Statistical Learning:Capacity and VC dimension (cont’d)
• How do we measure capacity?
– In statistical learning, the Vapnik-Chervonenkis (VC) dimension is a popular measure of capacity.
– The VC dimension can predict a probabilistic upper bound on the generalization error of a classifier.
Statistical Learning:Capacity and VC dimension (cont’d)
• A function that (1) minimizes the empirical risk and (2) has low VC dimension
will generalize well regardless of the dimensionality of the input space:
with probability (1-δ); (n: # of training examples)
(Vapnik, 1995, “Structural Risk Minimization Principle”)
nn(log(2 / ) 1) log( / 4)
true train
VC n VCerr err
n
structural risk minimization
VC dimension and margin of separation
• Vapnik has shown that maximizing the margin of separation (i.e., empty space between classes) is equivalent to minimizing the VC dimension.
• The optimal hyperplane is the one giving the largest margin of separation between the classes.
Margin of separation and support vectors
• How is the margin defined?– The margin is defined by the
distance of the nearest training samples from the hyperplane.
– We refer to these samples as support vectors.
– Intuitively speaking, these are the most difficult samples to classify.
Margin of separation andsupport vectors (cont’d)
B1
B2
B1
B2
b11
b12
b21b22
margin
different solutions corresponding margins
SVM Overview
• Primarily two-class classifiers but can be extended to multiple classes.
• It performs structural risk minimization to achieve good generalization performance.
• The optimization criterion is the margin of separation between classes.
• Training is equivalent to solving a quadratic programming problem with linear constraints.
Linear SVM: separable case
• Linear discriminant
• Class labels
• Consider the equivalent problem:
1
2
1
1k
kk
ifz
if
x
x
Decide ω1 if g(x) > 0 and ω2 if g(x) < 00( ) tg w x w x
0( ) 0 ( ) 0, 1,2,...,tk k k kz g or z w for k n x w x
Linear SVM: separable case (cont’d)
• The distance of a point xk from the separating hyperplane should satisfy the constraint:
• To constraint the length of w (uniqueness), we impose:
• Using the above constraint:
( ), 0
|| ||k kz g
b bw
x
0
1( ) 1 ( ) 1 where
|| ||t
k k k kz g or z w bw
x w x
1b w
Linear SVM: separable case (cont’d)
maximizemargin:
2
|| ||w0( ) 1, 1,2,...,t
k kz w for k n w x
quadraticprogrammingproblem
Linear SVM: separable case (cont’d)
• Using Langrange optimization, minimize:
• Easier to solve the “dual” problem (Kuhn-Tucker construction):
20 0
1
1( , , ) || || [ ( ) 1], 0
2
nt
k k k kk
L w z w
w w w x
1 ,
1
2
n nt
k k j k j j kk k j
z z
x x
Linear SVM: separable case (cont’d)
• The solution is given by:
01
,n
tk k k k k
k
z w z
w x w x
0 01 1
( ) ( ) ( . )n n
tk k k k k k
k k
g z w z w
x x x x x
dot product0( ) tg w x w x
Linear SVM: separable case (cont’d)
• It can be shown that if xk is not a support vector, then the corresponding λk=0.
Only the support vectorscontribute to the solution!
0 01 1
( ) ( ) ( . )n n
tk k k k k k
k k
g z w z w
x x x x x
dot product
Linear SVM: non-separable case
• Allow miss-classifications (i.e., soft margin classifier) by introducing positive error (slack) variables ψk :
0( ) 1 , 1,2,...,tk k kz w w k n x
Linear SVM: non-separable case (cont’d)
• The constant c controls the trade-off between margin and misclassification errors.
• Aims to prevent outliers from affecting the optimal hyperplane.
Linear SVM: non-separable case (cont’d)
1 ,
1
2
n nt
k k j k j j kk k j
z z
x x
• Easier to solve the “dual” problem (Kuhn-Tucker construction):
01
( ) ( . )n
k k kk
g z w
x x x
• Extending these concepts to the non-linear case involves mapping the data to a high-dimensional space h:
• Mapping the data to a sufficiently high dimensional space is likely to cast the data linearly separable in that space.
Nonlinear SVM
1
2
( )
( )( )
...
( )
k
kk k
h k
x
xx Φ x
x
Nonlinear SVM (cont’d)
Example:
Nonlinear SVM (cont’d)
0
1
( ) ( ( ). ( ))n
k k kk
g z w
x x x
01
( ) ( . )n
k k kk
g z w
x x xlinear SVM:
non-linear SVM:
Nonlinear SVM (cont’d)
01
( ) ( ( ). ( ))n
k k kk
g z w
x x xnon-linear SVM:
• The disadvantage of this approach is that the mapping
might be very computationally intensive to compute!
• Is there an efficient way to compute ?
( )k k x x
( ). ( )k x x
The kernel trick
• Compute dot products using a kernel function
01
( ) ( , )n
k k kk
g z K w
x x x
( , ) ( ). ( )k kK x x x x
01
( ) ( ( ). ( ))n
k k kk
g z w
x x x
The kernel trick (cont’d)
• Comments– Kernel functions which can be expressed as a dot
product in some space satisfy the Mercer’s condition (see Burges’ paper)
• The Mercer’s condition does not tell us how to construct Φ() or even what the high dimensional space is.
• Advantages of kernel trick– No need to know Φ() – Computations remain feasible even if the feature space
has high dimensionality.
Polynomial Kernel
K(x,y)=(x . y) d
Polynomial Kernel - Example
Common Kernel functions
Example
Example (cont’d)
h=6
Example (cont’d)
Example (cont’d)
(Problem 4)
Example (cont’d)
w0=0
Example (cont’d)
w =
Example (cont’d)
The discriminant
w =
Comments
• SVM is based on exact optimization, not on approximate methods (i.e., global optimization method, no local optima)
• Appears to avoid overfitting in high dimensional spaces and generalize well using a small training set.
• Performance depends on the choice of the kernel and its parameters.
• Its complexity depends on the number of support vectors, not on the dimensionality of the transformed space.