complexity and support vector machines - aleix ruiz de villaaleix ruiz de villa complexity and...
TRANSCRIPT
Complexity and Support Vector Machines
Aleix Ruiz de Villa
Grup d’Estudi de Machine Learning de BarcelonaKing’s Offices
May 6th, 2014
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
1 IntroductionIntroductionFrameworkHistory
2 Statistical Learning TheoryERMPrinciple of parsimonyConsistency
3 Introduction to SVM - ClassificationHard Margin ClassifiersRegularization - Soft Margin
4 KernelsDefinitionReproducing Kernel Hilbert Spaces
5 SVMDefinitionSVM for regressionStatistical Properties - Gaussian Kernel
6 ImplementationImplementation
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
IntroductionFrameworkHistory
Introduction
Typical framework with linear regression:
(1) We have some data (x1, y1), . . . , (xn, yn) ∈ Rp+1.(2) We consider the family of functions
y = w0 + w1x1 + . . .+ wpxp + ε = f (x) + ε,
where ε is some random error.(3) We play (include/exclude, transform, ...) with the covariates
xk .(3) We fit our models with data
minf ∈H
n∑i
(yi−f (xi ))2 = minw0,...,wp
n∑i
(yi−w0−w1x1i −. . .−wpxp
i )2,
obtaining optimal w0(n), . . . , wp(n)(4) We use the optimal model to make predictions (evaluate on
new x) and validate them (when we have the response y)
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
IntroductionFrameworkHistory
Motivation
The best w∗ would be obtained if we had an infinte samplen =∞.
We think (hope) that w(n) is a good approximation of w∗
We play with the covariates because more covariates
decrease the fitting errornot necessarily improves prediction error (when new data xarrives).
We would like an infinite sample of new data to evaluate thechosen model.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
IntroductionFrameworkHistory
We’ve got
Data D = {(x1, y1), . . . , (xn, yn) ∈ Rm+1} sampled independentlyfrom an unknow probability distribution P(x , y) (y = f (x) + ε is aparticular case).a
aGreen color emphasizes the source of randomness
We decide
A loss function L(y , t).A family of functions H (our models).
We calculate
Empirical risk ( training error)Rn(f ) = 1
n
∑ni L(yi , f (xi )) for f ∈ H.
Optimal empirical function fn = arg minHRn(f )
Goal quantities
Expected risk (generalization error)R(f ) =
∫L(y , f (x))dP(x , y) for f ∈ H.
Minimum expected risk R∗ = minH R(f ).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
IntroductionFrameworkHistory
We’ve got
Data D = {(x1, y1), . . . , (xn, yn) ∈ Rm+1} sampled independentlyfrom an unknow probability distribution P(x , y) (y = f (x) + ε is aparticular case).a
aGreen color emphasizes the source of randomness
We decide
A loss function L(y , t).A family of functions H (our models).
We calculate
Empirical risk ( training error)Rn(f ) = 1
n
∑ni L(yi , f (xi )) for f ∈ H.
Optimal empirical function fn = arg minHRn(f )
Goal quantities
Expected risk (generalization error)R(f ) =
∫L(y , f (x))dP(x , y) for f ∈ H.
Minimum expected risk R∗ = minH R(f ).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
IntroductionFrameworkHistory
We’ve got
Data D = {(x1, y1), . . . , (xn, yn) ∈ Rm+1} sampled independentlyfrom an unknow probability distribution P(x , y) (y = f (x) + ε is aparticular case).a
aGreen color emphasizes the source of randomness
We decide
A loss function L(y , t).A family of functions H (our models).
We calculate
Empirical risk ( training error)Rn(f ) = 1
n
∑ni L(yi , f (xi )) for f ∈ H.
Optimal empirical function fn = arg minHRn(f )
Goal quantities
Expected risk (generalization error)R(f ) =
∫L(y , f (x))dP(x , y) for f ∈ H.
Minimum expected risk R∗ = minH R(f ).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
IntroductionFrameworkHistory
We’ve got
Data D = {(x1, y1), . . . , (xn, yn) ∈ Rm+1} sampled independentlyfrom an unknow probability distribution P(x , y) (y = f (x) + ε is aparticular case).a
aGreen color emphasizes the source of randomness
We decide
A loss function L(y , t).A family of functions H (our models).
We calculate
Empirical risk ( training error)Rn(f ) = 1
n
∑ni L(yi , f (xi )) for f ∈ H.
Optimal empirical function fn = arg minHRn(f )
Goal quantities
Expected risk (generalization error)R(f ) =
∫L(y , f (x))dP(x , y) for f ∈ H.
Minimum expected risk R∗ = minH R(f ).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
IntroductionFrameworkHistory
Framework - Examples loss functions
Problem (Classification)
Y’s are labels, yi ∈ {−1, 1}.
L(y , f (x)) =
{1 sign(f (x)) 6= y0 sign(f (x)) = y
Problem (Regression)
Y’s are continuous variables
L(y , f (x)) = (y − f (x))2.
Our setting deals with more loss functions.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
IntroductionFrameworkHistory
Framework - Examples loss functions
Problem (Classification)
Y’s are labels, yi ∈ {−1, 1}.
L(y , f (x)) =
{1 sign(f (x)) 6= y0 sign(f (x)) = y
Problem (Regression)
Y’s are continuous variables
L(y , f (x)) = (y − f (x))2.
Our setting deals with more loss functions.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
IntroductionFrameworkHistory
History
1962 Perceptron ( primitive version of neural networks) wasintroduced
1968 Vapnik, Chervonenkis started theoretical analysis of thelearning process
1985 Backward propagation algorithm was discovered →Neural networks
1990 VC Theoretical analysis of the learning process finishes
1995 SVM were introduced.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
ERM
One of the most natural learning processes is the empirical riskminimization (ERM). Given data D we want to find fn ∈ H suchthat
Rn(fn) = minHRn(f ).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Problem (Principle of parsimony)
Adding variables in multilinear regression may suffer fromoverfitting:
better fitting,
worse predictions.
Problem (Consistency)
Degenerate model f =∑
i yi1xi :
null empirical risk,
brings no insight of the process (bad predictions).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Problem (Principle of parsimony)
Adding variables in multilinear regression may suffer fromoverfitting:
better fitting,
worse predictions.
Problem (Consistency)
Degenerate model f =∑
i yi1xi :
null empirical risk,
brings no insight of the process (bad predictions).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Principle of Parsimony
Principle
Of two equivalent theories or explanations, all other things beingequal, the simpler one is to be preferred.
We need to measure the complexity of our models (H) ⇒ VCdimension.
Definition (VC dimension)
We say that H shatters a set of points P = {x1, ..., xk} if H canclassify all the possible labels on P, that is, if evaluating on Pfunctions of the form sign(f + b) with f ∈ H and b ∈ R, we obtainall the possible combinations S = {y1, ..., yk}, yi ∈ {−1, 1}.The VC dimension of the set H is the maximum number of pointsthat H can shatter.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Principle of Parsimony
Principle
Of two equivalent theories or explanations, all other things beingequal, the simpler one is to be preferred.
We need to measure the complexity of our models (H) ⇒ VCdimension.
Definition (VC dimension)
We say that H shatters a set of points P = {x1, ..., xk} if H canclassify all the possible labels on P, that is, if evaluating on Pfunctions of the form sign(f + b) with f ∈ H and b ∈ R, we obtainall the possible combinations S = {y1, ..., yk}, yi ∈ {−1, 1}.The VC dimension of the set H is the maximum number of pointsthat H can shatter.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Example
The set of hyperplanes in Rn,
n∑i=1
wixi + b = 〈w , x〉+ b,
has VC dimension n + 1.
Example
The set of functions f =∑
i yi1xi has infinte dimension.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Example
The set of hyperplanes with b = 0 on {‖x‖ ≤ R} and satisfying‖w‖ ≤ Λ has VC dimension less or equal than R2Λ2.
Example
The VC dimension of the set of gaussian functions in Rm is(m2 + 3m)/2
Example
The VC dimension of the set of functionsH = {f (x) = sign(sin(θx)) : θ ∈ R} is ∞
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Theorem (Vapnik, Chervonenkis)
Given a class of functions H with VC dimension(H) = h, ifA ≤ L(Y , f (X )) ≤ B almost for sure, then given η > 0, withprobability at least 1− η, for all f ∈ H we have
|Rn(f )−R(f )| ≤ B − A
2
√ζ,
where
ζ = 4h(ln 2n
h + 1)− ln η4
n.
Moreover, with probability at least 1− 2η
R(fn)− infHR(f ) ≤ (B − A)
√− ln η
2n+
B − A
2
√ζ.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Theorem (Vapnik, Chervonenkis)
Given a class of functions H with VC dimension(H) = h, ifA ≤ L(Y , f (X )) ≤ B almost for sure, then given η > 0, withprobability at least 1− η, for all f ∈ H we have
|Rn(f )−R(f )| ≤ B − A
2
√ζ,
where
ζ = 4h(ln 2n
h + 1)− ln η4
n.
Moreover, with probability at least 1− 2η
R(fn)− infHR(f ) ≤ (B − A)
√− ln η
2n+
B − A
2
√ζ.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Consistency
For a fixed f , by the law of large numbers, Rn(f )→ R(f ).
What we actually do is:n = 1 find f1 minimizing minH R1(f )n = 2 find f2 minimizing minH R2(f )...n find fn minimizing minH Rn(f )
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Consistency
For a fixed f , by the law of large numbers, Rn(f )→ R(f ).
What we actually do is:n = 1 find f1 minimizing minH R1(f )n = 2 find f2 minimizing minH R2(f )...n find fn minimizing minH Rn(f )
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Consistency
For a fixed f , by the law of large numbers, Rn(f )→ R(f ).
What we actually do is:n = 1 find f1 minimizing minH R1(f )n = 2 find f2 minimizing minH R2(f )...n find fn minimizing minH Rn(f )
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Consistency
We expect Rn(fn)→ minH R, but this is not always the case.For instance, the family of functions f = sign(
∑i yi1xi ), in
the classification case, if PX is absolutely continuous,
minH Rn(f ) = 0R(fn) =
∫L(y , fn(x))dP(x , y) =
∫L(y , sign(0))dP(x , y) =
P(Y = −1)
Theorem
If VC dimension of H is finite, then ERM is consitent, i.e.
Rn(fn)→ minHR(f )
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
ERMPrinciple of parsimonyConsistency
Consistency
We expect Rn(fn)→ minH R, but this is not always the case.For instance, the family of functions f = sign(
∑i yi1xi ), in
the classification case, if PX is absolutely continuous,
minH Rn(f ) = 0R(fn) =
∫L(y , fn(x))dP(x , y) =
∫L(y , sign(0))dP(x , y) =
P(Y = −1)
Theorem
If VC dimension of H is finite, then ERM is consitent, i.e.
Rn(fn)→ minHR(f )
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Hard Margin Classifiers
Given data (x1, y1), . . . , (xn, yn) ∈ Rm+1, where the yi s are thelabels, yi ∈ {−1, 1}, we are looking for a function
f (x) =n∑
i=1
wixi + b = 〈w , x〉+ b,
that classifies the variable x ,
sign(f (x)) =
{1 if f (x) >= 0−1 if f (x) < 0
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Hard Margin Classifiers
Given data (x1, y1), . . . , (xn, yn) ∈ Rm+1, where the yi s are thelabels, yi ∈ {−1, 1}, we are looking for a function
f (x) =n∑
i=1
wixi + b = 〈w , x〉+ b,
that classifies the variable x ,
sign(f (x)) =
{1 if f (x) >= 0−1 if f (x) < 0
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Given an hyperplane
h : 〈w , z〉+ b = 0,
and a point x we want to find the distance d(x , h)
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
If ‖w‖ = 1, then 〈w , x〉 = Πw (x) is the projection of x to thedirection of w .
For an arbitrary w , then Πw (x) = 〈 w‖w‖ , x〉.
For z belonging to the hyperplane h,
〈w , z〉+ b = 0 = 〈 w
‖w‖, z〉+
p
‖w‖,
Πw (z) = − p‖w‖ .
The margin isDw ,p(x) = d(x , h) = Πw (x)− (− p
‖w‖) = 〈w ,x〉+p‖w‖
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
If ‖w‖ = 1, then 〈w , x〉 = Πw (x) is the projection of x to thedirection of w .
For an arbitrary w , then Πw (x) = 〈 w‖w‖ , x〉.
For z belonging to the hyperplane h,
〈w , z〉+ b = 0 = 〈 w
‖w‖, z〉+
p
‖w‖,
Πw (z) = − p‖w‖ .
The margin isDw ,p(x) = d(x , h) = Πw (x)− (− p
‖w‖) = 〈w ,x〉+p‖w‖
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
If ‖w‖ = 1, then 〈w , x〉 = Πw (x) is the projection of x to thedirection of w .
For an arbitrary w , then Πw (x) = 〈 w‖w‖ , x〉.
For z belonging to the hyperplane h,
〈w , z〉+ b = 0 = 〈 w
‖w‖, z〉+
p
‖w‖,
Πw (z) = − p‖w‖ .
The margin isDw ,p(x) = d(x , h) = Πw (x)− (− p
‖w‖) = 〈w ,x〉+p‖w‖
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
If ‖w‖ = 1, then 〈w , x〉 = Πw (x) is the projection of x to thedirection of w .
For an arbitrary w , then Πw (x) = 〈 w‖w‖ , x〉.
For z belonging to the hyperplane h,
〈w , z〉+ b = 0 = 〈 w
‖w‖, z〉+
p
‖w‖,
Πw (z) = − p‖w‖ .
The margin isDw ,p(x) = d(x , h) = Πw (x)− (− p
‖w‖) = 〈w ,x〉+p‖w‖
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Suppose that the set P = {(x1, y1), . . . , (xn, yn)} ⊂ Rm+1 withyi ∈ {−1, 1} is seprarable by hyperplanes. A point (x , y) is wellclassified by h if
yDw ,p(x) = y〈w , x〉+ p
‖w‖≥ 0.
Problem: Find the hyperplane with greater margin, i.e.
maxw ,b
mini
yiDw ,p(xi )
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Suppose that the set P = {(x1, y1), . . . , (xn, yn)} ⊂ Rm+1 withyi ∈ {−1, 1} is seprarable by hyperplanes. A point (x , y) is wellclassified by h if
yDw ,p(x) = y〈w , x〉+ p
‖w‖≥ 0.
Problem: Find the hyperplane with greater margin, i.e.
maxw ,b
mini
yiDw ,p(xi )
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Problem (1):maxw ,p,M M
st yi〈w ,xi 〉+p‖w‖ ≥ M, for all i
Problem (2):maxw ,p,M M
st yi (〈w , xi 〉+ p) ≥ 1, for all i
‖w‖M = 1
Problem (1) and (2) are equivalent:w , p,M −→{
w = w/‖w‖M,p = p/M
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Problem (1):maxw ,p,M M
st yi〈w ,xi 〉+p‖w‖ ≥ M, for all i
Problem (2):maxw ,p,M M
st yi (〈w , xi 〉+ p) ≥ 1, for all i
‖w‖M = 1
Problem (1) and (2) are equivalent:w , p,M −→{
w = w/‖w‖M,p = p/M
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Problem (Hard Margin Classification)
minw ,p12‖w‖
2
st yi (〈w , xi 〉+ p) ≥ 1, for all i
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Regularization - Soft Margin
When D cannot be separated by hyperplanes, finding the minimumnumber of misclassifications is a NP problem.
Consider thefollowing quadratic optimization problem:
Problem (Soft Margin Classification)
minw ,p,ξ12‖w‖
2 + C∑
i ξist yi (〈w , xi 〉+ p) ≥ 1− ξi , for all iξi ≥ 0 for all i
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Regularization - Soft Margin
When D cannot be separated by hyperplanes, finding the minimumnumber of misclassifications is a NP problem. Consider thefollowing quadratic optimization problem:
Problem (Soft Margin Classification)
minw ,p,ξ12‖w‖
2 + C∑
i ξist yi (〈w , xi 〉+ p) ≥ 1− ξi , for all iξi ≥ 0 for all i
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Karush-Kuhn-Tucker
Optimization Problem
minimizex
f (x)
subject to gi (x) = 0, i = 1, . . . ,m.
hi (x) ≤ 0, i = 1, . . . , k .
Consider the Lagrangian
L(x , λ, ν) = f (x) +∑i
λigi (x) +∑i
νihi (x)
Optimality conditions
If x is optimal, then there exists λ, ν such that
∂xL(x , λ, ν) = ∂x f (x) +∑
i λi∂xgi (x) +∑
i ∂xνihi (x) = 0,∂λi L(x , λ, ν) = gi (x) = 0,hi (x) ≤ 0,νihi (x) = 0,νi ≥ 0,
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
minw ,p,ξ
12‖w‖
2 + C∑
i ξi1− ξi − yi (〈w , xi 〉+ p) ≤ 0, for all i−ξi ≤ 0 for all i
Karush-Kuhn-Tucker sufficiency conditions:L = 1
2‖w‖2 + C
∑i ξi +
∑i αi (1− ξi − yi (〈w , xi 〉+ p))−
∑i ηiξi ,
αi , ηi ≥ 0,∂L∂w = w −
∑yixiαi = 0,
∂L∂p = 0,
∂L∂ξ = 0,
αi (1− ξi − yi (〈w , xi 〉+ p)) = 0,
ηiξi = 0
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Consequences:
w =∑
i xiyiαi , so the classification function is
f (x) = 〈w , x〉+ b =∑
yiαi 〈x , xi 〉+ p.
αi is zero if xi doesn’t meet the restriction.
- w depends on the bad classified points or just in the margin:Robustness
- Sparse representation: ”Dimensionality reduction” (kind of).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
minw ,p
12‖w‖
2 + C∑
i ξist yi (〈w , xi 〉+ p) ≥ 1− ξi , for all iξi ≥ 0 for all i
ξi ≥ max{0, 1− yi (〈w , xi 〉+ p)} = L(yi , 〈w , xi 〉+ p),
where L(y , t) = max{0, 1− yt} (Hinge loss)
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
minw ,p
12‖w‖
2 + C∑
i ξist yi (〈w , xi 〉+ p) ≥ 1− ξi , for all iξi ≥ 0 for all i
ξi ≥ max{0, 1− yi (〈w , xi 〉+ p)} = L(yi , 〈w , xi 〉+ p),
where L(y , t) = max{0, 1− yt} (Hinge loss)
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Problem
minw ,p
λ‖w‖2 +1
n
∑i
L(yi , 〈w , xi 〉+ p)
Interpretation:1n
∑i L(yi , 〈w , xi 〉+ p) = Rn(w , p)
‖w‖ measures the complexity
it is a trade-off between minimizing errors and the complexity ofthe model. The term λ‖w‖2 regularizes the empirical risk.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Problem
minw ,p
λ‖w‖2 +1
n
∑i
L(yi , 〈w , xi 〉+ p)
Interpretation:1n
∑i L(yi , 〈w , xi 〉+ p) = Rn(w , p)
‖w‖ measures the complexity
it is a trade-off between minimizing errors and the complexity ofthe model. The term λ‖w‖2 regularizes the empirical risk.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Hard Margin ClassifiersRegularization - Soft Margin
Problem
minw ,p
λ‖w‖2 +1
n
∑i
L(yi , 〈w , xi 〉+ p)
The problem is stated in terms of scalar products (and noexplicit dependence on the input dimension)
We want to deal with non linear decision functions
=⇒ substitute the scalar product by other type scalar products(kernels).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
Hilbert Space
A Hilbert space H is a space with a scalar product 〈·, ·〉:Symetric: 〈x , y〉 = 〈y , x〉Bilinear: 〈α1x1 + α2x2, y〉 = α1〈x1, y〉+ α2〈x2, y〉Positive Definite 〈x , x〉 ≥ 0
inducing a norm ‖x‖ =√〈x , x〉
Examples:
Rn with x · y =∑
i xiyi so ‖x‖ =√∑
i x2i
Square integrable functions 〈f , g〉 =∫
fg so ‖f ‖ =√∫
f 2
Main Idea
Map every x to a Hilbert space H and the compute the solution ofthe soft margin problem in H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
Hilbert Space
A Hilbert space H is a space with a scalar product 〈·, ·〉:Symetric: 〈x , y〉 = 〈y , x〉Bilinear: 〈α1x1 + α2x2, y〉 = α1〈x1, y〉+ α2〈x2, y〉Positive Definite 〈x , x〉 ≥ 0
inducing a norm ‖x‖ =√〈x , x〉
Examples:
Rn with x · y =∑
i xiyi so ‖x‖ =√∑
i x2i
Square integrable functions 〈f , g〉 =∫
fg so ‖f ‖ =√∫
f 2
Main Idea
Map every x to a Hilbert space H and the compute the solution ofthe soft margin problem in H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
Hilbert Space
A Hilbert space H is a space with a scalar product 〈·, ·〉:Symetric: 〈x , y〉 = 〈y , x〉Bilinear: 〈α1x1 + α2x2, y〉 = α1〈x1, y〉+ α2〈x2, y〉Positive Definite 〈x , x〉 ≥ 0
inducing a norm ‖x‖ =√〈x , x〉
Examples:
Rn with x · y =∑
i xiyi so ‖x‖ =√∑
i x2i
Square integrable functions 〈f , g〉 =∫
fg so ‖f ‖ =√∫
f 2
Main Idea
Map every x to a Hilbert space H and the compute the solution ofthe soft margin problem in H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
Definition (Kernels)
A function k : RmxRm −→ R is called a kernel if it is symmetricand positive definite, i.e
k(x , x ′) = k(x ′, x)
and for all x1, . . . , xk and α = (α1, . . . , αk),∑i ,j
αiαjk(xi , xj) = αTKα ≥ 0,
where K = {k(xi , xj)}i ,j (matrix)
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
Gaussian RBF kernels k(x , x ′) = e‖x−x′‖2/γ .
Polynomial kernels k(x , x ′) = (〈x , x ′〉+ c)d .
Sums and products of kernels.
Taylor kernels: given and analytic function f (x) =∑∞
k=1 akxk
defined on the ball {‖x‖ < r} with ak ≥ 0,
k(x , x ′) =∞∑k=1
ak〈x , x ′〉,
is defined on the ball {‖x‖ <√
r}.Fourier kernels.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
Suppose Φ : Rm −→ H where H is a Hilbert space ( a vector spacewith a scalar product). Then
k(x , x ′) = 〈Φ(x),Φ(x ′)〉.
is a kernel. The map Φ will be called feature map and H featurespace.
Example:
(x1x ′1 + x2x ′2 + c)2 = (x21 , x
22 ,√
2x1x2,√
2cx1,√
2cx2, c)·(x ′21 , x
′22 ,√
2x ′1x ′2,√
2cx ′1,√
2cx ′2, c)T
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
Suppose Φ : Rm −→ H where H is a Hilbert space ( a vector spacewith a scalar product). Then
k(x , x ′) = 〈Φ(x),Φ(x ′)〉.
is a kernel. The map Φ will be called feature map and H featurespace.Example:
(x1x ′1 + x2x ′2 + c)2 = (x21 , x
22 ,√
2x1x2,√
2cx1,√
2cx2, c)·(x ′21 , x
′22 ,√
2x ′1x ′2,√
2cx ′1,√
2cx ′2, c)T
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
(Pre)Definition
Given a kernel k, its RKHS (reproducing kernel Hilbert space) isthe smallest feature space of k containing the function k(·, x).
Theorem
Given a kernel k, there exists a unique RKHS H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
(Pre)Definition
Given a kernel k, its RKHS (reproducing kernel Hilbert space) isthe smallest feature space of k containing the function k(·, x).
Theorem
Given a kernel k, there exists a unique RKHS H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
RKHS are spaces of functions. Idea of the proof of the theoremabove:
Hpre := {f (x) =∑l
i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =
∑αik(·, xi ) and g =
∑βjk(·, x ′j ),
〈f , g〉 =∑
αiβjk(xi , x′j ),
and ‖f ‖2Hpre
= 〈f , f 〉 =∑αiαjk(xi , xj).
H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )
The map Φ(x) = k(·, x) is a feature map
〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).
Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
RKHS are spaces of functions. Idea of the proof of the theoremabove:
Hpre := {f (x) =∑l
i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}
For f =∑αik(·, xi ) and g =
∑βjk(·, x ′j ),
〈f , g〉 =∑
αiβjk(xi , x′j ),
and ‖f ‖2Hpre
= 〈f , f 〉 =∑αiαjk(xi , xj).
H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )
The map Φ(x) = k(·, x) is a feature map
〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).
Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
RKHS are spaces of functions. Idea of the proof of the theoremabove:
Hpre := {f (x) =∑l
i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =
∑αik(·, xi ) and g =
∑βjk(·, x ′j ),
〈f , g〉 =∑
αiβjk(xi , x′j ),
and ‖f ‖2Hpre
= 〈f , f 〉 =∑αiαjk(xi , xj).
H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )
The map Φ(x) = k(·, x) is a feature map
〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).
Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
RKHS are spaces of functions. Idea of the proof of the theoremabove:
Hpre := {f (x) =∑l
i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =
∑αik(·, xi ) and g =
∑βjk(·, x ′j ),
〈f , g〉 =∑
αiβjk(xi , x′j ),
and ‖f ‖2Hpre
= 〈f , f 〉 =∑αiαjk(xi , xj).
H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )
The map Φ(x) = k(·, x) is a feature map
〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).
Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
RKHS are spaces of functions. Idea of the proof of the theoremabove:
Hpre := {f (x) =∑l
i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =
∑αik(·, xi ) and g =
∑βjk(·, x ′j ),
〈f , g〉 =∑
αiβjk(xi , x′j ),
and ‖f ‖2Hpre
= 〈f , f 〉 =∑αiαjk(xi , xj).
H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )
The map Φ(x) = k(·, x) is a feature map
〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).
Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionReproducing Kernel Hilbert Spaces
RKHS are spaces of functions. Idea of the proof of the theoremabove:
Hpre := {f (x) =∑l
i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =
∑αik(·, xi ) and g =
∑βjk(·, x ′j ),
〈f , g〉 =∑
αiβjk(xi , x′j ),
and ‖f ‖2Hpre
= 〈f , f 〉 =∑αiαjk(xi , xj).
H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )
The map Φ(x) = k(·, x) is a feature map
〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).
Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
Definition
Definition (SVM)
Given data D, a support vector machine is the solution of theproblem (fD,λ),
minf ∈H
λ‖f ‖2H +
1
n
∑i
L(yi , f (xi ))
where H is a RKHS with kernel k , and L is a loss function.
The function fD,λ exists and is unique.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
Definition
Definition (SVM)
Given data D, a support vector machine is the solution of theproblem (fD,λ),
minf ∈H
λ‖f ‖2H +
1
n
∑i
L(yi , f (xi ))
where H is a RKHS with kernel k , and L is a loss function.
The function fD,λ exists and is unique.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
Classification problems with hinge loss, logistic loss, leastsquares loss.
Regression problems least squares loss, with the ε-insensitiveloss (L(y , t) = max{0, |y − t| − ε}), quantile regression.
Applications in dimension reduction (kernel PCA)
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
Theorem (Representer theorem)
There exist α1, . . . , αn ∈ R such that
fD,λ(x) =∑i
αik(x , xi ).
Problem (SVM)
minα∈Rn
λ∑i ,j
αiαjk(xi , xj) +1
n
∑i
L(yi ,∑j
αik(xi , xj))
The kernel k provides non linearity.
Mapping to RKHS retains the optimization structure andallows to solve the problem in the input space (non linearprogramming, λ is usually set by cross validation).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
Theorem (Representer theorem)
There exist α1, . . . , αn ∈ R such that
fD,λ(x) =∑i
αik(x , xi ).
Problem (SVM)
minα∈Rn
λ∑i ,j
αiαjk(xi , xj) +1
n
∑i
L(yi ,∑j
αik(xi , xj))
The kernel k provides non linearity.
Mapping to RKHS retains the optimization structure andallows to solve the problem in the input space (non linearprogramming, λ is usually set by cross validation).
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
It is typically used the ε insensitve loss function
L(x) = |x |ε =
{0 if |x | ≤ ε|x | − ε if |x | ≥ ε = max{0, |x | − ε}
Using a nondifferentiable loss produces sparsity on the solution
Using absolut value gives robustness
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
SVM for regression
minf ∈H
λ‖f ‖2 +1
n
∑i
|yi − f (xi )|ε =
minαλαTKα +
1
n
∑i
|yi −∑j
αjk(xi , xj))|ε
where K = {k(xi , xj)}i ,j
minα,ξ+,ξ−
λαTKα +∑i
(ξ+i + ξ−i )
subject to ξ+i ≥ yi −
∑j
αjk(xi , xj)− ε
ξ−i ≥ −yi +∑j
αjk(xi , xj) + ε
ξ+i , ξ
−i ≥ 0
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
SVM for regression
minf ∈H
λ‖f ‖2 +1
n
∑i
|yi − f (xi )|ε =
minαλαTKα +
1
n
∑i
|yi −∑j
αjk(xi , xj))|ε
where K = {k(xi , xj)}i ,j
minα,ξ+,ξ−
λαTKα +∑i
(ξ+i + ξ−i )
subject to ξ+i ≥ yi −
∑j
αjk(xi , xj)− ε
ξ−i ≥ −yi +∑j
αjk(xi , xj) + ε
ξ+i , ξ
−i ≥ 0
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
Statistical Properties - Gaussian Kernel
Properties:
kγ(x , x) > 0
Φγ is injective
If X is compact, Hγ is dense in C(X )
For a complex differentiable (holomorphic) functionf : Cm → C, consider
‖f ‖γ =
(2m
πmγ2m
∫Cm
|f (z)|2eγ−2
∑j (zj−z j )2
dz
)1/2
.
Then Hγ = {Re(f )|f is holomporphic and ‖f ‖γ <∞}The VC dimension of Hγ is ∞.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
Recall that R(f ) =∫
L(x , y , f (x))dP(x , y) is the expected riskfunction. Consider
R∗ = infFR(f ),
where F = {f is measurable} and
R∗H = minHR(f ).
Theorem (Consistency)
Given a sample size n, consider a sequence λn such thatlimn→∞ λn = 0 and limn→∞ λ
pnn =∞ for some p > 1. Then,
limn→∞
R(fD,λn) = R∗H .
MoreoverR∗H = R∗.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
DefinitionSVM for regressionStatistical Properties - Gaussian Kernel
Recall that R(f ) =∫
L(x , y , f (x))dP(x , y) is the expected riskfunction. Consider
R∗ = infFR(f ),
where F = {f is measurable} and
R∗H = minHR(f ).
Theorem (Consistency)
Given a sample size n, consider a sequence λn such thatlimn→∞ λn = 0 and limn→∞ λ
pnn =∞ for some p > 1. Then,
limn→∞
R(fD,λn) = R∗H .
MoreoverR∗H = R∗.
Aleix Ruiz de Villa Complexity and Support Vector Machines
IntroductionStatistical Learning Theory
Introduction to SVM - ClassificationKernels
SVMImplementation
Implementation
Implementation
C++, Java: Libsvm
C: SVM Light
R: packages ’e1071’ (Libsvm), ’kernlab’
Python: scikit
Aleix Ruiz de Villa Complexity and Support Vector Machines