complexity and support vector machines - aleix ruiz de villaaleix ruiz de villa complexity and...

77
Complexity and Support Vector Machines Aleix Ruiz de Villa Grup d’Estudi de Machine Learning de Barcelona King’s Offices May 6th, 2014

Upload: others

Post on 01-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

Complexity and Support Vector Machines

Aleix Ruiz de Villa

Grup d’Estudi de Machine Learning de BarcelonaKing’s Offices

May 6th, 2014

Page 2: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

1 IntroductionIntroductionFrameworkHistory

2 Statistical Learning TheoryERMPrinciple of parsimonyConsistency

3 Introduction to SVM - ClassificationHard Margin ClassifiersRegularization - Soft Margin

4 KernelsDefinitionReproducing Kernel Hilbert Spaces

5 SVMDefinitionSVM for regressionStatistical Properties - Gaussian Kernel

6 ImplementationImplementation

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 3: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

IntroductionFrameworkHistory

Introduction

Typical framework with linear regression:

(1) We have some data (x1, y1), . . . , (xn, yn) ∈ Rp+1.(2) We consider the family of functions

y = w0 + w1x1 + . . .+ wpxp + ε = f (x) + ε,

where ε is some random error.(3) We play (include/exclude, transform, ...) with the covariates

xk .(3) We fit our models with data

minf ∈H

n∑i

(yi−f (xi ))2 = minw0,...,wp

n∑i

(yi−w0−w1x1i −. . .−wpxp

i )2,

obtaining optimal w0(n), . . . , wp(n)(4) We use the optimal model to make predictions (evaluate on

new x) and validate them (when we have the response y)

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 4: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

IntroductionFrameworkHistory

Motivation

The best w∗ would be obtained if we had an infinte samplen =∞.

We think (hope) that w(n) is a good approximation of w∗

We play with the covariates because more covariates

decrease the fitting errornot necessarily improves prediction error (when new data xarrives).

We would like an infinite sample of new data to evaluate thechosen model.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 5: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

IntroductionFrameworkHistory

We’ve got

Data D = {(x1, y1), . . . , (xn, yn) ∈ Rm+1} sampled independentlyfrom an unknow probability distribution P(x , y) (y = f (x) + ε is aparticular case).a

aGreen color emphasizes the source of randomness

We decide

A loss function L(y , t).A family of functions H (our models).

We calculate

Empirical risk ( training error)Rn(f ) = 1

n

∑ni L(yi , f (xi )) for f ∈ H.

Optimal empirical function fn = arg minHRn(f )

Goal quantities

Expected risk (generalization error)R(f ) =

∫L(y , f (x))dP(x , y) for f ∈ H.

Minimum expected risk R∗ = minH R(f ).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 6: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

IntroductionFrameworkHistory

We’ve got

Data D = {(x1, y1), . . . , (xn, yn) ∈ Rm+1} sampled independentlyfrom an unknow probability distribution P(x , y) (y = f (x) + ε is aparticular case).a

aGreen color emphasizes the source of randomness

We decide

A loss function L(y , t).A family of functions H (our models).

We calculate

Empirical risk ( training error)Rn(f ) = 1

n

∑ni L(yi , f (xi )) for f ∈ H.

Optimal empirical function fn = arg minHRn(f )

Goal quantities

Expected risk (generalization error)R(f ) =

∫L(y , f (x))dP(x , y) for f ∈ H.

Minimum expected risk R∗ = minH R(f ).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 7: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

IntroductionFrameworkHistory

We’ve got

Data D = {(x1, y1), . . . , (xn, yn) ∈ Rm+1} sampled independentlyfrom an unknow probability distribution P(x , y) (y = f (x) + ε is aparticular case).a

aGreen color emphasizes the source of randomness

We decide

A loss function L(y , t).A family of functions H (our models).

We calculate

Empirical risk ( training error)Rn(f ) = 1

n

∑ni L(yi , f (xi )) for f ∈ H.

Optimal empirical function fn = arg minHRn(f )

Goal quantities

Expected risk (generalization error)R(f ) =

∫L(y , f (x))dP(x , y) for f ∈ H.

Minimum expected risk R∗ = minH R(f ).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 8: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

IntroductionFrameworkHistory

We’ve got

Data D = {(x1, y1), . . . , (xn, yn) ∈ Rm+1} sampled independentlyfrom an unknow probability distribution P(x , y) (y = f (x) + ε is aparticular case).a

aGreen color emphasizes the source of randomness

We decide

A loss function L(y , t).A family of functions H (our models).

We calculate

Empirical risk ( training error)Rn(f ) = 1

n

∑ni L(yi , f (xi )) for f ∈ H.

Optimal empirical function fn = arg minHRn(f )

Goal quantities

Expected risk (generalization error)R(f ) =

∫L(y , f (x))dP(x , y) for f ∈ H.

Minimum expected risk R∗ = minH R(f ).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 9: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

IntroductionFrameworkHistory

Framework - Examples loss functions

Problem (Classification)

Y’s are labels, yi ∈ {−1, 1}.

L(y , f (x)) =

{1 sign(f (x)) 6= y0 sign(f (x)) = y

Problem (Regression)

Y’s are continuous variables

L(y , f (x)) = (y − f (x))2.

Our setting deals with more loss functions.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 10: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

IntroductionFrameworkHistory

Framework - Examples loss functions

Problem (Classification)

Y’s are labels, yi ∈ {−1, 1}.

L(y , f (x)) =

{1 sign(f (x)) 6= y0 sign(f (x)) = y

Problem (Regression)

Y’s are continuous variables

L(y , f (x)) = (y − f (x))2.

Our setting deals with more loss functions.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 11: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

IntroductionFrameworkHistory

History

1962 Perceptron ( primitive version of neural networks) wasintroduced

1968 Vapnik, Chervonenkis started theoretical analysis of thelearning process

1985 Backward propagation algorithm was discovered →Neural networks

1990 VC Theoretical analysis of the learning process finishes

1995 SVM were introduced.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 12: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

ERM

One of the most natural learning processes is the empirical riskminimization (ERM). Given data D we want to find fn ∈ H suchthat

Rn(fn) = minHRn(f ).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 13: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Problem (Principle of parsimony)

Adding variables in multilinear regression may suffer fromoverfitting:

better fitting,

worse predictions.

Problem (Consistency)

Degenerate model f =∑

i yi1xi :

null empirical risk,

brings no insight of the process (bad predictions).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 14: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Problem (Principle of parsimony)

Adding variables in multilinear regression may suffer fromoverfitting:

better fitting,

worse predictions.

Problem (Consistency)

Degenerate model f =∑

i yi1xi :

null empirical risk,

brings no insight of the process (bad predictions).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 15: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Principle of Parsimony

Principle

Of two equivalent theories or explanations, all other things beingequal, the simpler one is to be preferred.

We need to measure the complexity of our models (H) ⇒ VCdimension.

Definition (VC dimension)

We say that H shatters a set of points P = {x1, ..., xk} if H canclassify all the possible labels on P, that is, if evaluating on Pfunctions of the form sign(f + b) with f ∈ H and b ∈ R, we obtainall the possible combinations S = {y1, ..., yk}, yi ∈ {−1, 1}.The VC dimension of the set H is the maximum number of pointsthat H can shatter.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 16: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Principle of Parsimony

Principle

Of two equivalent theories or explanations, all other things beingequal, the simpler one is to be preferred.

We need to measure the complexity of our models (H) ⇒ VCdimension.

Definition (VC dimension)

We say that H shatters a set of points P = {x1, ..., xk} if H canclassify all the possible labels on P, that is, if evaluating on Pfunctions of the form sign(f + b) with f ∈ H and b ∈ R, we obtainall the possible combinations S = {y1, ..., yk}, yi ∈ {−1, 1}.The VC dimension of the set H is the maximum number of pointsthat H can shatter.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 17: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 18: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Example

The set of hyperplanes in Rn,

n∑i=1

wixi + b = 〈w , x〉+ b,

has VC dimension n + 1.

Example

The set of functions f =∑

i yi1xi has infinte dimension.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 19: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Example

The set of hyperplanes with b = 0 on {‖x‖ ≤ R} and satisfying‖w‖ ≤ Λ has VC dimension less or equal than R2Λ2.

Example

The VC dimension of the set of gaussian functions in Rm is(m2 + 3m)/2

Example

The VC dimension of the set of functionsH = {f (x) = sign(sin(θx)) : θ ∈ R} is ∞

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 20: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Theorem (Vapnik, Chervonenkis)

Given a class of functions H with VC dimension(H) = h, ifA ≤ L(Y , f (X )) ≤ B almost for sure, then given η > 0, withprobability at least 1− η, for all f ∈ H we have

|Rn(f )−R(f )| ≤ B − A

2

√ζ,

where

ζ = 4h(ln 2n

h + 1)− ln η4

n.

Moreover, with probability at least 1− 2η

R(fn)− infHR(f ) ≤ (B − A)

√− ln η

2n+

B − A

2

√ζ.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 21: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Theorem (Vapnik, Chervonenkis)

Given a class of functions H with VC dimension(H) = h, ifA ≤ L(Y , f (X )) ≤ B almost for sure, then given η > 0, withprobability at least 1− η, for all f ∈ H we have

|Rn(f )−R(f )| ≤ B − A

2

√ζ,

where

ζ = 4h(ln 2n

h + 1)− ln η4

n.

Moreover, with probability at least 1− 2η

R(fn)− infHR(f ) ≤ (B − A)

√− ln η

2n+

B − A

2

√ζ.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 22: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Consistency

For a fixed f , by the law of large numbers, Rn(f )→ R(f ).

What we actually do is:n = 1 find f1 minimizing minH R1(f )n = 2 find f2 minimizing minH R2(f )...n find fn minimizing minH Rn(f )

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 23: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Consistency

For a fixed f , by the law of large numbers, Rn(f )→ R(f ).

What we actually do is:n = 1 find f1 minimizing minH R1(f )n = 2 find f2 minimizing minH R2(f )...n find fn minimizing minH Rn(f )

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 24: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Consistency

For a fixed f , by the law of large numbers, Rn(f )→ R(f ).

What we actually do is:n = 1 find f1 minimizing minH R1(f )n = 2 find f2 minimizing minH R2(f )...n find fn minimizing minH Rn(f )

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 25: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Consistency

We expect Rn(fn)→ minH R, but this is not always the case.For instance, the family of functions f = sign(

∑i yi1xi ), in

the classification case, if PX is absolutely continuous,

minH Rn(f ) = 0R(fn) =

∫L(y , fn(x))dP(x , y) =

∫L(y , sign(0))dP(x , y) =

P(Y = −1)

Theorem

If VC dimension of H is finite, then ERM is consitent, i.e.

Rn(fn)→ minHR(f )

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 26: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

ERMPrinciple of parsimonyConsistency

Consistency

We expect Rn(fn)→ minH R, but this is not always the case.For instance, the family of functions f = sign(

∑i yi1xi ), in

the classification case, if PX is absolutely continuous,

minH Rn(f ) = 0R(fn) =

∫L(y , fn(x))dP(x , y) =

∫L(y , sign(0))dP(x , y) =

P(Y = −1)

Theorem

If VC dimension of H is finite, then ERM is consitent, i.e.

Rn(fn)→ minHR(f )

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 27: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Hard Margin Classifiers

Given data (x1, y1), . . . , (xn, yn) ∈ Rm+1, where the yi s are thelabels, yi ∈ {−1, 1}, we are looking for a function

f (x) =n∑

i=1

wixi + b = 〈w , x〉+ b,

that classifies the variable x ,

sign(f (x)) =

{1 if f (x) >= 0−1 if f (x) < 0

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 28: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Hard Margin Classifiers

Given data (x1, y1), . . . , (xn, yn) ∈ Rm+1, where the yi s are thelabels, yi ∈ {−1, 1}, we are looking for a function

f (x) =n∑

i=1

wixi + b = 〈w , x〉+ b,

that classifies the variable x ,

sign(f (x)) =

{1 if f (x) >= 0−1 if f (x) < 0

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 29: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Given an hyperplane

h : 〈w , z〉+ b = 0,

and a point x we want to find the distance d(x , h)

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 30: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

If ‖w‖ = 1, then 〈w , x〉 = Πw (x) is the projection of x to thedirection of w .

For an arbitrary w , then Πw (x) = 〈 w‖w‖ , x〉.

For z belonging to the hyperplane h,

〈w , z〉+ b = 0 = 〈 w

‖w‖, z〉+

p

‖w‖,

Πw (z) = − p‖w‖ .

The margin isDw ,p(x) = d(x , h) = Πw (x)− (− p

‖w‖) = 〈w ,x〉+p‖w‖

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 31: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

If ‖w‖ = 1, then 〈w , x〉 = Πw (x) is the projection of x to thedirection of w .

For an arbitrary w , then Πw (x) = 〈 w‖w‖ , x〉.

For z belonging to the hyperplane h,

〈w , z〉+ b = 0 = 〈 w

‖w‖, z〉+

p

‖w‖,

Πw (z) = − p‖w‖ .

The margin isDw ,p(x) = d(x , h) = Πw (x)− (− p

‖w‖) = 〈w ,x〉+p‖w‖

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 32: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

If ‖w‖ = 1, then 〈w , x〉 = Πw (x) is the projection of x to thedirection of w .

For an arbitrary w , then Πw (x) = 〈 w‖w‖ , x〉.

For z belonging to the hyperplane h,

〈w , z〉+ b = 0 = 〈 w

‖w‖, z〉+

p

‖w‖,

Πw (z) = − p‖w‖ .

The margin isDw ,p(x) = d(x , h) = Πw (x)− (− p

‖w‖) = 〈w ,x〉+p‖w‖

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 33: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

If ‖w‖ = 1, then 〈w , x〉 = Πw (x) is the projection of x to thedirection of w .

For an arbitrary w , then Πw (x) = 〈 w‖w‖ , x〉.

For z belonging to the hyperplane h,

〈w , z〉+ b = 0 = 〈 w

‖w‖, z〉+

p

‖w‖,

Πw (z) = − p‖w‖ .

The margin isDw ,p(x) = d(x , h) = Πw (x)− (− p

‖w‖) = 〈w ,x〉+p‖w‖

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 34: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 35: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Suppose that the set P = {(x1, y1), . . . , (xn, yn)} ⊂ Rm+1 withyi ∈ {−1, 1} is seprarable by hyperplanes. A point (x , y) is wellclassified by h if

yDw ,p(x) = y〈w , x〉+ p

‖w‖≥ 0.

Problem: Find the hyperplane with greater margin, i.e.

maxw ,b

mini

yiDw ,p(xi )

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 36: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Suppose that the set P = {(x1, y1), . . . , (xn, yn)} ⊂ Rm+1 withyi ∈ {−1, 1} is seprarable by hyperplanes. A point (x , y) is wellclassified by h if

yDw ,p(x) = y〈w , x〉+ p

‖w‖≥ 0.

Problem: Find the hyperplane with greater margin, i.e.

maxw ,b

mini

yiDw ,p(xi )

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 37: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Problem (1):maxw ,p,M M

st yi〈w ,xi 〉+p‖w‖ ≥ M, for all i

Problem (2):maxw ,p,M M

st yi (〈w , xi 〉+ p) ≥ 1, for all i

‖w‖M = 1

Problem (1) and (2) are equivalent:w , p,M −→{

w = w/‖w‖M,p = p/M

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 38: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Problem (1):maxw ,p,M M

st yi〈w ,xi 〉+p‖w‖ ≥ M, for all i

Problem (2):maxw ,p,M M

st yi (〈w , xi 〉+ p) ≥ 1, for all i

‖w‖M = 1

Problem (1) and (2) are equivalent:w , p,M −→{

w = w/‖w‖M,p = p/M

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 39: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Problem (Hard Margin Classification)

minw ,p12‖w‖

2

st yi (〈w , xi 〉+ p) ≥ 1, for all i

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 40: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Regularization - Soft Margin

When D cannot be separated by hyperplanes, finding the minimumnumber of misclassifications is a NP problem.

Consider thefollowing quadratic optimization problem:

Problem (Soft Margin Classification)

minw ,p,ξ12‖w‖

2 + C∑

i ξist yi (〈w , xi 〉+ p) ≥ 1− ξi , for all iξi ≥ 0 for all i

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 41: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Regularization - Soft Margin

When D cannot be separated by hyperplanes, finding the minimumnumber of misclassifications is a NP problem. Consider thefollowing quadratic optimization problem:

Problem (Soft Margin Classification)

minw ,p,ξ12‖w‖

2 + C∑

i ξist yi (〈w , xi 〉+ p) ≥ 1− ξi , for all iξi ≥ 0 for all i

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 42: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Karush-Kuhn-Tucker

Optimization Problem

minimizex

f (x)

subject to gi (x) = 0, i = 1, . . . ,m.

hi (x) ≤ 0, i = 1, . . . , k .

Consider the Lagrangian

L(x , λ, ν) = f (x) +∑i

λigi (x) +∑i

νihi (x)

Optimality conditions

If x is optimal, then there exists λ, ν such that

∂xL(x , λ, ν) = ∂x f (x) +∑

i λi∂xgi (x) +∑

i ∂xνihi (x) = 0,∂λi L(x , λ, ν) = gi (x) = 0,hi (x) ≤ 0,νihi (x) = 0,νi ≥ 0,

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 43: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

minw ,p,ξ

12‖w‖

2 + C∑

i ξi1− ξi − yi (〈w , xi 〉+ p) ≤ 0, for all i−ξi ≤ 0 for all i

Karush-Kuhn-Tucker sufficiency conditions:L = 1

2‖w‖2 + C

∑i ξi +

∑i αi (1− ξi − yi (〈w , xi 〉+ p))−

∑i ηiξi ,

αi , ηi ≥ 0,∂L∂w = w −

∑yixiαi = 0,

∂L∂p = 0,

∂L∂ξ = 0,

αi (1− ξi − yi (〈w , xi 〉+ p)) = 0,

ηiξi = 0

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 44: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Consequences:

w =∑

i xiyiαi , so the classification function is

f (x) = 〈w , x〉+ b =∑

yiαi 〈x , xi 〉+ p.

αi is zero if xi doesn’t meet the restriction.

- w depends on the bad classified points or just in the margin:Robustness

- Sparse representation: ”Dimensionality reduction” (kind of).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 45: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 46: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

minw ,p

12‖w‖

2 + C∑

i ξist yi (〈w , xi 〉+ p) ≥ 1− ξi , for all iξi ≥ 0 for all i

ξi ≥ max{0, 1− yi (〈w , xi 〉+ p)} = L(yi , 〈w , xi 〉+ p),

where L(y , t) = max{0, 1− yt} (Hinge loss)

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 47: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

minw ,p

12‖w‖

2 + C∑

i ξist yi (〈w , xi 〉+ p) ≥ 1− ξi , for all iξi ≥ 0 for all i

ξi ≥ max{0, 1− yi (〈w , xi 〉+ p)} = L(yi , 〈w , xi 〉+ p),

where L(y , t) = max{0, 1− yt} (Hinge loss)

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 48: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Problem

minw ,p

λ‖w‖2 +1

n

∑i

L(yi , 〈w , xi 〉+ p)

Interpretation:1n

∑i L(yi , 〈w , xi 〉+ p) = Rn(w , p)

‖w‖ measures the complexity

it is a trade-off between minimizing errors and the complexity ofthe model. The term λ‖w‖2 regularizes the empirical risk.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 49: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Problem

minw ,p

λ‖w‖2 +1

n

∑i

L(yi , 〈w , xi 〉+ p)

Interpretation:1n

∑i L(yi , 〈w , xi 〉+ p) = Rn(w , p)

‖w‖ measures the complexity

it is a trade-off between minimizing errors and the complexity ofthe model. The term λ‖w‖2 regularizes the empirical risk.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 50: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Problem

minw ,p

λ‖w‖2 +1

n

∑i

L(yi , 〈w , xi 〉+ p)

The problem is stated in terms of scalar products (and noexplicit dependence on the input dimension)

We want to deal with non linear decision functions

=⇒ substitute the scalar product by other type scalar products(kernels).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 51: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

Hilbert Space

A Hilbert space H is a space with a scalar product 〈·, ·〉:Symetric: 〈x , y〉 = 〈y , x〉Bilinear: 〈α1x1 + α2x2, y〉 = α1〈x1, y〉+ α2〈x2, y〉Positive Definite 〈x , x〉 ≥ 0

inducing a norm ‖x‖ =√〈x , x〉

Examples:

Rn with x · y =∑

i xiyi so ‖x‖ =√∑

i x2i

Square integrable functions 〈f , g〉 =∫

fg so ‖f ‖ =√∫

f 2

Main Idea

Map every x to a Hilbert space H and the compute the solution ofthe soft margin problem in H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 52: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

Hilbert Space

A Hilbert space H is a space with a scalar product 〈·, ·〉:Symetric: 〈x , y〉 = 〈y , x〉Bilinear: 〈α1x1 + α2x2, y〉 = α1〈x1, y〉+ α2〈x2, y〉Positive Definite 〈x , x〉 ≥ 0

inducing a norm ‖x‖ =√〈x , x〉

Examples:

Rn with x · y =∑

i xiyi so ‖x‖ =√∑

i x2i

Square integrable functions 〈f , g〉 =∫

fg so ‖f ‖ =√∫

f 2

Main Idea

Map every x to a Hilbert space H and the compute the solution ofthe soft margin problem in H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 53: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

Hilbert Space

A Hilbert space H is a space with a scalar product 〈·, ·〉:Symetric: 〈x , y〉 = 〈y , x〉Bilinear: 〈α1x1 + α2x2, y〉 = α1〈x1, y〉+ α2〈x2, y〉Positive Definite 〈x , x〉 ≥ 0

inducing a norm ‖x‖ =√〈x , x〉

Examples:

Rn with x · y =∑

i xiyi so ‖x‖ =√∑

i x2i

Square integrable functions 〈f , g〉 =∫

fg so ‖f ‖ =√∫

f 2

Main Idea

Map every x to a Hilbert space H and the compute the solution ofthe soft margin problem in H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 54: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

Definition (Kernels)

A function k : RmxRm −→ R is called a kernel if it is symmetricand positive definite, i.e

k(x , x ′) = k(x ′, x)

and for all x1, . . . , xk and α = (α1, . . . , αk),∑i ,j

αiαjk(xi , xj) = αTKα ≥ 0,

where K = {k(xi , xj)}i ,j (matrix)

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 55: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

Gaussian RBF kernels k(x , x ′) = e‖x−x′‖2/γ .

Polynomial kernels k(x , x ′) = (〈x , x ′〉+ c)d .

Sums and products of kernels.

Taylor kernels: given and analytic function f (x) =∑∞

k=1 akxk

defined on the ball {‖x‖ < r} with ak ≥ 0,

k(x , x ′) =∞∑k=1

ak〈x , x ′〉,

is defined on the ball {‖x‖ <√

r}.Fourier kernels.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 56: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

Suppose Φ : Rm −→ H where H is a Hilbert space ( a vector spacewith a scalar product). Then

k(x , x ′) = 〈Φ(x),Φ(x ′)〉.

is a kernel. The map Φ will be called feature map and H featurespace.

Example:

(x1x ′1 + x2x ′2 + c)2 = (x21 , x

22 ,√

2x1x2,√

2cx1,√

2cx2, c)·(x ′21 , x

′22 ,√

2x ′1x ′2,√

2cx ′1,√

2cx ′2, c)T

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 57: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

Suppose Φ : Rm −→ H where H is a Hilbert space ( a vector spacewith a scalar product). Then

k(x , x ′) = 〈Φ(x),Φ(x ′)〉.

is a kernel. The map Φ will be called feature map and H featurespace.Example:

(x1x ′1 + x2x ′2 + c)2 = (x21 , x

22 ,√

2x1x2,√

2cx1,√

2cx2, c)·(x ′21 , x

′22 ,√

2x ′1x ′2,√

2cx ′1,√

2cx ′2, c)T

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 58: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

(Pre)Definition

Given a kernel k, its RKHS (reproducing kernel Hilbert space) isthe smallest feature space of k containing the function k(·, x).

Theorem

Given a kernel k, there exists a unique RKHS H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 59: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

(Pre)Definition

Given a kernel k, its RKHS (reproducing kernel Hilbert space) isthe smallest feature space of k containing the function k(·, x).

Theorem

Given a kernel k, there exists a unique RKHS H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 60: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

RKHS are spaces of functions. Idea of the proof of the theoremabove:

Hpre := {f (x) =∑l

i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =

∑αik(·, xi ) and g =

∑βjk(·, x ′j ),

〈f , g〉 =∑

αiβjk(xi , x′j ),

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )

The map Φ(x) = k(·, x) is a feature map

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 61: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

RKHS are spaces of functions. Idea of the proof of the theoremabove:

Hpre := {f (x) =∑l

i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}

For f =∑αik(·, xi ) and g =

∑βjk(·, x ′j ),

〈f , g〉 =∑

αiβjk(xi , x′j ),

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )

The map Φ(x) = k(·, x) is a feature map

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 62: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

RKHS are spaces of functions. Idea of the proof of the theoremabove:

Hpre := {f (x) =∑l

i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =

∑αik(·, xi ) and g =

∑βjk(·, x ′j ),

〈f , g〉 =∑

αiβjk(xi , x′j ),

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )

The map Φ(x) = k(·, x) is a feature map

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 63: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

RKHS are spaces of functions. Idea of the proof of the theoremabove:

Hpre := {f (x) =∑l

i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =

∑αik(·, xi ) and g =

∑βjk(·, x ′j ),

〈f , g〉 =∑

αiβjk(xi , x′j ),

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )

The map Φ(x) = k(·, x) is a feature map

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 64: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

RKHS are spaces of functions. Idea of the proof of the theoremabove:

Hpre := {f (x) =∑l

i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =

∑αik(·, xi ) and g =

∑βjk(·, x ′j ),

〈f , g〉 =∑

αiβjk(xi , x′j ),

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )

The map Φ(x) = k(·, x) is a feature map

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 65: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

RKHS are spaces of functions. Idea of the proof of the theoremabove:

Hpre := {f (x) =∑l

i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =

∑αik(·, xi ) and g =

∑βjk(·, x ′j ),

〈f , g〉 =∑

αiβjk(xi , x′j ),

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )

The map Φ(x) = k(·, x) is a feature map

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 66: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

Definition

Definition (SVM)

Given data D, a support vector machine is the solution of theproblem (fD,λ),

minf ∈H

λ‖f ‖2H +

1

n

∑i

L(yi , f (xi ))

where H is a RKHS with kernel k , and L is a loss function.

The function fD,λ exists and is unique.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 67: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

Definition

Definition (SVM)

Given data D, a support vector machine is the solution of theproblem (fD,λ),

minf ∈H

λ‖f ‖2H +

1

n

∑i

L(yi , f (xi ))

where H is a RKHS with kernel k , and L is a loss function.

The function fD,λ exists and is unique.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 68: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

Classification problems with hinge loss, logistic loss, leastsquares loss.

Regression problems least squares loss, with the ε-insensitiveloss (L(y , t) = max{0, |y − t| − ε}), quantile regression.

Applications in dimension reduction (kernel PCA)

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 69: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

Theorem (Representer theorem)

There exist α1, . . . , αn ∈ R such that

fD,λ(x) =∑i

αik(x , xi ).

Problem (SVM)

minα∈Rn

λ∑i ,j

αiαjk(xi , xj) +1

n

∑i

L(yi ,∑j

αik(xi , xj))

The kernel k provides non linearity.

Mapping to RKHS retains the optimization structure andallows to solve the problem in the input space (non linearprogramming, λ is usually set by cross validation).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 70: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

Theorem (Representer theorem)

There exist α1, . . . , αn ∈ R such that

fD,λ(x) =∑i

αik(x , xi ).

Problem (SVM)

minα∈Rn

λ∑i ,j

αiαjk(xi , xj) +1

n

∑i

L(yi ,∑j

αik(xi , xj))

The kernel k provides non linearity.

Mapping to RKHS retains the optimization structure andallows to solve the problem in the input space (non linearprogramming, λ is usually set by cross validation).

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 71: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

It is typically used the ε insensitve loss function

L(x) = |x |ε =

{0 if |x | ≤ ε|x | − ε if |x | ≥ ε = max{0, |x | − ε}

Using a nondifferentiable loss produces sparsity on the solution

Using absolut value gives robustness

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 72: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

SVM for regression

minf ∈H

λ‖f ‖2 +1

n

∑i

|yi − f (xi )|ε =

minαλαTKα +

1

n

∑i

|yi −∑j

αjk(xi , xj))|ε

where K = {k(xi , xj)}i ,j

minα,ξ+,ξ−

λαTKα +∑i

(ξ+i + ξ−i )

subject to ξ+i ≥ yi −

∑j

αjk(xi , xj)− ε

ξ−i ≥ −yi +∑j

αjk(xi , xj) + ε

ξ+i , ξ

−i ≥ 0

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 73: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

SVM for regression

minf ∈H

λ‖f ‖2 +1

n

∑i

|yi − f (xi )|ε =

minαλαTKα +

1

n

∑i

|yi −∑j

αjk(xi , xj))|ε

where K = {k(xi , xj)}i ,j

minα,ξ+,ξ−

λαTKα +∑i

(ξ+i + ξ−i )

subject to ξ+i ≥ yi −

∑j

αjk(xi , xj)− ε

ξ−i ≥ −yi +∑j

αjk(xi , xj) + ε

ξ+i , ξ

−i ≥ 0

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 74: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

Statistical Properties - Gaussian Kernel

Properties:

kγ(x , x) > 0

Φγ is injective

If X is compact, Hγ is dense in C(X )

For a complex differentiable (holomorphic) functionf : Cm → C, consider

‖f ‖γ =

(2m

πmγ2m

∫Cm

|f (z)|2eγ−2

∑j (zj−z j )2

dz

)1/2

.

Then Hγ = {Re(f )|f is holomporphic and ‖f ‖γ <∞}The VC dimension of Hγ is ∞.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 75: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

Recall that R(f ) =∫

L(x , y , f (x))dP(x , y) is the expected riskfunction. Consider

R∗ = infFR(f ),

where F = {f is measurable} and

R∗H = minHR(f ).

Theorem (Consistency)

Given a sample size n, consider a sequence λn such thatlimn→∞ λn = 0 and limn→∞ λ

pnn =∞ for some p > 1. Then,

limn→∞

R(fD,λn) = R∗H .

MoreoverR∗H = R∗.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 76: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

Recall that R(f ) =∫

L(x , y , f (x))dP(x , y) is the expected riskfunction. Consider

R∗ = infFR(f ),

where F = {f is measurable} and

R∗H = minHR(f ).

Theorem (Consistency)

Given a sample size n, consider a sequence λn such thatlimn→∞ λn = 0 and limn→∞ λ

pnn =∞ for some p > 1. Then,

limn→∞

R(fD,λn) = R∗H .

MoreoverR∗H = R∗.

Aleix Ruiz de Villa Complexity and Support Vector Machines

Page 77: Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

Implementation

Implementation

C++, Java: Libsvm

C: SVM Light

R: packages ’e1071’ (Libsvm), ’kernlab’

Python: scikit

Aleix Ruiz de Villa Complexity and Support Vector Machines