complexity and support vector machines - aleix ruiz de villaaleix ruiz de villa complexity and...

Complexity and Support Vector Machines

Aleix Ruiz de Villa

Grup d’Estudi de Machine Learning de BarcelonaKing’s Offices

May 6th, 2014

IntroductionStatistical Learning Theory

Introduction to SVM - ClassificationKernels

SVMImplementation

1 IntroductionIntroductionFrameworkHistory

2 Statistical Learning TheoryERMPrinciple of parsimonyConsistency

3 Introduction to SVM - ClassificationHard Margin ClassifiersRegularization - Soft Margin

4 KernelsDefinitionReproducing Kernel Hilbert Spaces

5 SVMDefinitionSVM for regressionStatistical Properties - Gaussian Kernel

6 ImplementationImplementation

Aleix Ruiz de Villa Complexity and Support Vector Machines

SVMImplementation

IntroductionFrameworkHistory

Introduction

Typical framework with linear regression:

(1) We have some data (x1, y1), . . . , (xn, yn) ∈ Rp+1.(2) We consider the family of functions

y = w0 + w1x1 + . . .+ wpxp + ε = f (x) + ε,

where ε is some random error.(3) We play (include/exclude, transform, ...) with the covariates

xk .(3) We fit our models with data

minf ∈H

(yi−f (xi ))2 = minw0,...,wp

(yi−w0−w1x1i −. . .−wpxp

obtaining optimal w0(n), . . . , wp(n)(4) We use the optimal model to make predictions (evaluate on

new x) and validate them (when we have the response y)

SVMImplementation

Motivation

The best w∗ would be obtained if we had an infinte samplen =∞.

We think (hope) that w(n) is a good approximation of w∗

We play with the covariates because more covariates

decrease the fitting errornot necessarily improves prediction error (when new data xarrives).

We would like an infinite sample of new data to evaluate thechosen model.

SVMImplementation

We’ve got

Data D = {(x1, y1), . . . , (xn, yn) ∈ Rm+1} sampled independentlyfrom an unknow probability distribution P(x , y) (y = f (x) + ε is aparticular case).a

aGreen color emphasizes the source of randomness

We decide

A loss function L(y , t).A family of functions H (our models).

We calculate

Empirical risk ( training error)Rn(f ) = 1

∑ni L(yi , f (xi )) for f ∈ H.

Optimal empirical function fn = arg minHRn(f )

Goal quantities

Expected risk (generalization error)R(f ) =

∫L(y , f (x))dP(x , y) for f ∈ H.

Minimum expected risk R∗ = minH R(f ).

SVMImplementation

We’ve got

We decide

We calculate

Goal quantities

∫L(y , f (x))dP(x , y) for f ∈ H.

SVMImplementation

We’ve got

We decide

We calculate

Goal quantities

∫L(y , f (x))dP(x , y) for f ∈ H.

SVMImplementation

We’ve got

We decide

We calculate

Goal quantities

∫L(y , f (x))dP(x , y) for f ∈ H.

SVMImplementation

Framework - Examples loss functions

Problem (Classification)

Y’s are labels, yi ∈ {−1, 1}.

L(y , f (x)) =

{1 sign(f (x)) 6= y0 sign(f (x)) = y

Problem (Regression)

Y’s are continuous variables

L(y , f (x)) = (y − f (x))2.

Our setting deals with more loss functions.

SVMImplementation

Framework - Examples loss functions

Problem (Classification)

Y’s are labels, yi ∈ {−1, 1}.

L(y , f (x)) =

{1 sign(f (x)) 6= y0 sign(f (x)) = y

Problem (Regression)

Y’s are continuous variables

L(y , f (x)) = (y − f (x))2.

Our setting deals with more loss functions.

SVMImplementation

History

1962 Perceptron ( primitive version of neural networks) wasintroduced

1968 Vapnik, Chervonenkis started theoretical analysis of thelearning process

1985 Backward propagation algorithm was discovered →Neural networks

1990 VC Theoretical analysis of the learning process finishes

1995 SVM were introduced.

SVMImplementation

ERMPrinciple of parsimonyConsistency

One of the most natural learning processes is the empirical riskminimization (ERM). Given data D we want to find fn ∈ H suchthat

Rn(fn) = minHRn(f ).

SVMImplementation

Problem (Principle of parsimony)

Adding variables in multilinear regression may suffer fromoverfitting:

better fitting,

worse predictions.

Problem (Consistency)

Degenerate model f =∑

i yi1xi :

null empirical risk,

brings no insight of the process (bad predictions).

SVMImplementation

Problem (Principle of parsimony)

Adding variables in multilinear regression may suffer fromoverfitting:

better fitting,

worse predictions.

Problem (Consistency)

Degenerate model f =∑

i yi1xi :

null empirical risk,

brings no insight of the process (bad predictions).

SVMImplementation

Principle of Parsimony

Principle

Of two equivalent theories or explanations, all other things beingequal, the simpler one is to be preferred.

We need to measure the complexity of our models (H) ⇒ VCdimension.

Definition (VC dimension)

We say that H shatters a set of points P = {x1, ..., xk} if H canclassify all the possible labels on P, that is, if evaluating on Pfunctions of the form sign(f + b) with f ∈ H and b ∈ R, we obtainall the possible combinations S = {y1, ..., yk}, yi ∈ {−1, 1}.The VC dimension of the set H is the maximum number of pointsthat H can shatter.

SVMImplementation

Principle of Parsimony

Principle

Of two equivalent theories or explanations, all other things beingequal, the simpler one is to be preferred.

We need to measure the complexity of our models (H) ⇒ VCdimension.

Definition (VC dimension)

We say that H shatters a set of points P = {x1, ..., xk} if H canclassify all the possible labels on P, that is, if evaluating on Pfunctions of the form sign(f + b) with f ∈ H and b ∈ R, we obtainall the possible combinations S = {y1, ..., yk}, yi ∈ {−1, 1}.The VC dimension of the set H is the maximum number of pointsthat H can shatter.

SVMImplementation

Example

The set of hyperplanes in Rn,

n∑i=1

wixi + b = 〈w , x〉+ b,

has VC dimension n + 1.

Example

The set of functions f =∑

i yi1xi has infinte dimension.

SVMImplementation

Example

The set of hyperplanes with b = 0 on {‖x‖ ≤ R} and satisfying‖w‖ ≤ Λ has VC dimension less or equal than R2Λ2.

Example

The VC dimension of the set of gaussian functions in Rm is(m2 + 3m)/2

Example

The VC dimension of the set of functionsH = {f (x) = sign(sin(θx)) : θ ∈ R} is ∞

SVMImplementation

Theorem (Vapnik, Chervonenkis)

Given a class of functions H with VC dimension(H) = h, ifA ≤ L(Y , f (X )) ≤ B almost for sure, then given η > 0, withprobability at least 1− η, for all f ∈ H we have

|Rn(f )−R(f )| ≤ B − A

√ζ,

ζ = 4h(ln 2n

h + 1)− ln η4

Moreover, with probability at least 1− 2η

R(fn)− infHR(f ) ≤ (B − A)

√− ln η

B − A

√ζ.

SVMImplementation

Theorem (Vapnik, Chervonenkis)

Given a class of functions H with VC dimension(H) = h, ifA ≤ L(Y , f (X )) ≤ B almost for sure, then given η > 0, withprobability at least 1− η, for all f ∈ H we have

|Rn(f )−R(f )| ≤ B − A

√ζ,

ζ = 4h(ln 2n

h + 1)− ln η4

Moreover, with probability at least 1− 2η

R(fn)− infHR(f ) ≤ (B − A)

√− ln η

B − A

√ζ.

SVMImplementation

Consistency

For a fixed f , by the law of large numbers, Rn(f )→ R(f ).

What we actually do is:n = 1 find f1 minimizing minH R1(f )n = 2 find f2 minimizing minH R2(f )...n find fn minimizing minH Rn(f )

SVMImplementation

Consistency

SVMImplementation

Consistency

SVMImplementation

Consistency

We expect Rn(fn)→ minH R, but this is not always the case.For instance, the family of functions f = sign(

∑i yi1xi ), in

the classification case, if PX is absolutely continuous,

minH Rn(f ) = 0R(fn) =

∫L(y , fn(x))dP(x , y) =

∫L(y , sign(0))dP(x , y) =

P(Y = −1)

Theorem

If VC dimension of H is finite, then ERM is consitent, i.e.

Rn(fn)→ minHR(f )

SVMImplementation

Consistency

We expect Rn(fn)→ minH R, but this is not always the case.For instance, the family of functions f = sign(

∑i yi1xi ), in

the classification case, if PX is absolutely continuous,

minH Rn(f ) = 0R(fn) =

∫L(y , fn(x))dP(x , y) =

∫L(y , sign(0))dP(x , y) =

P(Y = −1)

Theorem

If VC dimension of H is finite, then ERM is consitent, i.e.

Rn(fn)→ minHR(f )

SVMImplementation

Hard Margin ClassifiersRegularization - Soft Margin

Hard Margin Classifiers

Given data (x1, y1), . . . , (xn, yn) ∈ Rm+1, where the yi s are thelabels, yi ∈ {−1, 1}, we are looking for a function

f (x) =n∑

wixi + b = 〈w , x〉+ b,

that classifies the variable x ,

sign(f (x)) =

{1 if f (x) >= 0−1 if f (x) < 0

SVMImplementation

Hard Margin Classifiers

Given data (x1, y1), . . . , (xn, yn) ∈ Rm+1, where the yi s are thelabels, yi ∈ {−1, 1}, we are looking for a function

f (x) =n∑

wixi + b = 〈w , x〉+ b,

that classifies the variable x ,

sign(f (x)) =

{1 if f (x) >= 0−1 if f (x) < 0

SVMImplementation

Given an hyperplane

h : 〈w , z〉+ b = 0,

and a point x we want to find the distance d(x , h)

SVMImplementation

If ‖w‖ = 1, then 〈w , x〉 = Πw (x) is the projection of x to thedirection of w .

For an arbitrary w , then Πw (x) = 〈 w‖w‖ , x〉.

For z belonging to the hyperplane h,

〈w , z〉+ b = 0 = 〈 w

‖w‖, z〉+

‖w‖,

Πw (z) = − p‖w‖ .

The margin isDw ,p(x) = d(x , h) = Πw (x)− (− p

‖w‖) = 〈w ,x〉+p‖w‖

SVMImplementation

〈w , z〉+ b = 0 = 〈 w

‖w‖, z〉+

‖w‖,

Πw (z) = − p‖w‖ .

‖w‖) = 〈w ,x〉+p‖w‖

SVMImplementation

〈w , z〉+ b = 0 = 〈 w

‖w‖, z〉+

‖w‖,

Πw (z) = − p‖w‖ .

‖w‖) = 〈w ,x〉+p‖w‖

SVMImplementation

〈w , z〉+ b = 0 = 〈 w

‖w‖, z〉+

‖w‖,

Πw (z) = − p‖w‖ .

‖w‖) = 〈w ,x〉+p‖w‖

SVMImplementation

Suppose that the set P = {(x1, y1), . . . , (xn, yn)} ⊂ Rm+1 withyi ∈ {−1, 1} is seprarable by hyperplanes. A point (x , y) is wellclassified by h if

yDw ,p(x) = y〈w , x〉+ p

‖w‖≥ 0.

Problem: Find the hyperplane with greater margin, i.e.

maxw ,b

yiDw ,p(xi )

SVMImplementation

Suppose that the set P = {(x1, y1), . . . , (xn, yn)} ⊂ Rm+1 withyi ∈ {−1, 1} is seprarable by hyperplanes. A point (x , y) is wellclassified by h if

yDw ,p(x) = y〈w , x〉+ p

‖w‖≥ 0.

Problem: Find the hyperplane with greater margin, i.e.

maxw ,b

yiDw ,p(xi )

SVMImplementation

Problem (1):maxw ,p,M M

st yi〈w ,xi 〉+p‖w‖ ≥ M, for all i

st yi (〈w , xi 〉+ p) ≥ 1, for all i

‖w‖M = 1

Problem (1) and (2) are equivalent:w , p,M −→{

w = w/‖w‖M,p = p/M

SVMImplementation

st yi〈w ,xi 〉+p‖w‖ ≥ M, for all i

‖w‖M = 1

Problem (1) and (2) are equivalent:w , p,M −→{

w = w/‖w‖M,p = p/M

SVMImplementation

Problem (Hard Margin Classification)

minw ,p12‖w‖

SVMImplementation

Regularization - Soft Margin

When D cannot be separated by hyperplanes, finding the minimumnumber of misclassifications is a NP problem.

Consider thefollowing quadratic optimization problem:

Problem (Soft Margin Classification)

minw ,p,ξ12‖w‖

2 + C∑

i ξist yi (〈w , xi 〉+ p) ≥ 1− ξi , for all iξi ≥ 0 for all i

SVMImplementation

Regularization - Soft Margin

When D cannot be separated by hyperplanes, finding the minimumnumber of misclassifications is a NP problem. Consider thefollowing quadratic optimization problem:

Problem (Soft Margin Classification)

minw ,p,ξ12‖w‖

2 + C∑

SVMImplementation

Karush-Kuhn-Tucker

Optimization Problem

minimizex

subject to gi (x) = 0, i = 1, . . . ,m.

hi (x) ≤ 0, i = 1, . . . , k .

Consider the Lagrangian

L(x , λ, ν) = f (x) +∑i

λigi (x) +∑i

νihi (x)

Optimality conditions

If x is optimal, then there exists λ, ν such that

∂xL(x , λ, ν) = ∂x f (x) +∑

i λi∂xgi (x) +∑

i ∂xνihi (x) = 0,∂λi L(x , λ, ν) = gi (x) = 0,hi (x) ≤ 0,νihi (x) = 0,νi ≥ 0,

SVMImplementation

minw ,p,ξ

12‖w‖

2 + C∑

i ξi1− ξi − yi (〈w , xi 〉+ p) ≤ 0, for all i−ξi ≤ 0 for all i

Karush-Kuhn-Tucker sufficiency conditions:L = 1

2‖w‖2 + C

∑i ξi +

∑i αi (1− ξi − yi (〈w , xi 〉+ p))−

∑i ηiξi ,

αi , ηi ≥ 0,∂L∂w = w −

∑yixiαi = 0,

∂L∂p = 0,

∂L∂ξ = 0,

αi (1− ξi − yi (〈w , xi 〉+ p)) = 0,

ηiξi = 0

SVMImplementation

Consequences:

w =∑

i xiyiαi , so the classification function is

f (x) = 〈w , x〉+ b =∑

yiαi 〈x , xi 〉+ p.

αi is zero if xi doesn’t meet the restriction.

- w depends on the bad classified points or just in the margin:Robustness

- Sparse representation: ”Dimensionality reduction” (kind of).

SVMImplementation

minw ,p

12‖w‖

2 + C∑

ξi ≥ max{0, 1− yi (〈w , xi 〉+ p)} = L(yi , 〈w , xi 〉+ p),

where L(y , t) = max{0, 1− yt} (Hinge loss)

SVMImplementation

minw ,p

12‖w‖

2 + C∑

ξi ≥ max{0, 1− yi (〈w , xi 〉+ p)} = L(yi , 〈w , xi 〉+ p),

where L(y , t) = max{0, 1− yt} (Hinge loss)

SVMImplementation

Problem

minw ,p

λ‖w‖2 +1

L(yi , 〈w , xi 〉+ p)

Interpretation:1n

∑i L(yi , 〈w , xi 〉+ p) = Rn(w , p)

‖w‖ measures the complexity

it is a trade-off between minimizing errors and the complexity ofthe model. The term λ‖w‖2 regularizes the empirical risk.

SVMImplementation

Problem

minw ,p

λ‖w‖2 +1

L(yi , 〈w , xi 〉+ p)

Interpretation:1n

∑i L(yi , 〈w , xi 〉+ p) = Rn(w , p)

‖w‖ measures the complexity

it is a trade-off between minimizing errors and the complexity ofthe model. The term λ‖w‖2 regularizes the empirical risk.

SVMImplementation

Problem

minw ,p

λ‖w‖2 +1

L(yi , 〈w , xi 〉+ p)

The problem is stated in terms of scalar products (and noexplicit dependence on the input dimension)

We want to deal with non linear decision functions

=⇒ substitute the scalar product by other type scalar products(kernels).

SVMImplementation

DefinitionReproducing Kernel Hilbert Spaces

Hilbert Space

A Hilbert space H is a space with a scalar product 〈·, ·〉:Symetric: 〈x , y〉 = 〈y , x〉Bilinear: 〈α1x1 + α2x2, y〉 = α1〈x1, y〉+ α2〈x2, y〉Positive Definite 〈x , x〉 ≥ 0

inducing a norm ‖x‖ =√〈x , x〉

Examples:

Rn with x · y =∑

i xiyi so ‖x‖ =√∑

Square integrable functions 〈f , g〉 =∫

fg so ‖f ‖ =√∫

Main Idea

Map every x to a Hilbert space H and the compute the solution ofthe soft margin problem in H.

SVMImplementation

Hilbert Space

Examples:

Rn with x · y =∑

fg so ‖f ‖ =√∫

Main Idea

SVMImplementation

Hilbert Space

Examples:

Rn with x · y =∑

fg so ‖f ‖ =√∫

Main Idea

SVMImplementation

Definition (Kernels)

A function k : RmxRm −→ R is called a kernel if it is symmetricand positive definite, i.e

k(x , x ′) = k(x ′, x)

and for all x1, . . . , xk and α = (α1, . . . , αk),∑i ,j

αiαjk(xi , xj) = αTKα ≥ 0,

where K = {k(xi , xj)}i ,j (matrix)

SVMImplementation

Gaussian RBF kernels k(x , x ′) = e‖x−x′‖2/γ .

Polynomial kernels k(x , x ′) = (〈x , x ′〉+ c)d .

Sums and products of kernels.

Taylor kernels: given and analytic function f (x) =∑∞

k=1 akxk

defined on the ball {‖x‖ < r} with ak ≥ 0,

k(x , x ′) =∞∑k=1

ak〈x , x ′〉,

is defined on the ball {‖x‖ <√

r}.Fourier kernels.

SVMImplementation

Suppose Φ : Rm −→ H where H is a Hilbert space ( a vector spacewith a scalar product). Then

k(x , x ′) = 〈Φ(x),Φ(x ′)〉.

is a kernel. The map Φ will be called feature map and H featurespace.

Example:

(x1x ′1 + x2x ′2 + c)2 = (x21 , x

22 ,√

2x1x2,√

2cx1,√

2cx2, c)·(x ′21 , x

′22 ,√

2x ′1x ′2,√

2cx ′1,√

2cx ′2, c)T

SVMImplementation

Suppose Φ : Rm −→ H where H is a Hilbert space ( a vector spacewith a scalar product). Then

k(x , x ′) = 〈Φ(x),Φ(x ′)〉.

is a kernel. The map Φ will be called feature map and H featurespace.Example:

(x1x ′1 + x2x ′2 + c)2 = (x21 , x

22 ,√

2x1x2,√

2cx1,√

2cx2, c)·(x ′21 , x

′22 ,√

2x ′1x ′2,√

2cx ′1,√

2cx ′2, c)T

SVMImplementation

(Pre)Definition

Given a kernel k, its RKHS (reproducing kernel Hilbert space) isthe smallest feature space of k containing the function k(·, x).

Theorem

Given a kernel k, there exists a unique RKHS H.

SVMImplementation

(Pre)Definition

Given a kernel k, its RKHS (reproducing kernel Hilbert space) isthe smallest feature space of k containing the function k(·, x).

Theorem

Given a kernel k, there exists a unique RKHS H.

SVMImplementation

RKHS are spaces of functions. Idea of the proof of the theoremabove:

Hpre := {f (x) =∑l

i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}For f =

∑αik(·, xi ) and g =

∑βjk(·, x ′j ),

〈f , g〉 =∑

αiβjk(xi , x′j ),

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

H is the completion of Hpre (Hpre plus all the limits in thenorm ‖ · ‖Hpre )

The map Φ(x) = k(·, x) is a feature map

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

Reproducing property: 〈f , k(·, x)〉 = f (x) (= 〈f ,Φ(x)〉) for allf ∈ H.

SVMImplementation

Hpre := {f (x) =∑l

i=1 αik(x , xi ); l ∈ N, xi ∈ Rm, αi ∈ R}

For f =∑αik(·, xi ) and g =

∑βjk(·, x ′j ),

〈f , g〉 =∑

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

SVMImplementation

Hpre := {f (x) =∑l

∑βjk(·, x ′j ),

〈f , g〉 =∑

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

SVMImplementation

Hpre := {f (x) =∑l

∑βjk(·, x ′j ),

〈f , g〉 =∑

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

SVMImplementation

Hpre := {f (x) =∑l

∑βjk(·, x ′j ),

〈f , g〉 =∑

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

SVMImplementation

Hpre := {f (x) =∑l

∑βjk(·, x ′j ),

〈f , g〉 =∑

and ‖f ‖2Hpre

= 〈f , f 〉 =∑αiαjk(xi , xj).

〈Φ(x),Φ(x ′)〉 = 〈k(·, x), k(·, x ′)〉 = k(x , x ′).

SVMImplementation

DefinitionSVM for regressionStatistical Properties - Gaussian Kernel

Definition

Definition (SVM)

Given data D, a support vector machine is the solution of theproblem (fD,λ),

minf ∈H

λ‖f ‖2H +

L(yi , f (xi ))

where H is a RKHS with kernel k , and L is a loss function.

The function fD,λ exists and is unique.

SVMImplementation

Definition

Definition (SVM)

Given data D, a support vector machine is the solution of theproblem (fD,λ),

minf ∈H

λ‖f ‖2H +

L(yi , f (xi ))

where H is a RKHS with kernel k , and L is a loss function.

The function fD,λ exists and is unique.

SVMImplementation

Classification problems with hinge loss, logistic loss, leastsquares loss.

Regression problems least squares loss, with the ε-insensitiveloss (L(y , t) = max{0, |y − t| − ε}), quantile regression.

Applications in dimension reduction (kernel PCA)

SVMImplementation

Theorem (Representer theorem)

There exist α1, . . . , αn ∈ R such that

fD,λ(x) =∑i

αik(x , xi ).

Problem (SVM)

minα∈Rn

λ∑i ,j

αiαjk(xi , xj) +1

L(yi ,∑j

αik(xi , xj))

The kernel k provides non linearity.

Mapping to RKHS retains the optimization structure andallows to solve the problem in the input space (non linearprogramming, λ is usually set by cross validation).

SVMImplementation

Theorem (Representer theorem)

There exist α1, . . . , αn ∈ R such that

fD,λ(x) =∑i

αik(x , xi ).

Problem (SVM)

minα∈Rn

λ∑i ,j

αiαjk(xi , xj) +1

L(yi ,∑j

αik(xi , xj))

The kernel k provides non linearity.

Mapping to RKHS retains the optimization structure andallows to solve the problem in the input space (non linearprogramming, λ is usually set by cross validation).

SVMImplementation

It is typically used the ε insensitve loss function

L(x) = |x |ε =

{0 if |x | ≤ ε|x | − ε if |x | ≥ ε = max{0, |x | − ε}

Using a nondifferentiable loss produces sparsity on the solution

Using absolut value gives robustness

SVMImplementation

SVM for regression

minf ∈H

λ‖f ‖2 +1

|yi − f (xi )|ε =

minαλαTKα +

|yi −∑j

αjk(xi , xj))|ε

where K = {k(xi , xj)}i ,j

minα,ξ+,ξ−

λαTKα +∑i

(ξ+i + ξ−i )

subject to ξ+i ≥ yi −

αjk(xi , xj)− ε

ξ−i ≥ −yi +∑j

αjk(xi , xj) + ε

ξ+i , ξ

−i ≥ 0

SVMImplementation

SVM for regression

minf ∈H

λ‖f ‖2 +1

|yi − f (xi )|ε =

minαλαTKα +

|yi −∑j

αjk(xi , xj))|ε

where K = {k(xi , xj)}i ,j

minα,ξ+,ξ−

λαTKα +∑i

(ξ+i + ξ−i )

subject to ξ+i ≥ yi −

αjk(xi , xj)− ε

ξ−i ≥ −yi +∑j

αjk(xi , xj) + ε

ξ+i , ξ

−i ≥ 0

SVMImplementation

Statistical Properties - Gaussian Kernel

Properties:

kγ(x , x) > 0

Φγ is injective

If X is compact, Hγ is dense in C(X )

For a complex differentiable (holomorphic) functionf : Cm → C, consider

‖f ‖γ =

πmγ2m

|f (z)|2eγ−2

∑j (zj−z j )2

Then Hγ = {Re(f )|f is holomporphic and ‖f ‖γ <∞}The VC dimension of Hγ is ∞.

SVMImplementation

Recall that R(f ) =∫

L(x , y , f (x))dP(x , y) is the expected riskfunction. Consider

R∗ = infFR(f ),

where F = {f is measurable} and

R∗H = minHR(f ).

Theorem (Consistency)

Given a sample size n, consider a sequence λn such thatlimn→∞ λn = 0 and limn→∞ λ

pnn =∞ for some p > 1. Then,

limn→∞

R(fD,λn) = R∗H .

MoreoverR∗H = R∗.

SVMImplementation

Recall that R(f ) =∫

L(x , y , f (x))dP(x , y) is the expected riskfunction. Consider

R∗ = infFR(f ),

where F = {f is measurable} and

R∗H = minHR(f ).

Theorem (Consistency)

Given a sample size n, consider a sequence λn such thatlimn→∞ λn = 0 and limn→∞ λ

pnn =∞ for some p > 1. Then,

limn→∞

R(fD,λn) = R∗H .

MoreoverR∗H = R∗.

SVMImplementation

Implementation

C++, Java: Libsvm

C: SVM Light

R: packages ’e1071’ (Libsvm), ’kernlab’

Python: scikit

complexity and support vector machines - aleix ruiz de villaaleix ruiz de villa complexity and...

Documents