arvutiteaduse instituut - support vector machines...performance evaluation, statistical learning...

Support Vector Machines

Konstantin Tretyakov ([email protected])

MTAT.03.227 Machine Learning

So far…

May 11, 2015

So far…

Supervised machine learning

Linear models

Non-linear models

Unsupervised machine learning

Generic scaffolding

May 11, 2015

So far…


Linear models

Least squares regression

Fisher’s discriminant, Perceptron, Logistic model

Non-linear models

Neural networks, Decision trees, Association rules


Clustering/EM, PCA

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

May 11, 2015

Coming up next


Linear models

Least squares regression, SVM

Fisher’s discriminant, Perceptron, Logistic regression, SVM

Non-linear models

Neural networks, Decision trees, Association rules

SVM, Kernel-XXX


Clustering/EM, PCA, Kernel-XXX

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

KernelsMay 11, 2015

First things first

SVM: (𝑦 ∈ {−1,1})

library('e1071')

m = svm(X, y, kernel='linear')

predict(m, newX)

May 11, 2015

Quiz

May 11, 2015

This line is called …

This vector is …

Those lines are …

𝑓 𝒙 = ?

𝒙𝟏 = ? 𝑦1 = ?

Functional margin of 𝒙𝟏?

Geometric margin of 𝒙𝟏?

Distance to origin?

Quiz

May 11, 2015

Separating hyperplane

Normal 𝒘

Isolines (level lines)

𝑓 𝒙 = 𝒘𝑻𝒙 + 𝑏

𝒙𝟏 = (2, 6); 𝑦1 = −1

𝑦1 ⋅ 𝑓 𝒙𝟏 ≈ 2

𝑓(𝒙𝟏)/|𝒘| ≈ 3√2

𝑑 = 𝑏/|𝒘|

Quiz

Suppose we scale 𝒘 and 𝑏 by some constant.

Will it:

Affect the separating hyperplane? How?

Affect the functional margins? How?

Affect the geometric margins? How?

May 11, 2015

Quiz

Example: 𝒘 → 2𝒘, 𝑏 = 0

May 11, 2015

Quiz

Suppose we scale 𝒘 and b by some constant.

Will it:

Affect the separating hyperplane? How?

No: 𝒘𝑇𝒙 + 𝑏 = 0 ⇔ 2𝒘𝑇𝒙 + 2𝑏 = 0

Affect the functional margins? How?

Yes: 2𝒘𝑇𝒙 + 2𝑏 𝑦 = 2 ⋅ 𝒘𝑇𝒙 + 𝑏 𝑦

Affect the geometric margins? How?

No: 2𝒘𝑇𝒙+2𝑏

|2𝒘|=

𝒘𝑇𝒙+𝑏

|𝒘|

May 11, 2015

Which classifier is best?

May 11, 2015

Maximal margin classifier

May 11, 2015

Why maximal margin?

Well-defined, single stable solution

Noise-tolerant

Small parameterization

(Fairly) efficient algorithms exist for finding it

May 11, 2015

Maximal margin: Separable case

May 11, 2015

𝑓 𝒙 = 1

𝑓 𝒙 = −1


May 11, 2015

𝑓 𝒙 = 1

𝑓 𝒙 = −1

∀𝑖 𝑓 𝒙𝑖 𝑦𝑖 ≥ 1


May 11, 2015

𝑓 𝒙 = 1

𝑓 𝒙 = −1

The (geometric)

distance to the

isoline 𝑓 𝒙 = 1 is:


May 11, 2015

𝑓 𝒙 = 1

𝑓 𝒙 = −1

The (geometric)

distance to the

isoline 𝑓 𝒙 = 1 is:

𝑑 =𝑓 𝒙

𝒘=

1

𝒘


Among all linear classifiers (𝒘, 𝑏)

… which keep all points at functional margin of

𝟏 or more,

… we shall look for the one which has the largest

distance 𝒅 to the corresponding isolines, i.e. the

largest geometric margin.

As 𝑑 =1

𝒘, this is equivalent to finding the classifier

with minimal |𝒘|.

…which is equivalent to finding the classifier with

minimal 𝒘 2

May 11, 2015

May 11, 2015

Compare

“Generic” linear classification (separable case):

Find (𝒘, b), such that all points are classified correctly

i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 0

Maximal margin classification (separable case):

Find (𝒘, b), such that all points are classified correctly

with a fixed functional margin

i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 𝟏

and 𝒘 𝟐 is minimal.

May 11, 2015

Remember

May 11, 2015

SVM optimization problem

(separable case):

min𝒘,𝑏

1

2𝒘 2

so that

𝒘𝑇𝒙𝑖 + 𝑏 𝑦𝑖 ≥ 1

General case (“soft margin”)

The same, but we also penalize all margin

violations.

May 11, 2015

SVM optimization problem:

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

where

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +



violations.

May 11, 2015


min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

1 − 𝑓 𝒙𝑖 𝑦𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +



violations.

May 11, 2015


min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

1 − 𝑚𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +



violations.

May 11, 2015


min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

hinge(𝑚𝑖)

where

hinge 𝑚𝑖 = 1 − 𝑚𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

Hinge losshinge 𝑚𝑖 = 1 − 𝑚𝑖 +

May 11, 2015

Classification loss functions

“Generic”

classification:

min𝒘,𝑏

𝑖

[𝑚𝑖 < 0]

May 11, 2015


Perceptron:

May 11, 2015


Perceptron:

min𝒘,𝑏

𝑖

(−𝑚𝑖)+

May 11, 2015


Least squares

classification*:

min𝒘,𝑏

𝑖

𝑚𝑖 − 1 2

May 11, 2015


Boosting:

min𝒘,𝑏

𝑖

exp(−𝑚𝑖)

May 11, 2015


Logistic regression:

min𝒘,𝑏

𝑖

log(1 + 𝑒−𝑚𝑖)

May 11, 2015


Regularized logistic

regression:

min𝒘,𝑏

𝑖

log(1 + 𝑒−𝑚𝑖)

+𝜆1

2𝒘 2

May 11, 2015


SVM:

min𝒘,𝑏

𝑖

1 − 𝑚𝑖 +

+1

2𝐶𝒘 2

May 11, 2015


L2-SVM:

min𝒘,𝑏

𝑖

1 − 𝑚𝑖 +2

+1

2𝐶𝒘 2

May 11, 2015


L1-regularized L2-SVM:

min𝒘,𝑏

𝑖

1 − 𝑚𝑖 +2 +

1

2𝐶𝒘

… etc

May 11, 2015

In general

min𝒘,𝑏

𝑖

𝜙(𝑚𝑖) + 𝜆 ⋅ Ω(𝒘)

May 11, 2015

Model fit Model complexity

Compare to MAP estimation

maxModel

𝑖

log 𝑃(𝑥𝑖|Model) + log 𝑃(Model)

May 11, 2015

Likelihood Model prior

Compare to MAP estimation

maxModel

log 𝑃(Data|Model) + log 𝑃(Model)

May 11, 2015

Likelihood Model prior

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

1 − 𝑓 𝒙𝑖 𝑦𝑖 +

May 11, 2015

Solving the SVM

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 ≥ 1 − 𝜉𝑖

𝜉𝑖 ≥ 0

May 11, 2015

Solving the SVM

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0

May 11, 2015

Solving the SVM

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0

Quadratic function with linear constraints!

May 11, 2015

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0


May 11, 2015

Quadratic programming

Minimize

𝑓 𝒙 =1

2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙

subject to:

𝑨𝒙 ≥ 𝒃𝑪𝒙 = 𝒅

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0


May 11, 2015

Quadratic programming

Minimize

𝑓 𝒙 =1

2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙

subject to:

𝑨𝒙 ≥ 𝒃𝑪𝒙 = 𝒅

> library(quadprog)

> solve.QP(Q, -c, A, b, neq)

A popular trick in optimization:

min𝑥

𝑓(𝑥) , 𝑠. 𝑡. 𝑔 𝑥 ≥ 0

is equivalent to:

min𝑥

max𝛼≥0

𝑓 𝑥 − 𝛼𝑔 𝑥

May 11, 2015

Solving the SVM: Dual

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0

May 11, 2015


min𝒘,𝑏,𝜉

1


Is equivalent to:

min𝒘,b,𝜉

max𝜶≥0,𝜷≥0

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

−

𝑖

𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)

−

𝑖

𝛽𝑖𝜉𝑖

May 11, 2015


min𝒘,𝑏,𝜉

1


Is equivalent to:

min𝒘,b,𝜉


1

2𝒘 2 +

𝑖

𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖

−

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

May 11, 2015


min𝒘,𝑏,𝜉

1


Is equivalent to:

min𝒘,b,𝜉


1

2𝒘 2 +

𝑖


−

𝑖


𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0

May 11, 2015


min𝒘,𝑏,𝜉

1


Is equivalent to:

min𝒘,b,𝜉


1

2𝒘 2 +

𝑖


−

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶

May 11, 2015


min𝒘,b

max𝜶

1

2𝒘 2

−

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶

May 11, 2015


min𝒘,b

max𝜶

1

2𝒘 2

−

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶

Sparsity: 𝛼𝑖 is nonzero only for those points which

have

𝑓 𝒙𝑖 𝑦𝑖 − 1 < 0

May 11, 2015


min𝒘,b

max𝜶

1

2𝒘 2

−

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶

Now swap the min and the max (can be done in

particular because everything is nice and convex).

May 11, 2015


max𝜶

min𝒘,𝑏

1

2𝒘 2

−

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶

Next solve the inner (unconstrained) min as usual.

May 11, 2015


max𝜶

min𝒘,𝑏

1

2𝒘 2

−

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶

Next solve the inner (unconstrained) min as usual:

𝛻𝒘 = 𝒘 − 𝛼𝑖𝑦𝑖𝒙𝑖 = 0

𝛻𝑏 = − 𝛼𝑖𝑦𝑖 = 0

May 11, 2015


max𝜶

min𝒘,𝑏

1

2𝒘 2

−

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶

Express 𝒘 and substitute:

𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖

𝛼𝑖𝑦𝑖 = 0

May 11, 2015


max𝜶

min𝒘,𝑏

1

2𝒘 2

−

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶




May 11, 2015

Dual

representation


max𝜶

min𝒘,𝑏

1

2𝒘 2

−

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶




May 11, 2015

“Balance”


max𝜶

min𝒘,𝑏

1

2𝒘 2

−

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶


max𝜶

𝑖

𝛼𝑖 −1

2

𝑖,𝑗

𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖𝑇𝒙𝑗

0 ≤ 𝛼𝑖 ≤ 𝐶

𝑖


May 11, 2015


max𝜶

𝑖

𝛼𝑖 −1

2

𝑖,𝑗

𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖𝑇𝒙𝑗

0 ≤ 𝛼𝑖 ≤ 𝐶

𝑖


May 11, 2015


max𝜶

𝟏𝑇𝜶 −1

2𝜶𝑇 𝑲 ∘ 𝒀 𝜶

0 ≤ 𝜶 ≤ 𝐶𝒚𝑇𝜶 = 0

𝐾𝑖𝑗 = 𝒙𝑖𝑇𝒙𝑗, 𝑌𝑖𝑗 = 𝑦𝑖𝑦𝑗

May 11, 2015


min𝜶

1

2𝜶𝑇 𝑲 ∘ 𝒀 𝜶 − 𝟏𝑇𝜶

𝜶 ≥ 0−𝜶 ≥ −𝐶𝒚𝑇𝜶 = 0

Then find 𝑏 from the condition*:

𝑓 𝒙𝑖 𝑦𝑖 = 1 if 0 < 𝛼𝑖 < 𝐶*see homework, it’s actually not that easy!

May 11, 2015

May 11, 2015

Support vectors

May 11, 2015

C

C

0

0

0

0

0

0.5

0.5

1

Support vectors

𝑖


0 ≤ 𝛼𝑖 ≤ 𝐶

Sparsity

The dual solution is often very sparse, this

allows to perform optimization efficiently

“Working set” approach.

May 11, 2015

Kernels

𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏


𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏

May 11, 2015

Kernels





May 11, 2015

Kernel function





Kernels

May 11, 2015

𝑓 𝑥 = 𝑤1𝑥 + 𝑤2𝑥2 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖exp(−|𝒙𝑖 − 𝒙 𝟐) + 𝑏

Quiz

SVM is a __________ linear classifier.

Margin maximization can be achieved via

minimization of ______________.

SVM uses _____ loss and _______

regularization.

Besides hinge loss I also know ____ loss and

___ loss.

SVM in both primal and dual form is solved

using ________ programming.

May 11, 2015

Quiz

In primal formulation we solve for parameter

vector ___. In dual formulation we solve for

___ instead.

_____ form of SVM is typically sparse.

Support vectors are those training points for

which _______.

The relation between primal and dual variables

is: ___= 𝑖 ______.

A Kernel is a generalization of _____ product.

May 11, 2015

May 11, 2015

arvutiteaduse instituut - support vector machines...performance evaluation, statistical learning...

Documents