arvutiteaduse instituut - support vector machines...performance evaluation, statistical learning...

77
Support Vector Machines Konstantin Tretyakov ([email protected]) MTAT.03.227 Machine Learning

Upload: others

Post on 20-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Support Vector Machines

Konstantin Tretyakov ([email protected])

MTAT.03.227 Machine Learning

Page 2: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

So far…

May 11, 2015

Page 3: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

So far…

Supervised machine learning

Linear models

Non-linear models

Unsupervised machine learning

Generic scaffolding

May 11, 2015

Page 4: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

So far…

Supervised machine learning

Linear models

Least squares regression

Fisher’s discriminant, Perceptron, Logistic model

Non-linear models

Neural networks, Decision trees, Association rules

Unsupervised machine learning

Clustering/EM, PCA

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

May 11, 2015

Page 5: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Coming up next

Supervised machine learning

Linear models

Least squares regression, SVM

Fisher’s discriminant, Perceptron, Logistic regression, SVM

Non-linear models

Neural networks, Decision trees, Association rules

SVM, Kernel-XXX

Unsupervised machine learning

Clustering/EM, PCA, Kernel-XXX

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

KernelsMay 11, 2015

Page 6: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

First things first

SVM: (𝑦 ∈ {−1,1})

library('e1071')

m = svm(X, y, kernel='linear')

predict(m, newX)

May 11, 2015

Page 7: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Quiz

May 11, 2015

This line is called …

This vector is …

Those lines are …

𝑓 𝒙 = ?

𝒙𝟏 = ? 𝑦1 = ?

Functional margin of 𝒙𝟏?

Geometric margin of 𝒙𝟏?

Distance to origin?

Page 8: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Quiz

May 11, 2015

Separating hyperplane

Normal 𝒘

Isolines (level lines)

𝑓 𝒙 = 𝒘𝑻𝒙 + 𝑏

𝒙𝟏 = (2, 6); 𝑦1 = −1

𝑦1 ⋅ 𝑓 𝒙𝟏 ≈ 2

𝑓(𝒙𝟏)/|𝒘| ≈ 3√2

𝑑 = 𝑏/|𝒘|

Page 9: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Quiz

Suppose we scale 𝒘 and 𝑏 by some constant.

Will it:

Affect the separating hyperplane? How?

Affect the functional margins? How?

Affect the geometric margins? How?

May 11, 2015

Page 10: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Quiz

Example: 𝒘 → 2𝒘, 𝑏 = 0

May 11, 2015

Page 11: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Quiz

Suppose we scale 𝒘 and b by some constant.

Will it:

Affect the separating hyperplane? How?

No: 𝒘𝑇𝒙 + 𝑏 = 0 ⇔ 2𝒘𝑇𝒙 + 2𝑏 = 0

Affect the functional margins? How?

Yes: 2𝒘𝑇𝒙 + 2𝑏 𝑦 = 2 ⋅ 𝒘𝑇𝒙 + 𝑏 𝑦

Affect the geometric margins? How?

No: 2𝒘𝑇𝒙+2𝑏

|2𝒘|=

𝒘𝑇𝒙+𝑏

|𝒘|

May 11, 2015

Page 12: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Which classifier is best?

May 11, 2015

Page 13: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Maximal margin classifier

May 11, 2015

Page 14: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Why maximal margin?

Well-defined, single stable solution

Noise-tolerant

Small parameterization

(Fairly) efficient algorithms exist for finding it

May 11, 2015

Page 15: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Maximal margin: Separable case

May 11, 2015

𝑓 𝒙 = 1

𝑓 𝒙 = −1

Page 16: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Maximal margin: Separable case

May 11, 2015

𝑓 𝒙 = 1

𝑓 𝒙 = −1

∀𝑖 𝑓 𝒙𝑖 𝑦𝑖 ≥ 1

Page 17: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Maximal margin: Separable case

May 11, 2015

𝑓 𝒙 = 1

𝑓 𝒙 = −1

The (geometric)

distance to the

isoline 𝑓 𝒙 = 1 is:

Page 18: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Maximal margin: Separable case

May 11, 2015

𝑓 𝒙 = 1

𝑓 𝒙 = −1

The (geometric)

distance to the

isoline 𝑓 𝒙 = 1 is:

𝑑 =𝑓 𝒙

𝒘=

1

𝒘

Page 19: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Maximal margin: Separable case

Among all linear classifiers (𝒘, 𝑏)

… which keep all points at functional margin of

𝟏 or more,

… we shall look for the one which has the largest

distance 𝒅 to the corresponding isolines, i.e. the

largest geometric margin.

As 𝑑 =1

𝒘, this is equivalent to finding the classifier

with minimal |𝒘|.

…which is equivalent to finding the classifier with

minimal 𝒘 2

May 11, 2015

Page 20: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

May 11, 2015

Page 21: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

May 11, 2015

Page 22: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

May 11, 2015

Page 23: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

May 11, 2015

Page 24: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Compare

“Generic” linear classification (separable case):

Find (𝒘, b), such that all points are classified correctly

i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 0

Maximal margin classification (separable case):

Find (𝒘, b), such that all points are classified correctly

with a fixed functional margin

i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 𝟏

and 𝒘 𝟐 is minimal.

May 11, 2015

Page 25: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Remember

May 11, 2015

SVM optimization problem

(separable case):

min𝒘,𝑏

1

2𝒘 2

so that

𝒘𝑇𝒙𝑖 + 𝑏 𝑦𝑖 ≥ 1

Page 26: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

General case (“soft margin”)

The same, but we also penalize all margin

violations.

May 11, 2015

SVM optimization problem:

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

where

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

Page 27: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

General case (“soft margin”)

The same, but we also penalize all margin

violations.

May 11, 2015

SVM optimization problem:

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

1 − 𝑓 𝒙𝑖 𝑦𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

Page 28: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

General case (“soft margin”)

The same, but we also penalize all margin

violations.

May 11, 2015

SVM optimization problem:

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

1 − 𝑚𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

Page 29: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

General case (“soft margin”)

The same, but we also penalize all margin

violations.

May 11, 2015

SVM optimization problem:

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

hinge(𝑚𝑖)

where

hinge 𝑚𝑖 = 1 − 𝑚𝑖 +

𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +

Page 30: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Hinge losshinge 𝑚𝑖 = 1 − 𝑚𝑖 +

May 11, 2015

Page 31: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Classification loss functions

“Generic”

classification:

min𝒘,𝑏

𝑖

[𝑚𝑖 < 0]

May 11, 2015

Page 32: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Classification loss functions

Perceptron:

May 11, 2015

Page 33: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Classification loss functions

Perceptron:

min𝒘,𝑏

𝑖

(−𝑚𝑖)+

May 11, 2015

Page 34: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Classification loss functions

Least squares

classification*:

min𝒘,𝑏

𝑖

𝑚𝑖 − 1 2

May 11, 2015

Page 35: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Classification loss functions

Boosting:

min𝒘,𝑏

𝑖

exp(−𝑚𝑖)

May 11, 2015

Page 36: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Classification loss functions

Logistic regression:

min𝒘,𝑏

𝑖

log(1 + 𝑒−𝑚𝑖)

May 11, 2015

Page 37: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Classification loss functions

Regularized logistic

regression:

min𝒘,𝑏

𝑖

log(1 + 𝑒−𝑚𝑖)

+𝜆1

2𝒘 2

May 11, 2015

Page 38: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Classification loss functions

SVM:

min𝒘,𝑏

𝑖

1 − 𝑚𝑖 +

+1

2𝐶𝒘 2

May 11, 2015

Page 39: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Classification loss functions

L2-SVM:

min𝒘,𝑏

𝑖

1 − 𝑚𝑖 +2

+1

2𝐶𝒘 2

May 11, 2015

Page 40: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Classification loss functions

L1-regularized L2-SVM:

min𝒘,𝑏

𝑖

1 − 𝑚𝑖 +2 +

1

2𝐶𝒘

… etc

May 11, 2015

Page 41: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

In general

min𝒘,𝑏

𝑖

𝜙(𝑚𝑖) + 𝜆 ⋅ Ω(𝒘)

May 11, 2015

Model fit Model complexity

Page 42: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Compare to MAP estimation

maxModel

𝑖

log 𝑃(𝑥𝑖|Model) + log 𝑃(Model)

May 11, 2015

Likelihood Model prior

Page 43: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Compare to MAP estimation

maxModel

log 𝑃(Data|Model) + log 𝑃(Model)

May 11, 2015

Likelihood Model prior

Page 44: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

1 − 𝑓 𝒙𝑖 𝑦𝑖 +

May 11, 2015

Page 45: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 ≥ 1 − 𝜉𝑖

𝜉𝑖 ≥ 0

May 11, 2015

Page 46: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0

May 11, 2015

Page 47: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0

Quadratic function with linear constraints!

May 11, 2015

Page 48: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0

Quadratic function with linear constraints!

May 11, 2015

Quadratic programming

Minimize

𝑓 𝒙 =1

2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙

subject to:

𝑨𝒙 ≥ 𝒃𝑪𝒙 = 𝒅

Page 49: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM

min𝒘,𝑏

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

such that

𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0

Quadratic function with linear constraints!

May 11, 2015

Quadratic programming

Minimize

𝑓 𝒙 =1

2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙

subject to:

𝑨𝒙 ≥ 𝒃𝑪𝒙 = 𝒅

> library(quadprog)

> solve.QP(Q, -c, A, b, neq)

Page 50: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

A popular trick in optimization:

min𝑥

𝑓(𝑥) , 𝑠. 𝑡. 𝑔 𝑥 ≥ 0

is equivalent to:

min𝑥

max𝛼≥0

𝑓 𝑥 − 𝛼𝑔 𝑥

May 11, 2015

Page 51: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0

May 11, 2015

Page 52: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0

Is equivalent to:

min𝒘,b,𝜉

max𝜶≥0,𝜷≥0

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

𝑖

𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)

𝑖

𝛽𝑖𝜉𝑖

May 11, 2015

Page 53: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0

Is equivalent to:

min𝒘,b,𝜉

max𝜶≥0,𝜷≥0

1

2𝒘 2 + 𝐶

𝑖

𝜉𝑖

𝑖

𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)

𝑖

𝛽𝑖𝜉𝑖

May 11, 2015

Page 54: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0

Is equivalent to:

min𝒘,b,𝜉

max𝜶≥0,𝜷≥0

1

2𝒘 2 +

𝑖

𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

May 11, 2015

Page 55: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0

Is equivalent to:

min𝒘,b,𝜉

max𝜶≥0,𝜷≥0

1

2𝒘 2 +

𝑖

𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0

May 11, 2015

Page 56: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

min𝒘,𝑏,𝜉

1

2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0

Is equivalent to:

min𝒘,b,𝜉

max𝜶≥0,𝜷≥0

1

2𝒘 2 +

𝑖

𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

0 ≤ 𝛼𝑖 ≤ 𝐶

May 11, 2015

Page 57: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

min𝒘,b

max𝜶

1

2𝒘 2

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

0 ≤ 𝛼𝑖 ≤ 𝐶

May 11, 2015

Page 58: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

min𝒘,b

max𝜶

1

2𝒘 2

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

0 ≤ 𝛼𝑖 ≤ 𝐶

Sparsity: 𝛼𝑖 is nonzero only for those points which

have

𝑓 𝒙𝑖 𝑦𝑖 − 1 < 0

May 11, 2015

Page 59: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

min𝒘,b

max𝜶

1

2𝒘 2

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

0 ≤ 𝛼𝑖 ≤ 𝐶

Now swap the min and the max (can be done in

particular because everything is nice and convex).

May 11, 2015

Page 60: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

max𝜶

min𝒘,𝑏

1

2𝒘 2

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

0 ≤ 𝛼𝑖 ≤ 𝐶

Next solve the inner (unconstrained) min as usual.

May 11, 2015

Page 61: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

max𝜶

min𝒘,𝑏

1

2𝒘 2

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

0 ≤ 𝛼𝑖 ≤ 𝐶

Next solve the inner (unconstrained) min as usual:

𝛻𝒘 = 𝒘 − 𝛼𝑖𝑦𝑖𝒙𝑖 = 0

𝛻𝑏 = − 𝛼𝑖𝑦𝑖 = 0

May 11, 2015

Page 62: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

max𝜶

min𝒘,𝑏

1

2𝒘 2

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

0 ≤ 𝛼𝑖 ≤ 𝐶

Express 𝒘 and substitute:

𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖

𝛼𝑖𝑦𝑖 = 0

May 11, 2015

Page 63: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

max𝜶

min𝒘,𝑏

1

2𝒘 2

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

0 ≤ 𝛼𝑖 ≤ 𝐶

Express 𝒘 and substitute:

𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖

𝛼𝑖𝑦𝑖 = 0

May 11, 2015

Dual

representation

Page 64: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

max𝜶

min𝒘,𝑏

1

2𝒘 2

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

0 ≤ 𝛼𝑖 ≤ 𝐶

Express 𝒘 and substitute:

𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖

𝛼𝑖𝑦𝑖 = 0

May 11, 2015

“Balance”

Page 65: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

max𝜶

min𝒘,𝑏

1

2𝒘 2

𝑖

𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1

0 ≤ 𝛼𝑖 ≤ 𝐶

Express 𝒘 and substitute:

max𝜶

𝑖

𝛼𝑖 −1

2

𝑖,𝑗

𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖𝑇𝒙𝑗

0 ≤ 𝛼𝑖 ≤ 𝐶

𝑖

𝛼𝑖𝑦𝑖 = 0

May 11, 2015

Page 66: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

max𝜶

𝑖

𝛼𝑖 −1

2

𝑖,𝑗

𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖𝑇𝒙𝑗

0 ≤ 𝛼𝑖 ≤ 𝐶

𝑖

𝛼𝑖𝑦𝑖 = 0

May 11, 2015

Page 67: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

max𝜶

𝟏𝑇𝜶 −1

2𝜶𝑇 𝑲 ∘ 𝒀 𝜶

0 ≤ 𝜶 ≤ 𝐶𝒚𝑇𝜶 = 0

𝐾𝑖𝑗 = 𝒙𝑖𝑇𝒙𝑗, 𝑌𝑖𝑗 = 𝑦𝑖𝑦𝑗

May 11, 2015

Page 68: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Solving the SVM: Dual

min𝜶

1

2𝜶𝑇 𝑲 ∘ 𝒀 𝜶 − 𝟏𝑇𝜶

𝜶 ≥ 0−𝜶 ≥ −𝐶𝒚𝑇𝜶 = 0

Then find 𝑏 from the condition*:

𝑓 𝒙𝑖 𝑦𝑖 = 1 if 0 < 𝛼𝑖 < 𝐶*see homework, it’s actually not that easy!

May 11, 2015

Page 69: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

May 11, 2015

Support vectors

Page 70: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

May 11, 2015

C

C

0

0

0

0

0

0.5

0.5

1

Support vectors

𝑖

𝛼𝑖𝑦𝑖 = 0

0 ≤ 𝛼𝑖 ≤ 𝐶

Page 71: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Sparsity

The dual solution is often very sparse, this

allows to perform optimization efficiently

“Working set” approach.

May 11, 2015

Page 72: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Kernels

𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏

𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖

𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏

May 11, 2015

Page 73: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Kernels

𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏

𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖

𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏

May 11, 2015

Kernel function

Page 74: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏

𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖

𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏

Kernels

May 11, 2015

𝑓 𝑥 = 𝑤1𝑥 + 𝑤2𝑥2 + 𝑏

𝑓 𝒙 = 𝛼𝑖𝑦𝑖exp(−|𝒙𝑖 − 𝒙 𝟐) + 𝑏

Page 75: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Quiz

SVM is a __________ linear classifier.

Margin maximization can be achieved via

minimization of ______________.

SVM uses _____ loss and _______

regularization.

Besides hinge loss I also know ____ loss and

___ loss.

SVM in both primal and dual form is solved

using ________ programming.

May 11, 2015

Page 76: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

Quiz

In primal formulation we solve for parameter

vector ___. In dual formulation we solve for

___ instead.

_____ form of SVM is typically sparse.

Support vectors are those training points for

which _______.

The relation between primal and dual variables

is: ___= 𝑖 ______.

A Kernel is a generalization of _____ product.

May 11, 2015

Page 77: Arvutiteaduse instituut - Support Vector Machines...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 11, 2015 Coming up next Supervised

May 11, 2015