arvutiteaduse instituut - support vector machines...performance evaluation, statistical learning...
TRANSCRIPT
So far…
May 11, 2015
So far…
Supervised machine learning
Linear models
Non-linear models
Unsupervised machine learning
Generic scaffolding
May 11, 2015
So far…
Supervised machine learning
Linear models
Least squares regression
Fisher’s discriminant, Perceptron, Logistic model
Non-linear models
Neural networks, Decision trees, Association rules
Unsupervised machine learning
Clustering/EM, PCA
Generic scaffolding
Probabilistic modeling, ML/MAP estimation
Performance evaluation, Statistical learning theory
Linear algebra, Optimization methods
May 11, 2015
Coming up next
Supervised machine learning
Linear models
Least squares regression, SVM
Fisher’s discriminant, Perceptron, Logistic regression, SVM
Non-linear models
Neural networks, Decision trees, Association rules
SVM, Kernel-XXX
Unsupervised machine learning
Clustering/EM, PCA, Kernel-XXX
Generic scaffolding
Probabilistic modeling, ML/MAP estimation
Performance evaluation, Statistical learning theory
Linear algebra, Optimization methods
KernelsMay 11, 2015
First things first
SVM: (𝑦 ∈ {−1,1})
library('e1071')
m = svm(X, y, kernel='linear')
predict(m, newX)
May 11, 2015
Quiz
May 11, 2015
This line is called …
This vector is …
Those lines are …
𝑓 𝒙 = ?
𝒙𝟏 = ? 𝑦1 = ?
Functional margin of 𝒙𝟏?
Geometric margin of 𝒙𝟏?
Distance to origin?
Quiz
May 11, 2015
Separating hyperplane
Normal 𝒘
Isolines (level lines)
𝑓 𝒙 = 𝒘𝑻𝒙 + 𝑏
𝒙𝟏 = (2, 6); 𝑦1 = −1
𝑦1 ⋅ 𝑓 𝒙𝟏 ≈ 2
𝑓(𝒙𝟏)/|𝒘| ≈ 3√2
𝑑 = 𝑏/|𝒘|
Quiz
Suppose we scale 𝒘 and 𝑏 by some constant.
Will it:
Affect the separating hyperplane? How?
Affect the functional margins? How?
Affect the geometric margins? How?
May 11, 2015
Quiz
Example: 𝒘 → 2𝒘, 𝑏 = 0
May 11, 2015
Quiz
Suppose we scale 𝒘 and b by some constant.
Will it:
Affect the separating hyperplane? How?
No: 𝒘𝑇𝒙 + 𝑏 = 0 ⇔ 2𝒘𝑇𝒙 + 2𝑏 = 0
Affect the functional margins? How?
Yes: 2𝒘𝑇𝒙 + 2𝑏 𝑦 = 2 ⋅ 𝒘𝑇𝒙 + 𝑏 𝑦
Affect the geometric margins? How?
No: 2𝒘𝑇𝒙+2𝑏
|2𝒘|=
𝒘𝑇𝒙+𝑏
|𝒘|
May 11, 2015
Which classifier is best?
May 11, 2015
Maximal margin classifier
May 11, 2015
Why maximal margin?
Well-defined, single stable solution
Noise-tolerant
Small parameterization
(Fairly) efficient algorithms exist for finding it
May 11, 2015
Maximal margin: Separable case
May 11, 2015
𝑓 𝒙 = 1
𝑓 𝒙 = −1
Maximal margin: Separable case
May 11, 2015
𝑓 𝒙 = 1
𝑓 𝒙 = −1
∀𝑖 𝑓 𝒙𝑖 𝑦𝑖 ≥ 1
Maximal margin: Separable case
May 11, 2015
𝑓 𝒙 = 1
𝑓 𝒙 = −1
The (geometric)
distance to the
isoline 𝑓 𝒙 = 1 is:
Maximal margin: Separable case
May 11, 2015
𝑓 𝒙 = 1
𝑓 𝒙 = −1
The (geometric)
distance to the
isoline 𝑓 𝒙 = 1 is:
𝑑 =𝑓 𝒙
𝒘=
1
𝒘
Maximal margin: Separable case
Among all linear classifiers (𝒘, 𝑏)
… which keep all points at functional margin of
𝟏 or more,
… we shall look for the one which has the largest
distance 𝒅 to the corresponding isolines, i.e. the
largest geometric margin.
As 𝑑 =1
𝒘, this is equivalent to finding the classifier
with minimal |𝒘|.
…which is equivalent to finding the classifier with
minimal 𝒘 2
May 11, 2015
May 11, 2015
May 11, 2015
May 11, 2015
May 11, 2015
Compare
“Generic” linear classification (separable case):
Find (𝒘, b), such that all points are classified correctly
i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 0
Maximal margin classification (separable case):
Find (𝒘, b), such that all points are classified correctly
with a fixed functional margin
i.e. 𝑓 𝒙𝑖 𝑦𝑖 > 𝟏
and 𝒘 𝟐 is minimal.
May 11, 2015
Remember
May 11, 2015
SVM optimization problem
(separable case):
min𝒘,𝑏
1
2𝒘 2
so that
𝒘𝑇𝒙𝑖 + 𝑏 𝑦𝑖 ≥ 1
General case (“soft margin”)
The same, but we also penalize all margin
violations.
May 11, 2015
SVM optimization problem:
min𝒘,𝑏
1
2𝒘 2 + 𝐶
𝑖
𝜉𝑖
where
𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
General case (“soft margin”)
The same, but we also penalize all margin
violations.
May 11, 2015
SVM optimization problem:
min𝒘,𝑏
1
2𝒘 2 + 𝐶
𝑖
1 − 𝑓 𝒙𝑖 𝑦𝑖 +
𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
General case (“soft margin”)
The same, but we also penalize all margin
violations.
May 11, 2015
SVM optimization problem:
min𝒘,𝑏
1
2𝒘 2 + 𝐶
𝑖
1 − 𝑚𝑖 +
𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
General case (“soft margin”)
The same, but we also penalize all margin
violations.
May 11, 2015
SVM optimization problem:
min𝒘,𝑏
1
2𝒘 2 + 𝐶
𝑖
hinge(𝑚𝑖)
where
hinge 𝑚𝑖 = 1 − 𝑚𝑖 +
𝜉𝑖 = 1 − 𝑓 𝒙𝑖 𝑦𝑖 +
Hinge losshinge 𝑚𝑖 = 1 − 𝑚𝑖 +
May 11, 2015
Classification loss functions
“Generic”
classification:
min𝒘,𝑏
𝑖
[𝑚𝑖 < 0]
May 11, 2015
Classification loss functions
Perceptron:
May 11, 2015
Classification loss functions
Perceptron:
min𝒘,𝑏
𝑖
(−𝑚𝑖)+
May 11, 2015
Classification loss functions
Least squares
classification*:
min𝒘,𝑏
𝑖
𝑚𝑖 − 1 2
May 11, 2015
Classification loss functions
Boosting:
min𝒘,𝑏
𝑖
exp(−𝑚𝑖)
May 11, 2015
Classification loss functions
Logistic regression:
min𝒘,𝑏
𝑖
log(1 + 𝑒−𝑚𝑖)
May 11, 2015
Classification loss functions
Regularized logistic
regression:
min𝒘,𝑏
𝑖
log(1 + 𝑒−𝑚𝑖)
+𝜆1
2𝒘 2
May 11, 2015
Classification loss functions
SVM:
min𝒘,𝑏
𝑖
1 − 𝑚𝑖 +
+1
2𝐶𝒘 2
May 11, 2015
Classification loss functions
L2-SVM:
min𝒘,𝑏
𝑖
1 − 𝑚𝑖 +2
+1
2𝐶𝒘 2
May 11, 2015
Classification loss functions
L1-regularized L2-SVM:
min𝒘,𝑏
𝑖
1 − 𝑚𝑖 +2 +
1
2𝐶𝒘
… etc
May 11, 2015
In general
min𝒘,𝑏
𝑖
𝜙(𝑚𝑖) + 𝜆 ⋅ Ω(𝒘)
May 11, 2015
Model fit Model complexity
Compare to MAP estimation
maxModel
𝑖
log 𝑃(𝑥𝑖|Model) + log 𝑃(Model)
May 11, 2015
Likelihood Model prior
Compare to MAP estimation
maxModel
log 𝑃(Data|Model) + log 𝑃(Model)
May 11, 2015
Likelihood Model prior
Solving the SVM
min𝒘,𝑏
1
2𝒘 2 + 𝐶
𝑖
1 − 𝑓 𝒙𝑖 𝑦𝑖 +
May 11, 2015
Solving the SVM
min𝒘,𝑏,𝜉
1
2𝒘 2 + 𝐶
𝑖
𝜉𝑖
such that
𝑓 𝒙𝑖 𝑦𝑖 ≥ 1 − 𝜉𝑖
𝜉𝑖 ≥ 0
May 11, 2015
Solving the SVM
min𝒘,𝑏,𝜉
1
2𝒘 2 + 𝐶
𝑖
𝜉𝑖
such that
𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0
May 11, 2015
Solving the SVM
min𝒘,𝑏,𝜉
1
2𝒘 2 + 𝐶
𝑖
𝜉𝑖
such that
𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0
Quadratic function with linear constraints!
May 11, 2015
Solving the SVM
min𝒘,𝑏
1
2𝒘 2 + 𝐶
𝑖
𝜉𝑖
such that
𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0
Quadratic function with linear constraints!
May 11, 2015
Quadratic programming
Minimize
𝑓 𝒙 =1
2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙
subject to:
𝑨𝒙 ≥ 𝒃𝑪𝒙 = 𝒅
Solving the SVM
min𝒘,𝑏
1
2𝒘 2 + 𝐶
𝑖
𝜉𝑖
such that
𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0𝜉𝑖 ≥ 0
Quadratic function with linear constraints!
May 11, 2015
Quadratic programming
Minimize
𝑓 𝒙 =1
2𝒙𝑇𝑸𝒙 + 𝒄𝑇𝒙
subject to:
𝑨𝒙 ≥ 𝒃𝑪𝒙 = 𝒅
> library(quadprog)
> solve.QP(Q, -c, A, b, neq)
A popular trick in optimization:
min𝑥
𝑓(𝑥) , 𝑠. 𝑡. 𝑔 𝑥 ≥ 0
is equivalent to:
min𝑥
max𝛼≥0
𝑓 𝑥 − 𝛼𝑔 𝑥
May 11, 2015
Solving the SVM: Dual
min𝒘,𝑏,𝜉
1
2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
May 11, 2015
Solving the SVM: Dual
min𝒘,𝑏,𝜉
1
2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
Is equivalent to:
min𝒘,b,𝜉
max𝜶≥0,𝜷≥0
1
2𝒘 2 + 𝐶
𝑖
𝜉𝑖
−
𝑖
𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)
−
𝑖
𝛽𝑖𝜉𝑖
May 11, 2015
Solving the SVM: Dual
min𝒘,𝑏,𝜉
1
2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
Is equivalent to:
min𝒘,b,𝜉
max𝜶≥0,𝜷≥0
1
2𝒘 2 + 𝐶
𝑖
𝜉𝑖
−
𝑖
𝛼𝑖(𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖)
−
𝑖
𝛽𝑖𝜉𝑖
May 11, 2015
Solving the SVM: Dual
min𝒘,𝑏,𝜉
1
2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
Is equivalent to:
min𝒘,b,𝜉
max𝜶≥0,𝜷≥0
1
2𝒘 2 +
𝑖
𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
May 11, 2015
Solving the SVM: Dual
min𝒘,𝑏,𝜉
1
2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
Is equivalent to:
min𝒘,b,𝜉
max𝜶≥0,𝜷≥0
1
2𝒘 2 +
𝑖
𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0
May 11, 2015
Solving the SVM: Dual
min𝒘,𝑏,𝜉
1
2𝒘 2 + 𝐶 𝑖 𝜉𝑖 such that 𝑓 𝒙𝑖 𝑦𝑖 − 1 − 𝜉𝑖 ≥ 0, 𝜉𝑖 ≥ 0
Is equivalent to:
min𝒘,b,𝜉
max𝜶≥0,𝜷≥0
1
2𝒘 2 +
𝑖
𝜉𝑖 𝐶 − 𝛼𝑖 − 𝛽𝑖
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
0 ≤ 𝛼𝑖 ≤ 𝐶
May 11, 2015
Solving the SVM: Dual
min𝒘,b
max𝜶
1
2𝒘 2
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
0 ≤ 𝛼𝑖 ≤ 𝐶
May 11, 2015
Solving the SVM: Dual
min𝒘,b
max𝜶
1
2𝒘 2
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
0 ≤ 𝛼𝑖 ≤ 𝐶
Sparsity: 𝛼𝑖 is nonzero only for those points which
have
𝑓 𝒙𝑖 𝑦𝑖 − 1 < 0
May 11, 2015
Solving the SVM: Dual
min𝒘,b
max𝜶
1
2𝒘 2
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
0 ≤ 𝛼𝑖 ≤ 𝐶
Now swap the min and the max (can be done in
particular because everything is nice and convex).
May 11, 2015
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
0 ≤ 𝛼𝑖 ≤ 𝐶
Next solve the inner (unconstrained) min as usual.
May 11, 2015
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
0 ≤ 𝛼𝑖 ≤ 𝐶
Next solve the inner (unconstrained) min as usual:
𝛻𝒘 = 𝒘 − 𝛼𝑖𝑦𝑖𝒙𝑖 = 0
𝛻𝑏 = − 𝛼𝑖𝑦𝑖 = 0
May 11, 2015
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
0 ≤ 𝛼𝑖 ≤ 𝐶
Express 𝒘 and substitute:
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝛼𝑖𝑦𝑖 = 0
May 11, 2015
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
0 ≤ 𝛼𝑖 ≤ 𝐶
Express 𝒘 and substitute:
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝛼𝑖𝑦𝑖 = 0
May 11, 2015
Dual
representation
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
0 ≤ 𝛼𝑖 ≤ 𝐶
Express 𝒘 and substitute:
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝛼𝑖𝑦𝑖 = 0
May 11, 2015
“Balance”
Solving the SVM: Dual
max𝜶
min𝒘,𝑏
1
2𝒘 2
−
𝑖
𝛼𝑖 𝑓 𝒙𝑖 𝑦𝑖 − 1
0 ≤ 𝛼𝑖 ≤ 𝐶
Express 𝒘 and substitute:
max𝜶
𝑖
𝛼𝑖 −1
2
𝑖,𝑗
𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖𝑇𝒙𝑗
0 ≤ 𝛼𝑖 ≤ 𝐶
𝑖
𝛼𝑖𝑦𝑖 = 0
May 11, 2015
Solving the SVM: Dual
max𝜶
𝑖
𝛼𝑖 −1
2
𝑖,𝑗
𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝒙𝑖𝑇𝒙𝑗
0 ≤ 𝛼𝑖 ≤ 𝐶
𝑖
𝛼𝑖𝑦𝑖 = 0
May 11, 2015
Solving the SVM: Dual
max𝜶
𝟏𝑇𝜶 −1
2𝜶𝑇 𝑲 ∘ 𝒀 𝜶
0 ≤ 𝜶 ≤ 𝐶𝒚𝑇𝜶 = 0
𝐾𝑖𝑗 = 𝒙𝑖𝑇𝒙𝑗, 𝑌𝑖𝑗 = 𝑦𝑖𝑦𝑗
May 11, 2015
Solving the SVM: Dual
min𝜶
1
2𝜶𝑇 𝑲 ∘ 𝒀 𝜶 − 𝟏𝑇𝜶
𝜶 ≥ 0−𝜶 ≥ −𝐶𝒚𝑇𝜶 = 0
Then find 𝑏 from the condition*:
𝑓 𝒙𝑖 𝑦𝑖 = 1 if 0 < 𝛼𝑖 < 𝐶*see homework, it’s actually not that easy!
May 11, 2015
May 11, 2015
Support vectors
May 11, 2015
C
C
0
0
0
0
0
0.5
0.5
1
Support vectors
𝑖
𝛼𝑖𝑦𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶
Sparsity
The dual solution is often very sparse, this
allows to perform optimization efficiently
“Working set” approach.
May 11, 2015
Kernels
𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏
May 11, 2015
Kernels
𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏
May 11, 2015
Kernel function
𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏
𝒘 = 𝛼𝑖𝑦𝑖𝒙𝑖
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝒙𝑖𝑇𝒙 + 𝑏
𝑓 𝒙 = 𝛼𝑖𝑦𝑖𝐾(𝒙𝑖 , 𝒙) + 𝑏
Kernels
May 11, 2015
𝑓 𝑥 = 𝑤1𝑥 + 𝑤2𝑥2 + 𝑏
𝑓 𝒙 = 𝛼𝑖𝑦𝑖exp(−|𝒙𝑖 − 𝒙 𝟐) + 𝑏
Quiz
SVM is a __________ linear classifier.
Margin maximization can be achieved via
minimization of ______________.
SVM uses _____ loss and _______
regularization.
Besides hinge loss I also know ____ loss and
___ loss.
SVM in both primal and dual form is solved
using ________ programming.
May 11, 2015
Quiz
In primal formulation we solve for parameter
vector ___. In dual formulation we solve for
___ instead.
_____ form of SVM is typically sparse.
Support vectors are those training points for
which _______.
The relation between primal and dual variables
is: ___= 𝑖 ______.
A Kernel is a generalization of _____ product.
May 11, 2015
May 11, 2015