![Page 1: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/1.jpg)
Support Vector Machines
CMPUT 466/551Nilanjan Ray
![Page 2: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/2.jpg)
Agenda
• Linear support vector classifier– Separable case– Non-separable case
• Non-linear support vector classifier• Kernels for classification• SVM as a penalized method• Support vector regression
![Page 3: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/3.jpg)
Linear Support Vector Classifier: Separable Case
.0
,,0:tosubject
}2
1{min
,,1
iii
i
ii
i kk
Tikiki
y
i
xxyyN
Dual problem(simpler optimization)
Dual problemin matrix vector form:
.0
,0:tosubject
,1)]()([2
1
T
TYT
y
ydiagXXydiagCompare theimplementationsimple_svm.m
ixy iT
i ,1)(:tosubject
2
1min
0
2
,0
Primal problem
![Page 4: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/4.jpg)
Linear SVC (AKA Optimal Hyperplane)…
After solving the dual problem we obtain i ‘s; how do construct the hyperplane from here?
To obtain use the equation:
How do we obtain 0 ?We need the complementary slackness criteria, which are the results of Karush-Kuhn-Tucker (KKT) conditions for the primal optimization problem.
Complementary slackness means:
Training points corresponding to non-negative i ‘s are support vectors.
0 is computed from for which i ‘s are non-negative.
i
iii xy
.01)(
,1)(0
0
0
iTii
Tiii
xy
xy
1)( 0 Tii xy
![Page 5: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/5.jpg)
Optimal Hyperplane/Support Vector Classifier
In interesting interpretationfrom the equality constraintin the dual problem is as follows.
i are forces on both sides of thehyperplane, and the net force iszero on the hyperplane.
0i
ii y
![Page 6: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/6.jpg)
Linear Support Vector Classifier: Non-separable Case
![Page 7: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/7.jpg)
From Separable to Non-separable
.,0
,1)(:tosubject
2
1min
0
2
,0
i
ixy
i
iiT
i
ii
In the non-separable case the margin width is: , and if in addition , then the margin width is 1. This is the reason that in the primal problemwe have the following inequality constraints:
1
1
.,1)( 0 ixy iT
i
These inequality constraints ensure that there is no point in the margin area. Forthe non-separable case, such constraints must be violated, and it is modified to:
.,0,1)( 0 ixy iiiT
i
So, the primary optimization problem becomes:
The positive parameter controls the extentto which points areallowed to violate (1)
(1)
![Page 8: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/8.jpg)
Non-separable Case: Finding Dual Function
• Lagrangian function minimization:
• Solve:
• Substitute (1), (2) and (3) in L to form the dual function:
.,0,0,0:
)]1()([2
1
110
1
2
itosubject
xyL
iii
N
iii
N
ii
Tiii
N
ii
0
iiii xy
L
00
iii y
L
.0
,0:tosubject
2
1
1
i
N
iii
i kk
Tikiki
ii
y
xxyyq
(1)
(2)
., iii (3)
![Page 9: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/9.jpg)
Dual optimization: dual variables to primal variables
After solving the dual problem we obtain i ‘s; how do we construct the hyperplane from here?
To obtain use the equation:
How do we obtain 0 ?complementary slackness conditions for the primal optimization problem:
Training points corresponding to non-negative i ‘s are support vectors.
0 is computed from for which:(Average is taken from such points)
is chosen by cross-validation. should be typically greater than 1/N.
i
iii xy
.00
,00
,01)(
,1)(0
0
0
iii
iii
iiTii
iTiii
xy
xy
1)( 0 Tii xy .0 i
![Page 10: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/10.jpg)
Example: Non-separable Case
![Page 11: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/11.jpg)
Non-linear support vector classifier
Let’s take a look at the solution of optimal separating hyperplane in terms of dual variables:
N
ijj
Tiiij
N
i
Tiii
xxyy
xxyxf
10
01
.0somefor,
,)(
Let’s take a look at dual cost function for the optimal separating hyperplane:
.0
,0:tosubject
2
1
1
i
N
iii
i kk
Tikiki
ii
y
xxyyq
An invaluable observation: all these equations involve “feature points” in “inner products”
![Page 12: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/12.jpg)
Non-linear support vector classifier…
An invaluable observation: all these equations involve “feature points” in “inner products”
This feature is particularly very convenient when the input feature space has a large dimension
As for example, consider that we want a classifier which is additive in the feature component,not linear. Such a classifier is expected to perform better on problems with non-linearclassification boundary.
01
)()(
M
ppp xhxf
hi are non-linear functions of the input feature. Ex. input space: x=(x1, x2), and h’s aresecond order polynomials:
,2),(,),(,),(
,2),(,2),(,1),(
2121622215
21214
22131212211
xxxxhxxxhxxxh
xxxhxxxhxxh
So that the classifier is now non-linear:
.222),( 216225
21423121021 xxxxxxxxf
Because of the inner product feature, this non-linear classifier can still be computed by the methods for finding linear optimal hyperplane.
![Page 13: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/13.jpg)
Non-linear support vector classifier…
Denote: TM xhxhxh )]()([)( 1
The non-linear classifier: 001
)()()(
xhxhxf TM
ppp
.0
,0:tosubject
)()(2
1
1
i
N
iii
i kk
Tikiki
ii
y
xhxhyyq
The dual cost function:
N
ijj
Tiiij
N
i
Tiii
xhxhyy
xhxhyxf
10
01
.0somefor),()(
,)()()(
The non-linear classifierin dual variables:
Thus, in the dual variable space the non-linear classifer is expressed just with inner products!
![Page 14: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/14.jpg)
Non-linear support vector classifier…
,2),(,),(,),(
,2),(,2),(,1),(
2121622215
21214
22131212211
xxxxhxxxhxxxh
xxxhxxxhxxh
With the previous non-linear feature vector,
The inner product takes a particularly interesting form:
),(),(
)1(
)1(
2221
),(),(
21212121
2
2
121
22211
22
22
21
21
22
22
21
212211
2121
aabbKbbaaK
b
baa
baba
babababababa
bbhaah T
Kernel functionComputational savings:instead of 6 products, wecompute 3 products
![Page 15: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/15.jpg)
Kernel Functions
),()()( jijT
i xxKxhxh So, if the inner product can be expressed in terms of a function symmetric function K:
then we can apply the SV tool.
Well not quite! We need another property of K called positive (semi) definiteness.Why? The dual function has an answer to this question.
)()(),(where
,))(())((2
11
),(2
1
)()(2
1
kT
ikikiik
TT
i kkikiki
ii
i kk
Tikiki
ii
xhxhxxKKK
ydiagKydiag
xxKyy
xhxhyyq
The maximization of the dual is convex when the matrix K is positive semi-definite
Thus the kernel function K must satisfy two properties: symmetry and p.d.
![Page 16: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/16.jpg)
Kernel Functions…
Thus we need such h(x)’s that define kernel function.
In practice we don’t even need to define h(x)! All we need is the kernel function!
Example kernel functions:dxxxxK )',1()',(
)/||'||exp()',( 2 cxxxxK
)',tanh()',( 21 kxxkxxK
dth degree polynomial
Radial kernel
Neural network
The real question is now designing a kernel function
![Page 17: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/17.jpg)
Example
![Page 18: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/18.jpg)
SVM as a Penalty Method
With the following optimization,)()( 0 Txhxf
2
1, 2
1)](1[min
0
N
ii xfy
is equivalent to:
.,0
,1)(:tosubject
2
1min
0
2
,0
i
ixy
i
iiT
i
ii
SVM is a penalized optimization method for binary classification
![Page 19: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/19.jpg)
Negative Binomial Log-likelihood (LR Loss Function) Example
This is essentiallynon-linear logisticregression
![Page 20: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/20.jpg)
SVM for Regression
The penalty view of SVM leads to regression
With the following optimization,)()( 0 Txhxf
,2
))((min2
1, 0
N
ii xfyV
where, V(.) is a regression loss function.
![Page 21: Support Vector Machines CMPUT 466/551 Nilanjan Ray](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d2c5503460f94a01e55/html5/thumbnails/21.jpg)
SV Regression: Loss Functions