Download - Support Vector Machines CMPUT 466/551 Nilanjan Ray

Support Vector Machines

CMPUT 466/551Nilanjan Ray

Agenda

• Linear support vector classifier– Separable case– Non-separable case

• Non-linear support vector classifier• Kernels for classification• SVM as a penalized method• Support vector regression

Linear Support Vector Classifier: Separable Case

.0

,,0:tosubject

}2

1{min

,,1

iii

i

ii

i kk

Tikiki

y

i

xxyyN

Dual problem(simpler optimization)

Dual problemin matrix vector form:

.0

,0:tosubject

,1)]()([2

1

T

TYT

y

ydiagXXydiagCompare theimplementationsimple_svm.m

ixy iT

i ,1)(:tosubject

2

1min

0

2

,0

Primal problem

Linear SVC (AKA Optimal Hyperplane)…

After solving the dual problem we obtain i ‘s; how do construct the hyperplane from here?

To obtain use the equation:

How do we obtain 0 ?We need the complementary slackness criteria, which are the results of Karush-Kuhn-Tucker (KKT) conditions for the primal optimization problem.

Complementary slackness means:

Training points corresponding to non-negative i ‘s are support vectors.

0 is computed from for which i ‘s are non-negative.

i

iii xy

.01)(

,1)(0

0

0

iTii

Tiii

xy

xy

1)( 0 Tii xy

Optimal Hyperplane/Support Vector Classifier

In interesting interpretationfrom the equality constraintin the dual problem is as follows.

i are forces on both sides of thehyperplane, and the net force iszero on the hyperplane.

0i

ii y

Linear Support Vector Classifier: Non-separable Case

From Separable to Non-separable

.,0

,1)(:tosubject

2

1min

0

2

,0

i

ixy

i

iiT

i

ii

In the non-separable case the margin width is: , and if in addition , then the margin width is 1. This is the reason that in the primal problemwe have the following inequality constraints:

1

1

.,1)( 0 ixy iT

i

These inequality constraints ensure that there is no point in the margin area. Forthe non-separable case, such constraints must be violated, and it is modified to:

.,0,1)( 0 ixy iiiT

i

So, the primary optimization problem becomes:

The positive parameter controls the extentto which points areallowed to violate (1)

(1)

Non-separable Case: Finding Dual Function

• Lagrangian function minimization:

• Solve:

• Substitute (1), (2) and (3) in L to form the dual function:

.,0,0,0:

)]1()([2

1

110

1

2

itosubject

xyL

iii

N

iii

N

ii

Tiii

N

ii

0

iiii xy

L

00

iii y

L

.0

,0:tosubject

2

1

1

i

N

iii

i kk

Tikiki

ii

y

xxyyq

(1)

(2)

., iii (3)

Dual optimization: dual variables to primal variables

After solving the dual problem we obtain i ‘s; how do we construct the hyperplane from here?

To obtain use the equation:

How do we obtain 0 ?complementary slackness conditions for the primal optimization problem:

Training points corresponding to non-negative i ‘s are support vectors.

0 is computed from for which:(Average is taken from such points)

is chosen by cross-validation. should be typically greater than 1/N.

i

iii xy

.00

,00

,01)(

,1)(0

0

0

iii

iii

iiTii

iTiii

xy

xy

1)( 0 Tii xy .0 i

Example: Non-separable Case

Non-linear support vector classifier

Let’s take a look at the solution of optimal separating hyperplane in terms of dual variables:

N

ijj

Tiiij

N

i

Tiii

xxyy

xxyxf

10

01

.0somefor,

,)(

Let’s take a look at dual cost function for the optimal separating hyperplane:

.0

,0:tosubject

2

1

1

i

N

iii

i kk

Tikiki

ii

y

xxyyq

An invaluable observation: all these equations involve “feature points” in “inner products”

Non-linear support vector classifier…

An invaluable observation: all these equations involve “feature points” in “inner products”

This feature is particularly very convenient when the input feature space has a large dimension

As for example, consider that we want a classifier which is additive in the feature component,not linear. Such a classifier is expected to perform better on problems with non-linearclassification boundary.

01

)()(

M

ppp xhxf

hi are non-linear functions of the input feature. Ex. input space: x=(x1, x2), and h’s aresecond order polynomials:

,2),(,),(,),(

,2),(,2),(,1),(

2121622215

21214

22131212211

xxxxhxxxhxxxh

xxxhxxxhxxh

So that the classifier is now non-linear:

.222),( 216225

21423121021 xxxxxxxxf

Because of the inner product feature, this non-linear classifier can still be computed by the methods for finding linear optimal hyperplane.


Denote: TM xhxhxh )]()([)( 1

The non-linear classifier: 001

)()()(

xhxhxf TM

ppp

.0

,0:tosubject

)()(2

1

1

i

N

iii

i kk

Tikiki

ii

y

xhxhyyq

The dual cost function:

N

ijj

Tiiij

N

i

Tiii

xhxhyy

xhxhyxf

10

01

.0somefor),()(

,)()()(

The non-linear classifierin dual variables:

Thus, in the dual variable space the non-linear classifer is expressed just with inner products!


,2),(,),(,),(

,2),(,2),(,1),(

2121622215

21214

22131212211

xxxxhxxxhxxxh

xxxhxxxhxxh

With the previous non-linear feature vector,

The inner product takes a particularly interesting form:

),(),(

)1(

)1(

2221

),(),(

21212121

2

2

121

22211

22

22

21

21

22

22

21

212211

2121

aabbKbbaaK

b

baa

baba

babababababa

bbhaah T

Kernel functionComputational savings:instead of 6 products, wecompute 3 products

Kernel Functions

),()()( jijT

i xxKxhxh So, if the inner product can be expressed in terms of a function symmetric function K:

then we can apply the SV tool.

Well not quite! We need another property of K called positive (semi) definiteness.Why? The dual function has an answer to this question.

)()(),(where

,))(())((2

11

),(2

1

)()(2

1

kT

ikikiik

TT

i kkikiki

ii

i kk

Tikiki

ii

xhxhxxKKK

ydiagKydiag

xxKyy

xhxhyyq

The maximization of the dual is convex when the matrix K is positive semi-definite

Thus the kernel function K must satisfy two properties: symmetry and p.d.

Kernel Functions…

Thus we need such h(x)’s that define kernel function.

In practice we don’t even need to define h(x)! All we need is the kernel function!

Example kernel functions:dxxxxK )',1()',(

)/||'||exp()',( 2 cxxxxK

)',tanh()',( 21 kxxkxxK

dth degree polynomial

Radial kernel

Neural network

The real question is now designing a kernel function

Example

SVM as a Penalty Method

With the following optimization,)()( 0 Txhxf

2

1, 2

1)](1[min

0

N

ii xfy

is equivalent to:

.,0

,1)(:tosubject

2

1min

0

2

,0

i

ixy

i

iiT

i

ii

SVM is a penalized optimization method for binary classification

Negative Binomial Log-likelihood (LR Loss Function) Example

This is essentiallynon-linear logisticregression

SVM for Regression

The penalty view of SVM leads to regression

With the following optimization,)()( 0 Txhxf

,2

))((min2

1, 0

N

ii xfyV

where, V(.) is a regression loss function.

SV Regression: Loss Functions

Download - Support Vector Machines CMPUT 466/551 Nilanjan Ray

Top Related