svm

86
Introduction to SVMs Introduction to SVMs

Upload: wltongxing

Post on 14-Jun-2015

491 views

Category:

Education


2 download

DESCRIPTION

introduction to svm

TRANSCRIPT

Page 1: Svm

Introduction to SVMsIntroduction to SVMs

Page 2: Svm

SVMs

• Geometric – Maximizing Margin

• Kernel Methods– Making nonlinear decision boundaries linear– Efficiently!

• Capacity– Structural Risk Minimization

Page 3: Svm

SVM History

• SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis

• SVM was first introduced by Boser, Guyon and Vapnik in COLT-92

• SVM became famous when, using pixel maps as input, it gave accuracy comparable to NNs with hand-designed features in a handwriting recognition task

• SVM is closely related to:– Kernel machines (a generalization of SVMs), large

margin classifiers, reproducing kernel Hilbert space, Gaussian process, Boosting

Page 4: Svm

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 5: Svm

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 6: Svm

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 7: Svm

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 8: Svm

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Any of these would be fine..

..but which is best?

Page 9: Svm

Classifier Marginf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Page 10: Svm

Maximum Marginf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the maximum margin.

This is the simplest kind of SVM (Called an LSVM)Linear SVM

Page 11: Svm

Maximum Marginf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

Linear SVM

Page 12: Svm

Why Maximum Margin?

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

1. Intuitively this feels safest.

2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification.

3. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.

4. Empirically it works very very well.

Page 13: Svm

A “Good” Separator

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 14: Svm

Noise in the Observations

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 15: Svm

Ruling Out Some Separators

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 16: Svm

Lots of Noise

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 17: Svm

Maximizing the Margin

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 18: Svm

Specifying a line and margin

• How do we represent this mathematically?

• …in m input dimensions?

Plus-Plane

Minus-Plane

Classifier Boundary

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zone

Page 19: Svm

Specifying a line and margin

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }

Plus-Plane

Minus-Plane

Classifier Boundary

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zone

Classify as..

wx+b=1

wx+b=0

wx+b=-

1

+1 if w . x + b >= 1

-1 if w . x + b <= -1

Universe explodes

if -1 < w . x + b < 1

Page 20: Svm

Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }

Claim: The vector w is perpendicular to the Plus Plane. Why?

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

Page 21: Svm

Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }

Claim: The vector w is perpendicular to the Plus Plane. Why?

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) ?

And so of course the vector w is also perpendicular to the Minus Plane

Page 22: Svm

Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

x-

x+

Any location in m: not necessarily a datapoint

Any location in Rm: not necessarily a datapoint

Page 23: Svm

Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + w for some value of . Why?

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

x-

x+

Page 24: Svm

Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + w for some value of . Why?

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

x-

x+

The line from x- to x+ is perpendicular to the planes.

So to get from x- to x+ travel some distance in direction w.

Page 25: Svm

Computing the margin width

What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = M

It’s now easy to get M in terms of w and b

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

x-

x+

Page 26: Svm

Computing the margin width

What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = M

It’s now easy to get M in terms of w and b

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

w . (x - + w) + b = 1

=>

w . x - + b + w .w = 1

=>

-1 + w .w = 1

=>

x-

x+

w.w

Page 27: Svm

Computing the margin width

What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = M•

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width =

M = |x+ - x- | =| w |=

x-

x+

w.w

wwww

ww

.

2

.

.2

www .|| λλ

ww.

2

Page 28: Svm

Learning the Maximum Margin Classifier

Given a guess of w and b we can• Compute whether all data points in the correct half-planes• Compute the width of the margin

So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How?

Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method?

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width =

x-

x+ww.

2

Page 29: Svm

• Linear Programming

find w

argmax cw

subject to

wai bi, for i = 1, …, m

wj 0 for j = 1, …, n

Don’t worry…

it’s good for you…

Don’t worry…

it’s good for you…

There are fast algorithms for solving linear programs including the

simplex algorithm and Karmarkar’s algorithm

Page 30: Svm

Learning via Quadratic Programming

• QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

Page 31: Svm

Quadratic Programming

2maxarg

uuud

u

Rc

TT Find

nmnmnn

mm

mm

buauaua

buauaua

buauaua

...

:

...

...

2211

22222121

11212111

)()(22)(11)(

)2()2(22)2(11)2(

)1()1(22)1(11)1(

...

:

...

...

enmmenenen

nmmnnn

nmmnnn

buauaua

buauaua

buauaua

And subject to

n additional linear inequality constraints

e a

dd

ition

al

linear

eq

uality

co

nstra

ints

Quadratic criterion

Subject to

Page 32: Svm

Quadratic Programming

2maxarg

uuud

u

Rc

TT Find

Subject to

nmnmnn

mm

mm

buauaua

buauaua

buauaua

...

:

...

...

2211

22222121

11212111

)()(22)(11)(

)2()2(22)2(11)2(

)1()1(22)1(11)1(

...

:

...

...

enmmenenen

nmmnnn

nmmnnn

buauaua

buauaua

buauaua

And subject to

n additional linear inequality constraints

e a

dd

ition

al

linear

eq

uality

co

nstra

ints

Quadratic criterion

There exist algorithms for

finding such constrained

quadratic optima much

more efficiently and

reliably than gradient

ascent.

(But they are very fiddly…you

probably don’t want to

write one yourself)

Page 33: Svm

Learning the Maximum Margin ClassifierGiven guess of w , b we can• Compute whether all data

points are in the correct half-planes

• Compute the margin width

Assume R datapoints, each (xk,yk) where yk = +/- 1

“ Predict Class

= +1”

zone

“ Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M =

ww.

2

What should our quadratic optimization criterion be?

How many constraints will we have?

What should they be?

Minimize w.w

R

w . xk + b >= 1 if yk = 1

w . xk + b <= -1 if yk = -1

Page 34: Svm

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1:

Find minimum w.w, while minimizing number of training set errors.

Problem: Two things to minimize makes for an ill-defined optimization

Page 35: Svm

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1.1:

Minimize

w.w + C (#train errors)

There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?

Tradeoff parameter

Page 36: Svm

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1.1:

Minimize

w.w + C (#train errors)

There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?

Tradeoff parameterCan’t be expressed as a Quadratic

Programming problem.

Solving it may be too slow.

(Also, doesn’t distinguish between disastrous errors and near

misses)

So… any

other

ideas?

Page 37: Svm

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 2.0:

Minimize w.w + C (distance of error points to their correct place)

Page 38: Svm

Learning Maximum Margin with NoiseGiven guess of w , b we can

• Compute sum of distances of points to their correct zones

• Compute the margin width

Assume R datapoints, each (xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-

1

M =

ww.

2

What should our quadratic optimization criterion be?

How many constraints will we have?

What should they be?

Page 39: Svm

04/13/23 39

Large-margin Decision Boundary• The decision boundary should be as far away from the data of both

classes as possible– We should maximize the margin, m– Distance between the origin and the line wtx=k is k/||w||

Class 1

Class 2

m

Page 40: Svm

04/13/23 40

Finding the Decision Boundary

• Let {x1, ..., xn} be our data set and let yi {1,-1} be the class label of xi

• The decision boundary should classify all points correctly

• The decision boundary can be found by solving the following constrained optimization problem

• This is a constrained optimization problem. Solving it requires some new tools– Feel free to ignore the following several slides; what is important is the

constrained optimization problem above

Page 41: Svm

04/13/23 41

Back to the Original Problem

• The Lagrangian is

– Note that ||w||2 = wTw

• Setting the gradient of w.r.t. w and b to zero, we have

Page 42: Svm

• The Karush-Kuhn-Tucker conditions,

04/13/23 42

Page 43: Svm

04/13/23 43

The Dual Problem• If we substitute to , we have

• Note that

• This is a function of i only

Page 44: Svm

44

The Dual Problem

• The new objective function is in terms of i only

• It is known as the dual problem: if we know w, we know all i; if we know all i, we know w

• The original problem is known as the primal problem• The objective function of the dual problem needs to be maximized!• The dual problem is therefore:

Properties of i when we introduce the Lagrange multipliers

The result when we differentiate the original Lagrangian w.r.t. b

Page 45: Svm

04/13/23 45

The Dual Problem

• This is a quadratic programming (QP) problem

– A global maximum of i can always be found

• w can be recovered by

Page 46: Svm

04/13/23 46

Characteristics of the Solution• Many of the i are zero

– w is a linear combination of a small number of data points– This “sparse” representation can be viewed as data compression

as in the construction of knn classifier

• xi with non-zero i are called support vectors (SV)

– The decision boundary is determined only by the SV

– Let tj (j=1, ..., s) be the indices of the s support vectors. We can write

• For testing with a new data z

– Compute and classify z

as class 1 if the sum is positive, and class 2 otherwise

– Note: w need not be formed explicitly

Page 47: Svm

04/13/23 47

6=1.4

A Geometrical Interpretation

Class 1

Class 2

1=0.8

2=0

3=0

4=0

5=0

7=0

8=0.6

9=0

10=0

Page 48: Svm

04/13/23 48

Non-linearly Separable Problems

• We allow “error” i in classification; it is based on the output of the discriminant function wTx+b

• i approximates the number of misclassified samples

Class 1

Class 2

Page 49: Svm

Learning Maximum Margin with NoiseGiven guess of w , b we can

• Compute sum of distances of points to their correct zones

• Compute the margin width

Assume R datapoints, each (xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-

1

M =

ww.

2

What should our quadratic optimization criterion be?

Minimize

R

kkεC

1

.2

1ww

7

11 2

How many constraints will we have? R

What should they be?

w . xk + b >= 1-k if yk = 1

w . xk + b <= -1+k if yk = -1

Page 50: Svm

Learning Maximum Margin with NoiseGiven guess of w , b we can

• Compute sum of distances of points to their correct zones

• Compute the margin width

Assume R datapoints, each (xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-

1

M =

ww.

2

What should our quadratic optimization criterion be?

Minimize

R

kkεC

1

.2

1ww

7

11 2

Our original (noiseless data) QP had m+1 variables: w1, w2, … wm, and b.

Our new (noisy data) QP has m+1+R variables: w1, w2, … wm, b, k , 1 ,… R

m = # input dimension

s

How many constraints will we have? R

What should they be?

w . xk + b >= 1-k if yk = 1

w . xk + b <= -1+k if yk = -1

R= # records

Page 51: Svm

Learning Maximum Margin with NoiseGiven guess of w , b we

can• Compute sum of

distances of points to their correct zones

• Compute the margin width

Assume R datapoints, each (xk,yk) where yk = +/- 1

How many constraints will we have? R

What should they be?

w . xk + b >= 1-k if yk = 1

w . xk + b <= -1+k if yk = -1

wx+b=1

wx+b=0

wx+b=-

1

M =

ww.

2

What should our quadratic optimization criterion be?

Minimize

R

kkεC

1

.2

1ww

7

11 2

There’s a bug in this QP. Can you spot it?

Page 52: Svm

Learning Maximum Margin with NoiseGiven guess of w , b we can

• Compute sum of distances of points to their correct zones

• Compute the margin width

Assume R datapoints, each (xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-

1

M =

ww.

2

What should our quadratic optimization criterion be?

Minimize

R

kkεC

1

.2

1ww

7

11 2

How many constraints will we have? 2R

What should they be?

w . xk + b >= 1-k if yk = 1

w . xk + b <= -1+k if yk = -1

k >= 0 for all k

Page 53: Svm

Learning Maximum Margin with NoiseGiven guess of w , b we can

• Compute sum of distances of points to their correct zones

• Compute the margin width

Assume R datapoints, each (xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-

1

M =

ww.

2

What should our quadratic optimization criterion be?

Minimize

R

kkεC

1

.2

1ww

7

11 2

How many constraints will we have? 2R

What should they be?

w . xk + b >= 1-k if yk = 1

w . xk + b <= -1+k if yk = -1

k >= 0 for all k

Page 54: Svm

An Equivalent Dual QP

Maximize

R

k

R

lkllk

R

kk Qααα

1 11 2

1where ).( lklkkl yyQ xx

Subject to these constraints:

kCαk 0 01

R

kkk yα

Minimize

R

kkεC

1

.2

1ww

w . xk + b >= 1-k if yk = 1

w . xk + b <= -1+k if yk = -1

k >= 0 , for all k

Page 55: Svm

An Equivalent Dual QP

Maximize

R

k

R

lkllk

R

kk Qααα

1 11 2

1where ).( lklkkl yyQ xx

Subject to these constraints:

kCαk 0

Then define:

R

kkkk yα

1

xw

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. x - b)

01

R

kkk yα

Page 56: Svm

Example XOR problem revisited:

Let the nonlinear mapping be :

(x) = (1,x12, 21/2 x1x2, x2

2, 21/2 x1 , 21/2 x2)T

And: (xi)=(1,xi12, 21/2 xi1xi2, xi2

2, 21/2 xi1 , 21/2 xi2)T

Therefore the feature space is in 6D with input data in 2D

x1 = (-1,-1), d1= - 1 x2 = (-1, 1), d2= 1 x3 = ( 1,-1), d3= 1 x4 = (-1,-1), d4= -1

Page 57: Svm

Q(a)= ai – ½ ai aj di dj xi) Txj)

=a1 +a2 +a3 +a4 – ½(9 a1 a1 - 2a1 a2 -2 a1 a3 +2a1 a4

+9a2 a2 + 2a2 a3 -2a2 a4 +9a3 a3 -2a3 a4 +9 a4 a4 )

To minimize Q, we only need to calculate

(due to optimality conditions) which gives

1 = 9 a1 - a2 - a3 + a4

1 = -a1 + 9 a2 + a3 - a4

1 = -a1 + a2 + 9 a3 - a4

1 = a1 - a2 - a3 + 9 a4

4,...,1,0)( i

iaaQ

Page 58: Svm

The solution of which gives the optimal values: a0,1 =a0,2 =a0,3 =a0,4 =1/8

w0 = a0,i di xi) = 1/8[x1)- x2)- x3)+ x4)]

0

0

02

10

0

2

2

1

2

1

1

2

2

1

2

1

1

2

2

1

2

1

1

2

2

1

2

1

1

8

1

Where the first element of w0 gives the bias b

Page 59: Svm

From earlier we have that the optimal hyperplane is defined by:

w0T x) = 0

That is:

0

2

2

2

1

0002

100 21

2

1

22

21

21

xx

x

x

x

xx

x

w0T x)

which is the optimal decision boundary for the XOR problem. Furthermore we note that the solution is unique since the optimal decision boundary is unique

Page 60: Svm

Output for polynomial RBF

Page 61: Svm

Harder 1-dimensional datasetRemember how

permitting non-linear basis functions made linear regression so much nicer?

Let’s permit them here too

x=0 ),( 2kkk xxz

Page 62: Svm

For a non-linearly separable problem we have to first map data onto feature space so that they are linear separable

xixi)

Given the training data sample {(xi,yi), i=1, …,N}, find the optimum values of the weight vector w and bias b

w = a0,i yi xi)

where a0,i are the optimal Lagrange multipliers determined by maximizing the following objective function

subject to the constraints

ai yi =0 ; ai >0

N

iji

Tjij

N

ji

N

ii xxddaaaaQ

1 11

)()(2

1)(

Page 63: Svm

SVM building procedure:

1. Pick a nonlinear mapping 2. Solve for the optimal weight vector

However: how do we pick the function

• In practical applications, if it is not totally impossible to find it is very hard

• In the previous example, the function is quite complex: How would we find it?

Answer: the Kernel Trick

Page 64: Svm

Notice that in the dual problem the image of input vectors only involved as an inner product meaning that the optimization can be performed in the (lower dimensional) input space and that the inner product can be replaced by an inner-product kernel

How do we relate the output of the SVM to the kernel K?

Look at the equation of the boundary in the feature space and use the optimality conditions derived from the Lagrangian formulations

1 , 1

1 , 1

1( ) ( ) ( )2

1 ( , )2

N NT

ii i j i j j ji i j

N N

i i j i j i ji i j

Q a a a a d d x x

a a a d d K x x

Page 65: Svm

1

1

1

00

0 1 1

1

( ) 0

( ) 0; ( ) 1

: ( ) [ ( ), ( ),..., ( )]

: ( ) 0

: ( )

m

j jj

m

j jj

m

T

N

ii ii

Hyperplane is defined by

w x b

or

w x where x

writing x x x x

we get w x

from optimality conditions w a d x

Page 66: Svm

1

1

1

1

0

: ( ) ( ) 0

: ( , ) 0

( ) ( , )

: ( , ) ( ) ( )

NT

ii ii

N

i i ii

NT

i i ii

m

ii j jj

Thus a d x x

and so boundary is a d K x x

and Output w x a d K x x

where K x x x x

Page 67: Svm

In the XOR problem, we chose to use the kernel function:K(x, xi) = (x T

xi+1)2

= 1+ x12 xi1

2 + 2 x1x2 xi1xi2 + x22 xi2

2 + 2x1xi1 ,+ 2x2xi2

Which implied the form of our nonlinear functions:(x) = (1,x1

2, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T

And: (xi)=(1,xi12, 21/2 xi1xi2, xi2

2, 21/2 xi1 , 21/2 xi2)T

However, we did not need to calculate at all and could simply have used the kernel to calculate:

Q(a) = ai – ½ ai aj di dj xixj

Maximized and solved for ai and derived the hyperplane via:

0),(1

ii

N

ii xxKda

Page 68: Svm

We therefore only need a suitable choice of kernel function cf:Mercer’s Theorem:

Let K(x,y) be a continuous symmetric kernel that defined in the closed interval [a,b]. The kernel K can be expanded in the form

(x,y) = x) T y)

provided it is positive definite. Some of the usual choices for K are:

Polynomial SVM (x T xi+1)p p specified by user

RBF SVM exp(-1/(2) || x – xi||2) specified by user

MLP SVM tanh(s0 x T xi + s1)

Page 69: Svm

Maximize

where ).( lklkkl yyQ xx

Subject to these constraints:

kCαk 0

Then define:

R

kkkk yα

1

xw

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. x - b)

01

R

kkk yα

Datapoints with k > 0 will be the support vectors

..so this sum only needs to be over the support vectors.

R

k

R

lkllk

R

kk Qααα

1 11 2

1An Equivalent Dual QP

Page 70: Svm

Quadratic Basis

Functions

mm

m

m

m

m

xx

xx

xx

xx

xx

xx

x

x

x

x

x

x

1

1

32

1

31

21

2

22

21

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

)(xΦ

Constant Term

Linear Terms

Pure Quadratic

Terms

Quadratic Cross-Terms

Number of terms (assuming m input dimensions) = (m+2)-choose-2

= (m+2)(m+1)/2

= (as near as makes no difference) m2/2

You may be wondering what those

’s are doing.

•You should be happy that they do no harm

•You’ll find out why they’re there soon.

2

Page 71: Svm

QP with basis functionswhere ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. (x) - b)

01

R

kkk yα

0 s.t.

)(kαk

kkk yα xΦw

Maximize

R

k

R

lkllk

R

kk Qααα

1 11 2

1

Page 72: Svm

QP with basis functionswhere ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. (x) - b)

01

R

kkk yα

We must do R2/2 dot products to get this matrix ready.

Each dot product requires m2/2 additions and multiplications

The whole thing costs R2 m2 /4. Yeeks!

……or does it?or does it?

0 s.t.

)(kαk

kkk yα xΦw

Maximize

R

k

R

lkllk

R

kk Qααα

1 11 2

1

Page 73: Svm

mm

m

m

m

m

mm

m

m

m

m

bb

bb

bb

bb

bb

bb

b

b

b

b

b

b

aa

aa

aa

aa

aa

aa

a

a

a

a

a

a

1

1

32

1

31

21

2

22

21

2

1

1

1

32

1

31

21

2

22

21

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

2

:

2

:

2

2

:

2

2

:

2

:

2

2

1

)()( bΦaΦ

1

m

iiiba

1

2

m

iii ba

1

22

m

i

m

ijjiji bbaa

1 1

2

+

+

+

Quadratic Dot Products

Page 74: Svm

)()( bΦaΦ

m

i

m

ijjiji

m

iii

m

iii bbaababa

1 11

22

1

221

Just out of casual, innocent, interest, let’s look at another function of a and b:

2)1.( ba

1.2).( 2 baba

121

2

1

m

iii

m

iii baba

1211 1

m

iii

m

i

m

jjjii bababa

122)(11 11

2

m

iii

m

i

m

ijjjii

m

iii babababa

Quadratic Dot Products

Page 75: Svm

)()( bΦaΦ

Just out of casual, innocent, interest, let’s look at another function of a and b:

2)1.( ba

1.2).( 2 baba

121

2

1

m

iii

m

iii baba

1211 1

m

iii

m

i

m

jjjii bababa

122)(11 11

2

m

iii

m

i

m

ijjjii

m

iii babababa

They’re the same!

And this is only O(m) to compute!

m

i

m

ijjiji

m

iii

m

iii bbaababa

1 11

22

1

221

Quadratic Dot Products

Page 76: Svm

QP with Quadratic basis functionswhere ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. (x) - b)

01

R

kkk yα

We must do R2/2 dot products to get this matrix ready.

Each dot product now only requires m additions and multiplications

0 s.t.

)(kαk

kkk yα xΦw

Maximize

R

k

R

lkllk

R

kk Qααα

1 11 2

1

Page 77: Svm

Higher Order Polynomials

Polynomial (x) Cost to build Qkl matrix traditionally

Cost if 100 inputs

(a).(b) Cost to build Qkl matrix efficiently

Cost if 100 inputs

Quadratic All m2/2 terms up to degree 2

m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2

Cubic All m3/6 terms up to degree 3

m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2

Quartic All m4/24 terms up to degree 4

m4 R2 /48 1,960,000 R2

(a.b+1)4 m R2 / 2 50 R2

Page 78: Svm

QP with Quintic basis functions

Maximize

R

k

R

lkllk

R

kk Qααα

1 11

where ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

0 s.t.

)(kαk

kkk yα xΦw

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. (x) - b)

01

R

kkk yα

Page 79: Svm

QP with Quintic basis functions

Maximize

R

k

R

lkllk

R

kk Qααα

1 11

where ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

0 s.t.

)(kαk

kkk yα xΦw

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. (x) - b)

01

R

kkk yα

We must do R2/2 dot products to get this matrix ready.

In 100-d, each dot product now needs 103 operations instead of 75 million

But there are still worrying things lurking away. What are they?

•The fear of overfitting with this enormous number of terms

•The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?)

Page 80: Svm

QP with Quintic basis functions

Maximize

R

k

R

lkllk

R

kk Qααα

1 11

where ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

0 s.t.

)(kαk

kkk yα xΦw

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. (x) - b)

01

R

kkk yα

We must do R2/2 dot products to get this matrix ready.

In 100-d, each dot product now needs 103 operations instead of 75 million

But there are still worrying things lurking away. What are they?

•The fear of overfitting with this enormous number of terms

•The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?)

Because each w. (x) (see below) needs 75 million operations. What can be done?

The use of Maximum Margin magically makes this not a problem

Page 81: Svm

QP with Quintic basis functions

Maximize

R

k

R

lkllk

R

kk Qααα

1 11

where ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

0 s.t.

)(kαk

kkk yα xΦw

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. (x) - b)

01

R

kkk yα

We must do R2/2 dot products to get this matrix ready.

In 100-d, each dot product now needs 103 operations instead of 75 million

But there are still worrying things lurking away. What are they?

•The fear of overfitting with this enormous number of terms

•The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?)

Because each w. (x) (see below) needs 75 million operations. What can be done?

The use of Maximum Margin magically makes this not a problem

Only Sm operations (S=#support vectors)

0 s.t.

)()()(kαk

kkk yα xΦxΦxΦw

0 s.t.

5)1(kαk

kkk yα xx

Page 82: Svm

QP with Quintic basis functions

Maximize

R

k

R

lkllk

R

kk Qααα

1 11

where ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

0 s.t.

)(kαk

kkk yα xΦw

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. (x) - b)

01

R

kkk yα

We must do R2/2 dot products to get this matrix ready.

In 100-d, each dot product now needs 103 operations instead of 75 million

But there are still worrying things lurking away. What are they?

•The fear of overfitting with this enormous number of terms

•The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?)

Because each w. (x) (see below) needs 75 million operations. What can be done?

The use of Maximum Margin magically makes this not a problem

Only Sm operations (S=#support vectors)

0 s.t.

)()()(kαk

kkk yα xΦxΦxΦw

0 s.t.

5)1(kαk

kkk yα xx

Page 83: Svm

QP with Quintic basis functionswhere ))().(( lklkkl yyQ xΦxΦ

Subject to these constraints:

kCαk 0

Then define:

0 s.t.

)(kαk

kkk yα xΦw

kk

KKKK

αK

εyb

maxarg where

.)1(

wx

Then classify with:

f(x,w,b) = sign(w. (x) - b)

01

R

kkk yα

Why SVMs don’t overfit as much as you’d think:

No matter what the basis function, there are really only up to R parameters: 1, 2 .. R, and usually most are set to zero by the Maximum Margin.

Asking for small w.w is like “weight decay” in Neural Nets and like Ridge Regression parameters in Linear regression and like the use of Priors in Bayesian Regression---all designed to smooth the function and reduce overfitting.

Only Sm operations (S=#support vectors)

0 s.t.

)()()(kαk

kkk yα xΦxΦxΦw

0 s.t.

5)1(kαk

kkk yα xx

Maximize

R

k

R

lkllk

R

kk Qααα

1 11 2

1

Page 84: Svm

SVM Kernel Functions

• K(a,b)=(a . b +1)d is an example of an SVM Kernel Function

• Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function– Radial-Basis-style Kernel Function:

– Neural-net-style Kernel Function:

2

2

2

)(exp),(

ba

baK

).tanh(),( babaK

, and are magic parameters that must be chosen by a model selection method such as CV or VCSRM

Page 85: Svm

SVM Implementations

• Sequential Minimal Optimization, SMO, efficient implementation of SVMs, Platt – in Weka

• SVMlight

– http://svmlight.joachims.org/

Page 86: Svm

References

• Tutorial on VC-dimension and Support Vector Machines:

C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html

• The VC/SRM/SVM Bible:Statistical Learning Theory by Vladimir Vapnik, Wiley-

Interscience; 1998