machine learning classifiers: support vector...

80
MACHINE LEARNING 1 MACHINE LEARNING Classifiers: Support Vector Machine

Upload: others

Post on 08-Jul-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

1

MACHINE LEARNING

Classifiers:

Support Vector Machine

Page 2: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

2

What is Classification?

Detecting facial attributes

Sony (Make Believe) He & Zhang, Pattern Recognition, 2011

Children

Female Adult

Training set must be as unambiguous as possible

Not easy, especially as members of different classes may share similar attributes

Learning implies generalization; which of the features of each member of class makes

the class most distinguishable from the other classes.

Page 3: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

3

Multi-Class Classification

Children

Female Adult

Male Adult

Whenever possible, the classes should be balanced

Garbage model:

Male adult versus anything that is neither a female

adult nor a child

Classes can no longer be balanced!

Page 4: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

4

Classifiers

There is a plethora of classifiers, e.g:

- Neural networks (feed-forward with backpropagation, multi-layer perceptron)

- Decision trees (C4.5, random forest)

- Kernel methods (support vector machine, gaussian process classifier)

- Mixtures of linear classifiers (boosting)

In this class, we will see only SVM and Boosting for mixture of classifiers

Each classifier type has its pros and cons:

- Complex model: embed non-linearity but heavy computation

- Simple models: often high number of models, hence high stack memory

- Number of hyperparameters: high extensive crossvalidation to determine

the optimal classifier

- Some of the classifiers come with guarantees for global optimal solution;

other have only local optimality guarantee

Page 5: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

5

Support Vector Machine

Brief history:

SVM was invented by Vladimir Vapnik

Started with the invention of the statistical learning theory (Vapnik1979)

The current form of SVM was presented in (Boser, Guyon and Vapnik 1992) and Cortes and Vapnik (1995)

Textbooks:

An easy introduction to SVM is given in Learning with

Kernels by Bernhard Scholkopf and Alexander Smola.

A good survey of the theory behind SVM is

given in Support Vector Machines and other

Kernel Based Learning methods by Nello

Cristianini and John-Shawe Taylor.

Page 6: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

6

Support Vector Machine

The success of SVM is mainly due to:

- Its ease of use (lots of software available, good documentation)

- Excellent performance on variety of datasets

- Good solvers making optimization (learning phase) very quick

- Very fast at retrieval time – does not hinder practical applications

Was applied to numerous classification problems:

- Computer vision (face detection, object recognition, feature categorization,

etc)

- Bioinformatics (categorization of gene expression, of microarray data)

- WWW (categorization of websites)

- Production (control of quality, detection of defaults)

- Robotics (categorization of sensor readings)

- Finance (bankruptcy prediction)

Page 7: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

7

Optimal Linear Classification

• Which choice is better?

• How could we formulate this problem?

‘good’

‘OK’

‘bad’

Page 8: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

8

Linear Classifiers

x

denotes -1

denotes +1

How would you classify this data?

f

(W, b)

yest

; , sgn ,f x w b w x b

Page 9: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

9

f x yest

denotes -1

denotes +1

How would you classify this data?

(W, b)

Linear Classifiers

; , sgn ,f x w b w x b

Page 10: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

10

f x yest

denotes -1

denotes +1

How would you classify this data?

(W, b)

Linear Classifiers

; , sgn ,f x w b w x b

Page 11: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

11

f x yest

denotes -1

denotes +1

How would you classify this data?

(W, b)

Linear Classifiers

; , sgn ,f x w b w x b

Page 12: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

12

; , sgn ,f x w b w x b

f x yest

denotes -1

denotes +1

Any of these would be fine..

..but which is best?

(W, b)

Linear Classifiers

Page 13: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

13

Classifier Margin

f x yest

denotes -1

denotes +1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

(W, b)

; , sgn ,f x w b w x b

Page 14: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

14

f x yest

denotes -1

denotes +1 The maximum margin linear classifier is the linear classifier with the maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Linear SVM

(W, b)

Classifier Margin

; , sgn ,f x w b w x b

Page 15: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

15

f x yest

denotes -1

denotes +1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

Linear SVM

(W, b)

Classifier Margin

; , sgn ,f x w b w x b

Page 16: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

16

f x yest

denotes -1

denotes +1

(W, b)

Classifier Margin

; , sgn ,f x w b w x b

Need to determine a

measure of the

margin

Page 17: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

17

f x yest

denotes -1

denotes +1

(W, b)

Classifier Margin

; , sgn ,f x w b w x b

Need to determine a

measure of the

margin

To maximize this

measure

Page 18: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

18

Determining the Optimal Separating Hyperplane

: , 1x w x b

: , 1x w x b

Definition:

The margin on either side of the hyperplane satisfy , 1.w x b

: , 0x w x b

Page 19: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

19

Determining the Optimal Separating Hyperplane

: , 2x w x b

Class with label y=-1

Class with label y=+1

: , 2x w x b

Points on either side of the separating plane have negative and positive

coordinates, respectively .

: , 3x w x b

: , 3x w x b

Decision function:

; , sgn ,f x w b w x b

Page 20: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

20

What is the distance from a point x to the hyperplane

<w,x>+b= 0?

?

Determining the Optimal Separating Hyperplane

x

, 0w x b

Page 21: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

21

Determining the Optimal Separating Hyperplane

w

x

2

unitary vector

' . . , ' 0

, ' , ' , ,

, ' , .

Projection of x-x' onto w:

, ,

x s t w x b

w x x w x w x

w x x b w x

w x b w x b ww

w ww

x’-x

x’

,Distance to hyperplane

w x b

w

The margin between two classes is at least 2/||w||.

Page 22: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

22

Determining the Optimal Separating Hyperplane

Class with label y=-1

Class with label y=+1

1

2

1 2

1 2

Two points on either side of

the margin:

, 1

, 1

, x x 2

2x x

w x b

w x b

w

w

x1 x2

The margin between two classes is at least 2/||w||.

Page 23: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

23

2

2Separating condition is measured by .

To maximize this condition is equivalent to minimizing .2

Better even is to minimize the convex form .2

w

w

w

Determining the Optimal Separating Hyperplane

Page 24: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

24

• Finding the Optimal Separating Hyperplane turns out to be

an optimization problem of the following form:

• N+1 parameters (N: dimension of data)

• M constraints (M: nm of datapoints)

• It is called the primal problem.

2

,

i=1,2,....,M.

1

2

, 1 when 1, 1,

, 1 when 1

min

i i

i i

i i

w b

w

w x b yy w x b

w x b y

Determining the Optimal Separating Hyperplane

Page 25: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

25

Rephrase the minimization under constraint problem in terms of the

Lagrange Multipliers ai, i = 1, ..., M (M, # of data points), one for each of

the inequality constraints and we get the dual problem:

2

1

1, , , 1

2

with 0

i i

M

i

i

i

L w b w y w x ba a

a

Determining the Optimal Separating Hyperplane

(Minimization of convex function under linear constraints through Lagrange gives the optimal solution)

Page 26: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

26

,0

2

1

max min , ,

where

1, , , 1

2i i

w b

M

i

i

L w b

L w b w y w x b

aa

a a

The solution of this problem is found when maximizing over a and

minimizing over w and b:

Determining the Optimal Separating Hyperplane

Page 27: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

27

Requesting that the gradient of L vanishes with w.

1

, ,0 i i

M

i

i

L w bw y x

w

aa

The vector defining the hyperplane

is determined by the training points.

Note that while w is unique (minimization of convex function), the alpha are not unique.

Determining the Optimal Separating Hyperplane

Page 28: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

28

Requesting that the gradient of L vanishes with w.

Requires minimum one datapoint on

each class

Determining the Optimal Separating Hyperplane

1

, ,0 0i

M

i

i

L w by

b

aa

Page 29: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

29

Requesting that the gradient of L vanishes with w.

Original constraints

Determining the Optimal Separating Hyperplane

, ,0 , 1

, ,0 , 1

i i

i i

L w by w x b

L w by w x b

a

a

a

a

Page 30: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

30

1

1

Complete optimization problem:

, ,0 (Dual feasibility)

, ,0 0 (Dual feasibility)

, ,0 , 1 (Primal feasibility)

Karush-

i i

i

i i

M

i

i

M

i

i

L w bw y x

w

L w by

b

L w by w x b

aa

aa

a

a

Kuhn-Tucker conditions:

, 1 0 1,.. (Complementarity conditions)

0, 1,..

i i

i

i

y w x b i M

i M

a

a

Determining the Optimal Separating Hyperplane

Page 31: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

31

1

1

Complete optimization problem:

, ,0 (Dual feasibility)

, ,0 0 (Dual feasibility)

, ,0 , 1 (Primal feasibility)

Karush-

i i

i

i i

M

i

i

M

i

i

L w bw y x

w

L w by

b

L w by w x b

aa

aa

a

a

Kuhn-Tucker conditions:

, 1 0 1,.. (Complementarity conditions)

0, 1,..

i i

i

i

y w x b i M

i M

a

a

Determining the Optimal Separating Hyperplane

Feasibility conditions from KKT require 0 for all points x

for the primal constraints to be satisfied:

, 1

i

i i

i

y w x b

a

Points correctly classified

Page 32: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

32

1

1

Complete optimization problem:

, ,0 (Dual feasibility)

, ,0 0 (Dual feasibility)

, ,0 , 1 (Primal feasibility)

Karush-

i i

i

i i

M

i

i

M

i

i

L w bw y x

w

L w by

b

L w by w x b

aa

aa

a

a

Kuhn-Tucker conditions:

, 1 0 1,.. (Complementarity conditions)

0, 1,..

i i

i

i

y w x b i M

i M

a

a

Determining the Optimal Separating Hyperplane

The , 1,.. determine the solutions to the constraints

All the pairs of data points , for which 0 are the support vectors

All the pairs of data points , for which 0 are "irrelevant"

w

i i

i i

i

i

i

i M

x y

x y

a

a

a

hen computing the margin.

Page 33: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

34

Consider 3 cases:

0 1 1 <1 inside the margin

1 -1 for -1 or >1 for 1 outside the margin

0 >0 for -1 or <0 for

i i i

i i i i

i i i i

T T

T T i T i

T T i T

y w x b w x b

y w x b w x b y w x b y

y w x b w x b y w x b

1 do not satisfy the constraintiy

Determining the Optimal Separating Hyperplane

: , 1x w x b

: , 1x w x b

: , 0x w x b

Page 34: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

35

The decision function is then expressed in terms of the support vectors:

,f x sgn w x b

Determining the Optimal Separating Hyperplane

1

,i i

M

i

i

sgn y x x ba

1

, ,0 i i

M

i

i

L w bw y x

w

aa

Use , 1 0

to compute b.

i i

i y w x ba

To determine how good the hyperplane is

crossvalidation to get an estimation of the error on the testing set.

Page 35: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

36

Non-Separable Data Sets

denotes +1

denotes -1 This is going to be a problem!

What should we do?

Idea :

Introduce some slack on the constraints

Page 36: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

37

Support Vector Machine for non-separable datasets

The constraints are relaxed by

introducing slack variables , 1,... :

1 if y 1

1 if y 1

i

T i ii

T i ii

i M

w x b

w x b

denotes +1

denotes -1

1

2

3

Can again be expressed through compact notation:

y 1i T iiw x b

Page 37: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

39

2

1,

The objective function gives a penalty

for too large slack variables:

1min

2

Find a trade off between maximizing

the margin and minimizing the

classification errors.

M

jjw

Cw

M

denotes +1

denotes -1

1

2

3

C>0 weights the influence of the penalty term

Support Vector Machine for non-separable datasets

Page 38: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

40

2

1,

Optimization under constraint problem of the form:

1min

2

u.c.

1 ,

0 j=1,...M

j j

M

jjw

Tj

j

Cw

M

y w x b

Support Vector Machine for non-separable datasets

The hyperplane has the same solution as for the separable case:

1

j j

M

j

j

α y

w x

Page 39: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

42

denotes -1

denotes +1

1Tw x b

1Tw x b

w

Support Vectors

Decision boundary is determined only by those support vectors !

1

Mi i

i

i

α y

w x

: 1 0i iT

i ii y w x ba

ai = 0 for non-support vectors

ai 0 for support vectors (i=0)

Support Vector Machine for non-separable datasets

Page 40: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

43

Non-Linear Classification

What if the points in the input space cannot be separated

by a linear hyperplane?

Page 41: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

44

The Kernel Trick for SVM

As usual, observe that the decision function of the linear SVM computes

an inner product across pairs of observations:

,i jx x

1

The decision function:

, ,i i

M

i

i

f x sgn w x b sgn y x x ba

Page 42: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

49

1

The decision function in linear classification with SVM was given by:

, ,i i

M

i

i

f x sgn w x b sgn y x x ba

Non-Linear Classification with Support Vector Machines

1

Replace linear plane in original space ,

by linear plane in feature space: and ,

The decision function becomes:

, ,M

i i

i

i

w x b

x x w x b

f x sgn w x b y x x b

a

W lives in

feature space

now!

Page 43: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

50

Non-Linear Classification with Support Vector Machines

1

Replace linear plane in original space ,

by linear plane in feature space: and ,

The decision function becomes:

, ,M

i i

i

i

w x b

x x w x b

f x sgn w x b y x x b

a

Use the kernel trick by exploiting the fact that the decision function

depends on a dot product in feature space

, ,i j i jk x x x x

Page 44: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

51

Non-Linear Classification with Support Vector Machines

1

Replace linear plane in original space ,

by linear plane in feature space: and ,

The decision function becomes:

, ,M

i i

i

i

w x b

x x w x b

f x sgn w x b y k x x b

a

Use the kernel trick by exploiting the fact that the decision function

depends on a dot product in feature space

, ,i j i jk x x x x

Page 45: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

52

1

f ,i i

M

i

i

x sgn y k x x ba

1 , 1

1

1max ,

2

subject to 0 for all 1,... and 0.

N

M Mi j i j

i i j

i i j

Mi

i i

i

L y y k x x

i M y

aa a a a

a a

The optimization problem in feature space becomes:

The decision function in feature space is computed as follows:

Kernel

Kernel

Non-Linear Classification with Support Vector Machines

Page 46: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING – 2012

53

1 1

The offset b can be estimated from the KKT conditions.

Best is to compute the expectation over the constraints and hence

to compute an estimate of through regression:

1- ,j ji i

M M

i

j i

b

b y y k x xM

a

Non-Linear Classification with Support Vector Machines

Page 47: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

55

How to read out the result of SVM

Hyperplane

Support vectors

Color gradient

= distance to

the hyperplane

Page 48: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

56

How to read out the result of SVM

Hyperplane

Support vectors

The margin

Page 49: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

57

1

SVM decision function:

f sgn , i i i

M

i

i

x y k x x ba

The hyperparameters of SVM

2

'

2Gaussian kernel: , ' , .

x x

k x x e

The kernel has several open parameters (Hyperparameters)

that need to be determined before running SVM

Kernel width

Order of the polynomial

Usually d=1

Inhomogeneous polynomial kernel: , ' , ' , , 0p

k x x x x d p d

Page 50: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

58

2

1,

Optimization under constraint problem of the form:

1min

2

u.c.

1 ,

0 j=1,...M

j j

M

jjw

Tj

j

Cw

M

y w x b

The hyperparameters of SVM

C that determines the costs

associated to incorrectly classifying

datapoints is an open parameter of

the optimization function

Page 51: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

59

Non-Linear Support Vector Machines: Examples

Page 52: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

60

Effect of the penalty factor C

RBF kernel width=0.20; C=1000; several misclassified datapoints

Page 53: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

61

RBF kernel width=0.20; C=2000; less misclassified datapoints

Effect of the penalty factor C

Page 54: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

62

RBF kernel width =0.001; C=1000; 113 support vectors out of 345 total nm of datapoints

Effect of the width of Gaussian kernel

Page 55: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

63

RBF kernel width =0.008; C=1000; 64 support vectors out of 345 total nm of datapoints

Effect of the width of Gaussian kernel

Page 56: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

64

RBF kernel width = 0.02; C=1000; 33 support vectors out of 345 total nm of datapoints

Effect of the width of Gaussian kernel

Page 57: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

65

Different optimization runs end up with different solutions

Several combination of the support vectors yield the same optimum

Page 58: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

66

Different optimization runs end up with different solutions

Several combination of the support vectors yield the same optimum

Page 59: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

67

2

1,

The original objective function:

1min

2

M

jjw

Cw

M

Support Vector Machine for non-separable datasets

Determining C may be difficult in practice

Page 60: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

68

2

, 1

1min ,

subject to y ,

and 0, 0.

M

i

w i

i i

i

i

wM

w x b

Support Vector Machine for non-separable datasets

-SVM is an alternative that optimizes for the best tradeoff between

model complexity (the largest margin) and penalty on the error

automatically.

Page 61: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

69

Support Vector Machine for non-separable datasets

is an upper bound on the fraction of margin error (i.e.

the number of datapoints misclassified in the margin)

is a lower bound on the number of support vectors

Page 62: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

70

Support Vector Machine for non-separable datasets

-svm =0.001, rbf kernel width 0.1

Page 63: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

71

Support Vector Machine for non-separable datasets

Increase in the number of SV-s with =0.2

Page 64: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

72

Support Vector Machine for non-separable datasets

Increase in the number of SV-s with =0.9

Page 65: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

73

Support Vector Machine for non-separable datasets

Increase in the error with =0.2

Page 66: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

74

Increase in the error with =0.9

Support Vector Machine for non-separable datasets

Page 67: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

75

SVM have wide range application for all type of data (vision, text,

handwriting, etc).

SVM is very powerful for large scale classification. Optimized

solvers for the training stage. Rapid during recall.

One issue is that the computation grows linearly with number of

SV and the algorithm is not very sparse in SV.

Another issue is that it can predict only two classes. For multi-

class classification, one needs to run several two-class classifiers.

Summary: When to use SVM?

Page 68: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

76

Multi-Class SVM

Children

Female Adult

Male Adult

1Construct a set of K binary classifiers f ,....,f , each trained to

separate one class from the rest.

K

3f1f

2f

1,... 1

Compute the class label in a winner-take-all approach:

j= , arg max i i

Mj j

ij M i

y k x x ba

Sufficient to compute only K-

1 classifier for K classes

But computing the K’th

classifier may provide tighter

bounds on the Kth class.

Page 69: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

78

Multi-Class SVM

Page 70: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

79

Multi-Class SVM

Page 71: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

80

Multi-Class SVM Drawbacks of combining multiple binary classifiers:

- How to reject an instance if it belongs to none of the classes (garbage

model or threshold on minimum of associated classifier function; but

difficult to compare the scale of each classifier functions)

- Asymmetric classification (some classes have many more positive

examples than others); one can play with the C penalty to give a relative

influence as a function of the number of patterns, e.g. C=10*M.

Alternative is to compute all the classes as part of the optimization

function. Performance seem however comparable to one against all

optimization (Frank & Hlavak, Multi-class support vector machine, 2002; J. Weston

and C. Watkins. Multi-class support vector machines, 1998.)

Page 72: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

81

Relevance Vector Machine

Even though SVM usually results in a relatively small number of support

vectors compared to total number of data-points, nothing ensures to obtain a

sparse solution.

SVM requires also to finds hyper-parameters (C, ) and to have special form

for the basis function (the kernel must satisfy the Mercer conditions).

RVM relaxes these two assumptions, by taking a Bayesian approach.

Page 73: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

82

Relevance Vector Machine

Start from the solution of SVM

1

sgn ,

, .......1

Mi

i

i

TT

y x f x k x x b

xi

y x x xM

a

a

Rewrite the solution of SVM

as a linear combination over

M basis functions

A sparse solution has majority of

entries in alpha zero.

1

1

0

1

. 0

. 0

. ...

. 1

0M

a

a

a

a

Page 74: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

83

Relevance Vector Machine

1

1

Model the distribution on the class label with Bernoulli distribuion

; 1

Replace sign function with continuous but steep Sigmoid function

1/ 1

M

i i i i

i

z

ii

yy

p y x g x b g x b

g z e

a a a

Doing maximum likelihood would lead to overfitting,

as we have more parameters than datapoints.

Approximate the distribution of the alpha with a

probability density function which reduces the number of

parameters to estimate.

Cannot compute the optimal alpha in close form. Must

perform an iterative method similar to expectation

maximization (see Tipping 2001, supplementary material).

Page 75: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

84

Relevance Vector Machine

This parameter determines how good the fit is.

The smaller the value, the closer the true parameters

are fitted.

[0,1]

Page 76: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

85

Relevance Vector Machine

Page 77: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

86

Relevance Vector Machine

RVM with l=0.04, kernel width = 0.01

477 datapoints, 19 support vectors

Page 78: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

87

Relevance Vector Machine

SVM with C=1000, kernel width = 0.01

477 datapoints, 51 support vectors

Page 79: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

89

Pros and Cons of RVM

Pros:

- Yields sparser solutions (less support vectors)

- Outputs estimates of the posterior probability of class

membership (uncertainty on prediction of class label)

Cons:

- Iterative method (need a stopping criterion; arbitrary value on

precision)

- Very intensive computationally at training (memory and number of

iteration grow with square and cube of the number of basis

functions, i.e. with number of datapoints)

Page 80: MACHINE LEARNING Classifiers: Support Vector Machinelasa.epfl.ch/teaching/lectures/ML_Phd/Slides/SVM.pdfMACHINE LEARNING 4 Classifiers There is a plethora of classifiers, e.g: - Neural

MACHINE LEARNING

90

Other classifiers

Pros:

- Yields sparser solutions (less support vectors)

- Outputs estimates of the posterior probability of class

membership (uncertainty on prediction of class label)

Cons:

- Iterative method (need a stopping criterion; arbitrary value on

precision)

- Very intensive computationally at training (memory and number of

iteration grow with square and cube of the number of basis

functions, i.e. with number of datapoints)