ch3: linear classifiers · ch3: linear classifiers w =[w 1,w 2, . . . ,w l ]t is known as the...

49
1 In this chapter, we will focus on the design of linear classifiers, regardless of the underlying distributions describing the training data. The Problem: Consider a two class task with ω 1 , ω 2 0 2 2 1 1 0 ... 0 ) ( w x w x w x w w x w x g l l T 2 1 2 1 0 2 0 1 2 1 , 0 ) ( 0 : hyperplane decision on the , Assume x x x x w w x w w x w x x T T T Ch3: LINEAR CLASSIFIERS w =[w1,w2, . . . ,wl ] T is known as the weight vector and w 0 as the threshold.

Upload: others

Post on 22-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

1

In this chapter, we will focus on the design of linear classifiers, regardless of the underlying distributions describing the training data.

The Problem: Consider a two class task with ω1, ω2

02211

0

...

0)(

wxwxwxw

wxwxg

ll

T

2121

0201

21

, 0)(

0

:hyperplanedecision on the , Assume

xxxxw

wxwwxw

xx

T

TT

Ch3: LINEAR CLASSIFIERS

w =[w1,w2, . . . ,wl ]T is known as

the weight vector and w0

as the threshold.

Page 2: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

2

Hence:

to the hyperplanew

0( ) 0Tg x w x w

0

2 2

1 2

2 2

1 2

,

( )

wd

w w

g xz

w w

px

Page 3: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

3

p

2t

0

0 0

0

. (since is colinear with and 1)

since g( ) 0 and . ( )

. .( ) || ||

( ) ( ) z d(0,H)

p

t

p

tt t

p p

z

g w

z zw w z

wg gtherefore in particular

w wx x w x x

w w

x w w w x w x

w w ww x w x w

w w

x 0

w w w

Page 4: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

4

The Perceptron Algorithm

Assume linearly separable classes, i.e.,

The casefalls under the above formulation, since

2

T

1

T

x0x*w

x0x*w:*w

* *

0

Tw x w

*

0

*' , '

1

w xw x

w

00 'x'wwx*w T*T

Page 5: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

5

Our goal: Compute a solution, i.e., a hyperplane w,

so that

• The steps

– Define a cost function to be minimized

– Choose an algorithm to minimize the cost function

– The minimum corresponds to a solution

1

2

( )0

Tw x x

Page 6: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

6

The Cost Function

• Where Y is the subset of the vectors wronglyclassified by w. When Y=(empty set) a solution is

achieved and

Yx

T

x xwwJ )()(

0)( wJ

2

1

and if 1

and if 1

xYx

xYx

x

x

0)( wJ

Page 7: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

7

• J(w) is piecewise linear (WHY?)

The Algorithm

• The philosophy of the gradient descent is adopted.

( )

( )( 1) ( ) t w w t

J ww t w t

w

ρt is a sequence

of positive real

numbers

Page 8: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

8

• Wherever valid

This is the celebrated Perceptron Algorithm

xxwww

wJ

Yx Yx

x

T

x

)(

)(

xtwtwYx

xt

)()1(

w

(old)

( )(new) (old) , w w

J ww w w w

w

Page 9: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

9

The Perceptron Algorithm

Page 10: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

10

An example:

The perceptron algorithm converges in a finitenumber of iteration steps to a solution if

)1( )(

)()1(

xxt

t

xtw

xtwtw

2

0 0

lim , lim

e.g.,:

t t

k kt t

k k

t

c

t

Page 11: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

11

Example: At some stage t the perceptron algorithm

results in

The corresponding hyperplane is

5.0 ,1 ,1 021 www

1 2 0.5 0x x

5.0

51.0

42.1

1

75.0

2.0

)1(7.0

1

05.0

4.0

)1(7.0

5.0

1

1

)1(tw

ρ=0.7

The next weightVector will be:

1

( ) 1

0.5

w t

1 21.42 0.51 0.5 0x x

Page 12: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

12

A useful variant of the perceptron algorithm

It is a reward and punishment type of algorithm

Correct classification → the reward is no action.

Incorrect classification → the punishment is the cost

of correction.

1

2

( 1) ( ) ( ) , if ( ) and ( ) ( ) 0

( 1) ( ) ( ), if ( ) and ( ) ( ) 0

( 1) ( ) otherwise

T

T

w t w t x t x t w t x t

w t w t x t x t w t x t

w t w t

Page 13: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

13

The perceptron

0

' synapses or synaptic weights

threshold

iw s

w

It is a learning machine that learns from the training vectors via the perceptron algorithm.

The network is called perceptron or neuron.

0 1

0 2

if 0 assign to

if 0 assign to

T

T

w x w x

w x w x

Page 14: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

14

If classes are linearly separable, the perceptron output results in

If classes are NOT linearly separable, we shall compute the weights

so that the difference between

• The actual output of the classifier, , and

The desired outputs, e.g.

to be SMALL.

1

xwT

2

1

if 1

if 1

x

x

Least Squares Methods

Page 15: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

15

SMALL, in the mean square error sense, means to choose

so that the cost function

w

2 ( ) [( ) ] is minimum

ˆ arg min ( )

( ) 1 is the corresponding desired responses

T

w

J w E y w x

w J w

y x y

2 2

1 1 2 2( ) ( ) (1 ) ( | ) ( ) (1 ) ( | )T TJ P p d P p d w x w x x x w x x

Page 16: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

16

where Rx is the autocorrelation matrix

and the cross-correlation vector

][ˆ

][][

)]([2

0])[()(

:in results to w.r.)(

1

2

yxERw

yxEwxxE

wxyxE

xwyEww

wJ

wwJ

x

T

T

T

][]...[][

...................................

][]...[][

][

21

12111

llll

l

T

x

xxExxExxE

xxExxExxE

xxER

][

...

][

][

1

yxE

yxE

yxE

l

Minimizing

Orthogonality condition.

Page 17: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

17

Interpretation of the MSE estimate

as an orthogonal projection on the

input vector elements’ subspace.

The minimum mean square

error solution results if the

error is orthogonal to each

xi; thus it is orthogonal to the

vector subspace spanned by

xi , i = 1, 2, ..., l − in other

words, if y is approximated

by its orthogonal projection

on the subspace.

Page 18: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

18

Multi-class generalization

• The goal is to compute M linear discriminant functions:

according to the MSE.

• Adopt as desired responses yi:

• Let

• And the matrix

xwxgT

ii )(

otherwise 0

if 1

i

ii

y

xy

TMyyyy ,...,, 21

MwwwW ,...,, 21

Page 19: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

19

• The goal is to compute :

• The above is equivalent to a number M of MSE minimization problems. That is:

Design each so that its desired output is 1 for and 0 for any other class.

Remark: The MSE criterion belongs to a more general class of cost function with the following important property:

• The value of is an estimate, in the MSE sense, of the

a-posteriori probability , provided that the desired responses used during training are and 0 otherwise.

M

i

T

iiW

T

WxwyExWyEW

1

22

minargminargˆ

iwix

)(xgi

)|( xP i

ii xy ,1

W

Page 20: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

20

Mean square error regression: Let , be jointly distributed random vectors with a joint pdf

• The goal: Given the value of estimate the value of . In the pattern recognition framework, given one wants to estimate the respective label .

• The MSE estimate of given is defined as:

• It turns out that:

The above is known as the regression of given and it is, in general, a non-linear function of . If is Gaussian the MSE regressor is linear.

My x),( yxp

x y

x

y y x2

ˆ arg min y

y E y y

ydxypyxyEy

)|(|ˆ

y xx ),( yxp

1y

( )f y x ˆ ( )gy xregression

Page 21: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

21

A criterion closely related to the MSE is the sum of error squaresor simply the least squares (LS) criterion.

SMALL in the sum of error squares sense means

, that is, the input xi and its

corresponding class label (±1).

2 2

1 1

( ) ( )

( , )

N NT

ii i

i i

ii

J w y w x e

y x

: training pairs

i

N

i

i

T

i

N

i

i

N

i

i

T

i

yxwxx

xwyww

wJ

11

1

2

)(

0)()(

iy

Page 22: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

22

Pseudoinverse Matrix

Define1

2

1

(an matrix)...

y ... corresponding desired responses

T

T

T

N

N

x

xX N l

x

y

y

1 2[ , ,..., ] (an matrix)T

NX x x x l N

N

i

T

ii

T xxXX1

i

N

i

i

T yxyX

1

Page 23: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

23

Thus

Assume X square and invertible. Then

1 1

1

ˆ( ) ( )

ˆ( )

ˆ ( )

N NT

i i i i

i i

T T

T T

x x w x y

X X w X y

w X X X y

X y

1( )T TX X X X Pseudoinverse of X

1 1 1

1

( )T T T TX X X X X X X

X X

N l

Page 24: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

24

Assume Then, in general, there is no solution to

satisfy all equations simultaneously:

The “solution” corresponds to the minimumsum of squares solution

Assume

1 1

2 2: equations unknowns....

T

T

T

N N

x w y

x w yX w y N l

x w y

w X y

N l

N l

1ˆ ( )T Tw X XX y X y

Page 25: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

25

Example: 0.4 0.5 1 1

0.6 0.5 1 1

0.1 0.4 1 1

0.2 0.7 1 1

0.3 0.3 1 1,

0.4 0.6 1 1

0.6 0.2 1 1

0.7 0.4 1 1

0.8 0.6 1 1

0.7 0.5 1 1

X y

1

2.8 2.24 4.8 1.6 3.13

2.24 2.41 4.7 , 0.1 , ( ) 0.24

4.8 4.7 10 0.0 1.34

T T T TX X X y w X X X y

5.0

7.0,

6.0

8.0,

4.0

7.0,

2.0

6.0,

6.0

4.0:

3.0

3.0,

7.0

2.0,

4.0

1.0,

5.0

6.0,

5.0

4.0:

2

1

Page 26: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

26

A classifier is a learning machine that tries to predictthe class label y of . In practice, a finite data set D is used

for its training. Let us write . Observe that:

For some training sets, , the training may result to good estimates, for some othersthe result may be worse.

The average performance of the classifier can be tested against the MSE optimal value, in the mean squares sense, that is:

where ED is the mean over all possible data sets D.

)(xg

x);( Dxg

NixyD ii ..., ,2 ,1,,

2( ; ) [ | ]DE g x D E y x

The Bias – Variance Dilemma

Page 27: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

27

The above is written as:

In the above, the first term is the contribution of the bias and the second term is the contribution of the variance.

For a finite D, there is a trade-off between the two terms. Increasing bias it reduces variance and vice verse. This is known as the bias-variance dilemma.

Using a complex model results in low-bias but a high variance, as one changes from one training set to another. Using a simple model results in high bias but low variance.

2

( ; ) [ | ] DE g x D E y x

2 2

( ; ) [ | ] ( ; ) ( ; ) D D DE g x D E y x E g x D E g x D

Page 28: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

28

g1(x) is a polynomial

of high degree

→ High bias, zero variance

→ zero bias, variance is equal to the variance of the noise

Page 29: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

29

Let an -class task, . In logistic discrimination, the logarithm of the likelihood ratiosare modeled via linear functions, i.e.,

Taking into account that

it can be easily shown that the above is equivalent with modeling posterior probabilities as:

121 ,|

|ln 0, , ..., M-, ixww

xP

xP T

ii

M

i

1|1

M

i

i xP

M ..., , , 21M

LOGISTIC DISCRIMINATION

Page 30: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

30

For the two-class case it turns out that

1

1

0,exp1

1|

M

i

T

ii

M

xww

xP

,0

1

,0

1

exp| , 1,2,..., 1

1 exp

T

ii

i MT

ii

i

w w xP x i M

w w x

2

0

1|

1 expT

P xw w x

0

1

0

exp|

1 exp

T

T

w w xP x

w w x

Page 31: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

31

The unknown parameters are usually estimated by maximum likelihood arguments.

Logistic discrimination is a useful tool, since it allows linear modeling and at the same time ensures posterior probabilities to add to one.

,0, , 1 2 1i iw w i , , ..., M -

Any optimization algorithm can then be used to perform the required maximization.

Page 32: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

32

The goal: Given two linearly separable classes, design the classifier

that leaves the maximum margin from both classes

0)( 0 wxwxgT

Support Vector Machines

Page 33: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

33

Margin: Each hyperplane is characterized by

• Its direction in space, i.e.,

• Its position in space, i.e.,

• For EACH direction, , choose the hyperplane that leaves the SAME distance from the nearest points from each class. The margin is twice this distance.

Our goal is to search for the direction that gives the maximum possible margin.

w

0w

w

Page 34: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

34

The distance of a point from a hyperplaneis given by

Scale, so that at the nearest points fromeach class the discriminant function is ±1:

Thus the margin is given by

Also, the following is valid

x

, , 0ww

w

xgzx

)ˆ(ˆ

21 for 1)( and for 1)x( 1)( xggxg

1 1 2

w w w

20

10

1

1

xwxw

xwxw

T

T

Page 35: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

35

SVM (linear) classifier

Minimize

Subject to

The above is justified since by minimizing

the margin is maximized

0)( wxwxgT

2

2

1)( wwJ

2for 1

for 1

ii

iii

x,y

,x,y

0( ) 1, 1, 2, ..., T

iiy w x w i N

w

w

2

Page 36: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

36

The above is a nonlinear (quadratic) optimization task,subject to a set of linear inequality constraints. The Karush-Kuhh-Tucker (KKT) conditions state that the minimizer satisfies:

• (1)

• (2)

• (3)

• (4)

• Where is the Lagrangian

0) , ,L( 0

ww

w

0) , ,( 0

0

wwL

w

0, 1, 2, ..., i i N

0( ) 1 0, 1, 2, ..., T

ii iy w x w i N

(.)L

0 0

1

1( , , ) [ ( ) 1]

2

NT T

ii i

i

L w w w w y w x w

Page 37: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

37

The solution: from the above, it turns out that

Remarks:

• The Lagrange multipliers can be either zero or positive.Thus,

where , corresponding to positive Lagrangemultipliers

the vectors contributing to satisfy

ii

N

i

i xyw

1

01

i

N

i

i y

ii

N

i

i xyws

1

0NNs

Niwxwy i

T

ii ,...,2,1 ,0]1)([ 0

w

10 wxw i

T

Page 38: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

38

These vectors are known as SUPPORT VECTORS and are theclosest vectors, from each class, to the classifier.

Once is computed, is determined from conditions (4).

The optimal hyperplane classifier of a support vector machine isUNIQUE.

Although the solution is unique, the resulting Lagrangemultipliers are not unique.

Feature vectors corresponding to λi=0 can either lie outside the“class separation band,” defined as the region between the twohyperplanes, or they can also lie on one of these hyperplanes.

Although is explicitly given, can be implicitly obtained byany of the (complementary slackness) conditions.

w 0w

0( ) 1 0, 1, 2, ..., T

ii iy w x w i N

w 0w

Page 39: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

39

Dual Problem Formulation

The SVM formulation is a convex programming problem, with

• Convex cost function

• Convex region of feasible solutions

Thus, its solution can be achieved by its dual problem, i.e.,

• maximize

• subject to

),,( 0 wwL

0

01

1

i

N

i

i

ii

N

i

i

y

xyw

Page 40: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

40

Combine the above to obtain

.

subject to

Remarks:

• Support vectors enter into the game in pairs, in the form of inner products

1

1 ( )

2Maximize

NT

i ji i j i j

i ij

y y x x

0

01

i

N

i

i y

Page 41: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

41

Non-Separable classes

In this case, there is no hyperplane such that

Recall that the margin is defined as the distance between the following two hyperplanes

0 01 and 1 T T

w x w w x w

0( ) 1, 1, 2, ..., T

iiy w x w i N

Page 42: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

42

The training vectors belong to one of three possible categories

1) Vectors outside the band which are correctlyclassified, i.e.,

2) Vectors inside the band, and correctly classified,i.e.,

3) Vectors misclassified, i.e.,

All three cases above can be represented as

0( ) 1T

iy w x w

1)(0 0 wxwyT

i

0)( 0 wxwyT

i

i

T

i wxwy 1)( 0

(1) 0

(2) 0 1

(3) 1

i

i

i

are known as slack variablesi

Page 43: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

43

The goal of the optimization is now two-fold

• Maximize margin

• Minimize the number of patterns with ,

One way to achieve this goal is via the cost

where C is a constant and

• I(.) is not differentiable. In practice, we use an

approximation

• Following a similar procedure as before we obtain

0i

N

i

iICwwwJ1

2

02

1) , ,(

00

01)(

i

i

iI

N

i

iCwwwJ1

2

02

1) , ,(

Page 44: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

44

KKT conditions

2

0

1 1

0

1

1( , , , , )

2

[ ( ) 1 ]

N N

i i i

i i

NT

ii i i

i

L w w w C

y w x w

1

1

0

(1)

(2) 0

(3) 0, 1,2,...,

(4) [ ( ) 1 ] 0, 1, 2,...,

(5) 0, 1, 2,...,

(6) , 0, 1, 2,...,

N

ii i

i

N

i i

i

i i

T

ii i i

i i

i i

w λ y x

λ y

C i N

y w x w i N

i N

i N

0L

w

0

0L

w

0i

L

Page 45: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

45

The associated Wolfe dual representation now becomes

Page 46: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

46

The associated dual problem

subject to

Remarks: The only difference with the separableclass case is the existence of C in the

constraints

N

1i

ii

i

0y

N,...,2,1i,C0

1

1 ( )

2Maximize

NT

i ji i j i j

i ij

y y x x

Page 47: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

47

Training the SVM

A major problem is the high computational cost. To this end, decomposition techniques are used. The rationale behind them consists of the following:

• Start with an arbitrary data subset (working set) that can fit in the memory. Perform optimization, via a general purpose optimizer.

• Resulting support vectors remain in the working set, while others are replaced by new ones (outside the set) that violate severely the KKT conditions.

• Repeat the procedure.

• The above procedure guarantees that the cost function decreases.

• Platt’s Sequential Minimal Optimization (SMO) algorithmchooses a working set of two samples, thus analytic optimization solution can be obtained.

Page 48: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

48

Example: Two nonseparable classes and the resulting SVM linear

classifier (full line) with the associated margin (dotted lines) for

the values (a) C =0.2 and (b) C=1000. In the latter case, the

location and direction of the classifier as well as the width of the

margin have changed in order to include a smaller number of

points inside the margin.

Observe the effect of different values of C in the case of non-

separable classes.

C = 0.2C =1000

Page 49: Ch3: LINEAR CLASSIFIERS · Ch3: LINEAR CLASSIFIERS w =[w 1,w 2, . . . ,w l ]T is known as the weight vector and w 0 as the threshold. 2

49

Multi-class generalization

Although theoretical generalizations exist, the most popular in practice is to look at the problem as M two-

class problems (one against all).

For each one of the classes, we seek to design an optimal discriminant function,

Classification rule:

This technique, however, may lead to indeterminate

regions →one-against-one approach, error correcting

coding approach and extending the two class SVM mathematical formulation to the M-class problem.

Error Correcting Coding method: M-class, L binary classifiers. the HD of the code word is measured against the M code words

( ), 1,2, , so that ( ) ( ), , if i i j ig x i M g x g x j i x

assign in if =arg max ( )i kk

x i g x