linear discriminant functionsvision.psych.umn.edu/users/schrater/schrater_lab/...6 –the...

Linear Discriminant Functions

1

Linear discriminant functions anddecision surfaces

• DefinitionIt is a function that is a linear combination of the components of x

g(x) = wtx + w0 (1)where w is the weight vector and w0 the bias

• A two-category classifier with a discriminant function ofthe form (1) uses the following rule:Decide ω1 if

g(x) > 0 and ω2 if g(x) < 0⇔ Decide ω1 if

wtx > -w0 and ω2 otherwiseIf g(x) = 0 ⇒ x is assigned to either class

3

– The equation g(x) = 0 defines the decisionsurface that separates points assigned to thecategory ω1 from points assigned to thecategory ω2

– When g(x) is linear, the decision surface is ahyperplane

– Algebraic measure of the distance from x tothe hyperplane (interesting result!)

5

– In conclusion, a linear discriminant function dividesthe feature space by a hyperplane decision surface

– The orientation of the surface is determined by thenormal vector w and the location of the surface isdetermined by the bias

!

x = xp +rw

w

since g(xp ) = 0 and wtw = w2

g(x) = wtx + w0 " w

txp +

rw

w

#

$ %

&

' ( + w0

= g(xp ) + wtw

r

w

" r =g(x)

w

in particular d([0,0],H) =w 0

w

H

w

x

xtw

r

xp

6

– The multi-category case

• We define c linear discriminant functions

and assign x to ωi if gi(x) > gj(x) ∀ j ≠ i; in case of ties, the classificationis undefined

• In this case, the classifier is a “linear machine”• A linear machine divides the feature space into c decision regions, with

gi(x) being the largest discriminant if x is in the region Ri

• For a two contiguous regions Ri and Rj; the boundary that separatesthem is a portion of hyperplane Hij defined by:

gi(x) = gj(x)⇔ (wi – wj)tx + (wi0 – wj0) = 0

• wi – wj is normal to Hij and

!

gi(x) = wi

tx + wi0 i =1,...,c

ji

ji

ijww

gg)H,x(d

!

!=

8

Generalized Linear Discriminant Functions

• Decision boundaries which separate between classesmay not always be linear

• The complexity of the boundaries may sometimesrequest the use of highly non-linear surfaces

• A popular approach to generalize the concept of lineardecision functions is to consider a generalized decisionfunction as:

g(x) = w1f1(x) + w2f2(x) + … + wNfN(x) + wN+1 (1)

where fi(x), 1 ≤ i ≤ N are scalar functions of thepattern x, x ∈ Rn (Euclidean Space)

9

• Introducing fn+1(x) = 1 we get:

• This latter representation of g(x) implies that anydecision function defined by equation (1) can be treatedas linear in the (N + 1) dimensional space (N + 1 > n)

• g(x) maintains its non-linearity characteristics in Rn

!

g(x) = wi fi(x) = wT.y

i=1

N +1

"

where w = (w1,w2,...,wN ,wN +1)T

and y = (f1(x), f2(x),..., fN (x), fN +1(x))T

10

• The most commonly used generalized decisionfunction is g(x) for which fi(x) (1 ≤ i ≤N) arepolynomials

• Quadratic decision functions for a 2-dimensionalfeature space

!

g(x) = w0

+ wixii=1:N

" + # ijxij= 0

"i=1:N

" x j + $ ijkxij=1:N

"i=1:N

" x j

k=1:N

" xk + ....

!

g(x) = w1x1

2+ w2x1x2 + w3x2

2+ w4x1 + w5x2 + w6

here : w = (w1,w2,...,w6)T

and y = (x1

2,x1x2,x2

2,x1,x2,1)

T

11

• For patterns x ∈Rn, the most general quadratic decisionfunction is given by:

• The number of terms at the right-hand side is:

This is the total number of weights which are the freeparameters of the problem– n = 3 => 10-dimensional– n = 10 => 65-dimensional

!

g(x) = " ii xi2

+ " ij xix j + wixi + wn+1 (2)

i=1

n

#j= i+1

n

#i=1

n$1

#i=1

n

#

!

l = N +1

= n +n(n "1)

2+ n +1

=(n +1)(n + 2)

2

12

• In the case of polynomial decision functions of order m, atypical fi(x) is given by:

– It is a polynomial with a degree between 0 and m. To avoidrepetitions, i1 ≤ i2 ≤ …≤ im

where gm(x) is the most general polynomial decisionfunction of order m

!

fi(x) = xi1e1 xi2

e2 ...ximem

where 1" i1,i2,...,im " n and ei,1" i " m is 0 or 1.

!

gm(x) = ... wi1i2 ...im

xi1 xi2 ...xim + gm"1(x)

im= im"1

n

#i2= i1

n

#i1=1

n

#

13

Example 1: Let n = 3 and m = 2 then:

Example 2: Let n = 2 and m = 3 then:

4332211

2

3333223

2

22231132112

2

111

3

ii

4332211iiii

3

1i

2

wxwxwxw

xwxxwxwxxwxxwx w

wxwxwxwxxw)x(g12

2121

1

++++

+++++=

++++= !!==

32211

2

2222112

2

111

2

ii

1

iiii

2

1i

2

23

2222

2

211222

2

1112

3

1111

2

ii

2

iiiiii

2

ii

2

1i

3

wxwxwxwxxwx w

)x(gxxw)x(gwhere

)x(gxwxxwxxwx w

)x(gxxxw)x(g

12

2121

1

23

321321

121

+++++=

+=

++++=

+=

!!

!!!

==

===

17

• Augmentation

• Incorporate labels into data

• Margin

Learning Linear Classifiers

20

Basic Ideas

Directly “fit” linear boundary in feature space.• 1) Define an error function for classification

– Number of errors?– Total distance of mislabeled points from boundary?– Least squares error– Within/between class scatter

• Optimize the error function on training data– Gradient descent– Modified gradient descent– Various search proedures.– How much data is needed?

21

Two-category case• Given x1, x2,…, xn sample points, with true category labels:α1, α2,…,αn

• Decision are made according to:

• Now these decisions are wrong when atxi is negative andbelongs to class ω1.Let yi = αi xi Then yi >0 when correctly labelled,negative otherwise.

!

if atxi> 0 class "

1 is chosen

if atxi< 0 class "

2 is chosen

!

"i=1

"i= #1

$ % &

if point xi is from class '1

if point xi is from class '2

22

• Separating vector (Solution vector)– Find a such that atyi>0 for all i.

• atyi=0 defines a hyperplane with a as normal

x2

x1

23

• Problem: solution not unique• Add Additional constraints: Margin, the distance away from boundary atyi>b

24

Margins in data space

b

Larger margins promote uniqueness forunderconstrained problems

25

Finding a solution vector byminimizing an error criterion

What error function?How do we weight the points? All the points or

only error points?Only error points:

!

Perceptron Criterion

JP (at ) = "(atyi

yi #Y

$ ) Y = {yi | atyi < 0}

Perceptron Criterion with Margin

JP (at ) = "(atyi

yi #Y

$ ) Y = {yi | atyi < b}

27

Minimizing Error via GradientDescent

28

• The min(max) problem:

• But we learned in calculus how to solve thatkind of question!

)(min xfx

29

Motivation

• Not exactly,• Functions:• High order polynomials:

• What about function that don’t have ananalytic presentation: “Black Box”

! + ! x

1

6x3 1

120x5 1

5040x7

RRf n!:

30

:= f ! ( ),x y"

#$$

%

&''cos

1

2x

"

#$$

%

&''cos

1

2y x

31

Directional Derivatives:first, the one dimension derivative:

!

32

x

yxf

!

! ),(

y

yxf

!

! ),(

Directional Derivatives :Along the Axes…

33

v

yxf

!

! ),(

2Rv!

1=v

Directional Derivatives :In general direction…

34

DirectionalDerivatives

x

yxf

!

! ),(

y

yxf

!

! ),(

35

In the plane

2R

RRf !2

:

!!"

#$$%

&

'

'

'

'=(

y

f

x

fyxf :),(

The Gradient: Definition in

36

!!"

#$$%

&

'

'

'

'=(

n

nx

f

x

fxxf ,...,:),...,(

1

1

RRf n!:

The Gradient: Definition

37

The Gradient Properties• The gradient defines (hyper) plane

approximating the function infinitesimally

yy

fx

x

fz !"

#

#+!"

#

#=!

38

The Gradient properties• Computing directional derivatives• By the chain rule: (important for later use)

vfpv

fp ,)()( !=

"

"1=v

Want rate ofchange along ν, ata point p

39

The Gradient properties

is maximal choosing

is minimal choosing

(intuition: the gradient points to the greatest change in direction)

v

f

!

!p

p

ff

v )()(

1!"

!=

p

p

ff

v )()(

1!"

!

#=

40

The Gradient Properties

• Let be a smooth functionaround P,

if f has local minimum (maximum) at pthen,

(Necessary for local min(max))

!

f :Rn" # " R

!

("f )p =r 0

41


Intuition

42


Formally: for anyWe get: }0{\nRv!

!

0 =df (p + t " v)

dt= (#f )p,v

$ (#f )p =r 0

43

The Gradient Properties• We found the best INFINITESIMAL

DIRECTION at each point,• Looking for minimum using “blind man”

procedure• How can get to the minimum using this

knowledge?

44

Steepest Descent

• AlgorithmInitialization:loop while

compute search directionupdate

end loop

nRx !

0

!

hi ="f (xi)

!

xi+1 = x

i" # $ h

i

!

"f (xi) > #

45

Setting the Learning RateIntelligently

Minimizing this approximation:

46

Minimizing Perceptron Criterion

!

Perceptron Criterion

JP (at ) = "atyi

yi #Y

$ Y = {yi | atyi < 0}

%a JP (at )( ) =%a "atyi

yi #Y

$

= "yiyi #Y

$

Which gives the update rule :

a( j +1) = a(k) +&k yiyi #Yk

$

Where Yk is the misclassified set at step k

47

Batch Perceptron Update

49

Simplest Case

50

When does perceptron ruleconverge?

51

+

+

52

Different Error Functions Four learning criteriaas a function ofweights in a linearclassifier.

At the upper left is the totalnumber of patternsmisclassified, which ispiecewise constant and henceunacceptable for gradientdescent procedures.

At the upper right is thePerceptron criterion (Eq. 16),which is piecewise linear andacceptable for gradientdescent.

The lower left is squarederror (Eq. 32), which has niceanalytic properties and isuseful even when thepatterns are not linearlyseparable.

The lower right is the squareerror with margin. A designermay adjust the margin b inorder to force the solutionvector to lie toward themiddle of the b = 0 solutionregion in hopes of improvinggeneralization of the resultingclassifier.

53

Three Different issues

1) Class boundaryexpressiveness

2) Error definition

3) Learning method

54

Fisher’s linear disc.

55

Fisher’s Linear DiscriminantIntuition

56

Write this in terms of w

We are looking for a good projection

So,

Next,

57

Define the scatter matrix:

Then,

Thus the denominator can be written:

58

Rayleigh QuotientMaximizing equivalent tosolving:

With solution:

linear discriminant functionsvision.psych.umn.edu/users/schrater/schrater_lab/...6 –the...

Documents