eecs349 linear discriminants

50
Bryan Pardo, EECS 349 Machine Learning, 2015 Thanks to Mark Cartwright for his contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork for images and ideas Machine Learning Topic: Linear Discriminants 1

Upload: others

Post on 04-May-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: eecs349 linear discriminants

Bryan Pardo, EECS 349 Machine Learning, 2015

Thanks to Mark Cartwright for his contributions to these slidesThanks to Alpaydin, Bishop, and Duda/Hart/Stork for images and ideas

Machine Learning

Topic: Linear Discriminants

1

Page 2: eecs349 linear discriminants

There is a set of possible examples

Each example is a vector of k real valued attributes

A target function maps X onto some categorical variable Y

The DATA is a set of tuples <example, response value>

Find a hypothesis h such that...

Discrimination Learning Task

X = {x1,...xn}

xi =< xi1,..., xik >

YXf ®:

∀x,h(x) ≈ f (x)Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012

{< x1, y1 >,...< xn, yn >}

Page 3: eecs349 linear discriminants

Reminder about notation

• xisavectorofattributes<12, 14,…16>

• wisavectorofweights<;2, ;4,…;6>

• Giventhis…> 1 = ;@+;212 + ;414….+;616

• Wecannotateitwithlinearalgebraas> 1 = ;@+DEF

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 3

Page 4: eecs349 linear discriminants

It is more convenient if…

• " # = %&+'() isALMOSTwhatwewant,butthatpeskyoffset%& isnotinthelinearalgebrapartyet.

• Ifwedefinew toinclude%& andx toincludean#&thatisalways1,now…

xisavectorofattributes<1,#M, #N,…#O>

wisavectorofweights<%&,%M, %N,…%O>

• Thisletsusnotatethingsas…" # = '()

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 4

Page 5: eecs349 linear discriminants

Visually: Where to draw the line?

5

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4a2

a1

Bryan Pardo, Machine Learning: EECS 349

Page 6: eecs349 linear discriminants

Two-Class Classification

6

h x( ) = 1 if g x( ) > 0−1 otherwise

⎧⎨⎪

⎩⎪

defines a decision boundary that splits the space in twog(x) = 0

x1

If a line exists that does this without error, the classes are linearly separable

Bryan Pardo, Machine Learning: EECS 349

g(x) = w0 +w1x1 +w2x2 = 0

g(x) > 0

x2

g(x) < 0

Page 7: eecs349 linear discriminants

Example 2-D decision boundaries

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 7

0 = # $ = %&+%'$'+%($( = )*X

10

10

$'$(

10

10

$'$(

10

10

$'$(

%& = −5%' = 0%( = 1

%& = −5%' = 0.5%( = 0

%& = 0%' = 1%( = -1

Page 8: eecs349 linear discriminants

What’s the difference?

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 8

0 = # $ = %&+%'$'+%($( = )*X

10

10

$'$(

%& = 0%' = 1

%( = -1

10

10

$'$(

%& = 0%' = -1

%( = 1

What’s the difference between these two?

Page 9: eecs349 linear discriminants

Loss/Objective function

• To train a model (e.g. learn the weights of a useful line) we define a measure of the ”goodness” of that model. (e.g. the number of misclassified points).

• We make that measure a function of the parameters of the model (and the data).

• This is called a loss function, or an objective function.

• We want to minimize the loss (or maximize the objective) by picking good model parameters.

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 9

Page 10: eecs349 linear discriminants

Classification via regression

• Linear regression’s loss function is the the squared distance from a data point to the line, summed over all data points.

• The line that minimizes this function can be calculated by applying a simple formula.

• Can we find a decision boundary in one step, by just repurposing the math we used for finding a regression line?

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 10

w = (XTX)−1XTy

Page 11: eecs349 linear discriminants

Classification via regression

• Label each class by a number

• Call that number the response variable

• Analytically derive a regression line

• Round the regression output to the nearest label number

Bryan Pardo, Machine Learning: EECS 349 Fall 2014 11

Page 12: eecs349 linear discriminants

An example

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12

0

1 11 1

0O 00

1

.5 Regres

sion l

ineDec

isio

n bo

unda

ry

x

r

Page 13: eecs349 linear discriminants

What happens now?

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13

0

1 11 1

0O 00

1

.5 Regression lin

eDec

isio

n bo

unda

ry

x

r

Page 14: eecs349 linear discriminants

Classification via regression take-away

• Closed form solution: just hand me the data and I can apply that simple formula for getting the regression line.

• Very sensitive to outliers

• What’s the natural mapping from categories to the real numbers?

• Not used in practice (too finicky)

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 14

Page 15: eecs349 linear discriminants

What can we do instead?

• Let’s define an objective (aka “loss”) function that directly measures the thing we want to get right

• Then let’s try and find the line that minimizes the loss.

• How about basing our loss function on the number of misclassifications?

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 15

Page 16: eecs349 linear discriminants

h x( ) = 1 if g x( ) > 0−1 otherwise

⎧⎨⎪

⎩⎪

SSE = (yii

n

∑ − h(xi ))2

g(x) > 0g(x) < 0

SSE = 16

sum of squared errors (SSE)

Bryan Pardo, Machine Learning: EECS 349 Fall 2017

g(x) = w0 +w1x1 +w2x2 = 0= "#$

Page 17: eecs349 linear discriminants

h x( ) = 1 if g x( ) > 0−1 otherwise

⎧⎨⎪

⎩⎪

SSE = (yii

n

∑ − h(xi ))2

g(x) > 0g(x) < 0

SSE = 0

sum of squared errors (SSE)

Bryan Pardo, Machine Learning: EECS 349 Fall 2017

g(x) = w0 +w1x1 +w2x2 = 0= "#$

Page 18: eecs349 linear discriminants

No closed form solution!

• For many objective functions we can’t find a formula to to get the best model parameters, like we could with regression.

• The objective function from the previous slide is one of those ”no closed form solution” functions.

• This means we have to try various guesses for what the weights should be and try them out.

• Let’s look at the perceptron approach.

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 18

Page 19: eecs349 linear discriminants

Let’s learn a decision boundary

• We’ll do 2-class classification• We’ll learn a linear decision boundary

0 = # $ = %&'• Things on each side of 0 get their class labels

according to the sign of what g(x) outputs.

• We will use the Perceptron algorithm.

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 19

h x( ) = 1 if g x( ) > 0−1 otherwise

⎧⎨⎪

⎩⎪

Page 20: eecs349 linear discriminants

The Perceptron

• Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408

• The �first wave� in neural networks

• A linear classifier

Page 21: eecs349 linear discriminants

A single perceptron

w1

w3

w2

w4

w5

x1

x2

x3

x4

x5

x0w0

Inpu

t vec

tor

Bias x0=1

Step function

Sum of weighted input

outp

ut

! " = 1 %! 0 <()*+

,-)")

0 ./0.

Page 22: eecs349 linear discriminants

Weights define a hyperplane in the input space

-0.991

0.5x1

0.5x2

AND

0.5$% + 0.5$' − 0.99 = 0

$%

$ '1

10

1 region

0 region, $ = 1 -, 0 </

012

340$0

0 5675

Page 23: eecs349 linear discriminants

Classifies any (linearly separable) data

-0.991

0.5x1

0.5x2

AND

0.5$% + 05$' − 0.99 = 0

$%

$ '1

10

1 region

0 region, $ = 1 -, 0 </

012

340$0

0 5675

Page 24: eecs349 linear discriminants

Different logical functions are possible

-0.491

0.5x1

0.5x2

OR

0.5$% + 05$' − 0.49 = 0$%

$ '1

10

1 region

0 region

- $ = 1 .- 0 <0123

451$1

0 6786

Page 25: eecs349 linear discriminants

And, Or, Not are easy to define

01

0x1

-1x2

NOT

−"#= 0

" #1

-1

0 region

1 region

' " = 1 (' 0 <*+,-

./+"+

0 0120

Page 26: eecs349 linear discriminants

One perceptron: Only linear decisions

This is XOR.

It can’t learn XOR.

Page 27: eecs349 linear discriminants

Combining perceptrons can make any Boolean function

0x0

0.6x1

0.6x2

0x0

1x0

XOR1

1

-0.6

-0.6

…if you can set the weights & connections right

Page 28: eecs349 linear discriminants

Defining our goal

! is our data, consisting of training examples < #, % >. Remember y is the true label (drawn from {1,-1} and x is the thing being labeled.

Our goal : make (()#)% > 0 for all < #, % >∈ !

Think about why this is the goal.

28Bryan Pardo, Machine Learning: EECS 349

Page 29: eecs349 linear discriminants

An example

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 29

! = [$% ,$',$( ] = [−5, 0, 1]10

10

.'

.(

/ . > 0

/ . < 0

Goal: classify as +1 and as -1 by putting a line between them.

(5,7)(2,6)

(!34)6 > 0Our objective function is…

Start with a randomly placed line.

Measure the objective for each point.

Move the line if the objective isn’t met.

Page 30: eecs349 linear discriminants

An example

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 30

! = [$% ,$',$( ] = [−5, 0, 1]10

10

.'

.(

/ . > 0

/ . < 0

Goal: classify as +1 and as -1 by putting a line between them.

(5,7)(2,6)

(!34)6 > 0Our objective function is…

Start with a randomly placed line.

(!34)6 = [−5,0,1]7 1,5,7 (1)= 2> 0Objective met. Don’t move the line.

Page 31: eecs349 linear discriminants

An example

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 31

! = [$% ,$',$( ] = [−5, 0, 1]10

10

.'

.(

/ . > 0

/ . < 0

Goal: classify as +1 and as -1 by putting a line between them.

(5,7)(2,6)

(!34)6 > 0Our objective function is…

Start with a randomly placed line.

(!34)6 = −5,0,1 7 1,2,6 −1= −5 + 6 −1= −1< 0Objective not met. Move the line.

Page 32: eecs349 linear discriminants

An example

Bryan Pardo, Machine Learning: EECS 349 Fall 2017 32

! = [$% ,$',$( ] = [−5, 0, 1]10

10

.'

.(

/ . > 0

/ . < 0

Goal: classify as +1 and as -1 by putting a line between them.

(5,7)(2,6)

(!34)6 > 0Our objective function is…

Start with a randomly placed line.

Let’s update the line by doing! = !+ 4(6).

! = !+ 4 6 = −5,0,1 + 1,2,6 −1= [−6,−2,−5]

Page 33: eecs349 linear discriminants

Now what ?

• What does the decision boundary look like when w= [−6,−2, −5] ? Does it misclassify the blue dot now?

• What if we update it the same way, each time we find a misclassified point?

• Could this approach be used to find a good separation line for a lot of data?

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 33

Page 34: eecs349 linear discriminants

Perceptron Algorithm

Bryan Pardo, EECS 349 Fall 2017 34

! = #$%& '()*$% #&++,)-Do . = . + 1 mod %if ℎ 78 ! = :8! = !+ 7:

Until ∀., ℎ 78 = :8

h x( ) = 1 if g x( ) > 0−1 otherwise

⎧⎨⎪

⎩⎪

0 = - > = !?7The decision boundary

The weight update algorithm

The classification function

Warning: Only guaranteed to terminate if classes are linearly separable!

This means you have to add another exit condition for when you’ve gone through the data too many times and suspect you’ll never terminate.

% = @ = #,A& $B *(+( #&+

Page 35: eecs349 linear discriminants

Perceptron Algorithm

• Example:

35

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1Red is the positiveclass

Blue is the negativeclass

Bryan Pardo, Machine Learning: EECS 349

Page 36: eecs349 linear discriminants

Perceptron Algorithm

• Example (cont’d):

36

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1Red is the positiveclass

Blue is the negativeclass

Bryan Pardo, Machine Learning: EECS 349

Page 37: eecs349 linear discriminants

Perceptron Algorithm

• Example (cont’d):

37

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1Red is the positiveclass

Blue is the negativeclass

Bryan Pardo, Machine Learning: EECS 349

Page 38: eecs349 linear discriminants

Perceptron Algorithm

• Example (cont’d):

38

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1Red is the positiveclass

Blue is the negativeclass

Bryan Pardo, Machine Learning: EECS 349

Page 39: eecs349 linear discriminants

Multi-class Classification

39

When there are N classes you can classify using N discriminant functions.

Choose the class c from the set of all classes C whose function !"($) has the maximum output

Geometrically divides feature space into N convex decision regions

a2

Bryan Pardo, Machine Learning: EECS 349

a1

ℎ $ = argmax"∈.

!"($)

Page 40: eecs349 linear discriminants

Multi-class Classification

40

Remember !" # is the inner product of the feature vector for the example # with the weights of the decision boundary hyperplane for class c. If !" # is getting more positive, that means # is deeper inside its “yes” region.

Therefore, if you train a bunch of 2-way classifiers (one for each class) and pick the output of the classifier that says the example is deepest in its region, you have a multi-class classifier.

Bryan Pardo, Machine Learning: EECS 349

$ = ℎ # = argmax"∈-

!"(#)A class label

Set of all classes

Page 41: eecs349 linear discriminants

Pairwise Multi-class Classification

41

If they are not linearly separable (singly connected convex regions), may still be pair-wise separable, using N(N-1)/2 linear discriminants.

gijx | wij,wij0( ) = wij0 + wijl

l=1

K

∑ xl

choose Ci if ∀j ≠ i,gij x( ) > 0

a1

a2

Bryan Pardo, Machine Learning: EECS 349

Page 42: eecs349 linear discriminants

Appendix

(stuff I didn’t have time to discuss in class…and for which I haven’t updated the

notation. )

42

Page 43: eecs349 linear discriminants

Linear Discriminants

• A linear combination of the attributes.

• Easily interpretable

• Are optimal when classes are Gaussian and share a covariance matrix

43

g x | w,w0( ) = w0 +wT x = w0 + wi

i=1

k

∑ ai

Bryan Pardo, Machine Learning: EECS 349

Page 44: eecs349 linear discriminants

Fisher Linear Discriminant Criteria

• Can think of as dimensionality reduction from K-dimensions to 1

• Objective: – Maximize the difference between class means– Minimize the variance within the classes

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 44

wT x

−2 2 6

−2

0

2

4 J( w) = (m2 −m1)2

s12 + s2

2

where si and mi are thesample variance and meanfor class i in the projecteddimension. We want to maximize J.

Page 45: eecs349 linear discriminants

Fisher Linear Discriminant Criteria

• Solution:

where

• However, while this finds finds the direction ( ) of decision boundary. Must still solve for to find the threshold.

• Can be expanded to multiple classes

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 45

Page 46: eecs349 linear discriminants

Logistic Regression (Discrimination)

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 46

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

2.5

3

3.5

4

a1

a2

Page 47: eecs349 linear discriminants

Logistic Regression (Discrimination)

• Discriminant model but well-grounded in probability

• Flexible assumptions (exponential family class-conditional densities)

• Differentiable error function (“cross entropy”)

• Works very well when classes are linearly separable

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 47

Page 48: eecs349 linear discriminants

Logistic Regression (Discrimination)

• Probabilistic discriminative model• Models posterior probability • To see this, let’s start with the 2-class formulation:

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 48

p(C1|x) =p(�!x |C1)p(C1)

p(�!x |C1)p(C1) + p(�!x |C2)p(C2)

=1

1 + exp

✓� log

p(�!x |C1)p(C1)

p(�!x |C2)p(C2)

=1

1 + exp (�↵)

= �(↵)

where

↵ = logp(�!x |C1)p(C1)

p(�!x |C2)p(C2)

1

logistic sigmoid function

Page 49: eecs349 linear discriminants

“Squashing function” that maps

Logistic Regression (Discrimination)

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 49

−5 0 50

0.5

1logistic sigmoid function

Page 50: eecs349 linear discriminants

Logistic Regression (Discrimination)

Mark Cartwright and Bryan Pardo, Machine Learning: EECS 349 Fall 2012 50

For exponential family of densities,

is a linear function of x.

Therefore we can model the posterior probability as a logistic sigmoid acting on a linear function of the attribute vector, and simply solve for the weight vector w (e.g. treat it as a discriminant model):

To classify:

p(C1|x) =p(�!x |C1)p(C1)

p(�!x |C1)p(C1) + p(�!x |C2)p(C2)

=1

1 + exp

✓� log

p(�!x |C1)p(C1)

p(�!x |C2)p(C2)

=1

1 + exp (�↵)

= �(↵)

where

↵ = logp(�!x |C1)p(C1)

p(�!x |C2)p(C2)

1