sharif university of technologymaktabfile.ir/wp-content/uploads/2017/11/linear-classifiers.pdf ·...

48
Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Upload: others

Post on 15-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Linear classifiersCE-717: Machine LearningSharif University of Technology

M. Soleymani

Fall 2016

Page 2: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Topics

Discriminant functions

Linear classifiers

Perceptron

Fisher

Multi-class classification

2

SVM will be covered in the later lectures

Page 3: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Classification problem

3

Given:Training set

labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 𝑖 , 𝑦 𝑖𝑖=1

𝑁

𝑦 ∈ {1, … , 𝐾}

Goal: Given an input 𝒙, assign it to one of 𝐾 classes

Examples:

Spam filter

Handwritten digit recognition

Page 4: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Discriminant functions

4

Discriminant function can directly assign each vector 𝒙 to a

specific class 𝑘

A popular way of representing a classifier

Many classification methods are based on discriminant functions

Assumption: the classes are taken to be disjoint

The input space is thereby divided into decision regions

boundaries are called decision boundaries or decision surfaces.

Page 5: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Discriminant Functions

5

Discriminant functions: A discriminant function 𝑓𝑖 𝒙for each class 𝒞𝑖 (𝑖 = 1,… , 𝐾):

𝒙 is assigned to class 𝒞𝑖 if:

𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖

Thus, we can easily divide the feature space into 𝐾 decision

regions

∀𝒙, 𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖 ⇒ 𝒙 ∈ ℛ𝑖

Decision surfaces (or boundaries) can also be found using

discriminant functions

Boundary of the ℛ𝑖 and ℛ𝑗 separating samples of these two categories:

∀𝒙, 𝑓𝑖 𝒙 = 𝑓𝑗(𝒙)

ℛ𝑖: Region of the 𝑖-th class

Page 6: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Discriminant Functions: Two-Category

6

Decision surface: 𝑓 𝒙 = 0

For two-category problem, we can only find a function 𝑓 ∶ ℝd

→ ℝ 𝑓1 𝒙 = 𝑓(𝒙)

𝑓2 𝒙 = −𝑓(𝒙)

First, we explain two-category classification problem and then

discuss the multi-category problems.

Binary classification: a target variable 𝑦 ∈ 0,1 or 𝑦 ∈ −1,1

Page 7: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Linear classifiers

7

Decision boundaries are linear in 𝒙, or linear in some

given set of functions of 𝒙

Linearly separable data: data points that can be exactly

classified by a linear decision surface.

Why linear classifier?

Even when they are not optimal, we can use their simplicity

are relatively easy to compute

In the absence of information suggesting otherwise, linear classifiers are an

attractive candidates for initial, trial classifiers.

Page 8: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Two Category

8

𝑓 𝒙;𝒘 = 𝒘𝑇𝒙 + 𝑤0 = 𝑤0 +𝑤1𝑥1 + . . . 𝑤𝑑𝑥𝑑 𝒙 = 𝑥1 𝑥2 …𝑥𝑑 𝒘 = [𝑤1 𝑤2 …𝑤𝑑]

𝑤0: bias

if 𝒘𝑇𝒙 + 𝑤0 ≥ 0 then 𝒞1 else 𝒞2

Decision surface (boundary):𝒘𝑇𝒙 + 𝑤0 = 0

𝒘 is orthogonal to every vector lying within the decision surface

Page 9: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Example

9

𝑥1

𝑥2

1 2 3

1

2

3

4

3 −3

4𝑥1 − 𝑥2 = 0

if 𝒘𝑇𝒙 + 𝑤0 ≥ 0 then 𝒞1else 𝒞2

Page 10: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Linear classifier: Two Category

10

Decision boundary is a (𝑑 − 1)-dimensional hyperplane 𝐻 in

the 𝑑-dimensional feature space

The orientation of 𝐻 is determined by the normal vector 𝑤1, … , 𝑤𝑑

𝑤0 determine the location of the surface.

The normal distance from the origin to the decision surface is 𝑤0

𝒘

𝒙 = 𝒙⊥ + 𝑟𝒘

𝒘

𝒘𝑇𝒙 + 𝑤0 = 𝑟 𝒘 ⇒ 𝑟 =𝒘𝑇𝒙 + 𝑤0

𝒘

gives a signed measure of the perpendicular

distance 𝑟 of the point 𝒙 from the decision surface

𝑓 𝒙 = 0

𝒙⊥

Page 11: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Linear boundary: geometry

11

𝒘𝑇𝒙 + 𝑤0 = 0

𝒘𝑇𝒙 + 𝑤0 > 0

𝒘𝑇𝒙 + 𝑤0 < 0

𝒘𝑇𝒙 + 𝑤0

𝒘

Page 12: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Non-linear decision boundary

12

Choose non-linear features

Classifier still linear in parameters 𝒘

𝑥1

𝑥2

1

1

−1 + 𝑥12 + 𝑥2

2 = 0

if 𝒘𝑇𝝓(𝒙) ≥ 0 then 𝑦 = 1else 𝑦 = −1

𝒘 = 𝑤0, 𝑤1, … , 𝑤𝑚 = [−1, 0, 0,1,1,0]

-11

𝝓 𝒙 = [1, 𝒙1, 𝒙2 , 𝒙12, 𝒙2

2, 𝒙1𝒙2]

𝒙 = [𝒙1, 𝒙2]

Page 13: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Cost Function for linear classification

13

Finding linear classifiers can be formulated as an optimization

problem:

Select how to measure the prediction loss

Based on the training set 𝐷 = 𝒙 𝑖 , 𝑦 𝑖𝑖=1

𝑛, a cost function 𝐽 𝒘 is defined

Solve the resulting optimization problem to find parameters:

Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin𝒘

𝐽 𝒘

Criterion or cost functions for classification:

We will investigate several cost functions for the classification problem

Page 14: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

SSE cost function for classification

14

SSE cost function is not suitable for classification:

Least square loss penalizes ‘too correct’ predictions (that they lie a long

way on the correct side of the decision)

Least square loss also lack robustness to noise

𝐽 𝒘 =

𝑖=1

𝑁

𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 2

𝐾 = 2

Page 15: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

SSE cost function for classification

15

𝒘𝑇𝒙

𝑦 = 1𝒘𝑇𝒙 − 𝑦 2

1

𝒘𝑇𝒙

𝑦 = −1𝒘𝑇𝒙 − 𝑦 2

−1

Correct predictions that

are penalized by SSE

[Bishop]

𝐾 = 2

Page 16: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

SSE cost function for classification

16

𝐽(𝒘)

Is it more suitable if we set 𝑓 𝒙;𝒘 = 𝑔 𝒘𝑇𝒙 ?

𝐽 𝒘 =

𝑖=1

𝑁

sign 𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 2

sign 𝑧 = −1, 𝑧 < 01, 𝑧 ≥ 0

𝐽 𝒘 is a piecewise constant function shows the number

of misclassifications

𝐾 = 2

𝒘𝑇𝒙

𝑦 = 1

sign 𝒘𝑇𝒙 − 𝑦 2

Training error incurred in classifying

training samples

Page 17: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Perceptron algorithm

18

Linear classifier

Two-class: 𝑦 ∈ {−1,1}

𝑦 = −1 for 𝐶2, 𝑦 = 1 for 𝐶1

Goal:∀𝑖, 𝒙(𝑖) ∈ 𝐶1 ⇒ 𝒘𝑇𝒙(𝑖) > 0

∀𝑖, 𝒙 𝑖 ∈ 𝐶2 ⇒ 𝒘𝑇𝒙 𝑖 < 0

𝑓 𝒙;𝒘 = sign(𝒘𝑇𝒙)

Page 18: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Perceptron criterion

19

𝐽𝑃 𝒘 = −

𝑖∈ℳ

𝒘𝑇𝒙 𝑖 𝑦 𝑖

ℳ: subset of training data that are misclassified

Many solutions? Which solution among them?

Page 19: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Cost function

20 [Duda, Hart, and Stork, 2002]

𝐽(𝒘) 𝐽𝑃(𝒘)

𝑤0𝑤1

𝑤0𝑤1

# of misclassifications

as a cost function

Perceptron’s

cost function

There may be many solutions in these cost functions

Page 20: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Batch Perceptron

21

“Gradient Descent” to solve the optimization problem:

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽𝑃(𝒘𝑡)

𝛻𝒘𝐽𝑃 𝒘 = −

𝑖∈ℳ

𝒙 𝑖 𝑦 𝑖

Batch Perceptron converges in finite number of steps for linearly

separable data:

Initialize 𝒘Repeat

𝒘 = 𝒘+ 𝜂 𝑖∈ℳ 𝒙 𝑖 𝑦 𝑖

Until 𝜂 𝑖∈ℳ 𝒙 𝑖 𝑦 𝑖 < 𝜃

Page 21: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Stochastic gradient descent for Perceptron

22

Single-sample perceptron:

If 𝒙(𝑖) is misclassified:

𝒘𝑡+1 = 𝒘𝑡 + 𝜂𝒙(𝑖)𝑦(𝑖)

Perceptron convergence theorem: for linearly separable data

If training data are linearly separable, the single-sample perceptron is

also guaranteed to find a solution in a finite number of steps

Initialize 𝒘, 𝑡 ← 0repeat

𝑡 ← 𝑡 + 1𝑖 ← 𝑡 mod 𝑁

if 𝒙(𝑖) is misclassified then

𝒘 = 𝒘+ 𝒙(𝑖)𝑦(𝑖)

Until all patterns properly classified

Fixed-Increment single sample Perceptron

𝜂 can be set to 1 and

proof still works

Page 22: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Example

23

Page 23: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Perceptron: Example

24

Change 𝒘 in a direction

that corrects the error

[Bishop]

Page 24: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Convergence of Perceptron

25

For data sets that are not linearly separable, the single-sampleperceptron learning algorithm will never converge

[Duda, Hart & Stork, 2002]

Page 25: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Pocket algorithm

26

For the data that are not linearly separable due to noise:

Keeps in its pocket the best 𝒘 encountered up to now.

Initialize 𝒘for 𝑡 = 1,… , 𝑇

𝑖 ← 𝑡 mod 𝑁

if 𝒙(𝑖) is misclassified then

𝒘𝑛𝑒𝑤 = 𝒘+ 𝒙(𝑖)𝑦(𝑖)

if 𝐸𝑡𝑟𝑎𝑖𝑛 𝒘𝑛𝑒𝑤 < 𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 then

𝒘 = 𝒘𝑛𝑒𝑤

end

𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 =1

𝑁

𝑛=1

𝑁

𝑠𝑖𝑔𝑛(𝒘𝑇𝒙(𝑛)) ≠ 𝑦(𝑛)

Page 26: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Linear Discriminant Analysis (LDA)

27

Fisher’s Linear Discriminant Analysis :

Dimensionality reduction

Finds linear combinations of features with large ratios of between-

groups scatters to within-groups scatters (as discriminant new

variables)

Classification

Predicts the class of an observation 𝒙 by first projecting it to the

space of discriminant variables and then classifying it in this space

Page 27: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Good Projection for Classification

What is a good criterion?

Separating different classes in the projected space

28

Page 28: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Good Projection for Classification

What is a good criterion?

Separating different classes in the projected space

29

Page 29: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Good Projection for Classification

What is a good criterion?

Separating different classes in the projected space

30

𝒘

Page 30: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

LDA Problem

Problem definition:

𝐶 = 2 classes

𝒙(𝑖), 𝑦(𝑖)𝑖=1

𝑁training samples with 𝑁1 samples from the first class (𝒞1)

and 𝑁2 samples from the second class (𝒞2)

Goal: finding the best direction 𝒘 that we hope to enable accurate

classification

The projection of sample 𝒙 onto a line in direction 𝒘 is 𝒘𝑇𝒙

What is the measure of the separation between the projected

points of different classes?

31

Page 31: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Measure of Separation in the Projected Direction

32

[Bishop]

Is the direction of the line jointing the class means a good

candidate for 𝒘?

Page 32: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Measure of Separation in the Projected

Direction

33

The direction of the line jointing the class means is the

solution of the following problem:

Maximizes the separation of the projected class means

max𝒘

𝐽 𝒘 = 𝜇1′ − 𝜇2

′ 2

s. t. 𝒘 = 1

What is the problem with the criteria considering only

𝜇1′ − 𝜇2

′ ?

It does not consider the variances of the classes in the projected direction

𝜇1′ = 𝒘𝑇 𝝁1 𝝁1 =

𝒙(𝑖)∈𝒞1

𝒙(𝑖)

𝑁1

𝜇2′ = 𝒘𝑇 𝝁2 𝝁2 =

𝒙(𝑖)∈𝒞2

𝒙(𝑖)

𝑁2

Page 33: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

LDA Criteria

34

Fisher idea: maximize a function that will give

large separation between the projected class means

while also achieving a small variance within each class, thereby

minimizing the class overlap.

𝐽 𝒘 =𝜇1′ − 𝜇2

′ 2

𝑠1′2 + 𝑠2

′2

Page 34: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

LDA Criteria

The scatters of the original data are:

𝑠12 =

𝒙(𝑖)∈𝒞1

𝒙 𝑖 − 𝝁12

𝑠22 =

𝒙(𝑖)∈𝒞2

𝒙 𝑖 − 𝝁22

The scatters of projected data are:

𝑠1′2 =

𝒙(𝑖)∈𝒞1

𝒘𝑇𝒙 𝑖 −𝒘𝑇𝝁12

𝑠2′2 =

𝒙(𝑖)∈𝒞2

𝒘𝑇𝒙 𝑖 −𝒘𝑇𝝁12

35

Page 35: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

LDA Criteria

36

𝐽 𝒘 =𝜇1′ − 𝜇2

′ 2

𝑠1′2 + 𝑠2

′2

𝜇1′ − 𝜇2

′ 2 = 𝒘𝑇𝝁1 −𝒘𝑇𝝁22

= 𝒘𝑇 𝝁1 − 𝝁2 𝝁1 − 𝝁2𝑇𝒘

𝑠1′2 =

𝒙(𝑖)∈𝒞1

𝒘𝑇𝒙 𝑖 −𝒘𝑇𝝁12

= 𝒘𝑇

𝒙(𝑖)∈𝒞1

𝒙 𝑖 − 𝝁1 𝒙 𝑖 − 𝝁1𝑇

𝒘

Page 36: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

LDA Criteria

37

𝐽 𝒘 =𝒘𝑇𝑺𝐵𝒘

𝒘𝑇𝑺𝑊𝒘

𝑺𝐵 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2𝑇

𝑺𝑊 = 𝑺1 + 𝑺2

𝑺1 =

𝒙(𝑖)∈𝒞1

𝒙 𝑖 − 𝝁1 𝒙 𝑖 − 𝝁1𝑇

𝑺2 =

𝒙(𝑖)∈𝒞2

𝒙 𝑖 − 𝝁2 𝒙 𝑖 − 𝝁2𝑇

scatter matrix=N×covariance matrix

Between-class

scatter matrix

Within-class

scatter matrix

Page 37: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

LDA Derivation

( )T

B

T

W

J w S w

ww S w

2 2

2 2( )

TTT TWB

T TW B

B W W B

T T

W W

J

w S ww S ww S w w S w S w w S w S w w S ww w w

w w S w w S w

( )0 B W

J

wS w S w

w

38

Page 38: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

LDA Derivation

𝑺𝐵𝒘 (for any vector 𝒘) points in the same direction as

𝝁1 − 𝝁2:

Thus, we can solve the eigenvalue problem immediately

If 𝑺𝑊 is full-rank

39

𝑺𝐵𝒘 = 𝜆𝑺𝑊𝒘 𝑺𝑊−1𝑺𝐵𝒘 = 𝜆𝒘

𝑺𝐵𝒘 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2𝑇𝒘 ∝ 𝝁1 − 𝝁2

𝒘 ∝ 𝑺𝑊−1 𝝁1 − 𝝁2

Page 39: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

LDA Algorithm

40

Find 𝝁1 and 𝝁2 as the mean of class 1 and 2 respectively

Find 𝑺1 and 𝑺2 as scatter matrix of class 1 and 2 respectively

𝑺𝑊 = 𝑺1 + 𝑺2 𝑺𝐵 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2

𝑇

Feature Extraction

𝒘 = 𝑺𝑤−1 𝝁1 − 𝝁2 as the eigenvector corresponding to the largest

eigenvalue of 𝑺𝑤−1𝑺𝑏

Classification

𝒘 = 𝑺𝑤−1 𝝁1 − 𝝁2

Using a threshold on 𝒘𝑇𝒙, we can classify 𝒙

𝝁2

𝝁1

Page 40: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Multi-class classification

41

Solutions to multi-category problems:

Extend the learning algorithm to support multi-class:

A function 𝑓𝑖(𝒙) for each class 𝑖 is found

𝑦 = argmax𝑖=1,…,𝑐

𝑓𝑖(𝒙)

Converting the problem to a set of two-class problems:

𝑥1

𝑥2

𝒙 is assigned to class 𝐶𝑖 if 𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖

Page 41: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Converting multi-class problem to a set of

two-class problems

42

“one versus rest” or “one against all”

For each class 𝐶𝑖, a linear discriminant function that separates

samples of 𝐶𝑖 from all the other samples is found.

Totally linearly separable

“one versus one”

𝑐(𝑐 − 1)/2 linear discriminant functions are used, one to

separate samples of a pair of classes.

Pairwise linearly separable

Page 42: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Multi-class classification

43

One-vs-all (one-vs-rest)

Class 1:

Class 2:

Class 3:

𝑥2

𝑥2

𝑥1

𝑥2

𝑥1

𝑥1 𝑥2

𝑥1

Page 43: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Multi-class classification

44

One-vs-one

Class 1:

Class 2:

Class 3:

𝑥2

𝑥1

𝑥2

𝑥1

𝑥2

𝑥1

𝑥2

𝑥1

Page 44: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Multi-class classification: ambiguity

45

one versus rest one versus one

[Duda, Hart & Stork, 2002]

Converting the multi-class problem to a set of two-class

problems can lead to regions in which the classification is

undefined

Page 45: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Multi-class classification: linear machine

46

A discriminant function 𝑓𝑖 𝒙 = 𝒘𝑖𝑇𝒙 + 𝑤𝑖0 for each class

𝒞𝑖 (𝑖 = 1,… , 𝐾):

𝒙 is assigned to class 𝒞𝑖 if:

𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖

Decision surfaces (boundaries) can also be found using

discriminant functions

Boundary of the contiguous ℛ𝑖 and ℛ𝑗:∀𝒙, 𝑓𝑖 𝒙 = 𝑓𝑗(𝒙)

𝒘𝑖 −𝒘𝑗

𝑇𝒙 + 𝑤𝑖0 −𝑤𝑗0 = 0

Page 46: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Multi-class classification: linear machine

47

[Duda, Hart & Stork, 2002]

Page 47: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Perceptron: multi-class

48

𝑦 = argmax𝑖=1,…,𝑐

𝒘𝑖𝑇𝒙

𝐽𝑃 𝑾 = −

𝑖∈ℳ

𝒘𝑦 𝑖 −𝒘 𝑦 𝑖

𝑇𝒙 𝑖

ℳ: subset of training data that are misclassified

ℳ = 𝑖| 𝑦 𝑖 ≠ 𝑦(𝑖)

Initialize 𝑾 = 𝒘1, … ,𝒘𝑐 , 𝑘 ← 0repeat

𝑘 ← 𝑘 + 1 mod 𝑁

if 𝒙(𝑖) is misclassified then

𝒘 𝑦 𝑖 = 𝒘 𝑦 𝑖 − 𝒙(𝑖)

𝒘𝑦 𝑖 = 𝒘𝑦 𝑖 + 𝒙(𝑖)

Until all patterns properly classified

Page 48: Sharif University of Technologymaktabfile.ir/wp-content/uploads/2017/11/Linear-classifiers.pdf · Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani

Resources

49

C. Bishop, “Pattern Recognition and Machine Learning”,

Chapter 4.1.