sharif university of technologymaktabfile.ir/wp-content/uploads/2017/11/linear-classifiers.pdf ·...

Linear classifiersCE-717: Machine LearningSharif University of Technology

M. Soleymani

Fall 2016

Topics

Discriminant functions

Linear classifiers

Perceptron

Fisher

Multi-class classification

2

SVM will be covered in the later lectures

Classification problem

3

Given:Training set

labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 𝑖 , 𝑦 𝑖𝑖=1

𝑁

𝑦 ∈ {1, … , 𝐾}

Goal: Given an input 𝒙, assign it to one of 𝐾 classes

Examples:

Spam filter

Handwritten digit recognition

…

Discriminant functions

4

Discriminant function can directly assign each vector 𝒙 to a

specific class 𝑘

A popular way of representing a classifier

Many classification methods are based on discriminant functions

Assumption: the classes are taken to be disjoint

The input space is thereby divided into decision regions

boundaries are called decision boundaries or decision surfaces.

Discriminant Functions

5

Discriminant functions: A discriminant function 𝑓𝑖 𝒙for each class 𝒞𝑖 (𝑖 = 1,… , 𝐾):

𝒙 is assigned to class 𝒞𝑖 if:

𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖

Thus, we can easily divide the feature space into 𝐾 decision

regions

∀𝒙, 𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖 ⇒ 𝒙 ∈ ℛ𝑖

Decision surfaces (or boundaries) can also be found using

discriminant functions

Boundary of the ℛ𝑖 and ℛ𝑗 separating samples of these two categories:

∀𝒙, 𝑓𝑖 𝒙 = 𝑓𝑗(𝒙)

ℛ𝑖: Region of the 𝑖-th class

Discriminant Functions: Two-Category

6

Decision surface: 𝑓 𝒙 = 0

For two-category problem, we can only find a function 𝑓 ∶ ℝd

→ ℝ 𝑓1 𝒙 = 𝑓(𝒙)

𝑓2 𝒙 = −𝑓(𝒙)

First, we explain two-category classification problem and then

discuss the multi-category problems.

Binary classification: a target variable 𝑦 ∈ 0,1 or 𝑦 ∈ −1,1

Linear classifiers

7

Decision boundaries are linear in 𝒙, or linear in some

given set of functions of 𝒙

Linearly separable data: data points that can be exactly

classified by a linear decision surface.

Why linear classifier?

Even when they are not optimal, we can use their simplicity

are relatively easy to compute

In the absence of information suggesting otherwise, linear classifiers are an

attractive candidates for initial, trial classifiers.

Two Category

8

𝑓 𝒙;𝒘 = 𝒘𝑇𝒙 + 𝑤0 = 𝑤0 +𝑤1𝑥1 + . . . 𝑤𝑑𝑥𝑑 𝒙 = 𝑥1 𝑥2 …𝑥𝑑 𝒘 = [𝑤1 𝑤2 …𝑤𝑑]

𝑤0: bias

if 𝒘𝑇𝒙 + 𝑤0 ≥ 0 then 𝒞1 else 𝒞2

Decision surface (boundary):𝒘𝑇𝒙 + 𝑤0 = 0

𝒘 is orthogonal to every vector lying within the decision surface

Example

9

𝑥1

𝑥2

1 2 3

1

2

3

4

3 −3

4𝑥1 − 𝑥2 = 0

if 𝒘𝑇𝒙 + 𝑤0 ≥ 0 then 𝒞1else 𝒞2

Linear classifier: Two Category

10

Decision boundary is a (𝑑 − 1)-dimensional hyperplane 𝐻 in

the 𝑑-dimensional feature space

The orientation of 𝐻 is determined by the normal vector 𝑤1, … , 𝑤𝑑

𝑤0 determine the location of the surface.

The normal distance from the origin to the decision surface is 𝑤0

𝒘

𝒙 = 𝒙⊥ + 𝑟𝒘

𝒘

𝒘𝑇𝒙 + 𝑤0 = 𝑟 𝒘 ⇒ 𝑟 =𝒘𝑇𝒙 + 𝑤0

𝒘

gives a signed measure of the perpendicular

distance 𝑟 of the point 𝒙 from the decision surface

𝑓 𝒙 = 0

𝒙⊥

Linear boundary: geometry

11

𝒘𝑇𝒙 + 𝑤0 = 0

𝒘𝑇𝒙 + 𝑤0 > 0

𝒘𝑇𝒙 + 𝑤0 < 0

𝒘𝑇𝒙 + 𝑤0

𝒘

Non-linear decision boundary

12

Choose non-linear features

Classifier still linear in parameters 𝒘

𝑥1

𝑥2

1

1

−1 + 𝑥12 + 𝑥2

2 = 0

if 𝒘𝑇𝝓(𝒙) ≥ 0 then 𝑦 = 1else 𝑦 = −1

𝒘 = 𝑤0, 𝑤1, … , 𝑤𝑚 = [−1, 0, 0,1,1,0]

-11

𝝓 𝒙 = [1, 𝒙1, 𝒙2 , 𝒙12, 𝒙2

2, 𝒙1𝒙2]

𝒙 = [𝒙1, 𝒙2]

Cost Function for linear classification

13

Finding linear classifiers can be formulated as an optimization

problem:

Select how to measure the prediction loss

Based on the training set 𝐷 = 𝒙 𝑖 , 𝑦 𝑖𝑖=1

𝑛, a cost function 𝐽 𝒘 is defined

Solve the resulting optimization problem to find parameters:

Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin𝒘

𝐽 𝒘

Criterion or cost functions for classification:

We will investigate several cost functions for the classification problem

SSE cost function for classification

14

SSE cost function is not suitable for classification:

Least square loss penalizes ‘too correct’ predictions (that they lie a long

way on the correct side of the decision)

Least square loss also lack robustness to noise

𝐽 𝒘 =

𝑖=1

𝑁

𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 2

𝐾 = 2


15

𝒘𝑇𝒙

𝑦 = 1𝒘𝑇𝒙 − 𝑦 2

1

𝒘𝑇𝒙

𝑦 = −1𝒘𝑇𝒙 − 𝑦 2

−1

Correct predictions that

are penalized by SSE

[Bishop]

𝐾 = 2


16

𝐽(𝒘)

Is it more suitable if we set 𝑓 𝒙;𝒘 = 𝑔 𝒘𝑇𝒙 ?

𝐽 𝒘 =

𝑖=1

𝑁

sign 𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 2

sign 𝑧 = −1, 𝑧 < 01, 𝑧 ≥ 0

𝐽 𝒘 is a piecewise constant function shows the number

of misclassifications

𝐾 = 2

𝒘𝑇𝒙

𝑦 = 1

sign 𝒘𝑇𝒙 − 𝑦 2

Training error incurred in classifying

training samples

Perceptron algorithm

18

Linear classifier

Two-class: 𝑦 ∈ {−1,1}

𝑦 = −1 for 𝐶2, 𝑦 = 1 for 𝐶1

Goal:∀𝑖, 𝒙(𝑖) ∈ 𝐶1 ⇒ 𝒘𝑇𝒙(𝑖) > 0

∀𝑖, 𝒙 𝑖 ∈ 𝐶2 ⇒ 𝒘𝑇𝒙 𝑖 < 0

𝑓 𝒙;𝒘 = sign(𝒘𝑇𝒙)

Perceptron criterion

19

𝐽𝑃 𝒘 = −

𝑖∈ℳ

𝒘𝑇𝒙 𝑖 𝑦 𝑖

ℳ: subset of training data that are misclassified

Many solutions? Which solution among them?

Cost function

20 [Duda, Hart, and Stork, 2002]

𝐽(𝒘) 𝐽𝑃(𝒘)

𝑤0𝑤1

𝑤0𝑤1

# of misclassifications

as a cost function

Perceptron’s

cost function

There may be many solutions in these cost functions

Batch Perceptron

21

“Gradient Descent” to solve the optimization problem:

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽𝑃(𝒘𝑡)

𝛻𝒘𝐽𝑃 𝒘 = −

𝑖∈ℳ

𝒙 𝑖 𝑦 𝑖

Batch Perceptron converges in finite number of steps for linearly

separable data:

Initialize 𝒘Repeat

𝒘 = 𝒘+ 𝜂 𝑖∈ℳ 𝒙 𝑖 𝑦 𝑖

Until 𝜂 𝑖∈ℳ 𝒙 𝑖 𝑦 𝑖 < 𝜃

Stochastic gradient descent for Perceptron

22

Single-sample perceptron:

If 𝒙(𝑖) is misclassified:

𝒘𝑡+1 = 𝒘𝑡 + 𝜂𝒙(𝑖)𝑦(𝑖)

Perceptron convergence theorem: for linearly separable data

If training data are linearly separable, the single-sample perceptron is

also guaranteed to find a solution in a finite number of steps

Initialize 𝒘, 𝑡 ← 0repeat

𝑡 ← 𝑡 + 1𝑖 ← 𝑡 mod 𝑁

if 𝒙(𝑖) is misclassified then

𝒘 = 𝒘+ 𝒙(𝑖)𝑦(𝑖)

Until all patterns properly classified

Fixed-Increment single sample Perceptron

𝜂 can be set to 1 and

proof still works

Example

23

Perceptron: Example

24

Change 𝒘 in a direction

that corrects the error

[Bishop]

Convergence of Perceptron

25

For data sets that are not linearly separable, the single-sampleperceptron learning algorithm will never converge

[Duda, Hart & Stork, 2002]

Pocket algorithm

26

For the data that are not linearly separable due to noise:

Keeps in its pocket the best 𝒘 encountered up to now.

Initialize 𝒘for 𝑡 = 1,… , 𝑇

𝑖 ← 𝑡 mod 𝑁


𝒘𝑛𝑒𝑤 = 𝒘+ 𝒙(𝑖)𝑦(𝑖)

if 𝐸𝑡𝑟𝑎𝑖𝑛 𝒘𝑛𝑒𝑤 < 𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 then

𝒘 = 𝒘𝑛𝑒𝑤

end

𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 =1

𝑁

𝑛=1

𝑁

𝑠𝑖𝑔𝑛(𝒘𝑇𝒙(𝑛)) ≠ 𝑦(𝑛)

Linear Discriminant Analysis (LDA)

27

Fisher’s Linear Discriminant Analysis :

Dimensionality reduction

Finds linear combinations of features with large ratios of between-

groups scatters to within-groups scatters (as discriminant new

variables)

Classification

Predicts the class of an observation 𝒙 by first projecting it to the

space of discriminant variables and then classifying it in this space

Good Projection for Classification

What is a good criterion?

Separating different classes in the projected space

28




29




30

𝒘

LDA Problem

Problem definition:

𝐶 = 2 classes

𝒙(𝑖), 𝑦(𝑖)𝑖=1

𝑁training samples with 𝑁1 samples from the first class (𝒞1)

and 𝑁2 samples from the second class (𝒞2)

Goal: finding the best direction 𝒘 that we hope to enable accurate

classification

The projection of sample 𝒙 onto a line in direction 𝒘 is 𝒘𝑇𝒙

What is the measure of the separation between the projected

points of different classes?

31

Measure of Separation in the Projected Direction

32

[Bishop]

Is the direction of the line jointing the class means a good

candidate for 𝒘?

Measure of Separation in the Projected

Direction

33

The direction of the line jointing the class means is the

solution of the following problem:

Maximizes the separation of the projected class means

max𝒘

𝐽 𝒘 = 𝜇1′ − 𝜇2

′ 2

s. t. 𝒘 = 1

What is the problem with the criteria considering only

𝜇1′ − 𝜇2

′ ?

It does not consider the variances of the classes in the projected direction

𝜇1′ = 𝒘𝑇 𝝁1 𝝁1 =

𝒙(𝑖)∈𝒞1

𝒙(𝑖)

𝑁1

𝜇2′ = 𝒘𝑇 𝝁2 𝝁2 =

𝒙(𝑖)∈𝒞2

𝒙(𝑖)

𝑁2

LDA Criteria

34

Fisher idea: maximize a function that will give

large separation between the projected class means

while also achieving a small variance within each class, thereby

minimizing the class overlap.

𝐽 𝒘 =𝜇1′ − 𝜇2

′ 2

𝑠1′2 + 𝑠2

′2

LDA Criteria

The scatters of the original data are:

𝑠12 =

𝒙(𝑖)∈𝒞1

𝒙 𝑖 − 𝝁12

𝑠22 =

𝒙(𝑖)∈𝒞2

𝒙 𝑖 − 𝝁22

The scatters of projected data are:

𝑠1′2 =

𝒙(𝑖)∈𝒞1

𝒘𝑇𝒙 𝑖 −𝒘𝑇𝝁12

𝑠2′2 =

𝒙(𝑖)∈𝒞2


35

LDA Criteria

36

𝐽 𝒘 =𝜇1′ − 𝜇2

′ 2

𝑠1′2 + 𝑠2

′2

𝜇1′ − 𝜇2

′ 2 = 𝒘𝑇𝝁1 −𝒘𝑇𝝁22

= 𝒘𝑇 𝝁1 − 𝝁2 𝝁1 − 𝝁2𝑇𝒘

𝑠1′2 =

𝒙(𝑖)∈𝒞1


= 𝒘𝑇

𝒙(𝑖)∈𝒞1

𝒙 𝑖 − 𝝁1 𝒙 𝑖 − 𝝁1𝑇

𝒘

LDA Criteria

37

𝐽 𝒘 =𝒘𝑇𝑺𝐵𝒘

𝒘𝑇𝑺𝑊𝒘

𝑺𝐵 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2𝑇

𝑺𝑊 = 𝑺1 + 𝑺2

𝑺1 =

𝒙(𝑖)∈𝒞1

𝒙 𝑖 − 𝝁1 𝒙 𝑖 − 𝝁1𝑇

𝑺2 =

𝒙(𝑖)∈𝒞2

𝒙 𝑖 − 𝝁2 𝒙 𝑖 − 𝝁2𝑇

scatter matrix=N×covariance matrix

Between-class

scatter matrix

Within-class

scatter matrix

LDA Derivation

( )T

B

T

W

J w S w

ww S w

2 2

2 2( )

TTT TWB

T TW B

B W W B

T T

W W

J

w S ww S ww S w w S w S w w S w S w w S ww w w

w w S w w S w

( )0 B W

J

wS w S w

w

38

LDA Derivation

𝑺𝐵𝒘 (for any vector 𝒘) points in the same direction as

𝝁1 − 𝝁2:

Thus, we can solve the eigenvalue problem immediately

If 𝑺𝑊 is full-rank

39

𝑺𝐵𝒘 = 𝜆𝑺𝑊𝒘 𝑺𝑊−1𝑺𝐵𝒘 = 𝜆𝒘

𝑺𝐵𝒘 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2𝑇𝒘 ∝ 𝝁1 − 𝝁2

𝒘 ∝ 𝑺𝑊−1 𝝁1 − 𝝁2

LDA Algorithm

40

Find 𝝁1 and 𝝁2 as the mean of class 1 and 2 respectively

Find 𝑺1 and 𝑺2 as scatter matrix of class 1 and 2 respectively

𝑺𝑊 = 𝑺1 + 𝑺2 𝑺𝐵 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2

𝑇

Feature Extraction

𝒘 = 𝑺𝑤−1 𝝁1 − 𝝁2 as the eigenvector corresponding to the largest

eigenvalue of 𝑺𝑤−1𝑺𝑏

Classification

𝒘 = 𝑺𝑤−1 𝝁1 − 𝝁2

Using a threshold on 𝒘𝑇𝒙, we can classify 𝒙

𝝁2

𝝁1


41

Solutions to multi-category problems:

Extend the learning algorithm to support multi-class:

A function 𝑓𝑖(𝒙) for each class 𝑖 is found

𝑦 = argmax𝑖=1,…,𝑐

𝑓𝑖(𝒙)

Converting the problem to a set of two-class problems:

𝑥1

𝑥2

𝒙 is assigned to class 𝐶𝑖 if 𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖

Converting multi-class problem to a set of

two-class problems

42

“one versus rest” or “one against all”

For each class 𝐶𝑖, a linear discriminant function that separates

samples of 𝐶𝑖 from all the other samples is found.

Totally linearly separable

“one versus one”

𝑐(𝑐 − 1)/2 linear discriminant functions are used, one to

separate samples of a pair of classes.

Pairwise linearly separable


43

One-vs-all (one-vs-rest)

Class 1:

Class 2:

Class 3:

𝑥2

𝑥2

𝑥1

𝑥2

𝑥1

𝑥1 𝑥2

𝑥1


44

One-vs-one

Class 1:

Class 2:

Class 3:

𝑥2

𝑥1

𝑥2

𝑥1

𝑥2

𝑥1

𝑥2

𝑥1

Multi-class classification: ambiguity

45

one versus rest one versus one


Converting the multi-class problem to a set of two-class

problems can lead to regions in which the classification is

undefined

Multi-class classification: linear machine

46

A discriminant function 𝑓𝑖 𝒙 = 𝒘𝑖𝑇𝒙 + 𝑤𝑖0 for each class

𝒞𝑖 (𝑖 = 1,… , 𝐾):

𝒙 is assigned to class 𝒞𝑖 if:

𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖

Decision surfaces (boundaries) can also be found using

discriminant functions

Boundary of the contiguous ℛ𝑖 and ℛ𝑗:∀𝒙, 𝑓𝑖 𝒙 = 𝑓𝑗(𝒙)

𝒘𝑖 −𝒘𝑗

𝑇𝒙 + 𝑤𝑖0 −𝑤𝑗0 = 0

Multi-class classification: linear machine

47


Perceptron: multi-class

48

𝑦 = argmax𝑖=1,…,𝑐

𝒘𝑖𝑇𝒙

𝐽𝑃 𝑾 = −

𝑖∈ℳ

𝒘𝑦 𝑖 −𝒘 𝑦 𝑖

𝑇𝒙 𝑖

ℳ: subset of training data that are misclassified

ℳ = 𝑖| 𝑦 𝑖 ≠ 𝑦(𝑖)

Initialize 𝑾 = 𝒘1, … ,𝒘𝑐 , 𝑘 ← 0repeat

𝑘 ← 𝑘 + 1 mod 𝑁


𝒘 𝑦 𝑖 = 𝒘 𝑦 𝑖 − 𝒙(𝑖)

𝒘𝑦 𝑖 = 𝒘𝑦 𝑖 + 𝒙(𝑖)

Until all patterns properly classified

Resources

49

C. Bishop, “Pattern Recognition and Machine Learning”,

Chapter 4.1.

sharif university of technologymaktabfile.ir/wp-content/uploads/2017/11/linear-classifiers.pdf ·...

Documents