sharif university of technologymaktabfile.ir/wp-content/uploads/2017/11/linear-classifiers.pdf ·...
TRANSCRIPT
Linear classifiersCE-717: Machine LearningSharif University of Technology
M. Soleymani
Fall 2016
Topics
Discriminant functions
Linear classifiers
Perceptron
Fisher
Multi-class classification
2
SVM will be covered in the later lectures
Classification problem
3
Given:Training set
labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 𝑖 , 𝑦 𝑖𝑖=1
𝑁
𝑦 ∈ {1, … , 𝐾}
Goal: Given an input 𝒙, assign it to one of 𝐾 classes
Examples:
Spam filter
Handwritten digit recognition
…
Discriminant functions
4
Discriminant function can directly assign each vector 𝒙 to a
specific class 𝑘
A popular way of representing a classifier
Many classification methods are based on discriminant functions
Assumption: the classes are taken to be disjoint
The input space is thereby divided into decision regions
boundaries are called decision boundaries or decision surfaces.
Discriminant Functions
5
Discriminant functions: A discriminant function 𝑓𝑖 𝒙for each class 𝒞𝑖 (𝑖 = 1,… , 𝐾):
𝒙 is assigned to class 𝒞𝑖 if:
𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖
Thus, we can easily divide the feature space into 𝐾 decision
regions
∀𝒙, 𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖 ⇒ 𝒙 ∈ ℛ𝑖
Decision surfaces (or boundaries) can also be found using
discriminant functions
Boundary of the ℛ𝑖 and ℛ𝑗 separating samples of these two categories:
∀𝒙, 𝑓𝑖 𝒙 = 𝑓𝑗(𝒙)
ℛ𝑖: Region of the 𝑖-th class
Discriminant Functions: Two-Category
6
Decision surface: 𝑓 𝒙 = 0
For two-category problem, we can only find a function 𝑓 ∶ ℝd
→ ℝ 𝑓1 𝒙 = 𝑓(𝒙)
𝑓2 𝒙 = −𝑓(𝒙)
First, we explain two-category classification problem and then
discuss the multi-category problems.
Binary classification: a target variable 𝑦 ∈ 0,1 or 𝑦 ∈ −1,1
Linear classifiers
7
Decision boundaries are linear in 𝒙, or linear in some
given set of functions of 𝒙
Linearly separable data: data points that can be exactly
classified by a linear decision surface.
Why linear classifier?
Even when they are not optimal, we can use their simplicity
are relatively easy to compute
In the absence of information suggesting otherwise, linear classifiers are an
attractive candidates for initial, trial classifiers.
Two Category
8
𝑓 𝒙;𝒘 = 𝒘𝑇𝒙 + 𝑤0 = 𝑤0 +𝑤1𝑥1 + . . . 𝑤𝑑𝑥𝑑 𝒙 = 𝑥1 𝑥2 …𝑥𝑑 𝒘 = [𝑤1 𝑤2 …𝑤𝑑]
𝑤0: bias
if 𝒘𝑇𝒙 + 𝑤0 ≥ 0 then 𝒞1 else 𝒞2
Decision surface (boundary):𝒘𝑇𝒙 + 𝑤0 = 0
𝒘 is orthogonal to every vector lying within the decision surface
Example
9
𝑥1
𝑥2
1 2 3
1
2
3
4
3 −3
4𝑥1 − 𝑥2 = 0
if 𝒘𝑇𝒙 + 𝑤0 ≥ 0 then 𝒞1else 𝒞2
Linear classifier: Two Category
10
Decision boundary is a (𝑑 − 1)-dimensional hyperplane 𝐻 in
the 𝑑-dimensional feature space
The orientation of 𝐻 is determined by the normal vector 𝑤1, … , 𝑤𝑑
𝑤0 determine the location of the surface.
The normal distance from the origin to the decision surface is 𝑤0
𝒘
𝒙 = 𝒙⊥ + 𝑟𝒘
𝒘
𝒘𝑇𝒙 + 𝑤0 = 𝑟 𝒘 ⇒ 𝑟 =𝒘𝑇𝒙 + 𝑤0
𝒘
gives a signed measure of the perpendicular
distance 𝑟 of the point 𝒙 from the decision surface
𝑓 𝒙 = 0
𝒙⊥
Linear boundary: geometry
11
𝒘𝑇𝒙 + 𝑤0 = 0
𝒘𝑇𝒙 + 𝑤0 > 0
𝒘𝑇𝒙 + 𝑤0 < 0
𝒘𝑇𝒙 + 𝑤0
𝒘
Non-linear decision boundary
12
Choose non-linear features
Classifier still linear in parameters 𝒘
𝑥1
𝑥2
1
1
−1 + 𝑥12 + 𝑥2
2 = 0
if 𝒘𝑇𝝓(𝒙) ≥ 0 then 𝑦 = 1else 𝑦 = −1
𝒘 = 𝑤0, 𝑤1, … , 𝑤𝑚 = [−1, 0, 0,1,1,0]
-11
𝝓 𝒙 = [1, 𝒙1, 𝒙2 , 𝒙12, 𝒙2
2, 𝒙1𝒙2]
𝒙 = [𝒙1, 𝒙2]
Cost Function for linear classification
13
Finding linear classifiers can be formulated as an optimization
problem:
Select how to measure the prediction loss
Based on the training set 𝐷 = 𝒙 𝑖 , 𝑦 𝑖𝑖=1
𝑛, a cost function 𝐽 𝒘 is defined
Solve the resulting optimization problem to find parameters:
Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin𝒘
𝐽 𝒘
Criterion or cost functions for classification:
We will investigate several cost functions for the classification problem
SSE cost function for classification
14
SSE cost function is not suitable for classification:
Least square loss penalizes ‘too correct’ predictions (that they lie a long
way on the correct side of the decision)
Least square loss also lack robustness to noise
𝐽 𝒘 =
𝑖=1
𝑁
𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 2
𝐾 = 2
SSE cost function for classification
15
𝒘𝑇𝒙
𝑦 = 1𝒘𝑇𝒙 − 𝑦 2
1
𝒘𝑇𝒙
𝑦 = −1𝒘𝑇𝒙 − 𝑦 2
−1
Correct predictions that
are penalized by SSE
[Bishop]
𝐾 = 2
SSE cost function for classification
16
𝐽(𝒘)
Is it more suitable if we set 𝑓 𝒙;𝒘 = 𝑔 𝒘𝑇𝒙 ?
𝐽 𝒘 =
𝑖=1
𝑁
sign 𝒘𝑇𝒙 𝑖 − 𝑦 𝑖 2
sign 𝑧 = −1, 𝑧 < 01, 𝑧 ≥ 0
𝐽 𝒘 is a piecewise constant function shows the number
of misclassifications
𝐾 = 2
𝒘𝑇𝒙
𝑦 = 1
sign 𝒘𝑇𝒙 − 𝑦 2
Training error incurred in classifying
training samples
Perceptron algorithm
18
Linear classifier
Two-class: 𝑦 ∈ {−1,1}
𝑦 = −1 for 𝐶2, 𝑦 = 1 for 𝐶1
Goal:∀𝑖, 𝒙(𝑖) ∈ 𝐶1 ⇒ 𝒘𝑇𝒙(𝑖) > 0
∀𝑖, 𝒙 𝑖 ∈ 𝐶2 ⇒ 𝒘𝑇𝒙 𝑖 < 0
𝑓 𝒙;𝒘 = sign(𝒘𝑇𝒙)
Perceptron criterion
19
𝐽𝑃 𝒘 = −
𝑖∈ℳ
𝒘𝑇𝒙 𝑖 𝑦 𝑖
ℳ: subset of training data that are misclassified
Many solutions? Which solution among them?
Cost function
20 [Duda, Hart, and Stork, 2002]
𝐽(𝒘) 𝐽𝑃(𝒘)
𝑤0𝑤1
𝑤0𝑤1
# of misclassifications
as a cost function
Perceptron’s
cost function
There may be many solutions in these cost functions
Batch Perceptron
21
“Gradient Descent” to solve the optimization problem:
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘𝐽𝑃(𝒘𝑡)
𝛻𝒘𝐽𝑃 𝒘 = −
𝑖∈ℳ
𝒙 𝑖 𝑦 𝑖
Batch Perceptron converges in finite number of steps for linearly
separable data:
Initialize 𝒘Repeat
𝒘 = 𝒘+ 𝜂 𝑖∈ℳ 𝒙 𝑖 𝑦 𝑖
Until 𝜂 𝑖∈ℳ 𝒙 𝑖 𝑦 𝑖 < 𝜃
Stochastic gradient descent for Perceptron
22
Single-sample perceptron:
If 𝒙(𝑖) is misclassified:
𝒘𝑡+1 = 𝒘𝑡 + 𝜂𝒙(𝑖)𝑦(𝑖)
Perceptron convergence theorem: for linearly separable data
If training data are linearly separable, the single-sample perceptron is
also guaranteed to find a solution in a finite number of steps
Initialize 𝒘, 𝑡 ← 0repeat
𝑡 ← 𝑡 + 1𝑖 ← 𝑡 mod 𝑁
if 𝒙(𝑖) is misclassified then
𝒘 = 𝒘+ 𝒙(𝑖)𝑦(𝑖)
Until all patterns properly classified
Fixed-Increment single sample Perceptron
𝜂 can be set to 1 and
proof still works
Example
23
Perceptron: Example
24
Change 𝒘 in a direction
that corrects the error
[Bishop]
Convergence of Perceptron
25
For data sets that are not linearly separable, the single-sampleperceptron learning algorithm will never converge
[Duda, Hart & Stork, 2002]
Pocket algorithm
26
For the data that are not linearly separable due to noise:
Keeps in its pocket the best 𝒘 encountered up to now.
Initialize 𝒘for 𝑡 = 1,… , 𝑇
𝑖 ← 𝑡 mod 𝑁
if 𝒙(𝑖) is misclassified then
𝒘𝑛𝑒𝑤 = 𝒘+ 𝒙(𝑖)𝑦(𝑖)
if 𝐸𝑡𝑟𝑎𝑖𝑛 𝒘𝑛𝑒𝑤 < 𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 then
𝒘 = 𝒘𝑛𝑒𝑤
end
𝐸𝑡𝑟𝑎𝑖𝑛 𝒘 =1
𝑁
𝑛=1
𝑁
𝑠𝑖𝑔𝑛(𝒘𝑇𝒙(𝑛)) ≠ 𝑦(𝑛)
Linear Discriminant Analysis (LDA)
27
Fisher’s Linear Discriminant Analysis :
Dimensionality reduction
Finds linear combinations of features with large ratios of between-
groups scatters to within-groups scatters (as discriminant new
variables)
Classification
Predicts the class of an observation 𝒙 by first projecting it to the
space of discriminant variables and then classifying it in this space
Good Projection for Classification
What is a good criterion?
Separating different classes in the projected space
28
Good Projection for Classification
What is a good criterion?
Separating different classes in the projected space
29
Good Projection for Classification
What is a good criterion?
Separating different classes in the projected space
30
𝒘
LDA Problem
Problem definition:
𝐶 = 2 classes
𝒙(𝑖), 𝑦(𝑖)𝑖=1
𝑁training samples with 𝑁1 samples from the first class (𝒞1)
and 𝑁2 samples from the second class (𝒞2)
Goal: finding the best direction 𝒘 that we hope to enable accurate
classification
The projection of sample 𝒙 onto a line in direction 𝒘 is 𝒘𝑇𝒙
What is the measure of the separation between the projected
points of different classes?
31
Measure of Separation in the Projected Direction
32
[Bishop]
Is the direction of the line jointing the class means a good
candidate for 𝒘?
Measure of Separation in the Projected
Direction
33
The direction of the line jointing the class means is the
solution of the following problem:
Maximizes the separation of the projected class means
max𝒘
𝐽 𝒘 = 𝜇1′ − 𝜇2
′ 2
s. t. 𝒘 = 1
What is the problem with the criteria considering only
𝜇1′ − 𝜇2
′ ?
It does not consider the variances of the classes in the projected direction
𝜇1′ = 𝒘𝑇 𝝁1 𝝁1 =
𝒙(𝑖)∈𝒞1
𝒙(𝑖)
𝑁1
𝜇2′ = 𝒘𝑇 𝝁2 𝝁2 =
𝒙(𝑖)∈𝒞2
𝒙(𝑖)
𝑁2
LDA Criteria
34
Fisher idea: maximize a function that will give
large separation between the projected class means
while also achieving a small variance within each class, thereby
minimizing the class overlap.
𝐽 𝒘 =𝜇1′ − 𝜇2
′ 2
𝑠1′2 + 𝑠2
′2
LDA Criteria
The scatters of the original data are:
𝑠12 =
𝒙(𝑖)∈𝒞1
𝒙 𝑖 − 𝝁12
𝑠22 =
𝒙(𝑖)∈𝒞2
𝒙 𝑖 − 𝝁22
The scatters of projected data are:
𝑠1′2 =
𝒙(𝑖)∈𝒞1
𝒘𝑇𝒙 𝑖 −𝒘𝑇𝝁12
𝑠2′2 =
𝒙(𝑖)∈𝒞2
𝒘𝑇𝒙 𝑖 −𝒘𝑇𝝁12
35
LDA Criteria
36
𝐽 𝒘 =𝜇1′ − 𝜇2
′ 2
𝑠1′2 + 𝑠2
′2
𝜇1′ − 𝜇2
′ 2 = 𝒘𝑇𝝁1 −𝒘𝑇𝝁22
= 𝒘𝑇 𝝁1 − 𝝁2 𝝁1 − 𝝁2𝑇𝒘
𝑠1′2 =
𝒙(𝑖)∈𝒞1
𝒘𝑇𝒙 𝑖 −𝒘𝑇𝝁12
= 𝒘𝑇
𝒙(𝑖)∈𝒞1
𝒙 𝑖 − 𝝁1 𝒙 𝑖 − 𝝁1𝑇
𝒘
LDA Criteria
37
𝐽 𝒘 =𝒘𝑇𝑺𝐵𝒘
𝒘𝑇𝑺𝑊𝒘
𝑺𝐵 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2𝑇
𝑺𝑊 = 𝑺1 + 𝑺2
𝑺1 =
𝒙(𝑖)∈𝒞1
𝒙 𝑖 − 𝝁1 𝒙 𝑖 − 𝝁1𝑇
𝑺2 =
𝒙(𝑖)∈𝒞2
𝒙 𝑖 − 𝝁2 𝒙 𝑖 − 𝝁2𝑇
scatter matrix=N×covariance matrix
Between-class
scatter matrix
Within-class
scatter matrix
LDA Derivation
( )T
B
T
W
J w S w
ww S w
2 2
2 2( )
TTT TWB
T TW B
B W W B
T T
W W
J
w S ww S ww S w w S w S w w S w S w w S ww w w
w w S w w S w
( )0 B W
J
wS w S w
w
38
LDA Derivation
𝑺𝐵𝒘 (for any vector 𝒘) points in the same direction as
𝝁1 − 𝝁2:
Thus, we can solve the eigenvalue problem immediately
If 𝑺𝑊 is full-rank
39
𝑺𝐵𝒘 = 𝜆𝑺𝑊𝒘 𝑺𝑊−1𝑺𝐵𝒘 = 𝜆𝒘
𝑺𝐵𝒘 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2𝑇𝒘 ∝ 𝝁1 − 𝝁2
𝒘 ∝ 𝑺𝑊−1 𝝁1 − 𝝁2
LDA Algorithm
40
Find 𝝁1 and 𝝁2 as the mean of class 1 and 2 respectively
Find 𝑺1 and 𝑺2 as scatter matrix of class 1 and 2 respectively
𝑺𝑊 = 𝑺1 + 𝑺2 𝑺𝐵 = 𝝁1 − 𝝁2 𝝁1 − 𝝁2
𝑇
Feature Extraction
𝒘 = 𝑺𝑤−1 𝝁1 − 𝝁2 as the eigenvector corresponding to the largest
eigenvalue of 𝑺𝑤−1𝑺𝑏
Classification
𝒘 = 𝑺𝑤−1 𝝁1 − 𝝁2
Using a threshold on 𝒘𝑇𝒙, we can classify 𝒙
𝝁2
𝝁1
Multi-class classification
41
Solutions to multi-category problems:
Extend the learning algorithm to support multi-class:
A function 𝑓𝑖(𝒙) for each class 𝑖 is found
𝑦 = argmax𝑖=1,…,𝑐
𝑓𝑖(𝒙)
Converting the problem to a set of two-class problems:
𝑥1
𝑥2
𝒙 is assigned to class 𝐶𝑖 if 𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖
Converting multi-class problem to a set of
two-class problems
42
“one versus rest” or “one against all”
For each class 𝐶𝑖, a linear discriminant function that separates
samples of 𝐶𝑖 from all the other samples is found.
Totally linearly separable
“one versus one”
𝑐(𝑐 − 1)/2 linear discriminant functions are used, one to
separate samples of a pair of classes.
Pairwise linearly separable
Multi-class classification
43
One-vs-all (one-vs-rest)
Class 1:
Class 2:
Class 3:
𝑥2
𝑥2
𝑥1
𝑥2
𝑥1
𝑥1 𝑥2
𝑥1
Multi-class classification
44
One-vs-one
Class 1:
Class 2:
Class 3:
𝑥2
𝑥1
𝑥2
𝑥1
𝑥2
𝑥1
𝑥2
𝑥1
Multi-class classification: ambiguity
45
one versus rest one versus one
[Duda, Hart & Stork, 2002]
Converting the multi-class problem to a set of two-class
problems can lead to regions in which the classification is
undefined
Multi-class classification: linear machine
46
A discriminant function 𝑓𝑖 𝒙 = 𝒘𝑖𝑇𝒙 + 𝑤𝑖0 for each class
𝒞𝑖 (𝑖 = 1,… , 𝐾):
𝒙 is assigned to class 𝒞𝑖 if:
𝑓𝑖(𝒙) > 𝑓𝑗(𝒙) 𝑗 𝑖
Decision surfaces (boundaries) can also be found using
discriminant functions
Boundary of the contiguous ℛ𝑖 and ℛ𝑗:∀𝒙, 𝑓𝑖 𝒙 = 𝑓𝑗(𝒙)
𝒘𝑖 −𝒘𝑗
𝑇𝒙 + 𝑤𝑖0 −𝑤𝑗0 = 0
Multi-class classification: linear machine
47
[Duda, Hart & Stork, 2002]
Perceptron: multi-class
48
𝑦 = argmax𝑖=1,…,𝑐
𝒘𝑖𝑇𝒙
𝐽𝑃 𝑾 = −
𝑖∈ℳ
𝒘𝑦 𝑖 −𝒘 𝑦 𝑖
𝑇𝒙 𝑖
ℳ: subset of training data that are misclassified
ℳ = 𝑖| 𝑦 𝑖 ≠ 𝑦(𝑖)
Initialize 𝑾 = 𝒘1, … ,𝒘𝑐 , 𝑘 ← 0repeat
𝑘 ← 𝑘 + 1 mod 𝑁
if 𝒙(𝑖) is misclassified then
𝒘 𝑦 𝑖 = 𝒘 𝑦 𝑖 − 𝒙(𝑖)
𝒘𝑦 𝑖 = 𝒘𝑦 𝑖 + 𝒙(𝑖)
Until all patterns properly classified
Resources
49
C. Bishop, “Pattern Recognition and Machine Learning”,
Chapter 4.1.