linear discriminant functionsmlclab.org/pr/notes/lecture03-lineardiscriminant...lec3: linear...
TRANSCRIPT
Lec3: Linear Discriminant Functions 1
Linear Discriminant
Functions
Prof. Daniel Yeung
School of Computer Science and Engineering
South China University of Technology
Lecture 3Pattern Recognition
Lec3: Linear Discriminant Functions 2
Outline
� Linear Discriminant Function (5.2)
� Generalized Linear Discriminant Function
(5.3)
� Gradient Descent (5.4)
� Criterion Function (5.5)
� Sum-of-squared-error function (5.8)
� Least-Mean-Squared (5.8.4)
� Relationship: MSE vs Bayes (5.8.3)
Lec3: Linear Discriminant Functions 3
Review
� Chapter 1 – Introduce Pattern Recognition systems and its main concepts involved (data collection, feature, model, cost, decision making, classifier, training and learning, performance evaluation)
� Chapter 2 – Bayes decision theory: ideal case where probability structure underlying the classification categories is known perfectly. Hence one can design optimal (Bayes) classifier and even predict error.
� Chapter 3 – Probability structure not known, but general form of distributions are known. So only need to estimate parameters to achieve best categorization by maximum likelihood technique.
Lec3: Linear Discriminant Functions 4
Review
� Chapter 4 – Move a further step away from Bayes model and make no assumption on underlying probability structure. One relies on information provided by training samples alone. Example: nearest-neighborhood algorithm and potential function.
� Chapter 5 – We know a “nearly” linear form for the discriminant functions, and use samples to estimate the values of parameters of the classifier.
� Chapter 6 – Extend some of the linear discriminant idea to train multilayer neural networks.
Lec3: Linear Discriminant Functions 5
Parametric VS non-Parametric
� Parametric Methods
�Assume the form of the sample distribution
(pdf) is known
�Training samples used to estimate distribution
parameters
� E.g. µ and σ in Gaussian function
�Accurate if the distribution assumption is
correct; otherwise, the result may be very
poor
�E.g. Bayes Rule, etc
Lec3: Linear Discriminant Functions 6
Parametric VS non-Parametric
� Non-parametric Methods
�Do not make assumption on the form of the
sample distribution “pdf”
� Instead, the proper form for discriminant
function is assumed (e.g., linear, neural
network, SVM)
� training samples used to estimate values of
parameters of the classifier.
�Sub-optimal, but simple to use
Lec3: Linear Discriminant Functions 7
Linear Discriminant Function (LDF)
� Definition
�A linear combination of the components of
x (vector representing an object to be classified)
g(x) = wtx + w0
w : is the weight vector
w0 : the bias or threshold weight
� In general, there are c functions g1(x) ,
g2(x), …, gc(x), where c is the number of
classes
Lec3: Linear Discriminant Functions 8
Linear Discriminant Function2-Class Problem
� Decision rule
� If g(x) > 0, decide ω1
� If g(x) < 0, decide ω2
� Or
� If wtx > -w0, decide ω1
�Otherwise, decide ω2
� If g(x) = 0, an ambiguous situation, then x
is assigned to either class
Lec3: Linear Discriminant Functions 9
Linear Discriminant Function2-Class Problem
� Each unit is shown as having inputs and outputs.
� The input units exactly output the same values as the inputs (except the bias unit which outputs a constant 1)
� The output unit emits “1” if the sum of its weighted inputs is greater than zero and “-1” otherwise
Bias unit
Feature Vector
SummationWeight
Lec3: Linear Discriminant Functions 10
Linear Discriminant Function2-Class Problem
� Decision surface is defined as g(x) = 0,
�When g(x) is linear, this decision surface is a
hyperplane
g(x)
Lec3: Linear Discriminant Functions 11
Linear Discriminant Function
Multi-Class Problem: 1-against-All
H1
H3
H4
H1
H2
H4
H3
H2
H1 : not ω1
H2 : ω2
H3 : not ω3
H4 : not ω4
ω2
1 Yes 3 No
Lec3: Linear Discriminant Functions 12
Linear Discriminant Function
Multi-Class Problem: 1-against-All
H1
H3
H4
H1
H2
H4
H3
H2
H1 : not ω1
H2 : not ω2
H3 : not ω3
H4 : not ω4
0 Yes 4 No
Lec3: Linear Discriminant Functions 13
Linear Discriminant Function
Multi-Class Problem: 1-against-All
H1
H3
H4
H1
H2
H4
H3
H2
H1 : ω1
H2 : not ω2
H3 : ω3
H4 : not ω4
• Two classifiers think this region belongs to them
• This problem can be solved by comparing the outputs
of classifiers, such as,
• if H1 > H3, class 1
• Otherwise, class 3
Lec3: Linear Discriminant Functions 14
Linear Discriminant Function
Multi-Class Problem: 1-against-1
2. One against One
� Consider every possible pair of classes
� c(c−1)/2 classifiers are needed
� For example, 3-class problem
1 vs 2 2 vs 3 1 vs 3
� If Y Y/ N Y , then class 1
� If N Y Y / N , then class 2
� If Y / N N Y , then class 3
� Any problem?
Lec3: Linear Discriminant Functions 15
H13
H13
H24
H24H23H12
H34
H14
H12H34
H14
Linear Discriminant Function
Multi-Class Problem: 1-against-1
H12 : ω2
H13 : ω3
H14 : ω4
H21 : ω2
H23 : ω2
H24 : ω2
H31 : ω4
H32 : ω2
H34 : ω4
H41 : ω4
H42 : ω2
H43 : ω4
ω2
Lec3: Linear Discriminant Functions 16
H31 : ω3
H32 : ω3
H34 : ω3
H13
H13
H24
H24H23H12
H34
H14
H12H34
H14
Linear Discriminant Function
Multi-Class Problem: 1-against-1
ω3
H12 : ω1
H13 : ω3
H14 : ω1
H21 : ω1
H23 : ω3
H24 : ω4
H41 : ω1
H42 : ω4
H43 : ω4
Lec3: Linear Discriminant Functions 17
H12 : ω1
H13 : ω1
H14 : ω1
H21 : ω1
H23 : ω2
H24 : ω4
H31 : ω1
H32 : ω2
H34 : ω3
H41 : ω1
H42 : ω4
H43 : ω3
H13
H13
H24
H24H23H12
H34
H14
H12H34
H14
Linear Discriminant Function
Multi-Class Problem: 1-against-1
ω1
P(ω2) > P(ω3)
P(ω3) > P(ω4)
P(ω4) > P(ω2)
Although there is a conflict between some classifiers,
we still can have a result
Lec3: Linear Discriminant Functions 18
H13
H13
H24
H24H23H12
H34
H14
H12H34
H14
Linear Discriminant Function
Multi-Class Problem: 1-against-1
?
?
?
?
H12 : ω2
H13 : ω3
H14 : ω1
H21 : ω2
H23 : ω2
H24 : ω4
H31 : ω3
H32 : ω2
H34 : ω3
H41 : ω4
H42 : ω4
H43 : ω3
Lec3: Linear Discriminant Functions 19
Linear Discriminant Function
Multi-Class Problem
� Which approach is better?
�1-against-All needs a smaller number of
classifiers (c vs c(c−1)/2 classifiers)
�1-against-1 has a smaller ambiguous region
� Any other choice?1-against-11-against-All
Lec3: Linear Discriminant Functions 20
Linear Discriminant Function
Linear Machine for Multi-Class Problem� How to avoid the existence of ambiguous region caused by LDF (1-against-1 or 1-agianst-All) ?
� One possible solution : Use Linear Machine� For each class:
gi(x) = wtxi + wi0 , i = 1, ..., c
� x belongs to ωi (region Ri ), if gi(x) > gj(x) for all j ≠ i
� Undefined if gi(x) = gj(x)
� If Ri and Rj are contiguous, the boundary between them is a portion of the hyperplane Hij defined by
gi(x) = gj(x), or
(wi − wj)tx + (wi0 − wj0) = 0
Lec3: Linear Discriminant Functions 21
Linear Discriminant Function
Linear Machine for Multi-Class Problem
� Advantages:
�Avoid ambiguous region
�Every decision region is singly connected
�Low complexity� < c(c-1)/2 classifiers are needed
3-class problem 5-class problem
correction
Lec3: Linear Discriminant Functions 22
Linear Discriminant Function (LDF)
� Practically, a problem is seldom linearly separable
� Can we handle a non linearly separable problem by LDF?� Use a mapping to convert a nonlinearly to linearly
separable problem, i.e., map the points from a lower dim space (non linearly separable) to a higher dim space (linearly separable)
low dim space high dim space
Non linearly separableLinearly
separable
Lec3: Linear Discriminant Functions 23
Generalized Linear Discriminant Function
� Linear Discriminant Function
� Quadratic Discriminant Function
� Continue to add…
∑∑∑= ==
++=d
i
d
j
jiij
d
i
ii xxwxwwxg1 11
0)(
∑=
+=d
i
ii xwwxg1
0)(
∑∑∑= = =
d
i
d
j
d
k
kjiijk xxxw1 1 1
∑∑∑∑= = = =
d
i
d
j
d
k
d
l
lkjiijkl xxxxw1 1 1 1
, ,…
Lec3: Linear Discriminant Functions 24
Generalized Linear Discriminant Function
� Generalized Linear Discriminant Function
g is not linear in x, but linear in y.
where
� a is a d dimensional weight vector
� yi(x) are called φ-functions� Arbitrary function of x
� Map points in low d-dim space into points in high d-dim space
� Homogeneous discriminant function atyseparates points by a hyperplane that passes through origin in transformed space
∑=
=d
i
ii xyaxg
ˆ
1
)()( )()( xyaxg t=or
^
^
Lec3: Linear Discriminant Functions 25
Generalized Linear Discriminant Function
Example: Quadratic as Generalized LDF
� The quadratic discriminant function
� Generalized LDF
where y1(x) = 1, y2(x) = x, y3(x) = x2
�3-dim vector y:
2
321)( xaxaaxg ++=
=
2
1
)(
x
xxy
∑=
=3
1
)()(i
ii xyaxg
Lec3: Linear Discriminant Functions 26
Generalized Linear Discriminant Function
Example: Quadratic as Generalized LDF
� A plane splits the resulting y-space into regions corresponding to two categories which gives a non-simply connected decision region in 1-dim x-space
� Mapping y = (1, x, x2)t
takes a line and transforms it to a parabola in 3-dim
Points below this separating plane belong to
Region ℜ2 and those abovre belong to ℜ1
Non linearly separable
Linearly separable
Lec3: Linear Discriminant Functions 27
Generalized Linear Discriminant Function
Another Example
2121 2)( xxxxxg ++−=
� A linear discriminant in this transformed space is a hyperplane which cuts the surface
� Points on the positive side of H (above H) correspond to ω1 and those beneath it correspond to ω2
∧∧∧∧ ∧∧∧∧
Lec3: Linear Discriminant Functions 28
Two-Class Linearly Separable Case
� Given
�Set of n samples y1, y2, …, yn
�Some are ω1 and some are ω2
� Objective
�Use samples to determine weight vector “a” such that
� atyi > 0 implies ω1
� atyi < 0 implies ω2
� If such a weight vector exists a then the samples are said to be linearly separable
Lec3: Linear Discriminant Functions 29
Two-Class Linearly Separable Case
� Normalization
�Simplify the setting
�Replace all samples labeled by ω2 by their
“negatives”
� Such as yi = -yi, where yi is ω2
� If the problem is linearly separable, all
samples are atyi > 0
�Labels can be ignored
Lec3: Linear Discriminant Functions 30
Two-Class Linearly Separable Case
Raw Data Normalized Data
• Solution vector separates
red from black, a is at right
angle with the red dotted
line
• Solution vector places all vectors
on same side of plane
• Solution region is intersection of n
half-spaces
aty<0 red
aty>0 black
Negative yi’s
Infinite solutions
Lec3: Linear Discriminant Functions 31
Additional Requirements
� Solution vector (solution to the decision plane) is not unique� Which one is the best?
� Additional requirements to constrain solution vector
g(x)
� Two possible approaches:1.Seek a hyperplane having a larger margin
� Discuss in next slides
2. Seek a smoother and simpler decision hyperplane� Regularization, discuss later
Lec3: Linear Discriminant Functions 32
Margin
� Minimum length weight vector a satisfying
atyi > b > 0
where b is called the margin
� It is insulated from old boundaries by b / ||yi||
No Margin: b = 0
Margin b > 0
shrinks solution region
by margins b / ||yi||
like a buffer zone
Old boundaries
New boundary
Lec3: Linear Discriminant Functions 33
Gradient Descent (GD)
� Recall after normalization (slide 30),
sample yi is classified correctly if atyi > 0
� To find a solution to set of inequalities
atyi > 0
� Define a criterion function J(a) that is
minimized if a is a solution vector
�Function J will be defined later
Lec3: Linear Discriminant Functions 34
Gradient Descent (GD)
� Algorithm
�Start with an arbitrarily chosen weight a(1)
�Compute gradient vector ∇J(a(1))
�Next value a(2) determined by moving some
distance from a(1) in the direction of the
steepest descent
� i.e., along the negative of the gradient
))(()()()1( kaJkkaka ∇−=+ η
a
JJ
∂∂
=∇η is the learning rate which
controls the size of each step
Lec3: Linear Discriminant Functions 35
Gradient Descent (GD)
� Algorithm
Note: in step 4, it should be
The absolute value < θθθθ a(0)
a(1)
a* a
)0()( Jk ∇η
a(2) )1()( Jk ∇η
)2()( Jk ∇η a(3)
1
2
3Lec3: Linear Discriminant Functions 36
Gradient Descent (GD)
� Related Issues:
�Size of Learning Rate (η)
� Too small, convergence is needlessly slow
� Too large, the correction process will overshoot
and can even diverge
�Sub-optimal Solution
� Trapped by
local minimum
Lec3: Linear Discriminant Functions 37
Criterion Function (J)
� Recall, all correctly classified samples should satisfy atyi > 0
� We would like to find at which yields smallest error
� Examples of Criterion Function (J)
� Perceptron Criterion Function
� Squared Error Function
� If all samples are correct (i.e. Y set is empty), J is equal to 0
� If all samples are wrong (i.e. all atyi < 0 ), J is a positive value
( )∑∈
=Yy
t
q yaaJ2
)(
( )∑∈
−=Yy
t
P yaaJ )(
Set of misclassified
samples
Lec3: Linear Discriminant Functions 38
Criterion Function (J)
� Gradient function:
� Update Rule:
( )∑∈
−=∇Yy
P yaJ )(
∑∈
+=+Yy
ykkaka )()()1( η
( )∑∈
=Yy
t
q yaaJ2
)(( )∑∈
−=Yy
t
P yaaJ )(
∑∈
=∇Yy
P yaJ 2)(
∑∈
−=+Yy
ykkaka )(2)()1( η
Perceptron Criterion Function Squared Error Function
Lec3: Linear Discriminant Functions 39
Criterion Function (J)
Sum-of-squared-error Function� Perceptron Criterion Function and
Sum-of-Squared Error Function
� focus only on misclassified samples
� hope that all of the inner products (atyi) are positive
� Sum-of-squared-error Function handles the margin situation (atyi= bi) (slide 32)
where bi are some arbitrarily specified positive constants
� The problem is more stringent but better understood problem of finding a solution to a set of linear equations (rather than set of linear inequalities)
( )∑=
−=n
i
ii
t
s byaaJ1
2)(
Lec3: Linear Discriminant Functions 40
Criterion Function (J)
Sum-of-squared-error Function
� For all samples y1, y2, …, yn
� We want to find a weight vector a, so that
atyi = bi
� where bi are some arbitrarily specified positive
constants
� Matrix notation:
=
ddnddn
d
d
b
b
b
a
a
a
yyy
yyy
yyy
MM
L
MOM
K
1
0
1
0
10
22120
11110
bYa=or
Lec3: Linear Discriminant Functions 41
Criterion Function (J)
Sum-of-squared-error Function
� As Ya = b, a can be solved by calculating
the inverse if Y is nonsingular,
a = Y-1b
� However, usually, Y is rectangular
�More rows than columns
More samples than features
More equations than unknowns
� a is over-determined
� No exact solution exists
Lec3: Linear Discriminant Functions 42
Criterion Function (J)
Sum-of-squared-error Function� Pseudoinverse Method
� Error vector e
� Sum-of-squared-error function
� The Gradient
� Necessary Condition
�a can be solved uniquely
bYae −=
( )∑=
−=−=n
i
ii
t
s byabYaaJ1
22)(
( ) ( )∑=
−=−=∇n
i
ii
t
i
t
s byaybYaYaJ1
22)(
bYYaY tt =
bYbYYYa tt '1)( == − tt YYYY 1' )( −=where
Lec3: Linear Discriminant Functions 43
Criterion Function (J)
Sum-of-squared-error Function
� Remark
�YtY is not always nonsingluar
� Y’ should be defined more generally by
�The solution depends on b
tt YIYYY 1
0
' )(lim −
→+= ε
ε
Lec3: Linear Discriminant Functions 44
x1 x2 ω
1 2 1
2 0 1
3 1 2
2 3 2
� Given
� y and b are defined as
� Hence
( )txxy 211=
−−−
−−−=
321
131
021
211
Y
( ) 01 21 =tt xxa
−−
−−−−== −
3/103/10
6/12/16/12/1
12/74/312/134/5
)('1 ttYYYY
( )tb 1111=
( )tbYa 3/23/43/11' −−==
Normalize Class 2 samples
Refer to slide 28
aty = 0 => 4x1 + 2x2 = 11
=> 4x1 + 2x2 = 11
(-1, -1)
CF: Sum-of-squared-error function
(example: find a by pseudoinverse)
y
Lec3: Linear Discriminant Functions 45
CF: Sum-of-squared-error function
Least-Mean-Squared (LMS)
� Besides pseudo-inverse method, Gradient
Descent can also be used
� The algorithm is called Least-Mean-Squared (LMS)
� Advantage over pseudo-inverse:
� Problem when YtY is singular
� Avoids need for working with large matrices
� Computation involved is a feedback scheme that
copes with round off or truncation
� Disadvantage
� Longer training time
Lec3: Linear Discriminant Functions 46
CF: Sum-of-squared-error function
Least-Mean-Squared (LMS)
� Recall
� Gradient function:
� Update Rule:
( )∑=
−=n
i
ii
t
s byaaJ1
2)(
( ) ii
t
i yyabkkaka ++=+ )()()1( η
( )∑=
−=∇n
i
iii
t
s ybyaaJ1
2)(
Lec3: Linear Discriminant Functions 47
CF: Sum-of-squared-error function
Least-Mean-Squared (LMS)
� LMS needs not converge to a separating hyperplane (which can separate the samples perfectly), even if one exists
� It minimizes the sum of the squares of the distances of the training points to the hyperplane
� For this example, the plane is different from the separating hyperplane
Lec3: Linear Discriminant Functions 48
Relationship: MSE vs BayesClass conditional densities
Posteriors
Bayes discriminant function
(grey line)
MSE solution
(dotted line)
best approximation in
region of data points
i.e., when g(x)>>>> g0(x)
� If b=1 MSE solution approaches a
minimum mean squared
approximation to Bayes
discriminant function