linear discriminant functionsvision.psych.umn.edu/users/schrater/schrater_lab/...6 –the...
TRANSCRIPT
Linear Discriminant Functions
1
Linear discriminant functions anddecision surfaces
• DefinitionIt is a function that is a linear combination of the components of x
g(x) = wtx + w0 (1)where w is the weight vector and w0 the bias
• A two-category classifier with a discriminant function ofthe form (1) uses the following rule:Decide ω1 if
g(x) > 0 and ω2 if g(x) < 0⇔ Decide ω1 if
wtx > -w0 and ω2 otherwiseIf g(x) = 0 ⇒ x is assigned to either class
2
3
– The equation g(x) = 0 defines the decisionsurface that separates points assigned to thecategory ω1 from points assigned to thecategory ω2
– When g(x) is linear, the decision surface is ahyperplane
– Algebraic measure of the distance from x tothe hyperplane (interesting result!)
4
5
– In conclusion, a linear discriminant function dividesthe feature space by a hyperplane decision surface
– The orientation of the surface is determined by thenormal vector w and the location of the surface isdetermined by the bias
!
x = xp +rw
w
since g(xp ) = 0 and wtw = w2
g(x) = wtx + w0 " w
txp +
rw
w
#
$ %
&
' ( + w0
= g(xp ) + wtw
r
w
" r =g(x)
w
in particular d([0,0],H) =w 0
w
H
w
x
xtw
r
xp
6
– The multi-category case
• We define c linear discriminant functions
and assign x to ωi if gi(x) > gj(x) ∀ j ≠ i; in case of ties, the classificationis undefined
• In this case, the classifier is a “linear machine”• A linear machine divides the feature space into c decision regions, with
gi(x) being the largest discriminant if x is in the region Ri
• For a two contiguous regions Ri and Rj; the boundary that separatesthem is a portion of hyperplane Hij defined by:
gi(x) = gj(x)⇔ (wi – wj)tx + (wi0 – wj0) = 0
• wi – wj is normal to Hij and
!
gi(x) = wi
tx + wi0 i =1,...,c
ji
ji
ijww
gg)H,x(d
!
!=
7
8
Generalized Linear Discriminant Functions
• Decision boundaries which separate between classesmay not always be linear
• The complexity of the boundaries may sometimesrequest the use of highly non-linear surfaces
• A popular approach to generalize the concept of lineardecision functions is to consider a generalized decisionfunction as:
g(x) = w1f1(x) + w2f2(x) + … + wNfN(x) + wN+1 (1)
where fi(x), 1 ≤ i ≤ N are scalar functions of thepattern x, x ∈ Rn (Euclidean Space)
9
• Introducing fn+1(x) = 1 we get:
• This latter representation of g(x) implies that anydecision function defined by equation (1) can be treatedas linear in the (N + 1) dimensional space (N + 1 > n)
• g(x) maintains its non-linearity characteristics in Rn
!
g(x) = wi fi(x) = wT.y
i=1
N +1
"
where w = (w1,w2,...,wN ,wN +1)T
and y = (f1(x), f2(x),..., fN (x), fN +1(x))T
10
• The most commonly used generalized decisionfunction is g(x) for which fi(x) (1 ≤ i ≤N) arepolynomials
• Quadratic decision functions for a 2-dimensionalfeature space
!
g(x) = w0
+ wixii=1:N
" + # ijxij= 0
"i=1:N
" x j + $ ijkxij=1:N
"i=1:N
" x j
k=1:N
" xk + ....
!
g(x) = w1x1
2+ w2x1x2 + w3x2
2+ w4x1 + w5x2 + w6
here : w = (w1,w2,...,w6)T
and y = (x1
2,x1x2,x2
2,x1,x2,1)
T
11
• For patterns x ∈Rn, the most general quadratic decisionfunction is given by:
• The number of terms at the right-hand side is:
This is the total number of weights which are the freeparameters of the problem– n = 3 => 10-dimensional– n = 10 => 65-dimensional
!
g(x) = " ii xi2
+ " ij xix j + wixi + wn+1 (2)
i=1
n
#j= i+1
n
#i=1
n$1
#i=1
n
#
!
l = N +1
= n +n(n "1)
2+ n +1
=(n +1)(n + 2)
2
12
• In the case of polynomial decision functions of order m, atypical fi(x) is given by:
– It is a polynomial with a degree between 0 and m. To avoidrepetitions, i1 ≤ i2 ≤ …≤ im
where gm(x) is the most general polynomial decisionfunction of order m
!
fi(x) = xi1e1 xi2
e2 ...ximem
where 1" i1,i2,...,im " n and ei,1" i " m is 0 or 1.
!
gm(x) = ... wi1i2 ...im
xi1 xi2 ...xim + gm"1(x)
im= im"1
n
#i2= i1
n
#i1=1
n
#
13
Example 1: Let n = 3 and m = 2 then:
Example 2: Let n = 2 and m = 3 then:
4332211
2
3333223
2
22231132112
2
111
3
ii
4332211iiii
3
1i
2
wxwxwxw
xwxxwxwxxwxxwx w
wxwxwxwxxw)x(g12
2121
1
++++
+++++=
++++= !!==
32211
2
2222112
2
111
2
ii
1
iiii
2
1i
2
23
2222
2
211222
2
1112
3
1111
2
ii
2
iiiiii
2
ii
2
1i
3
wxwxwxwxxwx w
)x(gxxw)x(gwhere
)x(gxwxxwxxwx w
)x(gxxxw)x(g
12
2121
1
23
321321
121
+++++=
+=
++++=
+=
!!
!!!
==
===
14
15
17
• Augmentation
• Incorporate labels into data
• Margin
18
Learning Linear Classifiers
20
Basic Ideas
Directly “fit” linear boundary in feature space.• 1) Define an error function for classification
– Number of errors?– Total distance of mislabeled points from boundary?– Least squares error– Within/between class scatter
• Optimize the error function on training data– Gradient descent– Modified gradient descent– Various search proedures.– How much data is needed?
21
Two-category case• Given x1, x2,…, xn sample points, with true category labels:α1, α2,…,αn
• Decision are made according to:
• Now these decisions are wrong when atxi is negative andbelongs to class ω1.Let yi = αi xi Then yi >0 when correctly labelled,negative otherwise.
!
if atxi> 0 class "
1 is chosen
if atxi< 0 class "
2 is chosen
!
"i=1
"i= #1
$ % &
if point xi is from class '1
if point xi is from class '2
22
• Separating vector (Solution vector)– Find a such that atyi>0 for all i.
• atyi=0 defines a hyperplane with a as normal
x2
x1
23
• Problem: solution not unique• Add Additional constraints: Margin, the distance away from boundary atyi>b
24
Margins in data space
b
Larger margins promote uniqueness forunderconstrained problems
25
Finding a solution vector byminimizing an error criterion
What error function?How do we weight the points? All the points or
only error points?Only error points:
!
Perceptron Criterion
JP (at ) = "(atyi
yi #Y
$ ) Y = {yi | atyi < 0}
Perceptron Criterion with Margin
JP (at ) = "(atyi
yi #Y
$ ) Y = {yi | atyi < b}
26
27
Minimizing Error via GradientDescent
28
• The min(max) problem:
• But we learned in calculus how to solve thatkind of question!
)(min xfx
29
Motivation
• Not exactly,• Functions:• High order polynomials:
• What about function that don’t have ananalytic presentation: “Black Box”
! + ! x
1
6x3 1
120x5 1
5040x7
RRf n!:
30
:= f ! ( ),x y"
#$$
%
&''cos
1
2x
"
#$$
%
&''cos
1
2y x
31
Directional Derivatives:first, the one dimension derivative:
!
32
x
yxf
!
! ),(
y
yxf
!
! ),(
Directional Derivatives :Along the Axes…
33
v
yxf
!
! ),(
2Rv!
1=v
Directional Derivatives :In general direction…
34
DirectionalDerivatives
x
yxf
!
! ),(
y
yxf
!
! ),(
35
In the plane
2R
RRf !2
:
!!"
#$$%
&
'
'
'
'=(
y
f
x
fyxf :),(
The Gradient: Definition in
36
!!"
#$$%
&
'
'
'
'=(
n
nx
f
x
fxxf ,...,:),...,(
1
1
RRf n!:
The Gradient: Definition
37
The Gradient Properties• The gradient defines (hyper) plane
approximating the function infinitesimally
yy
fx
x
fz !"
#
#+!"
#
#=!
38
The Gradient properties• Computing directional derivatives• By the chain rule: (important for later use)
vfpv
fp ,)()( !=
"
"1=v
Want rate ofchange along ν, ata point p
39
The Gradient properties
is maximal choosing
is minimal choosing
(intuition: the gradient points to the greatest change in direction)
v
f
!
!p
p
ff
v )()(
1!"
!=
p
p
ff
v )()(
1!"
!
#=
40
The Gradient Properties
• Let be a smooth functionaround P,
if f has local minimum (maximum) at pthen,
(Necessary for local min(max))
!
f :Rn" # " R
!
("f )p =r 0
41
The Gradient Properties
Intuition
42
The Gradient Properties
Formally: for anyWe get: }0{\nRv!
!
0 =df (p + t " v)
dt= (#f )p,v
$ (#f )p =r 0
43
The Gradient Properties• We found the best INFINITESIMAL
DIRECTION at each point,• Looking for minimum using “blind man”
procedure• How can get to the minimum using this
knowledge?
44
Steepest Descent
• AlgorithmInitialization:loop while
compute search directionupdate
end loop
nRx !
0
!
hi ="f (xi)
!
xi+1 = x
i" # $ h
i
!
"f (xi) > #
45
Setting the Learning RateIntelligently
Minimizing this approximation:
46
Minimizing Perceptron Criterion
!
Perceptron Criterion
JP (at ) = "atyi
yi #Y
$ Y = {yi | atyi < 0}
%a JP (at )( ) =%a "atyi
yi #Y
$
= "yiyi #Y
$
Which gives the update rule :
a( j +1) = a(k) +&k yiyi #Yk
$
Where Yk is the misclassified set at step k
47
Batch Perceptron Update
48
49
Simplest Case
50
When does perceptron ruleconverge?
51
+
+
52
Different Error Functions Four learning criteriaas a function ofweights in a linearclassifier.
At the upper left is the totalnumber of patternsmisclassified, which ispiecewise constant and henceunacceptable for gradientdescent procedures.
At the upper right is thePerceptron criterion (Eq. 16),which is piecewise linear andacceptable for gradientdescent.
The lower left is squarederror (Eq. 32), which has niceanalytic properties and isuseful even when thepatterns are not linearlyseparable.
The lower right is the squareerror with margin. A designermay adjust the margin b inorder to force the solutionvector to lie toward themiddle of the b = 0 solutionregion in hopes of improvinggeneralization of the resultingclassifier.
53
Three Different issues
1) Class boundaryexpressiveness
2) Error definition
3) Learning method
54
Fisher’s linear disc.
55
Fisher’s Linear DiscriminantIntuition
56
Write this in terms of w
We are looking for a good projection
So,
Next,
57
Define the scatter matrix:
Then,
Thus the denominator can be written:
58
Rayleigh QuotientMaximizing equivalent tosolving:
With solution:
59
60
61
62