lecture 2: linear and quadratic classifiers

Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University

1

LECTURE 2: Linear and quadratic classifiers

g Part 1: Bayesian Decision Theoryn The Likelihood Ratio Testn Maximum A Posteriori and Maximum Likelihoodn Discriminant functions

g Part 2: Quadratic classifiersn Bayes classifiers for normally distributed classesn Euclidean and Mahalanobis distance classifiersn Numerical example

g Part 3: Linear classifiersn Gradient descentn The perceptron rulen The pseudo-inverse solutionn Least mean squares


2

Part 1: Bayesian Decision Theory


3

The Likelihood Ratio Test (1)g Assume we are to classify an object based on the

evidence provided by a measurement (or feature vector) x

g Would you agree that a reasonable decision rule would be the following?

n "Choose the class that is most ‘probable’ given the observed feature vector x”

n More formally: Evaluate the posterior probability of each class P(ωi|x) and choose the class with largest P(ωi|x)


4

The Likelihood Ratio Test (2)g Let us examine this decision rule for a two-class

problemn In this case the decision rule becomes

g Or, in a more compact form

n Applying Bayes theorem

2

121

ωchooseelseωchoose)x|ω(P)x|ω(Pif >

)x|ω(P)x|ω(P 2

ω

ω1

1

2

<>

)x(P)ω(P)ω|x(P

)x(P)ω(P)ω|x(P 22ω

ω

111

2

<>


5

The Likelihood Ratio Test (3)n P(x) does not affect the decision rule so it can be eliminated*.

Rearranging the previous expression

n The term Λ(x) is called the likelihood ratio, and the decision rule is known as the likelihood ratio test

)ω(P)ω(P

)ω|x(P)ω|x(P)x(Λ

1

2ω

ω2

11

2

<>=


6

Likelihood Ratio Test: an example (1)g Given a classification problem with the following

class conditional densities:

g Derive a classification rule based on the Likelihood Ratio Test (assume equal priors)

2

2

10)(x21

2

4)(x21

1

e2π1)ω|P(x

e2π1)ω|P(x

−−

−−

=

=

x

P(x|ω1) P(x|ω2)

4 10


7

Likelihood Ratio Test: an example (2)g Solution

n Substituting the given likelihoods and priors into the LRT expression:

n Simplifying, changing signs and taking logs:

n Which yields:

n This LRT result makes intuitive sense since the likelihoods are identical and differ only in their mean value

11

e

eΛ(x)

1

2

2

2 ω

ω10)(x

21

2π1

4)(x21

2π1

<>

=−−

−−

0)10x()4x(1

2

ω

ω

22

><

−−−

7x1

2

ω

ω><

R1: say ω1

x

R2: say ω2

P(x|ω1) P(x|ω2)

4 10


8

The probability of error

g The probability of error is “the probability of assigning x to the wrong class”n For a two-class problem, P[error|x] is simply

g It makes sense that the classification rule be designed to minimize the average prob. of error P[error] across all possible values of x

g To minimize P(error) we minimize the integrand P(error|x) at each x: choose the class with maximum posterior P(ωi|x)

n This is called the MAXIMUM A POSTERIORI (MAP) RULE

=12

21

ωdecideweifx)|P(ωωdecideweifx)|P(ω

x)|P(error

∫∫+∞

∞−

+∞

∞−== x)P(x)dx|P(errorx)dxP(error,P(error)


9

Minimizing probability of errorg We “prove” the optimality of the

MAP rule graphicallyn The right plot shows the posterior

for each of the two classesn The bottom plots shows the

P(error) for the MAP rule and an alternative decision rule

n Which one has lower P(error) (color-filled area) ? x

P(w

i|x)

ChooseRED

ChooseBLUE

ChooseRED

THE MAP RULEChoose

REDChooseBLUE

ChooseRED

THE “OTHER” RULE


10

The Bayes Risk (1)

g So far we have assumed that the penalty of misclassifying ω1 as ω2 is the same as the reciprocaln In general, this is not the casen Misclassifications in the fish sorting lead to different costs n Medical diagnostics errors are very asymmetric

g We capture this concept with a cost function Cij

n Cij represents the cost of choosing class ωi when class ωj is the true class

g And define the Bayes Risk as the expected value of the cost

∑∑∑∑= == =

⋅∈⋅=∈⋅==ℜ2

1i

2

1jjjiij

2

1i

2

1jjiij ]ω[P]ω|Rx[PC]ωxandωchoose[PC]C[E


11

The Bayes Risk (2)

g What is the decision rule that minimizes the Bayes Risk?n It can be shown* that the minimum risk can be achieved by

using the following decision rule:

n *For an intuitive proof visit my lecture notes at TAMU

g Notice any similarities with the LRT?

]ω[P]ω[P

)CC()CC(

)ω|x(P)ω|x(P

1

2

1121

2212

ω

ω2

1

1

2

−−

<>


12

The Bayes Risk: an example (1)

g Consider a classification problem with two classes defined by the following likelihood functions

g What is the decision rule that minimizes P[error]?n Assume P[ω1]=P[ω2]=0.5, C11=C22=0, C12=1 and C21=31/2

2

2

2)(x21

2

3x

21

1

e2π1)ω|P(x

e32π

1)ω|P(x

−−

−

=

=

-6 -4 -2 0 2 4 60

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

x

likel

ihoo

d


13

The Bayes Risk: an example (2)

1.274.73,x01212x2x

02)(x21

3x

21

1e

e

31

e2π

e32πΛ(x)

1

2

1

2

1

2

2

2

1

2

2

2

ω

ω

2

ω

ω

22

ω

ω2)(x

21

3x

21

ω

ω2)(x

21

3x

21

=⇒<>

+−

⇒<>

−+−

⇒<>

⇒<>

=

−−

−

−−

−

1

1

-6 -4 -2 0 2 4 60

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

x

R1R2R1


14

g The LRT that minimizes the Bayes Risk is called the Bayes Criterion

g Maximum A Posteriori Criterionn Sometimes we will be interested in minimizing P[error], which is

a special case of the Bayes Criterion if we use a zero-one cost function

Variations of the LRT

]ω[P]ω[P

)CC()CC(

)ω|x(P)ω|x(P)x(Λ

1

2

1121

2212

ω

ω2

1

1

2

−−

<>

=

1)x|ω(P)x|ω(P

)ω(P)ω(P

)ω|x(P)ω|x(P)x(Λ

ji1ji0

C1

2

1

2

ω

ω2

1

1

2

ω

ω2

1ij <

>⇔

<>

=⇒

≠=

=


15

g Maximum Likelihoodn Finally, the simplest for of the LRT is obtained for the case of

equal priors P[ωi]=1/2 and zero-one cost function:

n When would you want to use an ML criterion?

Variations of the LRT

1)ω|P(x)ω|P(xΛ(x)

iC1)P(ω

ji1ji0

C 1

2

ω

ω2

1

i

ij

<>

=⇒

∀=

≠=

=


16

Multi-class problems

g The previous decision rules were derived for two-class problems, but generalize gracefully for multiple classes:

n To minimize P[error] choose the class ωi with highest P[ωi|x]

n To minimize Bayes risk choose the class ωi with lowest ℜ[ωi|x]

∑=≤≤≤≤

=ℜ=C

1jjij

Ci1j

Ci1i x)|P(ωCargminx)|(ωargminω

x)|P(ωargmaxω iCi1

i≤≤

=


17

Discriminant functions (1)

g All these decision rules have the same structuren At each point x in feature space choose class ωi which

maximizes (or minimizes) some measure gi(x)n This structure can be formalized with a set of discriminant

functions gi(x), i=1..C, and the following decision rule

n We can then express the three basic decision rules (Bayes, MAP and ML) in terms of Discriminant Functions:

i"j(x)g(x)gifωclasstoxassign" jii ≠∀>

Criterion Discriminant FunctionBayes gi(x)=-ℜ(αi|x)MAP gi(x)=P(ωi|x)ML gi(x)=P(x|ωi)


18

Discriminant functions (2)

g Therefore, we can visualize the decision rule as a network that computes C discriminant functions and selects the class corresponding to the largest discriminant

x2x2 x3

x3 xdxd

g1(x)g1(x)

x1x1

g2(x)g2(x) gC(x)gC(x)

Select maxSelect max

CostsCosts

Class assignment

Discriminant functions

Features


19

Recapping…

g The LRT is a theoretical result that can only be applied if we have complete knowledge of the likelihoods P[x|ωi]n P[x|ωi] generally unknown, but can be estimated from data

n If the form of the likelihood is known (e.g., Gaussian) the problem is simplified b/c we only need to estimate the parameters of the model (e.g., mean and covariance)g This leads to a classifier known as QUADRATIC, which we cover

next

n If the form of the likelihood is unknown, the problem becomes much harder, and requires a technique known as non-parametric density estimationg This technique is covered in lecture 3


20

Part 2: Quadratic classifiers


21

Bayes classifier for Gaussian classes (1)

g For Normally distributed classes, the discriminant functions reduce to very simple expressionsn The (multivariate) Gaussian density can be defined as

n Using the Bayes rule, the MAP discriminant function can be written as

−∑−−

∑= − µ)(xµ)(x

21exp

π)(21p(x) 1T

1/2n/2

P(x)1)P()µ(x)µ(x

21exp

π)(21

P(x)))P(|P(xx)|P((x)g

iiT

i1/2n/2

iiii

ω

ωωω

−∑−−

∑=

===

−1i

i


22

Bayes classifier for Gaussian classes (2)n Eliminating constant terms

n Taking logs

n This is known as a QUADRATIC discriminant function (because it is a function of the square of x)

n In the next few slides we will analyze what happens to this expression under different assumptions for the covariance

( ) ( ))ωP(loglog21-)µ(x)µ(x

21(x)g iii

1i

Tii +∑−∑−−= −

)ωP()µ(x)µ(x21exp(x)g ii

1i

Ti

1/2-ii

−∑−−∑= −


23

Case 1: Σi=σ2I (1)g This situation occurs when the features are

statistically independent, and have the same variance for all classesn In this case, the quadratic discriminant function becomes

n Assuming equal priors and dropping constant terms

n This is called an Euclidean-distance or nearest mean classifier

( ) ( ) ( ) ( ))P(ωlog)µ(x)µ(x2σ

1)P(ωlogIσlog21-)µ(xIσ)µ(x

21(x)g ii

Ti2i

2i

-12Tii +−−−=+−−−=

( )∑=

−=−−−=DIM

1i

2iii

Tii µx-)µ(x)µ(x(x)g

From [Schalkoff, 1992]


24

Case 1: Σi=σ2I (1)g This is probably the simplest statistical classifier

that you can build:n “Assign an unknown example to the class whose center is

the closest using the Euclidean distance”

µ1

Min

imum

Sel

ecto

r

µ2

µC

class

xEuclidean Distance

Euclidean Distance

Euclidean Distance

µ1

Min

imum

Sel

ecto

r

µ2

µC

class

xEuclidean Distance

Euclidean Distance

Euclidean Distance


25

Case 1: Σi=σ2I, example

[ ] [ ] [ ]

=

=

=

===

2002

Σ2002

Σ2002

Σ

52µ47µ23µ

321

T3

T2

T1


26

Case 2: Σi=Σ (Σ non-diagonal)g All the classes have the same covariance matrix, but

the matrix is not diagonaln In this case, the quadratic discriminant becomes

n Assuming equal priors and eliminating constant terms

n This is known as a Mahalanobis-distance classifier

( ) ( ))P(ωloglog21-)µ(x)µ(x

21(x)g ii

1Tii +∑−∑−−= −

)µ(x)µ(x(x)g i-1T

ii −Σ−−=

µ1

Min

imum

Sel

ecto

r

µ2

µC

class

xMahalanobis

Distance

Mahalanobis Distance


Σµ1

Min

imum

Sel

ecto

r

µ2

µC

class

xMahalanobis

Distance



Σ


27

The Mahalanobis distance

g The quadratic term is called the Mahalanobis distance, a very important metric in SPRn The Mahalanobis metric is a vector distance that uses a ∑-1 normn ∑-1 can be thought of as a “stretching factor” on the spacen Note that for an identity covariance matrix (∑=I), the Mahalanobis

distance becomes the familiar Euclidean distance

µx2

x1

Κ µ-x 2i =

K µ-x 2i 1 =−∑


28

[ ] [ ] [ ]

=

=

=

===

27.07.01

Σ27.07.01

Σ27.07.01

Σ

52µ45µ23µ

321

T3

T2

T1

Case 2: Σi=Σ (Σ non-diagonal), example


29

[ ] [ ] [ ]

=

−

−=

−

−=

===

35.05.05.0

Σ7111

Σ2111

Σ

52µ45µ23µ

321

T3

T2

T1

Case 3: Σi≠Σj general case, example

Zoomout


30

Numerical example (1)

g Derive a linear discriminant function for the two-class 3D classification problem defined by

g Anybody would dare to sketch the likelihood densities and decision boundary for this problem?

[ ] [ ] ( ) ( )1221T

2T

1 ω2pωp;1/40001/40001/4

ΣΣ ;111µ;000µ =

====


31


g Solution

( ) ( ) )logP(ω-µ-zµ-yµ-x

400040004

µ-zµ-yµ-x

21)logP(ωΣlog

21µ-xΣµ-x

21(x)g i

z

y

xT

z

y

x

ii1T

ii +

−∝+−−= −

32log

1-z1-y1-x

400040004

1-z1-y1-x

21(x)g

31log

0-z0-y0-x

400040004

0-z0-y0-x

21(x)g

T

2

T

1

+

−=

+

−= ;


32


g Solution (continued)

n Classify the test example xu=[0.1 0.7 0.8]T

( ) ( ) ( ) ( )( )32log1z1y1x2-

31logzyx2-

(x)g(x)g

222

ω

ω

222

2

ω

ω

1

2

1

2

1

+−+−+−<>

+++

⇒<>

1.324log26 zyx

1

2

ω

ω

=−

<>

++

2u

ω

ω

ωx1.321.6 0.80.70.11

2

∈⇒<>

=++


33

Conclusions

g The Euclidean distance classifier is Bayes-optimal* ifn Gaussian classes and equal covariance matrices proportional to

the identity matrix and equal priors

g The Mahalanobis dist. classifier is Bayes-optimal ifn Gaussian classes and equal covariance matrices and equal

priors

*Bayes optimal means that the classifier yields the minimum P[error], which is the best ANY classifier can achieve


34

Part 3: Linear classifiers


35

Linear Discriminant Functions (1)

g The objective of this section is to present methods for learning linear discriminant functions of the form

n where w is the weight vectorand w0 is the threshold or bias

n Similar discriminant functions were derived in the previous section as a special case of the quadratic classifier

n In this chapter, the discriminant functions will be derived in a non-parametric fashion, this is, no assumptions will be made about the underlying densities

x1

x2

wx

wTx+w0>0

wTx+w0<0

x(1

x(2d

x1

x2

wx

wTx+w0>0

wTx+w0<0

x(1

x(2d

( ) ( )( )

∈<∈>

⇔+=2

10

T

ωx0xgωx0xg

wxwxg


36

Linear Discriminant Functions (2)

g For convenience, we will focus on binary classificationn Extension to the multicategory case can be easily achieved by

g Using ωi/not ωi dichotomiesg Using ωi/ωi dichotomies


37

Gradient descent (1)

g Gradient descent is a general method for function minimization n From basic calculus, we know that the minimum of a function

J(x) is defined by the zeros of the gradient

n Only in very special cases this minimization function has a closed form solution

n In some other cases, a closed form solution may exist, but is numerically ill-posed or impractical (e.g., memory requirements)

[ ] 0J(x)J(x)argminx* xx

=⇒∇=∀


38

Gradient descent (2)g Gradient descent finds the

minimum in an iterative fashion by moving in the direction of steepest descent

n where η is a learning rate

1. Start with an arbitrary solution x(0)2. Compute the gradient ∇xJ(x(k))3. Move in the direction of steepest descent:

4. Go to 1 (until convergence)

1. Start with an arbitrary solution x(0)2. Compute the gradient ∇xJ(x(k))3. Move in the direction of steepest descent:

4. Go to 1 (until convergence)( ) ( ) ( )( )kxJηkx1kx x∇−=+

-2 0 2-2

0

2

x1

x 2

Initialguess Global

minimum

Localminimum

-2 0 2-2

0

2

x1

x 2

Initialguess Global

minimum

Localminimum

J(w)

w

∇J<0∆w>0

∇J>0∆w<0

J(w)

w

∇J<0∆w>0

∇J>0∆w<0


39

Perceptron learning (1)

g Let’s now consider the problem of solving a binary classification problem with a linear discriminantn As usual, assume we have a dataset X={x(1,x(2,…x(N} containing

examples from the two classesn For convenience, we will absorb the intercept w0 by augmenting

the feature vector x with an additional constant dimension:

[ ] yax1

wwwxw TT00

T =

=+

From [Duda, Hart and Stork, 2001]


40

Perceptron learning (2)n Keep in mind that our objective is to find a vector a such that

n To simplify the derivation, we will “normalize” the training set by replacing all examples from class ω2 by their negative

g This allows us to ignore class labels and look for a weight vector such that

y0yaT ∀>

[ ] 2ωyyy ∈∀−←


( )

∈<∈>

=2

1T

ωx0ωx0

yaxg


41


g To find this solution we must first define an objective function J(a)n A good choice is what is known as the Perceptron criterion

g where YM is the set of examples misclassified by ag Note that JP(a) is non-negative since aTy<0 for misclassified

samples

( ) ( )∑∈

−=MΥy

TP yaaJ


42

Perceptron learning (4)g To find the minimum of JP(a), we use gradient

descentn The gradient is defined by

n And the gradient descent update rule becomes

n This is known as the perceptron batch update rule. g The weight vector may also be updated in an “on-line” fashion, this

is, after the presentation of each individual example

n where y(i is an example that has been misclassified by a(k)

( ) ( )∑∈

−=∇MΥy

Pa yaJ

( ) ( )( )

∑∈

+=+kΥy M

yηka1ka

( ) ( ) (iηyka1ka +=+ Perceptron rule


43


g If classes are linearly separable, the perceptron rule is guaranteed to converge to a valid solution

g However, if the two classes are not linearly separable, the perceptron rule will not convergen Since no weight vector a can correctly classify every sample in a

non-separable dataset, the corrections in the perceptron rule will never cease

n One ad-hoc solution to this problem is to enforce convergence by using variable learning rates η(k) that approach zero as kapproaches infinite


44

Perceptron learning example

g Consider the following classification problemn class ω1 defined by feature vectors x={[0, 0]T, [0, 1]T}; n class ω2 defined by feature vectors x={[1, 0]T, [1, 1]T}.

g Apply the perceptron algorithm to build a vector ‘a’that separates both classes. n Use learning rate η=1 and a(0) =[1, -1, -1]T.n Update the vector ‘a’ on a per-example basisn Present examples in the order in which they were given above.n Draw a scatterplot of the data, and the separating line you found

with the perceptron rule.


45

Minimum Squared Error solution (1)

g The classical Minimum Squared Error (MSE) criterion provides an alternative to the perceptron rulen The perceptron rule seeks a weight vector aT that satisfies the

inequality aTy(i>0g The perceptron rule only considers misclassified samples, since

these are the only ones that violate the above inequalityn Instead, the MSE criterion looks for a solution to the equality

aTy(i=b(i, where b(i are some pre-specified target values (e.g., class labels)

g As a result, the MSE solution uses ALL of the samples in the training set



46


g The system of equations solved by MSE is

n where a is the weight vector, each row in Y is a training example, and each row in b is the corresponding class label

g For consistency, we will continue assuming that examples from class ω2 have been replaced by their negative vector, although this is not a requirement for the MSE solution

bYa

b

bb

a

aa

yyy

yyyyyy

(N

(2

(1

D

1

0

N(D

N(1

N(0

(2D

2(1

2(0

(1D

(11

(10

=⇔

=

M

MM

L

MMM

MMM

L

L



47


g An exact solution to Ya=b can sometimes be found n If the number of (independent) equations (N) is equal to the

number of unknowns (D+1), the exact solution is defined by

g In practice, however, Y will be singular so its inverse Y-1 does not existn Y will commonly have more rows (examples) than columns

(unknown), which yields an over-determined system, for which an exact solution cannot be found

bYa 1−=


48


g The solution in this case is to find a weight vector that minimizes some function of the error between the model (aY) and the desired output (b)n In particular, MSE seeks to Minimize the sum of the Squares of

these Errors:

n which, as usual, can be found by setting its gradient to zero

( ) ( ) 2N

1i

2(i(iTMSE b-YabyaaJ =−=∑

=


49

The pseudo-inverse solution (1)

g The gradient of the objective function is

n with zeros defined by

n Notice that YTY is now a square matrix!

g If YTY is nonsingular, the MSE solution becomes

n where the matrix Y†=(YTY)-1YT is known as the pseudo-inverseof Y (Y†Y=I)

g Note that, in general, YY†≠I in general

( ) ( ) ( ) 0bYa2Yybya2aJ TN

1i

(i(i(iTMSEa =−=−=∇ ∑

=

bYYaY TT =

( ) bYbYYYa †T-1T == Pseudo-inverse solution


50

Least-mean-squares solution (1)

g The objective function JMSE(a)=||Ya-b||2 can also be found using a gradient descent proceduren This avoids the problems that arise when YTY is singularn In addition, it also avoids the need for working with large

matrices

g Looking at the expression of the gradient, the obvious update rule is

n It can be shown that if η(k)=η(1)/k, where η(1) is any positive constant, this rule generates a sequence of vectors that converge to a solution to YT(Ya-b)=0

( ) ( ) ( ) ( )( )kYa-bYkηka1ka T+=+



51

Least-mean-squares solution (2)

g The storage requirements of this algorithm can be reduced by considering each sample sequentially

n This is known as the Widrow-Hoff, least-mean-squares (LMS) or delta rule [Mitchell, 1997]

( ) ( ) ( ) ( )( )kYa-bYkηka1ka T+=+

( ) ( ) ( ) ( )( ) (i(i(i ykay-bkηka1ka +=+ LMS rule


lecture 2: linear and quadratic classifiers

Documents