ece 8443 – pattern recognition...

23
ECE 8443 – Pattern Recognition • URL: .../publications/courses/ece_8443/lectures/current/lectur e_03.ppt LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources: D.H.S : Chapter 2 (Part 3) K.F.: Intro to PR X. Z.: PR Course M.B. : Gaussian Discriminants E.M. : Linear Discriminants

Upload: marvin-boyd

Post on 18-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443 – Pattern Recognition

• URL: .../publications/courses/ece_8443/lectures/current/lecture_03.ppt

LECTURE 03: GAUSSIAN CLASSIFIERS

• Objectives:Normal DistributionsWhitening TransformationsLinear Discriminants

• Resources:D.H.S: Chapter 2 (Part 3)K.F.: Intro to PRX. Z.: PR CourseM.B. : Gaussian DiscriminantsE.M. : Linear Discriminants

Page 2: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 2

• Define a set of discriminant functions: gi(x), i = 1,…, c

• Define a decision rule:

choose i if: gi(x) > gj(x) j i

• For a Bayes classifier, let gi(x) = -R(i|x) because the maximum discriminant function will correspond to the minimum conditional risk.

• For the minimum error rate case, let gi(x) = P(i|x), so that the maximum discriminant function corresponds to the maximum posterior probability.

• Choice of discriminant function is not unique:

multiply or add by same positive constant

Replace gi(x) with a monotonically increasing function, f(gi(x)).

Multicategory Decision Surfaces

Page 3: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 3

• A classifier can be visualized as a connected graph with arcs and weights:

• What are the advantages of this type of visualization?

Network Representation of a Classifier

Page 4: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 4

• Some monotonically increasing functions can simplify calculations considerably:

)ln()ln()ln()()3(

)2(

)()1(

1

iiii

iii

j

c

jj

iiii

Ppggf

Ppg

Pp

PpPg

xxx

xx

x

xxx

• What are some of the reasons (3) is particularly useful?

Computational complexity (e.g., Gaussian)

Numerical accuracy (e.g., probabilities tend to zero)

Decomposition (e.g., likelihood and prior are separated and can be weighted differently)

Normalization (e.g., likelihoods are channel dependent).

Log Probabilities

Page 5: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 5

• We can visualize our decision rule several ways:

choose i if: gi(x) > gj(x) j i

Decision Surfaces

Page 6: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 6

• A classifier that places a pattern in one of two classes is often referred to as a dichotomizer.

• We can reshape the decision rule:

02121 )() - g(g) g() (g)(if g xxxxx

• If we use log of the posterior probabilities:

)ln()ln()ln(

)()(

2

1

2

1

21

P

P

p

p)g()f(

PP)g(

x

xxx

xxx

• A dichotomizer can be viewed as a machine that computes a single discriminant function and classifies x according to the sign (e.g., support vector machines).

Two-Category Case

Page 7: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 7

• Mean:• Covariance:

xxxx )dp(Ε

xxxxxx )dp(]E[ tt )-)-)-)- ((((

• Statistical independence?• Higher-order moments? Occam’s Razor?• Entropy?• Linear combinations of normal random variables?• Central Limit Theorem?

• Recall the definition of a normal distribution (Gaussian):

)]()(2

1exp[

)2(

1)( 1

2/12/

xxx t

dp

• Why is this distribution so important in engineering?

Normal Distributions

Page 8: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 8

• A normal or Gaussian density is a powerful model for modeling continuous-valued feature vectors corrupted by noise due to its analytical tractability.

• Univariate normal distribution:

]2

1exp[

2

1)(

2

xxp

dxxpxxE

dxxxpxE

)()()[(

)(][

222

where the mean and covariance are defined by:

• The entropy of a univariate normal distribution is given by:

)2log(

2

1)(ln)())(( 2edxxpxpxpH

Univariate Normal Distribution

Page 9: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 9

• A normal distribution is completely specified by its mean and variance:

• A normal distribution achieves the maximum entropy of all distributions having a given mean and variance.

• Central Limit Theorem: The sum of a large number of small, independent random variables will lead to a Gaussian distribution.

• The peak is at:

• 66% of the area is within one ; 95% is within two ; 99% is within three .

2

1)( p

Mean and Variance

Page 10: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 10

• A multivariate distribution is defined as:

)]()(2

1exp[

)2(

1)( 1

2/12/

xxx t

dp

where represents the mean (vector) and represents the covariance (matrix).

• Note the exponent term is really a dot product or weighted Euclidean distance.

• The covariance is always symmetric and positive semidefinite.

• How does the shape vary as a function of the covariance?

Multivariate Normal Distributions

Page 11: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 11

• A support region is the obtained by the intersection of a Gaussian distribution with a plane.

• For a horizontal plane, this generates an ellipse whose points are of equal probability density.

• The shape of the support region is defined by the covariance matrix.

Support Regions

Page 12: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 12

Derivation

Page 13: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 13

Identity Covariance

Page 14: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 14

Unequal Variances

Page 15: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 15

Nonzero Off-Diagonal Elements

Page 16: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 16

Unconstrained or “Full” Covariance

Page 17: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 17

• Why is it convenient to convert an arbitrary distribution into a spherical one? (Hint: Euclidean distance)

• Consider the transformation: Aw= -1/2

where is the matrix whose columns are the orthonormal eigenvectors of and is a diagonal matrix of eigenvalues (= t). Note that is unitary.

• What is the covariance of y=Awx?

E[yyt] = (Awx)(Awx)t =( -1/2x) ( -1/2x)t

= -1/2x xt -1/2 t = -1/2 -1/2 t

= -1/2 t -1/2 t

= t -1/2 -1/2 (t)

= I

Coordinate Transformations

Page 18: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 18

)()( 1 μμ xx t

• The weighted Euclidean distance:

is known as the Mahalanobis distance, and represents a statistically normalized distance calculation that results from our whitening transformation.

• Consider an example using our Java Applet.

Mahalanobis Distance

Page 19: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 19

• Recall our discriminant function for minimum error rate classification:

)(ln)|(ln)( iii Ppg xx

)(lnln2

1)2ln(

2)()(

2

1)( 1

iiiiii Pd

g μμ xxx t

• For a multivariate normal distribution:

• Consider the case: i = 2I (statistical independence, equal variance, class-independent variance)

idi

i

ddi

oft independen is and

)/1(

...00

.........0

0...0

000

2

21

22

2

2

2

I

Discriminant Functions

Page 20: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 20

)(lnln2

1)2ln(

2)()(

2

1)( 1

iiiiii Pd

g μμ xxx t

• The discriminant function can be reduced to:

• Since these terms are constant w.r.t. the maximization:

)(ln2

)(ln)()(2

1)(

2

2

1

ii

iiiii

P

Pg

μ

μμ

x

xxx t

• We can expand this:

)(ln)2(2

1)(

2 iiiii Pg

ttt xxxx

• The term xtx is a constant w.r.t. i, and iti is a constant that can be

precomputed.

Gaussian Classifiers

Page 21: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 21

• We can use an equivalent linear discriminant function:

• wi0 is called the threshold or bias for the ith category.

• A classifier that uses linear discriminant functions is called a linear machine.

• The decision surfaces defined by the equation:

)(ln2

11)(

2020 iiiiiiiii Pwg

tt w wxwx

)(

)(ln2

0)(ln2

)(ln2

222

2

2

2

2

i

jji

jj

ii

ji

P

P

PP

gg

xx

xx

xx 0)(-)(

Linear Machines

Page 22: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 22

• This has a simple geometric interpretation:

)(

)(ln2 222

i

jji P

P

xx

• The decision region when the priors are equal and the support regions are spherical is simply halfway between the means (Euclidean distance).

Threshold Decoding

Page 23: ECE 8443 – Pattern Recognition URL:.../publications/courses/ece_8443/lectures/current/lecture_03.ppt.../publications/courses/ece_8443/lectures/current/lecture_03.ppt

ECE 8443: Lecture 03, Slide 23

Summary

• Decision Surfaces: geometric interpretation of a Bayesian classifier.

• Gaussian Distributions: how is the shape of the distribution influenced by the mean and covariance?

• Bayesian classifiers for Gaussian distributions: how does the decision surface change as a function of the mean and covariance?