object orie’d data analysis, last time classification / discrimination –try to separate classes...

Object Orie’d Data Analysis, Last Time

• Classification / Discrimination– Try to Separate Classes +1 & -1

– Statistics & EECS viewpoints

– Introduced Simple Methods• Mean Difference

• Naïve Bayes

• Fisher Linear Discrimination (nonparametric view)

• Gaussian Likelihood ratio

– Started Comparing

Classification - Discrimination

Important Distinction:

Classification vs. Clustering

Useful terminology:

Classification: supervised learning

Clustering: unsupervised learning

Fisher Linear Discrimination

Graphical Introduction (non-Gaussian):

Classical DiscriminationFLD for Tilted Point Clouds – Works well

Classical DiscriminationGLR for Tilted Point Clouds – Works

well

Classical DiscriminationFLD for Donut – Poor, no plane can

work

Classical DiscriminationGLR for Donut – Works well (good

quadratic)

Classical DiscriminationFLD for X – Poor, no plane can work

Classical DiscriminationGLR for X – Better, but not great

Classical DiscriminationSummary of FLD vs. GLR:• Tilted Point Clouds Data

– FLD good– GLR good

• Donut Data– FLD bad– GLR good

• X Data– FLD bad– GLR OK, not great

Classical Conclusion: GLR generally better

(will see a different answer for HDLSS data)

Classical DiscriminationFLD Generalization II (Gen. I was GLR)

Different prior probabilitiesMain idea:

Give different weights to 2 classes• I.e. assume not a priori equally likely• Development is “straightforward”• Modified likelihood• Change intercept in FLD• Won’t explore further here

Classical DiscriminationFLD Generalization III

Principal Discriminant Analysis• Idea: FLD-like approach to > two

classes• Assumption: Class covariance

matrices are the same (similar)(but not Gaussian, same situation as for

FLD)• Main idea:

Quantify “location of classes” by their means

k ,...,, 21

Classical DiscriminationPrincipal Discriminant Analysis (cont.)Simple way to find “interesting

directions” among the means:PCA on set of means

i.e. Eigen-analysis of “between class covariance matrix”

Where

Aside: can show: overall

tB MM

)()1(1 k

kM

wB nkn

Classical DiscriminationPrincipal Discriminant Analysis (cont.)

But PCA only works like Mean

Difference,

Expect can improve bytaking covariance into account.

Blind application of above ideas

suggests eigen-analysis of:Bw 1

Classical DiscriminationPrincipal Discriminant Analysis (cont.)There are:• smarter ways to compute

(“generalized eigenvalue”)• other representations

(this solves optimization prob’s)Special case: 2 classes,

reduces to standard FLDGood reference for more: Section 3.8

of:Duda, Hart & Stork (2001)

Classical DiscriminationSummary of Classical Ideas:• Among “Simple Methods”

– MD and FLD sometimes similar– Sometimes FLD better– So FLD is preferred

• Among Complicated Methods– GLR is best– So always use that

• Caution:– Story changes for HDLSS settings

HDLSS Discrimination

Recall main HDLSS issues:

• Sample Size, n < Dimension, d

• Singular covariance matrix

• So can’t use matrix inverse

• I.e. can’t standardize (sphere) the data

(requires root inverse covariance)

• Can’t do classical multivariate analysis

HDLSS DiscriminationAn approach to non-invertible

covariances:• Replace by generalized inverses• Sometimes called pseudo inverses• Note: there are several• Here use Moore Penrose inverse• As used by Matlab (pinv.m)• Often provides useful results

(but not always)Recall Linear Algebra Review…

Recall Linear Algebra

Eigenvalue Decomposition:

For a (symmetric) square matrix

Find a diagonal matrix

And an orthonormal matrix

(i.e. )

So that: , i.e.

ddX

d

D

0

01

ddB

ddtt IBBBB

DBBX tBDBX

Recall Linear Algebra (Cont.)

Eigenvalue Decomp. solves matrix problems:

• Inversion:

• Square Root:

• is positive (nonn’ve, i.e. semi) definite

all

t

d

BBX

1

11

1

0

0

t

d

BBX

2/1

2/11

2/1

0

0

X

0)(i


Moore-Penrose Generalized Inverse:

For

tr BBX

000

0

0

0

00

1

11

0,,0,, 11 drr


Easy to see this satisfies the definition of

Generalized (Pseudo) Inverse

•

•

• symmetric

• symmetric

XXXX

XXXX

XX

XX


Moore-Penrose Generalized Inverse:

Idea: matrix inverse on non-null space of linear transformation

Reduces to ordinary inverse, in full rank case,i.e. for r = d, so could just always use this

Tricky aspect:“>0 vs. = 0” & floating point arithmetic

HDLSS DiscriminationApplication of Generalized Inverse to

FLD:

Direction (Normal) Vector:

Intercept:

Have replaced by

)2()1(ˆ XXn wFLD

)2()1(

21

21

XXFLD

1ˆ w wˆ

HDLSS DiscriminationToy Example: Increasing Dimension

data vectors:

• Entry 1:

Class +1: Class –1:

• Other Entries:

• All Entries Independent

Look through dimensions,

1,2.2N

20 nn

1,2.2N

1,0N

1000,,2,1 d


Increasing Dimension Example

Proj. on

Opt’l Dir’n

Proj. on

FLD Dir’n

Proj. on

both Dir’ns


Add a 2nd Dimension (noise)

Same Proj. on

Opt’l Dir’n

Axes same

as dir’ns

Now See 2

Dim’ns


Add a 3rd Dimension (noise)

Project on

2-d subspace

generated

by optimal

dir’n & by

FLD dir’n

HDLSS DiscriminationMovie Through Increasing Dimensions

HDLSS DiscriminationFLD in Increasing Dimensions:• Low dimensions (d = 2-9):

– Visually good separation– Small angle between FLD and Optimal– Good generalizability

• Medium Dimensions (d = 10-26):– Visual separation too good?!?– Larger angle between FLD and Optimal– Worse generalizability– Feel effect of sampling noise

HDLSS DiscriminationFLD in Increasing Dimensions:• High Dimensions (d=27-37):

– Much worse angle– Very poor generalizability– But very small within class variation– Poor separation between classes– Large separation / variation ratio

HDLSS DiscriminationFLD in Increasing Dimensions:• At HDLSS Boundary (d=38):

– 38 = degrees of freedom(need to estimate 2 class means)

– Within class variation = 0 ?!?– Data pile up, on just two points– Perfect separation / variation ratio?– But only feels microscopic noise aspects

So likely not generalizable– Angle to optimal very large


FLD in Increasing Dimensions:

• Just beyond HDLSS boundary (d=39-

70):

– Improves with higher dimension?!?

– Angle gets better

– Improving generalizability?

– More noise helps classification?!?

HDLSS DiscriminationFLD in Increasing Dimensions:• Far beyond HDLSS boun’ry (d=70-

1000):– Quality degrades– Projections look terrible

(populations overlap)– And Generalizability falls apart, as well– Math’s worked out by Bickel & Levina

(2004)– Problem is estimation of d x d

covariance matrix

object orie’d data analysis, last time classification / discrimination –try to separate classes...

Documents

donut works

use matrix inversei

bytaking covariance

diagonal matrix

orthonormal matrix

class covariance matrices

object oried data analysis

simple way