object orie’d data analysis, last time classification / discrimination –try to separate classes...
TRANSCRIPT
Object Orie’d Data Analysis, Last Time
• Classification / Discrimination– Try to Separate Classes +1 & -1
– Statistics & EECS viewpoints
– Introduced Simple Methods• Mean Difference
• Naïve Bayes
• Fisher Linear Discrimination (nonparametric view)
• Gaussian Likelihood ratio
– Started Comparing
Classification - Discrimination
Important Distinction:
Classification vs. Clustering
Useful terminology:
Classification: supervised learning
Clustering: unsupervised learning
Fisher Linear Discrimination
Graphical Introduction (non-Gaussian):
Classical DiscriminationFLD for Tilted Point Clouds – Works well
Classical DiscriminationGLR for Tilted Point Clouds – Works
well
Classical DiscriminationFLD for Donut – Poor, no plane can
work
Classical DiscriminationGLR for Donut – Works well (good
quadratic)
Classical DiscriminationFLD for X – Poor, no plane can work
Classical DiscriminationGLR for X – Better, but not great
Classical DiscriminationSummary of FLD vs. GLR:• Tilted Point Clouds Data
– FLD good– GLR good
• Donut Data– FLD bad– GLR good
• X Data– FLD bad– GLR OK, not great
Classical Conclusion: GLR generally better
(will see a different answer for HDLSS data)
Classical DiscriminationFLD Generalization II (Gen. I was GLR)
Different prior probabilitiesMain idea:
Give different weights to 2 classes• I.e. assume not a priori equally likely• Development is “straightforward”• Modified likelihood• Change intercept in FLD• Won’t explore further here
Classical DiscriminationFLD Generalization III
Principal Discriminant Analysis• Idea: FLD-like approach to > two
classes• Assumption: Class covariance
matrices are the same (similar)(but not Gaussian, same situation as for
FLD)• Main idea:
Quantify “location of classes” by their means
k ,...,, 21
Classical DiscriminationPrincipal Discriminant Analysis (cont.)Simple way to find “interesting
directions” among the means:PCA on set of means
i.e. Eigen-analysis of “between class covariance matrix”
Where
Aside: can show: overall
tB MM
)()1(1 k
kM
wB nkn
Classical DiscriminationPrincipal Discriminant Analysis (cont.)
But PCA only works like Mean
Difference,
Expect can improve bytaking covariance into account.
Blind application of above ideas
suggests eigen-analysis of:Bw 1
Classical DiscriminationPrincipal Discriminant Analysis (cont.)There are:• smarter ways to compute
(“generalized eigenvalue”)• other representations
(this solves optimization prob’s)Special case: 2 classes,
reduces to standard FLDGood reference for more: Section 3.8
of:Duda, Hart & Stork (2001)
Classical DiscriminationSummary of Classical Ideas:• Among “Simple Methods”
– MD and FLD sometimes similar– Sometimes FLD better– So FLD is preferred
• Among Complicated Methods– GLR is best– So always use that
• Caution:– Story changes for HDLSS settings
HDLSS Discrimination
Recall main HDLSS issues:
• Sample Size, n < Dimension, d
• Singular covariance matrix
• So can’t use matrix inverse
• I.e. can’t standardize (sphere) the data
(requires root inverse covariance)
• Can’t do classical multivariate analysis
HDLSS DiscriminationAn approach to non-invertible
covariances:• Replace by generalized inverses• Sometimes called pseudo inverses• Note: there are several• Here use Moore Penrose inverse• As used by Matlab (pinv.m)• Often provides useful results
(but not always)Recall Linear Algebra Review…
Recall Linear Algebra
Eigenvalue Decomposition:
For a (symmetric) square matrix
Find a diagonal matrix
And an orthonormal matrix
(i.e. )
So that: , i.e.
ddX
d
D
0
01
ddB
ddtt IBBBB
DBBX tBDBX
Recall Linear Algebra (Cont.)
Eigenvalue Decomp. solves matrix problems:
• Inversion:
• Square Root:
• is positive (nonn’ve, i.e. semi) definite
all
t
d
BBX
1
11
1
0
0
t
d
BBX
2/1
2/11
2/1
0
0
X
0)(i
Recall Linear Algebra (Cont.)
Moore-Penrose Generalized Inverse:
For
tr BBX
000
0
0
0
00
1
11
0,,0,, 11 drr
Recall Linear Algebra (Cont.)
Easy to see this satisfies the definition of
Generalized (Pseudo) Inverse
•
•
• symmetric
• symmetric
XXXX
XXXX
XX
XX
Recall Linear Algebra (Cont.)
Moore-Penrose Generalized Inverse:
Idea: matrix inverse on non-null space of linear transformation
Reduces to ordinary inverse, in full rank case,i.e. for r = d, so could just always use this
Tricky aspect:“>0 vs. = 0” & floating point arithmetic
HDLSS DiscriminationApplication of Generalized Inverse to
FLD:
Direction (Normal) Vector:
Intercept:
Have replaced by
)2()1(ˆ XXn wFLD
)2()1(
21
21
XXFLD
1ˆ w wˆ
HDLSS DiscriminationToy Example: Increasing Dimension
data vectors:
• Entry 1:
Class +1: Class –1:
• Other Entries:
• All Entries Independent
Look through dimensions,
1,2.2N
20 nn
1,2.2N
1,0N
1000,,2,1 d
HDLSS Discrimination
Increasing Dimension Example
Proj. on
Opt’l Dir’n
Proj. on
FLD Dir’n
Proj. on
both Dir’ns
HDLSS Discrimination
Add a 2nd Dimension (noise)
Same Proj. on
Opt’l Dir’n
Axes same
as dir’ns
Now See 2
Dim’ns
HDLSS Discrimination
Add a 3rd Dimension (noise)
Project on
2-d subspace
generated
by optimal
dir’n & by
FLD dir’n
HDLSS DiscriminationMovie Through Increasing Dimensions
HDLSS DiscriminationFLD in Increasing Dimensions:• Low dimensions (d = 2-9):
– Visually good separation– Small angle between FLD and Optimal– Good generalizability
• Medium Dimensions (d = 10-26):– Visual separation too good?!?– Larger angle between FLD and Optimal– Worse generalizability– Feel effect of sampling noise
HDLSS DiscriminationFLD in Increasing Dimensions:• High Dimensions (d=27-37):
– Much worse angle– Very poor generalizability– But very small within class variation– Poor separation between classes– Large separation / variation ratio
HDLSS DiscriminationFLD in Increasing Dimensions:• At HDLSS Boundary (d=38):
– 38 = degrees of freedom(need to estimate 2 class means)
– Within class variation = 0 ?!?– Data pile up, on just two points– Perfect separation / variation ratio?– But only feels microscopic noise aspects
So likely not generalizable– Angle to optimal very large
HDLSS Discrimination
FLD in Increasing Dimensions:
• Just beyond HDLSS boundary (d=39-
70):
– Improves with higher dimension?!?
– Angle gets better
– Improving generalizability?
– More noise helps classification?!?
HDLSS DiscriminationFLD in Increasing Dimensions:• Far beyond HDLSS boun’ry (d=70-
1000):– Quality degrades– Projections look terrible
(populations overlap)– And Generalizability falls apart, as well– Math’s worked out by Bickel & Levina
(2004)– Problem is estimation of d x d
covariance matrix