high dimensional discriminant analysis - mistismistis.inrialpes.fr/docs/presentation_hdda.pdfto...

High Dimensional DiscriminantAnalysis

Charles BouveyronLMC-IMAG & INRIA Rhône-Alpes

Joint work with S. Girard and C. Schmid

High Dimensional Discriminant Analysis - Lear seminar – p.1/43

IntroductionHigh dimensional data:

many scientific domains need to analyze data whichare increasingly complex,modern data are made up of many variables:imagery (MRI, vision), biology (DNA micro-array), ...

Classification is very difficult in high dimensionalspaces:

many learning methods suffer from the curse ofdimensionality [Bel61],since the number n of data is not generally sufficientto learn high-dimensional data.

The empty space phenomena [ST83] allows to assumethat data live in subspaces with lower dimensionality.


IntroductionClassification:

supervised classification (discriminant analysis)requires examples of classes,unsupervised classification (clustering) aims toorganize data in homogeneous classes.

2 ways:generative methods: QDA, LDA, GMM,discriminantive methods: logistic regression andSVM.

Generative models can be both used in supervised andunsupervised classification.


Outline of the talkDiscriminant analysis frameworkNew modelisation of high-dimensional dataHigh dimensional discriminant analysis (HDDA)

construction of the decision rulea posteriori probability and reformulation

Particular rulesEstimators and intrinsic dimension estimationNumerical results

application to image categorizationapplication to object recognition

Extension to unsupervised classification


Part 1

Discriminant analysis framework


Discriminant analysis frameworkDiscriminant analysis is the supervised part ofclassification, i.e. it requires a professor !

Discriminant analysis goals:descriptive aspect: find a data representation whichallows to interpret the groups using explanatoryvariables.decisional aspect: the major goal is to find the goodclass membership of a new data x.

Of course, HDDA favours the decisional aspect !


Discrimination problemThe basic problem:

assign an observation x = (x1, ..., xp) ∈ Rp with

unknown class membershipto one of k classes C1, ..., Ck known a priori.

We have a learning dataset A:

A = {(x1, c1), ..., (xn, cn)/xj ∈ Rp and yj ∈ {1, ..., k}},

where the vector xj contains p explanatory variablesand yj indicates the index of the class of xj.We have to construct a decision rule δ:

δ : Rp → {1, ..., k}

x → y.


Bayes decision ruleThe optimal decision rule δ∗, called Bayes decision rule,is :

δ∗ : x ∈ Ci∗ , if i∗ = argmaxi=1,...,k

{p(Ci|x)},

δ∗ : x ∈ Ci∗ , if i∗ = argmini=1,...,k

{−2 log(πi fi(x))},

where πi is the a priori probability of class Ci and fi(x)denotes the class conditional density of x.Generative methods usually assume that distributionsof classes are Gaussian N (µi,Σi).


Classical discriminant analysis methodQuadratic discriminant analysis (QDA):

i∗ = argmini=1,...,k

{(x− µi)tΣ−1

i (x− µi) + log(det Σi)− 2 log(πi)}.

Linear discriminant analysis (LDA): with the assumptionthat ∀i, Σi = Σ


{µtiΣ

−1µi − 2µtiΣ

−1x − 2 log(πi)}.

QDA and LDA have disappointing behavior when thesize of the training dataset n is small compared to thenumber p of variables.


Discriminant analysis regularizationDimension reduction: PCA, FDA, features selection,

Fischer discriminant analysis (FDA) combines:a dimension reduction step (projection on the k − 1discriminant axes)with one of the previous methods (usually LDA).

Parsimonious models:Regularized discriminant analysis (RDA, [Fri89]), isan intermediate classifier between QDA and LDA,Eigenvalue decomposition discriminant analysis(EDDA, [BC96]) is based on the re-parametrizationof the covariance matrices of classes:

Σi = λiDiAiDti .


Dimension reduction for classification

−20 −15 −10 −5 0 5 10 15 20−25

−20

−15

−10

−5

0

5

10

15

20

25

−15 −10 −5 0 5 10 15−25

−20

−15

−10

−5

0

5

10

15

20

PCA axes Discriminant axes

Fig.1 - High-dimensional data which classes live in differentsubspaces with lower dimensionality.


Part 2

New modelisation


New modelisation

The empty space phenomena enables us to assumethat HD data live in subspaces with low dimensionality.

The main idea of the new modelisation is:each class is decomposed on two subspaces withlow dimensionality,and the classes are assumed spherical in thesesubspaces.


New modelisation

We assume that class conditional densities areGaussian N (µi,Σi) with means µi and covariancematrices Σi.Let Qi be the orthogonal matrix of eigenvectors of thecovariance matrix Σi,Let Bi be the basis of R

p made of the eigenvectors of Σi.The class conditional covariance matrix ∆i is defined inthe basis Bi by:

∆i = Qti Σi Qi.


New modelisation

We assume in addition that ∆i contains only twodifferent eigenvalues ai > bi.Let Ei be the affine space generated by eigenvectorsassociated to the eigenvalue ai and such that µi ∈ Ei.We define also E

⊥i such that Ei ⊕ E

⊥i = R

p and µi ∈ E⊥i .

Let Pi and P⊥i be the projection operators on Ei and E

⊥i .


New modelisation

Thus, we assume that ∆i has the following form:

∆i =

0

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

@

ai 0

. . .0 ai

0

0

bi 0

. . .. . .

0 bi

1

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

A

9

>

>

=

>

>

;

di

9

>

>

>

>

>

=

>

>

>

>

>

;

(p − di)


New modelisation: illustration


Part 3

High Dimensional DiscriminantAnalysis


High Dimensional Discriminant AnalysisUnder the preceding assumptions, the Bayes decisionrule yields a new decision rule δ+:

Theorem 1: The new decision rule δ+ consists inclassifying x to the class Ci∗ if:


{

1

ai‖µi − Pi(x)‖2 +

1

bi‖x − Pi(x)‖2

+di log(ai) + (p − di) log(bi) − 2 log(πi)

}

.


HDDA: illustration

Ki(x) = 1ai‖µi − Pi(x)‖2 + 1

bi‖x − Pi(x)‖2 + di log(ai) + (p −

di) log(bi) − 2 log(πi).


HDDA: a posteriori probabilityIn many applications, it is interesting to dispose of the aposteriori probability p(Ci|x) that x belongs to Ci.The Bayes formula yields:

p(Ci|x) =exp

(

−12Ki(x)

)

∑kj=1 exp

(

−12Kj(x)

),

where Ki is the cost function of δ+conditionally with theclass Ci:

Ki(x) =1

ai‖µi − Pi(x)‖2 +

1

bi‖x − Pi(x)‖2

+di log(ai) + (p − di) log(bi) − 2 log(πi).


HDDA: reformulationIn order to interpret easily the decision rule δ+, weintroduce αi and σi:

ai = σ2i

αiand bi = σ2

i

(1−αi)

with αi ∈]0, 1[ and σi > 0.Thus, the decision rule δ+consists in classifying x to theclass Ci∗ if:


{

1

σ2i

(

αi‖µi − Pi(x)‖2 + (1 − αi)‖x − Pi(x)‖2)

+2p log(σi) + di log

(

1 − αi

αi

)

− p log(1 − αi) − 2 log(πi)

}

.

Notation: HDDA is the model [aibiQidi] or [αiσiQidi].


Part 4

Particular rules


Particular rulesBy allowing some but not all of HDDA parameters tovary, we obtain 24 particular rules:

which correspond to different regularizations,which some ones are easily geometricallyinterpretable,which 9 have explicit formulations.

HDDA can be interpreted as a classical discriminantanalysis in particular cases:

if ∀i, αi = 12 : δ+ is QDA with sperical classes,

if in addition ∀i, σi = σ: δ+ is LDA with spericalclasses.


Links with classical methods

EDDA HDDA

LDAs

QDA

...... LDA QDAs

LDA géo

Σi = λiDiAiDt

iΣi = Qi∆iQ

t

i

Σi = λDADt

Ai = Id

αi =1

2

Σi = σ2

iId σi = σ

πi = π∗


Model [ασQidi]

The decision rule δ+consists in classifying x to the classCi∗ if:


{α‖µi − Pi(x)‖2 + (1 − α)‖x − Pi(x)‖2}.


Part 5

Estimation


HDDA estimatorsEstimators are computed using maximum likelihoodestimation from the learning set A.Common estimators:

πi =ni

n, ni = #(Ci),

µi =1

ni

∑

xj∈Ci

xj ,

Σi =1

ni

∑

xj∈Ci

(xj − µi)t(xj − µi).


Estimators of the model [aibiQidi]

Assuming di is known, the ML estimators are:Qi is made of the eigenvectors associated to theordered eigenvalues of Σi,ai is the mean of the largest di eigenvalues of Σi:

ai =

di∑

l=1

λil

di,

bi is the mean of the smallest (p − di) eigenvalues ofΣi:

bi =

p∑

l=di+1

λil

(p − di).


Estimation trickThe decision rule δ+ do not requires to compute the last(p − di) eigenvectors of Σi.Thus, in order to minize the number of parameters toestimate, we use the following relation:

p∑

l=di+1

λil = tr(Σi) −di

∑

l=1

λil.

Number of parameters to estimate with p = 100, di = 10and k = 4:

QDA: 20 603HDDA: 4 323


Intrinsic dimension estimationWe base our approach to chose the values of di oneigenvalues of Σi,We use two empirical methods:

common thresholding on the cumulative variance:

di = argmind=1,...,p−1

d∑

j=1

λj/

p∑

j=1

λj ≥ s

,

scree-test of Cattell:analyses differences between the eigenvalues inorder to find a brake in the scree of eigenvalues.


Intrinsic dimension estimation

1 2 3 4 5 6 7 8 9 100

0.02

0.04

0.06

0.08

0.1

Ordered eigenvalues of Σi

0 2 4 6 8 100.2

0.4

0.6

0.8

1

Cumulative sum of eigenvalues

1 2 3 4 5 6 7 8 9 100

0.02

0.04

0.06

0.08

0.1

Ordered eigenvalues of Σi

0 2 4 6 8 100

0.01

0.02

0.03

0.04

0.05

Difference betwenn eigenvalues

Common tresholding Scree-test of Cattell


Part 6

Numerical results


Results: artificial data

Method Classification rateHDDA ([aibiQidi]) 0.958HDDA ([aibiQid]) 0.964

LDA 0.512FDA 0.51SVM 0.478

3 Gaussian densities in R15, with d1 = 3, d2 = 4 and

d3 = 5,In addition, the proportions are very different: π1 = 1

2 ,π2 = 1

3 and π3 = 16 ,


Results: image categorization

A recent study [LBGGDH03] proposes an approachbased on the human perception to categorize naturalimages.An image is represented by a vector of 49 dimensions.Each one of these 49 components is the response ofthe image to a Gabor filter.


Results: image categorizationData: 328 descriptors in 49 dimensions,Results:

Method Classification rateHDDA ([aibiQidi]) 0.857HDDA ([aibQid]) 0.881QDA 0.849LDA 0.775FDA (d = k − 1) 0.79SVM 0.839

Classification results for the image categorization experiment(leave-one-out).


Results: object recognition

Our approach uses local descriptors(Harris-Laplace+Sift),We consider 3 object classes (wheels, seat andhandlebars) and 1 background class,The dataset is made of 1000 descriptors in 128dimensions:

learning dataset: 500, test dataset: 500.High Dimensional Discriminant Analysis - Lear seminar – p.37/43


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positives

True

pos

itives

SVM classifiersHDDA classifiers

FDA

LDA

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positivesTr

ue p

ositiv

es

HDDA

with error probability < 10−5

with error probability < 10−10

Classification results for the object recognition experiment.



Recognition using HDDA

Recognition using SVM


Part 7

Unsupervised classification


Extension to unsupervised classificationThe unsupervised classification aims to organize datain homogeneous classes.Gaussian mixture models (GMM) are an efficient wayfor unsupervised classification:

in Gaussian mixture models, the density of themixture is:

f(x, θ) =

k∑

i=1

πifi(x;µi,Σi),

where θ = {π1, ..., πk, µ1, ..., µk,Σ1, ...,Σk}.the parameters estimation is generally done by theEM algorithm.


Extension to unsupervised classificationUsing our model for HD data, the two main steps of theEM algorithm are:

E step: compute t(q)ij = t

(q)i (xj)

t(q)ij = exp(−K

(q)i (xj)/2)/

∑kl=1 exp(−K

(q)l (xj)/2),

where K(q)i (x) =

‖µ(q)i −P

(q)i (xj)‖

2

a(q)i

+‖xj−P

(q)i (xj)‖

2

b(q)i

+

d(q)i log(a

(q)i ) + (p − d

(q)i ) log(b

(q)i ) − 2 log(π

(q)i ).

M step: classical estimation of πi, µi and Σi; theestimators of ai, bi and Qi are the same as those ofHDDA.


References[BC96] H. Bensmail and G. Celeux. Regularized gaussian discriminant analysis

through eigenvalue decomposition. Journal of the American StatisticalAssociation, 91:1743–1748, 1996.

[Bel61] R. Bellman. Adaptive Control Processes. Princeton University Press, 1961.[Fri89] J.H. Friedman. Regularized discriminant analysis. Journal of the American

Statistical Association, 84:165–175, 1989.[LBGGDH03] H. Le Borgne, N. Guyader, A. Guerin-Dugué, and J. Hérault. Classification of

images: Ica filters vs human perception. In 7th International Symposium onSignal Processing and its Applications, number 2, pages 251–254, 2003.

[ST83] D. Scott and J. Thompson. Probability density estimation in higherdimensions. In Proceedings of the Fifteenth Symposium on the Interface,North Holland-Elsevier Science Publishers, pages 173–179, 1983.


high dimensional discriminant analysis - mistismistis.inrialpes.fr/docs/presentation_hdda.pdfto...

Documents