linear discriminant analysis (lda)€¦ · flexible discriminant analysis (fda) can tackle the rst...

32
Linear Discriminant Analysis (LDA) Yang Xiaozhou March 18, 2020 Industrial Systems Engineering and Management, NUS

Upload: others

Post on 24-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Linear Discriminant Analysis (LDA)

Yang Xiaozhou

March 18, 2020

Industrial Systems Engineering and Management, NUS

Page 2: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Table of contents

1. LDA

2. Reduced-rank LDA

3. Fisher’s LDA

4. Flexible Discriminant Analysis

March 18, 2020 1

Page 3: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

LDA

Page 4: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

LDA and its applications

LDA is used as a tool for classification.

• Bankruptcy prediction: Edward Altman’s 1968 model

• Face recognition: learnt features are called Fisher faces

• Biomedical studies: discriminate different stages of a disease

• and many more

It has shown some really good results:

• Top 3 classifiers for 11 of the 22 datasets studied in the STATLOG

project1

1(Michie et al. 1994)

March 18, 2020 2

Page 5: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Classification by discriminant analysis

Consider a generic classification problem:

• K groups: G = 1, . . . ,K , each with a density fk(x) on Rp.

• A discriminant rule divides the space into K disjoint regions

R1, . . . ,RK and

allocate x to Πj if x ∈ Rj .

• Maximum likelihood rule:

allocate x to Πj if j = arg maxi

fi (x) ,

• Bayesian rule with class priors π1, . . . , πK :

allocate x to Πj if j = arg maxiπi fi (x) .

March 18, 2020 3

Page 6: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Classification by discriminant analysis

Consider a generic classification problem:

• K groups: G = 1, . . . ,K , each with a density fk(x) on Rp.

• A discriminant rule divides the space into K disjoint regions

R1, . . . ,RK and

allocate x to Πj if x ∈ Rj .

• Maximum likelihood rule:

allocate x to Πj if j = arg maxi

fi (x) ,

• Bayesian rule with class priors π1, . . . , πK :

allocate x to Πj if j = arg maxiπi fi (x) .

March 18, 2020 3

Page 7: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Gaussian as class density

If we assume data comes from Gaussian distribution:

fk(x) = 1(2π)p/2|Σk |1/2

e−12 (x−µk )

T Σ−1k (x−µk )

• Parameters are estimated using training data: πk , µk , Σk .

• Looking at the log-likelihod:

allocate x to Πj if j = arg maxiδi (x) .

δi (x) = log fi (x) + log πi is called discriminant function.

Assume equal covariance among K classes: LDA

δk(x) = xTΣ−1µk −1

2µTk Σ−1µk + log πk

Without that assumption on class covariance: QDA.

March 18, 2020 4

Page 8: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Gaussian as class density

If we assume data comes from Gaussian distribution:

fk(x) = 1(2π)p/2|Σk |1/2

e−12 (x−µk )

T Σ−1k (x−µk )

• Parameters are estimated using training data: πk , µk , Σk .

• Looking at the log-likelihod:

allocate x to Πj if j = arg maxiδi (x) .

δi (x) = log fi (x) + log πi is called discriminant function.

Assume equal covariance among K classes: LDA

δk(x) = xTΣ−1µk −1

2µTk Σ−1µk + log πk

Without that assumption on class covariance: QDA.

March 18, 2020 4

Page 9: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Decision boundary: LDA vs QDA

• Between any pair of classes k and `, the decision boundary is:

{x : δk(x) = δ`(x)}

• LDA: linear boundary; QDA: quadratic boundary.

• Number of parameters to estimate rises quickly in QDA:

• LDA: (K − 1)(p + 1)

• QDA: (K − 1){p(p + 3)/2 + 1}

March 18, 2020 5

Page 10: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Reduced-rank LDA

Page 11: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Reduced-rank LDA

Computation for LDA:

• Sphere the data:

x∗ ← D−12 UTx ,

where Σ = UDUT .

• Classify x to the closest centroid in the transformed space:

δk(x∗) = x∗T µk −1

2µTk µk + log πk .

Inherent dimension reduction in LDA:

• K centroids lie in a subspace of dimension at most (K − 1):

HK−1 = µ1 ⊕ span {µi − µ1, 2 ≤ i ≤ K}

• Classification is done by distance comparison in HK−1.

• p → K − 1 dimension reduction assuming p > K .

March 18, 2020 6

Page 12: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Reduced-rank LDA

Computation for LDA:

• Sphere the data:

x∗ ← D−12 UTx ,

where Σ = UDUT .

• Classify x to the closest centroid in the transformed space:

δk(x∗) = x∗T µk −1

2µTk µk + log πk .

Inherent dimension reduction in LDA:

• K centroids lie in a subspace of dimension at most (K − 1):

HK−1 = µ1 ⊕ span {µi − µ1, 2 ≤ i ≤ K}

• Classification is done by distance comparison in HK−1.

• p → K − 1 dimension reduction assuming p > K .

March 18, 2020 6

Page 13: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Reduced-rank LDA

We can look for an even smaller subspace HL ⊆ HK−1:

• Rule: Class centroids of sphered data have maximum separation in

this subspace in terms of variance.

PCA on class centroids to find coordinates of HL.

1. Find class mean and pooled var-cov: M,W.

2. Sphere the centroids: M∗ = MW−12 .

3. Obtain eigenvectors (v∗` ) in V∗ of cov(M∗) = V∗DBV∗T .

4. Obtain new (discriminant) variables Z` = (W−12 v∗` )TX , ` = 1, . . . , L.

Dimension reduction: XN×p → ZN×L.

March 18, 2020 7

Page 14: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Reduced-rank LDA

We can look for an even smaller subspace HL ⊆ HK−1:

• Rule: Class centroids of sphered data have maximum separation in

this subspace in terms of variance.

PCA on class centroids to find coordinates of HL.

1. Find class mean and pooled var-cov: M,W.

2. Sphere the centroids: M∗ = MW−12 .

3. Obtain eigenvectors (v∗` ) in V∗ of cov(M∗) = V∗DBV∗T .

4. Obtain new (discriminant) variables Z` = (W−12 v∗` )TX , ` = 1, . . . , L.

Dimension reduction: XN×p → ZN×L.

March 18, 2020 7

Page 15: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Reduced-rank LDA vs PCA

Wine dataset: 13 variables to distinguish three types of wines.

6 4 2 0 2 4 6Discriminant Coordinate 1

6

4

2

0

2

4

Disc

rimin

ant C

oord

inat

e 2

LDA for Wine dataset

400 200 0 200 400 600 800 1000PC 1

20

0

20

40

60

PC 2

PCA for Wine dataset

March 18, 2020 8

Page 16: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Fisher’s LDA

Page 17: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Fisher’s LDA

The previous rule is proposed by Fisher:

• Find a linear combination Z = aTX that has maximum

between-class variance relative to its within-class variance:

a = arg maxa

aTBa

aTWa.

• The optimization is solved by a generalized eigenvalue problem:

W−1Ba = λa.

• Eigenvectors (a`) of W−1B are the same as (W−12 v∗` ). Fisher

arrives at this without Gaussian assumption.

March 18, 2020 9

Page 18: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Fisher’s LDA

The previous rule is proposed by Fisher:

• Find a linear combination Z = aTX that has maximum

between-class variance relative to its within-class variance:

a = arg maxa

aTBa

aTWa.

• The optimization is solved by a generalized eigenvalue problem:

W−1Ba = λa.

• Eigenvectors (a`) of W−1B are the same as (W−12 v∗` ). Fisher

arrives at this without Gaussian assumption.

March 18, 2020 9

Page 19: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Fisher’s LDA

Digit dataset: 64 variables to distinguish 10 written digits.

• Top 4 of Fisher’s discriminant variables are shown.

• For example, coordinate 1 contrasts 4’s and 2/3’s.

10 8 6 4 2 0 2 4 6Discriminant Coordinate 1

6

4

2

0

2

4

6

Disc

rimin

ant C

oord

inat

e 2

6 4 2 0 2 4Discriminant Coordinate 3

4

2

0

2

4

6

Disc

rimin

ant C

oord

inat

e 4

March 18, 2020 10

Page 20: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Summary of LDA

Virtues of LDA:

1. Simple prototype classifier: simple to interpret.

2. Decision boundary is linear: simple to describe and implement.

3. Dimension reduction: provides informative low-dimensional view on

data.

Shortcomings of LDA:

1. Linear decision boundaries may not adequately separate the classes.

Support for more general boundaries is desired.

2. In high-dimensional setting, LDA uses too many parameters.

Regularized version of LDA is desired.

March 18, 2020 11

Page 21: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Flexible Discriminant Analysis

Page 22: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Beyond linear boundaries: FDA

Flexible discriminant analysis (FDA) can tackle the first shortcoming.

−4

0

4

−5 0 5X1

X2

y

1

2

3

LDA Decision Boundaries

−5

0

5

−5 0 5X1

X2

y

1

2

3

QDA Decision Boundaries

Idea: Recast LDA as a regression problem, apply the same techniques

generalizing linear regression.

March 18, 2020 12

Page 23: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

LDA as a regression problem

We can recast LDA as a regression problem via optimal scoring.

Set up:

• Response G falls into one of K classes, G = {1, . . . ,K}.• X is the p-dimensional feature vector.

Suppose a scoring function:

θ : G 7→ R1

such that scores are optimally predicted by regressing on X , e.g. a linear

map η(X ) = XTβ.

In general, select L ≤ K − 1 such scoring functions and find the optimal

{score, linear map} pairs that minimize:

ASR =1

N

L∑`=1

[N∑i=1

(θ` (gi )− xT

i β`)2]

March 18, 2020 13

Page 24: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

LDA as a regression problem

We can recast LDA as a regression problem via optimal scoring.

Set up:

• Response G falls into one of K classes, G = {1, . . . ,K}.• X is the p-dimensional feature vector.

Suppose a scoring function:

θ : G 7→ R1

such that scores are optimally predicted by regressing on X , e.g. a linear

map η(X ) = XTβ.

In general, select L ≤ K − 1 such scoring functions and find the optimal

{score, linear map} pairs that minimize:

ASR =1

N

L∑`=1

[N∑i=1

(θ` (gi )− xT

i β`)2]

March 18, 2020 13

Page 25: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

LDA via optimal scoring

Procedures of LDA via optimal scoring:

1. Initialize. Build response indicator matrix YN×K where Yij = 1 if

ith samples comes from jth class, and 0 otherwise.

2. Multivariate regression. Regress Y on X using ASR to get PX

where Y = PXY, and regression coefficients B.

3. Optimal scores. Obtain the L largest eigenvectors Θ of YTPXY.

4. Update. Update the coefficients: B← BΘ

• The optimal linear map is: η(X ) = BTX .

• Columns of B, β1, . . . ,β`, are the same as a`’s in LDA up to a

constant.

This equivalence with regression problem provides a starting point for

generalizing LDA to a more flexible and nonparametric version.

March 18, 2020 14

Page 26: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

LDA via optimal scoring

Procedures of LDA via optimal scoring:

1. Initialize. Build response indicator matrix YN×K where Yij = 1 if

ith samples comes from jth class, and 0 otherwise.

2. Multivariate regression. Regress Y on X using ASR to get PX

where Y = PXY, and regression coefficients B.

3. Optimal scores. Obtain the L largest eigenvectors Θ of YTPXY.

4. Update. Update the coefficients: B← BΘ

• The optimal linear map is: η(X ) = BTX .

• Columns of B, β1, . . . ,β`, are the same as a`’s in LDA up to a

constant.

This equivalence with regression problem provides a starting point for

generalizing LDA to a more flexible and nonparametric version.

March 18, 2020 14

Page 27: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

From LDA to FDA

Extend LDA by generalizing the linear map:

η(X ) = BTX

to

η(X ) = BTh(X ).

• Generalized additive fits

• Spline functions

• MARS models

• Projection pursuits

• Neural networks

The idea behind FDA: LDA in an enlarged space.

March 18, 2020 15

Page 28: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

FDA via optimal scoring

The procedures of FDA is the same as LDA via optimal scoring with one

change:

• Replace PX with Sh(X ), the nonparametric regression operator.

Initialize → Multivariate regression → Optimal scores → Update.

• Optimal fit: η(X).

• Fitted class centroids: ηk =∑

gi=k η (xi ) /Nk .

A new observation X is classified to class k that minimizes:

δ(x, k) =∥∥D(η(x)− ηj

)∥∥2where D is the constant factor linking optimal fits and LDA coordinates.

March 18, 2020 16

Page 29: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

FDA via optimal scoring

The procedures of FDA is the same as LDA via optimal scoring with one

change:

• Replace PX with Sh(X ), the nonparametric regression operator.

Initialize → Multivariate regression → Optimal scores → Update.

• Optimal fit: η(X).

• Fitted class centroids: ηk =∑

gi=k η (xi ) /Nk .

A new observation X is classified to class k that minimizes:

δ(x, k) =∥∥D(η(x)− ηj

)∥∥2where D is the constant factor linking optimal fits and LDA coordinates.

March 18, 2020 16

Page 30: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

LDA vs FDA

Data: three classes with mixture Gaussian densities.

FDA uses an additive model using smoothing splines of the form:

α +

p∑1

fj (Xj)

−4

0

4

−5 0 5X1

X2

y

1

2

3

LDA Decision Boundaries

−5

0

5

−5 0 5X1

X2

y

1

2

3

FDA Decision Boundaries

March 18, 2020 17

Page 31: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

References

1. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of

statistical learning (Vol. 1, No. 10). New York: Springer series in

statistics.

2. Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible discriminant

analysis by optimal scoring. Journal of the American statistical

association, 89(428), 1255-1270.

3. Hastie, T., Tibshirani, R., & Buja, A. (1995). Flexible discriminant

and mixture models.

4. Mardia, K. V., Kent, J. T., & Bibby, J. M. Multivariate analysis.

1979. Probability and mathematical statistics. Academic Press Inc.

March 18, 2020 18

Page 32: Linear Discriminant Analysis (LDA)€¦ · Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y

Questions?

Contact: [email protected]

March 18, 2020 18