the many flavors of penalized linear discriminant analysisjrojo/4th-lehmann/slides/witten.pdf ·...

40
Linear Discriminant Analysis Penalized LDA Connections The Many Flavors of Penalized Linear Discriminant Analysis Daniela M. Witten Assistant Professor of Biostatistics University of Washington May 9, 2011 Fourth Erich L. Lehmann Symposium Rice University 1 / 29

Upload: others

Post on 30-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Many Flavors ofPenalized Linear Discriminant Analysis

Daniela M. WittenAssistant Professor of Biostatistics

University of Washington

May 9, 2011Fourth Erich L. Lehmann Symposium

Rice University

1 / 29

Page 2: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Overview

I There has been a great deal of interest in the past 15+ yearsin penalized regression,

minimizeβ

{||y − Xβ||2 + P(β)},

especially in the setting where the number of features pexceeds the number of observations n.

I P is a penalty function. Could be chosen to promoteI sparsity: e.g. the lasso, P(β) = ||β||1I smoothnessI piecewise constancy...

I How can we extend the concepts developed for regressionwhen p > n to other problems?

I A Case Study: Penalized linear discriminant analysis.

2 / 29

Page 3: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

The classification problem

I The Set-up:I We are given n training observations x1, . . . , xn ∈ Rp, each of

which falls into one of K classes.I Let y ∈ {1, . . . ,K}n contain class memberships for the training

observations.

I Let X =

xT1...

xTn

.

I Each column of X (feature) is centered to have mean zero.

I The Goal:I We wish to develop a classifier based on the training

observations x1, . . . , xn ∈ Rp, that we can use to classify a testobservation x∗ ∈ Rp.

I A classical approach: linear discriminant analysis.

3 / 29

Page 4: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

The classification problem

I The Set-up:I We are given n training observations x1, . . . , xn ∈ Rp, each of

which falls into one of K classes.I Let y ∈ {1, . . . ,K}n contain class memberships for the training

observations.

I Let X =

xT1...

xTn

.

I Each column of X (feature) is centered to have mean zero.

I The Goal:I We wish to develop a classifier based on the training

observations x1, . . . , xn ∈ Rp, that we can use to classify a testobservation x∗ ∈ Rp.

I A classical approach: linear discriminant analysis.

3 / 29

Page 5: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

Linear discriminant analysis

4 / 29

Page 6: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

LDA via the normal model

I Fit a simple normal model to the data:

xi |yi = k ∼ N(µk ,Σw )

I Apply Bayes’ Theorem to obtain a classifier: assign x∗ to theclass for which δk(x∗) is largest:

δk(x∗) = x∗TΣ−1w µk −

1

2µTk Σ−1

w µk + logπk

5 / 29

Page 7: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

Fisher’s discriminant

A geometric perspective: project the data to achieve goodclassification.

6 / 29

Page 8: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

Fisher’s discriminant

A geometric perspective: project the data to achieve goodclassification.

6 / 29

Page 9: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

Fisher’s discriminant

A geometric perspective: project the data to achieve goodclassification.

6 / 29

Page 10: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

Fisher’s discriminant

A geometric perspective: project the data to achieve goodclassification.

6 / 29

Page 11: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

Fisher’s discriminant and the associated criterion

Look for the discriminant vector β ∈ Rp that maximizes

βT Σ̂bβ subject to βT Σ̂wβ ≤ 1.

I Σ̂b is an estimate for the between-class covariance matrix.

I Σ̂w is an estimate for the within-class covariance matrix.

I This is a generalized eigen problem; can obtain multiplediscriminant vectors.

I To classify, multiply data by discriminant vectors and performnearest centroid classification in this reduced space.

I If we use K − 1 discriminant vectors then we get the LDAclassification rule. If we use fewer than K − 1, we getreduced-rank LDA.

7 / 29

Page 12: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

LDA via optimal scoring

I Classification is such a bother. Isn’t regression so much nicer?

I It wouldn’t make sense to solve

minimizeβ

{||y − Xβ||2}.

I But can we formulate classification as a regression problem insome other way?

8 / 29

Page 13: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

LDA via optimal scoring

I Let Y be a n × K matrix of dummy variables; Yik = 1yi=k .

minimizeβ,θ

{||Yθ − Xβ||2} subject to θTYTYθ = 1.

I We are choosing the optimal scoring of the class labels inorder to recast the classification problem as a regressionproblem.

I The resulting β is proportional to the discriminant vector inFisher’s discriminant problem.

I Can obtain the LDA classification rule, or reduced-rank LDA.

9 / 29

Page 14: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelFisher’s Discriminant ProblemOptimal Scoring

Linear discriminant analysis

10 / 29

Page 15: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

LDA when p � n

When p � n, we cannot apply LDA directly, because thewithin-class covariance matrix is singular.

There is also an interpretability issue:

I All p features are involved in the classification rule.I We want an interpretable classifier. For instance, a

classification rule that is aI sparse,I smooth, orI piecewise constant

linear combination of the features.

11 / 29

Page 16: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Penalized LDA

I We could extend LDA to the high-dimensional setting byapplying (convex) penalties, in order to obtain aninterpretable classifier.

I For concreteness, in this talk: we will use `1 penalties in orderto obtain a sparse classifier.

I Which version of LDA should we penalize, and does it matter?

12 / 29

Page 17: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Penalized LDA via the normal model

I The classification rule for LDA is

x∗T Σ̂−1w µ̂k −

1

2µ̂Tk Σ̂−1

w µ̂k ,

where Σ̂w and µ̂k denote MLEs for Σw and µk .

I When p � n, we cannot invert Σ̂w .

I Can use a regularized estimate of Σw , such as

ΣDw =

σ̂21 0 . . . 0

0 σ̂22. . .

......

. . .. . . 0

0 . . . 0 σ̂2p

.

13 / 29

Page 18: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Interpretable class centroids in the normal model

I For a sparse classifier, we need zeros in estimate of Σ−1w µk .

I An interpretable classifier:I Use ΣD

w , and estimate µk according to

minimizeµk

p∑

j=1

∑i :yi=k

(Xij − µkj)2

σ2j

+ λ||µk ||1

.

I Apply Bayes’ Theorem to obtain a classification rule.

I This is the nearest shrunken centroids proposal, which yields asparse classifier because we are using a diagonal estimate ofthe within-class covariance matrix and a sparse estimate ofthe class mean vectors.

Citation: Tibshirani et al. 2003, Stat Sinica

14 / 29

Page 19: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Penalized LDA via optimal scoring

I We can easily extend the optimal scoring criterion:

minimizeβ,θ

{1

n||Yθ − Xβ||2 + λ||β||1} subject to θTYTYθ = 1.

I An efficient iterative algorithm will find a local optimum.

I We get sparse discriminant vectors, and hence classificationusing a subset of the features.

Citation: Clemmensen Hastie Witten and Ersboll 2011, Submitted

15 / 29

Page 20: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Penalized LDA via Fisher’s discriminant problem

I A simple formulation:

maximizeβ

{βT Σ̂bβ − λ||β||1)} subject to βT Σ̃wβ ≤ 1

where Σ̃w is some full rank estimate of Σw .

I A non-convex problem, because βT Σ̂bβ isn’t concave in β.

I Can we find a local optimum?

Citation: Witten and Tibshirani 2011, JRSSB

16 / 29

Page 21: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Maximizing a function via minorization

17 / 29

Page 22: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Maximizing a function via minorization

17 / 29

Page 23: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Maximizing a function via minorization

17 / 29

Page 24: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Maximizing a function via minorization

17 / 29

Page 25: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Maximizing a function via minorization

17 / 29

Page 26: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Maximizing a function via minorization

17 / 29

Page 27: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Maximizing a function via minorization

17 / 29

Page 28: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Maximizing a function via minorization

17 / 29

Page 29: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

The Normal ModelOptimal ScoringFisher’s Discriminant Problem

Minorization

I Key point: Choose a minorizing function that is easy tomaximize.

I Minorization allows us to efficiently find a local optimum forFisher’s discriminant problem with any convex penalty.

18 / 29

Page 30: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Normal + `1 and Fisher’s + `1Fisher’s + `1 and Optimal Scoring + `1

Connections between flavors of penalized LDA

19 / 29

Page 31: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Normal + `1 and Fisher’s + `1Fisher’s + `1 and Optimal Scoring + `1

Connections between flavors of penalized LDA

1. Normal Model + `1: use a diagonal estimate for Σw and thenapply `1 penalty to the class mean vectors.

2. Optimal scoring + `1: apply `1 penalty to discriminantvectors.

3. Fisher’s discriminant problem + `1: apply `1 penalty todiscriminant vectors.

So are (1) and (3) different? And are (2) and (3) the same?

20 / 29

Page 32: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Normal + `1 and Fisher’s + `1Fisher’s + `1 and Optimal Scoring + `1

Normal Model + `1 and Fisher’s + `1

21 / 29

Page 33: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Normal + `1 and Fisher’s + `1Fisher’s + `1 and Optimal Scoring + `1

Normal Model + `1 and Fisher’s + `1

I Normal model + `1 penalizes theelements of this matrix.

I Fisher’s + `1 penalizes the left singularvectors.

I Clearly these are different...

I ...but if K = 2, then they are (essentially)the same.

22 / 29

Page 34: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Normal + `1 and Fisher’s + `1Fisher’s + `1 and Optimal Scoring + `1

Normal Model + `1 and Fisher’s + `1

23 / 29

Page 35: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Normal + `1 and Fisher’s + `1Fisher’s + `1 and Optimal Scoring + `1

Fisher’s+`1 and Optimal Scoring+`1

Both problems involve “penalizing the discriminant vectors” sothey must be the same, right?

24 / 29

Page 36: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Normal + `1 and Fisher’s + `1Fisher’s + `1 and Optimal Scoring + `1

Fisher’s+`1 and Optimal Scoring+`1

Theorem: For any value of the tuning parameter for FD+`1, thereexists some tuning parameter for OS+`1 such that the solution toone problem is a critical point of the other.

I In other words – there is a correspondence between the criticalpoints, though not necessarily the solutions.

I So the resulting “sparse discriminant vectors” may bedifferent!

25 / 29

Page 37: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Normal + `1 and Fisher’s + `1Fisher’s + `1 and Optimal Scoring + `1

Connections

26 / 29

Page 38: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Pros and Cons

Penalized LDA via normal model:

I (+) In the case of a diagonal estimate for Σw and `1 penalties on meanvectors, it is well-motivated and simple.

I (-) No obvious extension to non-diagonal estimates of Σw .

I (-) Cannot obtain a “low-rank” classifier.

Penalized LDA via Fisher’s discriminant problem:

I (+) Any convex penalties can be applied to discriminant vectors.

I (+) Can use any full-rank estimate of Σw .

I (+) Can obtain a “low-rank” classifier.

Penalized LDA via optimal scoring:

I (+) An extension of regression.

I (+) Any convex penalties can be applied to discriminant vectors.

I (+) Can obtain a “low-rank” classifier.

I (-) Cannot use any estimate of Σw .

I (-) Usual motivation for OS is that it yields the same discriminant vectorsas Fisher’s problem. Not true when penalized! 27 / 29

Page 39: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

Conclusions

I A sensible way to regularize regression when p � n:

minimizeβ

{||y − Xβ||2 + P(β)}.

I One could argue that this is the way to penalize regression.

I But as soon as we step away from regression, even to a closelyrelated problem like LDA, the situation becomes much morecomplex – there is no longer a “single way” to approach theproblem.

I And the situation becomes only more complex for morecomplex statistical methods!

I Need a principled framework to determine which penalizedextension of established statistical methods is “best”.

28 / 29

Page 40: The Many Flavors of Penalized Linear Discriminant Analysisjrojo/4th-Lehmann/slides/Witten.pdf · Penalized LDA Connections The Normal Model Optimal Scoring Fisher’s Discriminant

Linear Discriminant AnalysisPenalized LDA

Connections

References

I Witten and Tibshirani (2011) Penalized classification usingFisher’s linear discriminant. To appear in Journal of the RoyalStatistical Society, Series B.

I Clemmensen, Hastie, Witten, and Ersboll (2011) Sparsediscriminant analysis. Submitted.

29 / 29