dna −→ rna −→ proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… ·...

63
Molecular diagnosis Florian Markowetz [email protected] Max Planck Institute for Molecular Genetics Computational Diagnostics Group Berlin, Germany B e r l i n C e n t e r f o r G e n o m e B a s e d B i o i n f o r m a t i c s IPM workshop Tehran, 2005 April

Upload: others

Post on 25-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Molecular diagnosis

Florian [email protected] Planck Institute for Molecular GeneticsComputational Diagnostics GroupBerlin, Germany Ber

linC

en

ter

for

Genome BasedBio

i nfo

rmatics

IPM workshopTehran, 2005 April

Page 2: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Personalized medicine

Which disease has the patient?

Which treatment should he get?

Will he develop side-effects?

These questions

1. refer to individuals,

2. address predictive problems,

3. directly link to decisions.

We are interested in individuals — not in

gene function (that’s functional genomics).

Florian Markowetz, Molecular diagnosis, 2005 April 1

Page 3: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

DNA −→ RNA −→ Protein

Florian Markowetz, Molecular diagnosis, 2005 April 2

Page 4: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Microarray data

www.affymetrix.com

Florian Markowetz, Molecular diagnosis, 2005 April 3

Page 5: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Why to use microarrays

Two major advantages:

1. Bird’s eye view: Microarrays allow to

screen thousands of genes without a-

prior knowledge of which genes might be

involved.

2. Multivariate signatures: group of genes

together may be more accurate and robust

indicators of patients’ outcome than a

single gene.

Florian Markowetz, Molecular diagnosis, 2005 April 4

Page 6: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Overview

1. Classification in high dimensions−→ a fight against overfitting

Florian Markowetz, Molecular diagnosis, 2005 April 5

Page 7: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Overview

1. Classification in high dimensions−→ a fight against overfitting

2. Discriminant Analysis−→ Gaussian assumption, feature selection

Florian Markowetz, Molecular diagnosis, 2005 April 5

Page 8: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Overview

1. Classification in high dimensions−→ a fight against overfitting

2. Discriminant Analysis−→ Gaussian assumption, feature selection

3. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures

Florian Markowetz, Molecular diagnosis, 2005 April 5

Page 9: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Overview

1. Classification in high dimensions−→ a fight against overfitting

2. Discriminant Analysis−→ Gaussian assumption, feature selection

3. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures

4. Model selection and assessment−→ Traps and pitfalls, or: How to cheat.

Florian Markowetz, Molecular diagnosis, 2005 April 5

Page 10: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Overview

1. Classification in high dimensions−→ a fight against overfitting

2. Discriminant Analysis−→ Gaussian assumption, feature selection

3. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures

4. Model selection and assessment−→ Traps and pitfalls, or: How to cheat.

5. Interpretation of results−→ what do classifiers teach us about biology?

Florian Markowetz, Molecular diagnosis, 2005 April 5

Page 11: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Molecular diagnosis = a classification problem

We measure p genes on N patients. Each microarray is a profilex(i) ∈ Rp. With each profiles comes a label yi ∈ K = {+1,−1}.

Florian Markowetz, Molecular diagnosis, 2005 April 6

Page 12: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Molecular diagnosis = a classification problem

We measure p genes on N patients. Each microarray is a profilex(i) ∈ Rp. With each profiles comes a label yi ∈ K = {+1,−1}.

Assume data generating distribution Pr(X, Y ) — which is unknown!

What we got are samples from Pr, called a Training set:

D = {(x(1), y1), . . . , (x(N), yN)}

Florian Markowetz, Molecular diagnosis, 2005 April 6

Page 13: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Molecular diagnosis = a classification problem

We measure p genes on N patients. Each microarray is a profilex(i) ∈ Rp. With each profiles comes a label yi ∈ K = {+1,−1}.

Assume data generating distribution Pr(X, Y ) — which is unknown!

What we got are samples from Pr, called a Training set:

D = {(x(1), y1), . . . , (x(N), yN)}

A classification rule c : Rp → K splits the Rp into one subspace for

each class.

Challenge: Find c, which classifies future patients well.

Florian Markowetz, Molecular diagnosis, 2005 April 6

Page 14: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

How to measure success

Loss function quantifies loss of classifying x to have label c(x) if

true label is y:

l(x, c(x), y) : Rp ×K ×K −→ [0,∞)

Risk is the expected loss over the whole population:

R[c] = E l(X, c(X), Y ) =∫

l(x, c(x), y) dPr(x, y)

Florian Markowetz, Molecular diagnosis, 2005 April 7

Page 15: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

0/1-loss

The most simple loss function for classification is 0/1-loss:

l(x, c(x), y) =

{0 if c(x) = y

1 if c(x) 6= y

With this loss function we get

R[c] = Pr( c(x) 6= y )

Florian Markowetz, Molecular diagnosis, 2005 April 8

Page 16: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

A first estimate for the risk

The empirical risk (aka training error) approximates Pr(X, Y ) by

the empirical distribution P r(X, Y ) of the training set:∫l(x, c(x), y) dPr(x, y) ≈ 1

N

N∑i=1

l(x(i), c(x(i)), yi) =: Remp[x]

First idea: Find a classifier c minimizing empirical risk!

c = argminc

Remp[x]

Florian Markowetz, Molecular diagnosis, 2005 April 9

Page 17: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

The trivial solution

A trivial classifier with zero empirical risk is

ctriv(x) =

{yi whenever x ∈ D1 else.

Florian Markowetz, Molecular diagnosis, 2005 April 10

Page 18: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

The trivial solution

A trivial classifier with zero empirical risk is

ctriv(x) =

{yi whenever x ∈ D1 else.

Ok, this is a bit artificial.

But still: in small-sample situations, learning single datapoints instead

of general features of the data is the main problem. This is called

overfitting.

Florian Markowetz, Molecular diagnosis, 2005 April 10

Page 19: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

From Under- to Over-fitting

Florian Markowetz, Molecular diagnosis, 2005 April 11

Page 20: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

From Under- to Over-fitting

Overfitting: Perfect separation of training data may not generalize

well to future patients.

Florian Markowetz, Molecular diagnosis, 2005 April 11

Page 21: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Bias-variance trade-off

[4]Florian Markowetz, Molecular diagnosis, 2005 April 12

Page 22: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

How to measure model complexity

We have to restrict the set of functions to one that has capacity (or

complexity) suitable for the amount of available training data.

Very prominent capacity concept [7, 8]:

Vapnik-Chervonenkis (VC) dimension

Florian Markowetz, Molecular diagnosis, 2005 April 13

Page 23: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

How to measure model complexity

We have to restrict the set of functions to one that has capacity (or

complexity) suitable for the amount of available training data.

Very prominent capacity concept [7, 8]:

Vapnik-Chervonenkis (VC) dimension

Shattering points: With labels in {+1,−1} and N points, there are

at most 2N different labelings. A rich function class may be able to

realize all of them. It is then said to shatter the N points.

Florian Markowetz, Molecular diagnosis, 2005 April 13

Page 24: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

How to measure model complexity

We have to restrict the set of functions to one that has capacity (or

complexity) suitable for the amount of available training data.

Very prominent capacity concept [7, 8]:

Vapnik-Chervonenkis (VC) dimension

Shattering points: With labels in {+1,−1} and N points, there are

at most 2N different labelings. A rich function class may be able to

realize all of them. It is then said to shatter the N points.

VC dimension is defined as the largest N such that there exists a

set of N points the function class can shatter, and ∞ if there is no

such set.

Florian Markowetz, Molecular diagnosis, 2005 April 13

Page 25: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Shattering p + 1 points in p dimensions

Curse of dimensionality:if p � N even linear methods are too complex.

Florian Markowetz, Molecular diagnosis, 2005 April 14

Page 26: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Means to fight overfitting in high dimensions

1. Dimension reductione.g. Principal Components Analysis: Find the directions with

highest variance in the data.

2. Feature selectiongene-wise filtering or shrinkage.

3. Regularizationintroduce additional constraints into objective function.

Florian Markowetz, Molecular diagnosis, 2005 April 15

Page 27: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Two roads to classification

1. model class probabilities→ Gaussian assumption leads to

Discriminant Analysis.

2. model class boundaries directly

→ Optimal Separating Hyperplanes

→ SVM

Florian Markowetz, Molecular diagnosis, 2005 April 16

Page 28: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Discriminant Analysis

Florian Markowetz, Molecular diagnosis, 2005 April 17

Page 29: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Bayes classifier

Image we knew the data generating distribution.

To minimize the risk with 0/1-loss we would classify a new point to

the most likely class:

c(x) = argmaxk

Pr(Y = k | X = x)

This is known as the Bayes classifier. It’s error rate is called the

Bayes rate.

Florian Markowetz, Molecular diagnosis, 2005 April 18

Page 30: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Bayes classifier

Image we knew the data generating distribution.

To minimize the risk with 0/1-loss we would classify a new point to

the most likely class:

c(x) = argmaxk

Pr(Y = k | X = x)

This is known as the Bayes classifier. It’s error rate is called the

Bayes rate.

In real-world problems, we do not know the data generating

distribution. But we can still make an educated guess . . .

Florian Markowetz, Molecular diagnosis, 2005 April 18

Page 31: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Comparing Gaussian likelihoods

Assumption: each group of patients is well described by a Normal

density.

Training: estimate mean and covariance matrix for each group.

Prediction: assign new patient to group with higher likelihood.

Constraints on covariance structure lead to different forms of

discriminant analysis.

Florian Markowetz, Molecular diagnosis, 2005 April 19

Page 32: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Gaussian likelihoods

Model each class density as multivariate Gaussian [2]

fk(x) = |2π Σk|−12 exp

{−1

2(x− µk)TΣ−1

k (x− µk)}

.

In comparing two classes k and l, we look at the log-ratio

logPr(Y = k|X = x)Pr(Y = l|X = x)

Florian Markowetz, Molecular diagnosis, 2005 April 20

Page 33: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Gaussian likelihoods

Model each class density as multivariate Gaussian [2]

fk(x) = |2π Σk|−12 exp

{−1

2(x− µk)TΣ−1

k (x− µk)}

.

In comparing two classes k and l, we look at the log-ratio

logPr(Y = k|X = x)Pr(Y = l|X = x)

= logPr(X = x|Y = k) Pr(Y = k)Pr(X = x|Y = l) Pr(Y = l)

Florian Markowetz, Molecular diagnosis, 2005 April 20

Page 34: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Gaussian likelihoods

Model each class density as multivariate Gaussian [2]

fk(x) = |2π Σk|−12 exp

{−1

2(x− µk)TΣ−1

k (x− µk)}

.

In comparing two classes k and l, we look at the log-ratio

logPr(Y = k|X = x)Pr(Y = l|X = x)

= logPr(X = x|Y = k) Pr(Y = k)Pr(X = x|Y = l) Pr(Y = l)

= logfk(x)fl(x)

+ logπk

πl.

Florian Markowetz, Molecular diagnosis, 2005 April 20

Page 35: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Quadratic and Linear Discrimant Analysis

1. Unrestricted {Σk} lead to Quadratic discriminant analysis.

2. The special case Σk = Σ, ∀k, leads to convenient cancellations

in the log-ratio:

logPr(Y = k|X = x)Pr(Y = l|X = x)

= logfk(x)fl(x)

+ logπk

πl

Florian Markowetz, Molecular diagnosis, 2005 April 21

Page 36: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Quadratic and Linear Discrimant Analysis

1. Unrestricted {Σk} lead to Quadratic discriminant analysis.

2. The special case Σk = Σ, ∀k, leads to convenient cancellations

in the log-ratio:

logPr(Y = k|X = x)Pr(Y = l|X = x)

= logfk(x)fl(x)

+ logπk

πl=

= xTΣ−1(µk − µl)−12(µk + µl)TΣ−1(µk − µl) + log

πk

πl.

The quadratic parts vanish, the decision boundary is linear.

Florian Markowetz, Molecular diagnosis, 2005 April 21

Page 37: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Discriminant functions

Equivalent descriptions of decision rule with c(x) = argmaxk δk(x).

Quadratic discriminant analysis

δQDAk (x) = −1

2log |Σk| −

12(x− µk)TΣ−1

k (x− µk) + log πk

Linear discriminant analysis

δLDAk (x) = xTΣ−1µk −

12µT

k Σ−1µk + log πk

Florian Markowetz, Molecular diagnosis, 2005 April 22

Page 38: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

More constraints on Σ

Diagonal discriminant analysis constraints Σk to diagonal form.

This means: genes/features are thought to be independent.

Again there exists a linear and a quadratic form.

Nearest centroids classification requires Σk = σ2k I, where I is the

identity matrix. Not only are genes now independent, they also have

the same variance (per class).

We will use both in the linear form, i.e. Σk = Σ, ∀k.

Florian Markowetz, Molecular diagnosis, 2005 April 23

Page 39: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Estimation from data

Prior:πk = Nk/N

Florian Markowetz, Molecular diagnosis, 2005 April 24

Page 40: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Estimation from data

Prior:πk = Nk/N

Class means:

µk =1

Nk

∑{i:yi=k}

xi

Florian Markowetz, Molecular diagnosis, 2005 April 24

Page 41: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Estimation from data

Prior:πk = Nk/N

Class means:

µk =1

Nk

∑{i:yi=k}

xi

Covariance matrix:

Σ =1

N − 2

2∑k=1

∑{i:yi=k}

(xi − µk)(xi − µk)T

Florian Markowetz, Molecular diagnosis, 2005 April 24

Page 42: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Disriminant analysis in a nutshell

DLDA

QDA LDA

NearestCentroid

Characterize each class

by mean and covariancestructure.

• Quadratic D.A.different COVs

• Linear D.A.requires same COVs.

• Diagonal linear D.A.same diagonal COVs.

• Nearest centroidsforces COVs to σ2I.

Florian Markowetz, Molecular diagnosis, 2005 April 25

Page 43: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Why does discriminant analysis work?

Is it, because Gaussian assumption is always fulfilled? Not likely!

The reason is more pragmatic:

1. The data can only support simple decision rules,

2. estimates by the Gaussian model are stable.

Florian Markowetz, Molecular diagnosis, 2005 April 26

Page 44: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Why does discriminant analysis work?

Is it, because Gaussian assumption is always fulfilled? Not likely!

The reason is more pragmatic:

1. The data can only support simple decision rules,

2. estimates by the Gaussian model are stable.

But still we work in very high dimensions.

Next simplification: Base the classification only on a small number

of genes.

Feature selection: Find the most discriminative genes.

Florian Markowetz, Molecular diagnosis, 2005 April 26

Page 45: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Single feature ranking

Idea: compare difference in group-means scaled by variance in the

groups.

freq

uenc

y

gene expression

freq

uenc

y

Florian Markowetz, Molecular diagnosis, 2005 April 27

Page 46: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Correlation scores

Three implementations of the mean/variance comparison:

t-statistic Fisher Golub [1]

t =µ1 − µ2√

σ21

n1+ σ2

2n2

f =(µ1 − µ2)2

σ21 + σ2

2

g =µ1 − µ2

σ1 + σ2

We rank the genes by one of these scores, use the top k for further

analysis and discard the rest.

Florian Markowetz, Molecular diagnosis, 2005 April 28

Page 47: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

From filters to shrinkage

Filtering involves an arbitrary hard threshold. Gene k + 1 is

discarded, even if it bears no less information than gene k.

We fight this point by Shrinkage: Continously shrink genes until

only a few have influence on classification.

Example: Nearest Shrunken Centroids (NSC).

Florian Markowetz, Molecular diagnosis, 2005 April 29

Page 48: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Nearest Shrunken Centroids

Florian Markowetz, Molecular diagnosis, 2005 April 30

Page 49: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

NSC: global and class centroids

For gene i: how far is the class centroid xik from the overallcentroid xi, measured in units of standard deviation?

dik =xik − xi

mk · si,

where si is the pooled within-class standard deviation for gene i and

mk =√

1/Nk − 1/N .

We transform this quantity into

xik = xi + mk · si · dik.

Florian Markowetz, Molecular diagnosis, 2005 April 31

Page 50: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

NSC: Shrinkage

Noisy and uninformative xik will be close

to the overall mean xi.

Shrink each dik toward zero by softthresholding [5, 6]:

d′ik = sign(dik)(|dik| −∆))+

This gives new class prototypes

x′ik = x′i + mk · si · d′ik.

Florian Markowetz, Molecular diagnosis, 2005 April 32

Page 51: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

NSC: how genes vanish from the model

If shrinkage paramters ∆ is large enough, genes are eliminated from

class prediction:

If ∆ causes dik to shrink to zero for all classes k, then the class

centroid falls into one with the overall centroid.

The gene i then does not contribute to nearest centroid computation.

Florian Markowetz, Molecular diagnosis, 2005 April 33

Page 52: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Shrunken Centroids

overall centroidfirst class centroid second class centroid

gene A

gene B

Florian Markowetz, Molecular diagnosis, 2005 April 34

Page 53: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Shrunken Centroids

Expression of gene 1

Exp

ress

ion

of g

ene

2●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

● ●

● ●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

Florian Markowetz, Molecular diagnosis, 2005 April 35

Page 54: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

NSC: Discriminant scores

The discriminant function for nearest shrunken centroid classification

is

δNSCk (x) = xTΣ−1x′k −

12x′Tk Σ−1x′k + log πk,

which looks exactely like δLDAk (x) except for three differences:

1. diagonal wihtin-class covariance matrix Σ

2. shrunken centroids x′k rather than centroids xk ≡ µk.

3. as ∆ increases, more and more genes lose discriminatory power.

Florian Markowetz, Molecular diagnosis, 2005 April 36

Page 55: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Class probabilities from discriminant scores

Using the δk(x) we can construct estimates of the class probabilities

Pr(Y = k|X = x):

p(x) =exp

(−1

2δk(x))∑K

l=1 exp(−1

2δl(x))

The monotone transformation log[p/(1 − p)] is called the logittransformation.

Florian Markowetz, Molecular diagnosis, 2005 April 37

Page 56: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Shortcomings of filter and shrinkage methods

1. High correlated genes get similar score but offer no new

information.

But see Jaeger et al. [3] for a cure.

2. Filter and Shrinkage work only on single genes.

They don’t find interactions between groups of genes.

3. Filter and Shrinkage methods are only heuristics.

Search for best subset is infeasible for more than 30 genes.

Florian Markowetz, Molecular diagnosis, 2005 April 38

Page 57: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Differential genes may not be predictive!

freq

uenc

y

gene expression

freq

uenc

y

The upper one is differential and predictive, the lower one is also

differential, but not predictive.

Florian Markowetz, Molecular diagnosis, 2005 April 39

Page 58: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Predictive genes may not be differential!

Florian Markowetz, Molecular diagnosis, 2005 April 40

Page 59: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

A first summary

1. Molecular Diagnosis from microarray data is a classification

problem.

2. From training data, find a classifier working well for future patients.

3. Curse of dimensionality leads to easy overfitting.

4. Thus, bias the models to be simple!

5. One example: Gaussian model with restricted covariance and gene

selection.

Florian Markowetz, Molecular diagnosis, 2005 April 41

Page 60: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

What’s to come

Part II will deal with

1. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures

2. Model selection and assessment−→ Traps and pitfalls, or: How to cheat.

3. Interpretation of results−→ what do classifiers teach us about biology?

Florian Markowetz, Molecular diagnosis, 2005 April 42

Page 61: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

What’s to come

Part II will deal with

1. Support vector machines−→ Maximal margin hyperplanes, non-linear similarity measures

2. Model selection and assessment−→ Traps and pitfalls, or: How to cheat.

3. Interpretation of results−→ what do classifiers teach us about biology?

Thank you! Questions?Florian Markowetz, Molecular diagnosis, 2005 April 42

Page 62: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmatics

Acknowledgements

Thanks to MIT Press and the authors for making the figures from

Learning with Kernels available at

http://www.learning-with-kernels.org.

Thanks to Springer and the authors for making the figures from The

Elements of Statistical Learning available at

http://www-stat-class.stanford.edu/∼tibs/ElemStatLearn/.

Florian Markowetz, Molecular diagnosis, 2005 April 43

Page 63: DNA −→ RNA −→ Proteinmath.ipm.ac.ir/conferences/2005/biomath2005/slides/markowetz/tal… · Very prominent capacity concept [7, 8]: Vapnik-Chervonenkis (VC) dimension Shattering

Berlin

Cen

ter

for

Genome BasedBio

i nfo

rmaticsReferences

[1] TR. Golub, DK. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, JP. Mesirov, H. Coller, ML. Loh, JR. Downing, MA.Caligiuri, CD. Bloomfield, and ES. Lander. Molecular classification of cancer: class discovery and class predictionby gene expression monitoring. Science, 286(5439):531–7, Oct 1999.

[2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2001.

[3] J Jaeger, R Sengupta, and WL Ruzzo. Improved gene selection for classification of microarrays. Pac SympBiocomput, pages 53–64, 2003.

[4] Bernhard Scholkopf and Alexander J. Smola. Learning with kernels. The MIT Press, Cambridge, MA, 2002.

[5] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Diagnosis of multiple cancertypes by shrunken centroids of gene expression. Proc Natl Acad Sci U S A, 99(10):6567–72, May 2002.

[6] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Class prediction by nearestshrunken centroids, with applications to dna microarrays. Statist. Sci., 18(1):104–117, 2003.

[7] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995.

[8] Vladimir Vapnik. Statistical Learning Theory. Wiley, N.Y., 1998.

Florian Markowetz, Molecular diagnosis, 2005 April 44