sp’10bafna/ideker classification (svms / kernel method)

56
Sp’10 Bafna/Ideker Classification (SVMs / Kernel method)

Upload: maria-watkins

Post on 25-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Sp’10 Bafna/Ideker

Classification (SVMs / Kernel method)

Sp’10 Bafna/Ideker

LP versus Quadratic programming

min cT x

Ax ≤ b

x ≥ 0

min xTQx + cT x

Ax ≤ b

x ≥ 0

• LP: linear constraints, linear objective function

• LP can be solved in polynomial time.

• In QP, the objective function contains a quadratic form.

• For +ve semindefinite Q, the QP can be solved in polynomial time

Sp’10 Bafna/Ideker

Margin of separation

• Suppose we find a separating hyperplane (, 0) s.t.– For all +ve points

x• Tx-0>=1

– For all +ve points x• Tx-0 <= -1

• What is the margin of separation?

Tx- 0=0

Tx- 0=1

Tx- 0=-1

Sp’10 Bafna/Ideker

Separating by a wider margin

• Solutions with a wider margin are better.

Maximize 2β

2 , or Minimize β

2

2

Sp’10 Bafna/Ideker

Separating via misclassification

• In general, data is not linearly separable• What if we also wanted to minimize misclassified points• Recall that, each sample xi in our training set has the

label yi {-1,1} • For each point i, yi(Txi-0) should be positive• Define i >= max {0, 1- yi(Txi-0) }• If i is correctly classified ( yi(Txi-0) >= 1), and i = 0• If i is incorrectly classified, or close to the boundaries i

> 0• We must minimize ii

Sp’10 Bafna/Ideker

Support Vector machines (wide margin and misclassification)

• Maximimize margin while minimizing misclassification

• Solved using non-linear optimization techniques

• The problem can be reformulated to exclusively using cross products of variables, which allows us to employ the kernel method.

• This gives a lot of power to the method. €

minβ

2

2 +C ξ ii∑

Sp’10 Bafna/Ideker

Reformulating the optimization

minβ

2

2 +C ξ ii∑

ξ i ≥ 0

ξ i ≥1− y i βT x i −β 0( )

Sp’10 Bafna/Ideker

Lagrangian relaxation

L =β

2

2+C ξ ii

∑ − α ii∑ ξ i −1+ y i β

T x i −β 0( )( ) − λ ii∑ ξ i

• Goal

• S.t.

• We minimize€

minβ

2

2+C ξ ii

ξ i ≥ 0

ξ i ≥1− y i βT x i −β 0( )

Sp’10 Bafna/Ideker

Simplifying

L =β

2

2+C ξ ii

∑ − α ii∑ ξ i −1+ y i β

T x i −β 0( )( ) − λ ii∑ ξ i

= β Tβ

2− α iy ix ii∑

⎝ ⎜

⎠ ⎟+ C −α i − λ i( )

i∑ ξ i + α ii

∑ y iβ 0 + α ii∑

• For fixed >= 0, >= 0, we minimize the lagrangian

∂L∂β

= β − y ii∑ α ix i = 0 (1)

∂L

∂β 0

= y ii∑ α i = 0 (2)

∂L

∂ξ i=C −α i − λ i = 0 (3)

Sp’10 Bafna/Ideker

Substituting

• Substituting (1)

L =β

2

2+C ξ ii

∑ − α ii∑ ξ i −1+ y i β

T x i −β 0( )( ) − λ ii∑ ξ i

= β Tβ

2− α iy ix ii∑

⎝ ⎜

⎠ ⎟+ C −α i − λ i( )

i∑ ξ i + α ii

∑ y iβ 0 + α ii∑

L = −1

2α iα jy iy jx i

T x ji, j

∑ + C −α i − λ i( )i

∑ ξ i + α ii∑ y iβ 0 + α ii

Sp’10 Bafna/Ideker

• Substituting (2,3), we have the minimization problem€

L = −1

2α iα jy iy jx i

T x ji, j

∑ + C −α i − λ i( )i

∑ ξ i + α ii∑ y iβ 0 + α ii

min −1

2α iα jy iy jx i

T x ji, j

∑ + α ii∑

s.t.

y iα ii∑ = 0

0 ≤ α i ≤ C

Sp’10 Bafna/Ideker

Classification using SVMs

• Under these conditions, the problem is a quadratic programming problem and can be solved using known techniques

• Quiz: When we have solved this QP, how do we classify a point x?

f (x) = β T x −β 0 = y ii∑ α ix i

T x −β 0

Sp’10 Bafna/Ideker

The kernel method

• The SVM formulation can be solved using QP on dot-products.

• As these are wide-margin classifiers, they provide a more robust solution.

• However, the true power of SVMs approach from using ‘the kernel method’, which allows us to go to higher dimensional (and non-linear spaces)

Sp’10 Bafna/Ideker

kernel

• Let X be the set of objects – Ex: X =the set of samples in micro-arrays.– Each object xX is a vector of gene

expression values• k: X X -> R is a positive semidefinite

kernel if– k is symmetric.– k is +ve semidefinite

k(x,x ') = k(x',x)

cT kc ≥ 0 ∀c ∈ R p

Sp’10 Bafna/Ideker

Kernels as dot-product

• Quiz: Suppose the objects x are all real vectors (as in gene expression)

• Define

• Is kL a kernel? It is symmetric, but is is +ve semi-definite?€

kL x,x'( ) = xT x'

Sp’10 Bafna/Ideker

Linear kernel is +ve semidefinite

• Recall X as a matrix, such that each column is a sample– X=[x1 x2 …]

• By definition, the linear kernel kL=XTX• For any c

cT kLc = cTX TXc = Xc2

≥ 0

Sp’10 Bafna/Ideker

Generalizing kernels

• Any object can be represented by a feature vector in real space.

φ : X → R p

k(x,x ') = φ(x)T φ(x')

Sp’10 Bafna/Ideker

Generalizing

• Note that the feature mapping could actually be non-linear.

• On the flip side, Every kernel can be represented as a dot-product in a high dimensional space.

• Sometimes the kernel space is easier to define than the mapping

Sp’10 Bafna/Ideker

The kernel trick

• If an algorithm for vectorial data is expressed exclusively in the form of dot-products, it can be changed to an algorithm on an arbitrary kernel– Simply replace the dot-product by the kernel

Sp’10 Bafna/Ideker

Kernel trick example

• Consider a kernel k defined on a mapping – k(x,x’) = (x)T (x’)

• It could be that is very difficult to compute explicitly, but k is easy to compute

• Suppose we define a distance function between two objects as

• How do we compute this distance?€

d(x,x ') = φ(x) −φ(x')

d(x,x ') = φ(x) −φ(x') = φ(x)T φ(x) + φ(x ')Tφ(x ') − 2φ(x)Tφ(x ')

= k(x,x) + k(x',x ') − 2k(x,x')

Sp’10 Bafna/Ideker

Kernels and SVMs

• Recall that SVM based classification is described as

min −1

2α iα jy iy jx i

T x ji, j

∑ + α ii∑

s.t.

y iα ii∑ = 0

0 ≤ α i ≤ C

Sp’10 Bafna/Ideker

Kernels and SVMs

• Applying the kernel trick

• We can try kernels that are biologically relevant€

min −1

2α iα jy iy jk(x i

i, j

∑ ,x j ) + α ii∑

s.t.

y iα ii∑ = 0

0 ≤ α i ≤ C

Sp’10 Bafna/Ideker

Examples of kernels for vectors

linear kernel kL (x,x') = xT x '

poly kernel kp (x,x') = xT x'+c( )d

Gaussian RBF kernel kG (x,x') = exp −x − x '

2

2σ 2

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Sp’10 Bafna/Ideker

String kernel

• Consider a string s = s1, s2,…

• Define an index set I as a subset of indices

• s[I] is the substring limited to those indices

• l(I) = span• W(I) = cl(I) c<1

– Weight decreases as span increases

• For any string u of length k

l(I)

φu(s) = c l(I )

I :s(I )= u

Sp’10 Bafna/Ideker

String Kernel

• Map every string to a ||n dimensional space, indexed by all strings u of length upto n

• The mapping is expensive, but given two strings s,t,the dot-product kernel k(s,t) = (s)T(t) can be computed in O(n |s| |t|) time

φs u

φu(s)

Sp’10 Bafna/Ideker

SVM conclusion

• SVM are a generic scheme for classifying data with wide margins and low misclassifications

• For data that is not easily represented as vectors, the kernel trick provides a standard recipe for classification– Define a meaningful kernel, and solve using

SVM• Many standard kernels are available

(linear, poly., RBF, string)

Sp’10 Bafna/Ideker

Classification review

• We started out by treating the classification problem as one of separating points in high dimensional space

• Obvious for gene expression data, but applicable to any kind of data

• Question of separability, linear separation• Algorithms for classification

– Perceptron– Lin. Discriminant– Max Likelihood– Linear Programming– SVMs– Kernel methods & SVM

Sp’10 Bafna/Ideker

Classification review

• Recall that we considered 3 problems:– Group together samples in an unsupervised

fashion (clustering)– Classify based on a training data (often by

learning a hyperplane that separates).– Selection of marker genes that are

diagnostic for the class. All other genes can be discarded, leading to lower dimensionality.

Sp’10 Bafna/Ideker

Dimensionality reduction

• Many genes have highly correlated expression profiles.

• By discarding some of the genes, we can greatly reduce the dimensionality of the problem.

• There are other, more principled ways to do such dimensionality reduction.

Sp’10 Bafna/Ideker

Why is high dimensionality bad?

• With a high enough dimensionality, all points can be linearly separated.

• Recall that a point xi is misclassified if – it is +ve, but Txi-0<=0– it is -ve, but Txi+0 > 0

• In the first case choose i s.t. – Txi-0+i >= 0

• By adding a dimension for each misclassified point, we create a higher dimension hyperplane that perfectly separates all of the points!

Sp’10 Bafna/Ideker

Principle Components Analysis

• We get the intrinsic dimensionality of a data-set.

Sp’10 Bafna/Ideker

Principle Components Analysis

• Consider the expression values of 2 genes over 6 samples.

• Clearly, the expression of the two genes is highly correlated.

• Projecting all the genes on a single line could explain most of the data.

• This is a generalization of “discarding the gene”.

Sp’10 Bafna/Ideker

Projecting

• Consider the mean of all points m, and a vector emanating from the mean

• Algebraically, this projection on means that all samples x can be represented by a single value T(x-m)

m

x

x-m

T =

M

T(x-m)

Sp’10 Bafna/Ideker

Higher dimensions

• Consider a set of 2 (k) orthonormal vectors 1, 2…

• Once projected, each sample means that all samples x can be represented by 2 (k) dimensional vector– 1

T(x-m), 2T(x-m)

1m

x

x-m1

T

=

M

1T(x-m)

2

Sp’10 Bafna/Ideker

How to project

• The generic scheme allows us to project an m dimensional surface into a k dimensional one.

• How do we select the k ‘best’ dimensions?

• The strategy used by PCA is one that maximizes the variance of the projected points around the mean

Sp’10 Bafna/Ideker

PCA

• Suppose all of the data were to be reduced by projecting to a single line from the mean.

• How do we select the line ?

m

Sp’10 Bafna/Ideker

PCA cont’d

• Let each point xk map to x’k=m+ak. We want to minimize the error

• Observation 1: Each point xk maps to x’k = m + T(xk-m)– (ak= T(xk-m))€

xk − x 'k2

k

m

xk

x’k

Sp’10 Bafna/Ideker

Proof of Observation 1

minak xk − x'k2

= minak xk −m + m − x'k2

= minak xk −m2

+ m − x'k2

− 2(x'k −m)T (xk −m)

= minak xk −m2

+ ak2β Tβ − 2akβ

T (xk −m)

= minak xk −m2

+ ak2 − 2akβ

T (xk −m)

2ak − 2β T (xk −m) = 0

ak = β T (xk −m)

⇒ ak2 = akβ

T (xk −m)

⇒ xk − x 'k2

= xk −m2

−β T (xk −m)(xk −m)T β

Differentiating w.r.t ak

Sp’10 Bafna/Ideker

Minimizing PCA Error

• To minimize error, we must maximize TS• By definition, = TS implies that is an eigenvalue, and

the corresponding eigenvector.• Therefore, we must choose the eigenvector

corresponding to the largest eigenvalue.

xk − x 'kk

∑2

=C − β T

k

∑ (xk −m)(xk −m)T β =C −β TSβ

Sp’10 Bafna/Ideker

PCA steps• X = starting matrix with n

columns, m rows

xjX

1. m =1

nx j

j=1

n

2. hT = 11L 1[ ]

3. M = X −mhT

4. S = MMT = x j −m( )j=1

n

∑ x j −m( )T

5. BTSB = Λ

6. Return BTM

End of Lecture

Sp’10 Bafna/Ideker

Sp’10 Bafna/Ideker

Sp’10 Bafna/Ideker

ALL-AML classification

• The two leukemias need different different therapeutic regimen.

• Usually distinguished through hematopathology

• Can Gene expression be used for a more definitive test?– 38 bonemarrow samples– Total mRNA was hybridized against probes for 6817

genes– Q: Are these classes separable

Sp’10 Bafna/Ideker

Neighborhood analysis (cont’d)

• Each gene is represented by an expression vector v(g) = (e1,e2,…,en)

• Choose an idealized expression vector as center.

• Discriminating genes will be ‘closer’ to the center (any distance measure can be used).

Discriminating gene

Sp’10 Bafna/Ideker

Neighborhood analysis

• Q: Are there genes, whose expression correlates with one of the two classes

• A: For each class, create an idealized vector c– Compute the number of genes Nc whose expression

‘matches’ the idealized expression vector– Is Nc significantly larger than Nc* for a random c*?

Sp’10 Bafna/Ideker

Neighborhood test• Distance measure used:

– For any binary vector c, let the one entries denote class 1, and the 0 entries denote class 2

– Compute mean and std. dev. [1(g),1(g)] of expression in class 1 and also [2(g),2(g)].

– P(g,c) = [1(g)-2(g)]/ [1(g)+2(g)] – N1(c,r) = {g | P(g,c) == r}– High density for some r is indicative of correlation with class

distinction– Neighborhood is significant if a random center does not

produce the same density.

Sp’10 Bafna/Ideker

Neighborhood analysis

• #{g |P(g,c) > 0.3} > 709 (ALL) vs 173 by chance.

• Class prediction should be possible using micro-array expression values.

Sp’10 Bafna/Ideker

Class prediction

• Choose a fixed set of informative genes (based on their correlation with the class distinction).– The predictor is uniquely defined by the sample and the subset

of informative genes.• For each informative gene g, define (wg,bg).

– wg=P(g,c) (When is this +ve?)– bg = [1(g)+2(g)]/2

• Given a new sample X– xg is the normalized expression value at g– Vote of gene g =wg|xg-bg| (+ve value is a vote for class 1, and

negative for class 2)

Sp’10 Bafna/Ideker

Prediction Strength

• PS = [Vwin-Vlose]/[Vwin+Vlose]– Reflects the margin of victory

• A 50 gene predictor is correct 36/38 (cross-validation)

• Prediction accuracy on other samples 100% (prediction made for 29/34 samples.

• Median PS = 0.73• Other predictors between 10 and 200 genes all

worked well.

Sp’10 Bafna/Ideker

Performance

Sp’10 Bafna/Ideker

Differentially expressed genes?

• Do the predictive genes reveal any biology?

• Initial expectation is that most genes would be of a hematopoetic lineage.

• However, many genes encode– Cell cycle progression genes– Chromatin remodelling– Transcription– Known oncogenes– Leukemia targets (etopside)

Sp’10 Bafna/Ideker

Relationship between ML, and Golub predictor

• ML when the covariance matrix is a diagonal matrix with identical variance for different classes is similar to Golub’s classifier€

p(x |ωi) =1

2π( )d

2 Σ1

2

exp −1

2x −m( )

TΣ−1 x −m( )

⎝ ⎜

⎠ ⎟

gi(x) = ln p(x |ωi)( ) + lnP(ωi)

Compute argmaxi gi(x)

gi(x) = ln p(x |ωi)( ) + lnP(ωi)

gi(x) =x j −μ kj( )

2

σ j2

j=1

p

g1(x) − g2(x) =μ1 j −μ2 j( )

σ j2

x j −μ1 j + μ2 j( )

2

⎝ ⎜ ⎜

⎠ ⎟ ⎟

j

Sp’10 Bafna/Ideker

Automatic class discovery

• The classification of different cancers is over years of hypothesis driven research.

• Suppose you were given unlabeled samples of ALL/AML. Would you be able to distinguish the two classes?

Sp’10 Bafna/Ideker

Self Organizing Maps

• SOMs was applied to group the 38 samples

• Class A1 contained 24/25 ALL and 3/13 AML samples.

• How can we validate this?• Use the labels to do supervised

classification via cross-validation• A 20 gene predictor gave 34 accurate

predictions, 1 error, and 2 of 3 uncertains

Sp’10 Bafna/Ideker

Comparing various error models

Sp’10 Bafna/Ideker

Conclusion