sparse statistical modelling - uclrmjbale/stat/7.pdf · 2016-11-30 · sparse statistical modelling...

28
Sparse statistical modelling Tom Bartlett Sparse statistical modelling Tom Bartlett 1 / 28

Upload: others

Post on 25-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse statistical modelling

Tom Bartlett

Sparse statistical modelling Tom Bartlett 1 / 28

Page 2: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Introduction

‘A sparse statistical model is one having only a small number ofnonzero parameters or weights.’[1]

The number of features or variables measured on a person or objectcan be very large (e.g., expression levels of ∼ 30000 genes)

These measurements are often highly correlated, i.e., contain muchredundant information

This scenario is particularly relevant in the age of ‘big-data’

1Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity:the lasso and generalizations. CRC Press, 2015

Sparse statistical modelling Tom Bartlett 2 / 28

Page 3: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Outline

Sparse linear models

Sparse PCA

Sparse SVD

Sparse CCA

Sparse LDA

Sparse clustering

Sparse statistical modelling Tom Bartlett 3 / 28

Page 4: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse linear models

A linear model can be written as

yi =α +

p∑j=1

xijβj + εi , i = 1, ..., n

=α + x>i β + εi

Hence, the model can be fit by minimising the objective function

minimisea,β

{N∑i=1

(yi − α− x>i β)2

}

Adding a penalisation term to the objective function makes thesolution more sparse:

minimisea,β

{1

2N

N∑i=1

(yi − α− x>i β)2 + λ‖β‖qq

}, where q = 1 or 2

Sparse statistical modelling Tom Bartlett 4 / 28

Page 5: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse linear models

The penalty term λ‖β‖qq means that only the bare minimum is used ofall the information available in the p predictor variables xij , j = 1, ...p.

minimisea,β

{1

2N

N∑i=1

(yi − α− x>i β)2 + λ‖β‖qq

}

q is typically chosen as q = 1 or q = 2, because these produce convexsolutions and hence are computationally much nicer!

q = 1 is called the ‘lasso’; it tends to set as many elements of β aspossible to zero

q = 2 is called ‘ridge regression’, and it tends to minimise the size ofall the elements of β

Penalisation is equally applicable to other types of linear models:logistic regression, generalised linear models etc

Sparse statistical modelling Tom Bartlett 5 / 28

Page 6: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse linear models - simple example

0.0 0.2 0.4 0.6 0.8 1.0

−50

510

Coe

ffici

ents

hs

college

college4not−hs

funding

Lasso

0.0 0.2 0.4 0.6 0.8 1.0

−50

510

Coe

ffici

ents

hs

college

college4not−hs

funding

Ridge Regression

β 1/ β 1 β 2/ β 2

Crime-rate modelled according to 5 predictors: annual police funding indollars per resident (funding), percent of people 25 years and older withfour years of high school (hs), percent of 16- to 19-year olds not in highschool and not high school graduates (not-hs), percent of 18- to 24-yearolds in college (college), and percent of people 25 years and older with atleast four years of college (college4).

Sparse statistical modelling Tom Bartlett 6 / 28

Page 7: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse linear models - genomics exampleGene expression data, for p = 17280 genes, for nc = 530 cancersamples + nh = 61 healthy tissue samplesFit logistic (i.e., 2 class, cancer/healthy) lasso model using the R

package glmnet, selecting λ by cross-validationOut of 17280 possible genes for prediction, lasso chooses just these25 (shown with their fitted model coefficients)

ADAMTS5 -0.0666 HPD -0.00679 NUP210 0.00582ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0.297CA4 -0.151 IGSF10 -0.356 TACC3 0.128

CCDC36 -0.335 LRRTM2 -0.0711 TESC -0.0568CDH12 -0.253 LRRC3B -0.211 TRPM3 -1.24CES1 -0.302 MEG3 -0.022 TSLP -0.0841

COL10A1 0.747 MMP11 0.22 WDR51A 0.0722DPP6 -0.107 NUAK2 0.0354 WISP1 0.14HHATL -0.0665

Caveat: these are not necessarily the only ‘predictive’ genes. If weremoved these genes from the data-set and fitted the model again,lasso would choose an entirely new set of genes which might bealmost as good at predicting!

Sparse statistical modelling Tom Bartlett 7 / 28

Page 8: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse PCA

Ordinary PCA finds v by carrying out the optimisation:

maximise‖v‖2=1

{v>

X>X

nv

},

with X ∈ Rn×p (i.e., n samples and p variables).

With p >> n, the eigenvectors of the sample covariance matrixX>X/n are not necessarily close to those of the population covariancematrix [2].

Hence ordinary PCA can fail in this context. This motivates sparsePCA, in which many entries of v are encouraged to be zero, byfinding v by carrying out the optimisation:

maximise‖v‖2=1

{v>X>Xv

}, subject to: ‖v‖1 ≤ t.

In effect this discards some variables such that p is closer to n.2Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components

analysis”. In: Annals of statistics (2001), pp. 295–327

Sparse statistical modelling Tom Bartlett 8 / 28

Page 9: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse SVD

The SVD of a matrix X ∈ Rn×p, with n > p, can be expressed asX = UDV>, where U ∈ Rn×p and V ∈ Rp×p are orthogonal andD ∈ Rp×p is diagonal. The SVD can hence be found by carrying outthe optimisation:

minimiseU∈Rn×p ,V∈Rp×p ,D∈Rp×p

‖X−UDV>‖2.

Hence, a sparse SVD with rank r can be obtained by carrying out theoptimisation:

minimiseU∈Rn×r ,V∈Rp×r ,D∈Rr×r

{‖X−UDV>‖2 + λ1‖U‖1 + λ2‖V‖1

}.

This allows SVD to be applied to the p > n scenario.

Sparse statistical modelling Tom Bartlett 9 / 28

Page 10: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse PCA and SVD - an algorithm

SVD is a generalisation of PCA. Hence, algorithms to solve the SVDproblem can be applied to the PCA problem

The sparse PCA can thus be re-formulated as:

maximise‖u‖2=‖v‖2=1

{u>Xv

}, subject to: ‖v‖1 ≤ t,

which is biconvex in u and v and can be solved by alternatingbetween the updates:

u← Xv

‖Xv‖2, and v←

Sλ(X>u

)‖Sλ (X>u) ‖2

, (1)

where Sλ is the soft-thresholding operator Sλ = sign(x) (|x | − λ)+.

Sparse statistical modelling Tom Bartlett 10 / 28

Page 11: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse PCA - simulation study

Define Σ as a p × p block-diagonalmatrix, with p = 200 and 10 blocksof 1s of size 20× 20.

Hence, we would expect there to be 10independent components of variationin the corresponding distribution.

Generate n samples x ∼ Normal(0, Σ)

Estimate Σ =∑

(x− x)(x− x)>/n

Correlate eigenvectors of Σ witheigenvectors of Σ

Repeat 100 times for eachdifferent value of n

0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Top 10 PCs

n/p

Eig

enve

ctor

cor

rela

tion

The plot shows the means ofthese correlations over the100 repetitions for differentvalues of n.

Sparse statistical modelling Tom Bartlett 11 / 28

Page 12: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse PCA - simulation study

An implementation of sparse PCA isavailable in the R package PMA as thefunction spca. It proceeds similarlyto the algorithm described earlier,which is presented in more detail byWitten, Tibshirani and Hastie [3].

I applied this function to the samesimulation as described in theprevious slide.

The scale of the penalisation is in termsof ‖u‖1, with ‖u‖1 =

√p being the

minimum and ‖u‖1 = 1 being themaximum permissible values.

0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Top 10 PCs

n/p

Eig

enve

ctor

cor

rela

tion

The plot shows the resultwith ‖u‖1 =

√p.

3Daniela M Witten, Robert Tibshirani, and Trevor Hastie. “A penalized matrixdecomposition, with applications to sparse principal components and canonical correlationanalysis”. In: Biostatistics (2009), kxp008

Sparse statistical modelling Tom Bartlett 12 / 28

Page 13: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse PCA - simulation study

0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Top 10 PCs

n/p

Eig

enve

ctor

cor

rela

tion

The plot shows the resultwith ‖u‖1 =

√p/2.

0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Top 10 PCs

n/p

Eig

enve

ctor

cor

rela

tion

The plot shows the resultwith ‖u‖1 =

√p/3.

Sparse statistical modelling Tom Bartlett 13 / 28

Page 14: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse PCA - real data example

I carried out PCA on expression levelsof 10138 genes in individual cellsfrom developing brains

There are many different cell types inthe data - some mature, someimmature, and some in between

Different cell-types are characterised bydifferent gene expression profiles

We would therefore expect to be ablevisualise some separation of thecell-types by dimensionality reductionto three dimensions

The plot shows the cellsplotted in terms of the topthree (standard) PCAcomponents.

Sparse statistical modelling Tom Bartlett 14 / 28

Page 15: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse PCA - real data example

The plot shows the cells interms of the top three sparsePCA components, with‖u‖1 = 0.1

√p (i.e., a high

level of regularisation).

The plot shows the cells interms of the top three sparsePCA components, with‖u‖1 = 0.8

√p (i.e., a low

level of regularisation).

Sparse statistical modelling Tom Bartlett 15 / 28

Page 16: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse CCA

In CCA, the aim is to find coefficient vectors u ∈ Rp and v ∈ Rq

which project the data-matrices X ∈ Rn×p and Y ∈ Rn×q so as tomaximise the correlations between these projections.

Whereas PCA aims to find the ‘direction’ of maximum variance in asingle data-matrix, CCA aims to find the ‘directions’ in the twodata-matrices in which the variances best explain each other.

The CCA problem can be solved by carrying out the optimisation:

maximiseu∈Rp ,v∈Rq

Cor(Xu, Yv)

This problem is not well posed for n < max(p, q), in which case u andv can be found which trivially give Cor(Xu, Yv) = 1.

Sparse CCA solves this problem by carrying out the optimisation:

maximiseu∈Rp ,v∈Rq

Cor(Xu, Yv), subject to ‖u‖1 < t1 and ‖v‖1 < t2.

Sparse statistical modelling Tom Bartlett 16 / 28

Page 17: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse CCA - real data example

‘Cell cycle’ is a biological processinvolved in the replication of cells

Cell-cycle can be thought of as a latentprocess which is not directlyobservable in genomics data

It is driven by a small set of genes(particularly cyclins and cyclin-dependent kinases) from which itmay be inferred

It has an effect on the expression of verymany genes: hence it can also tendto act as a confounding factor whenmodelling many other biologicalprocesses

Used CCA here as anexploratory tool, with Y thedata for the cell cycle genes,and X the data for all theother genes.

Sparse statistical modelling Tom Bartlett 17 / 28

Page 18: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse LDA

LDA assigns item i to a group G based a corresponding data-vectorxi , according to the posterior probability:

P(G = k|xi ) =πk fk(xi )∑Kl=1 πl fl(xi )

, with

fk(xi ) =1

(2π)p/2|Σ|1/2exp

{−1

2(xi − µk)>Σ−1(xi − µk)

},

with prior πk and mean µk for group k, and covariance Σ.

This assignment takes place by constructing ‘decision boundaries’between classes k and l :

logP(G = k |xi )P(G = l |xi )

= logπkπl

+ x>i Σ−1(µk − µl)

− 1

2(µk + µl)

>Σ−1(µk − µl)

Because this boundary is linear in xi , we get the name LDA.

Sparse statistical modelling Tom Bartlett 18 / 28

Page 19: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse LDAThe decision boundary

logP(G = k |xi )P(G = l |xi )

= logπkπl

+ x>i Σ−1(µk − µl)

− 1

2(µk + µl)

>Σ−1(µk − µl)

then naturally leads to the decision rule:

G (xi ) = argmaxk

{log πk + x>i Σ−1µk − µ>k Σ−1µk

}.

By assuming Σ is diagonal, i.e., there is no covariance between the pdimensions, this decision rule can be reduced to the nearest centroidsclassifier:

G (xi ) = argmink

p∑

j=1

(xj − µjk)2

σ2j− log πk

.

Typically, Σ (or σ) are estimated from the data as Σ (or σ), and theµk are estimated as µk whilst training the classifier.

Sparse statistical modelling Tom Bartlett 19 / 28

Page 20: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse LDAThe nearest centroids classifier

G (xi ) = argmink

p∑

j=1

(xj − µjk)2

σ2j− log πk

will typically use all p variables. This is often unnecessary and canlead to overfitting in high-dimensional contexts. The nearest shrunkencentroids classifier deals with this issue.Define µ = x + αk , where x is the data-mean across all classes, andαk is the class-specific deviation of the mean from x. Then, thenearest shrunken centroids classifier proceeds with the optimisation:

minimiseαk∈Rp ,k∈{1,...,K}

1

2n

K∑k=1

∑i∈Ck

p∑j=1

(xij − xj − αjk)2

σ2

+λK∑

k=1

p∑j=1

√nkσ2|αjk |

,

where Ck and nk are the set and number of samples in group k .

Sparse statistical modelling Tom Bartlett 20 / 28

Page 21: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse LDA

Hence, the αk estimated from the optimisation

minimiseαk∈Rp ,k∈{1,...,K}

1

2n

K∑k=1

∑i∈Ck

p∑j=1

(xij − xj − αjk)2

σ2

+λK∑

k=1

p∑j=1

√nkσ2|αjk |

can be used to estimate the shrunken centroids µ = x + αk , thustraining the classifier:

G (xi ) = argmink

p∑

j=1

(xj − µjk)2

σ2j− log πk

.

Sparse statistical modelling Tom Bartlett 21 / 28

Page 22: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse LDA - real data example

I applied nearest (shrunken) centroids toexpression data for 14349 genes, for347 cells of different types:leukocytes (54); lymphoblastic cells(88); fetal brain cells (16wk, 26;21wk, 24); fibroblasts (37); ductalcarcinoma (22); keratinocytes (40);B lymphoblasts (17); iPS cells (24);neural progenitors (15).

Used R packages MASS, and pamr [4].Carried out 100 repetitions of 3-foldCV. Plots show normalised mutualinformation (NMI), adjusted Randindex (ARI) and prediction accuracy.

0 5 10 15 20 25 300.0

0.4

0.8

Sparsity threshold

NM

I

0 5 10 15 20 25 300.0

0.4

0.8

Sparsity threshold

AR

I

0 5 10 15 20 25 300.0

0.4

0.8

Sparsity threshold

Acc

urac

y

Sparse LDA quantile (over 300 predictions)

100% 75% 50% 25% 0%

Regular LDA quantile (over 300 predictions)

100% 75% 50% 25% 0%

4Robert Tibshirani et al. “Class prediction by nearest shrunken centroids, with applicationsto DNA microarrays”. In: Statistical Science (2003), pp. 104–117

Sparse statistical modelling Tom Bartlett 22 / 28

Page 23: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse clustering

Many clustering methods, such hierarchical clustering, are based on adissimilarity measure Di ,i ′ =

∑pj=1 di ,i ′,j between samples i and i ′.

One popular choice of dissimilarity measure is the euclidean distance.

In high-dimensions, it is often unnecessary to use information from allof the p dimensions.

A weighted dissimilarity measure Di ,i ′ =∑p

j=1 wjdi ,i ′,j can be a usefulapproach to this problem. This can be obtained by the sparse matrixdecomposition:

maximiseu∈Rn2 ,w∈Rp

u>∆w, subject to ‖u‖2 ≤ 1, ‖w‖2 ≤ 1,

‖w‖1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p},

where w is vector of the weights wj , j ∈ {1, ..., p}, and ∆ ∈ Rn2×p isthe dissimilarity components arranged such that each row of ∆corresponds to the di ,i ′,j , j ∈ {1, ..., p} for a pair of samples i , i ′.

This weighted dissimilarity measure can then be used for sparseclustering, such as sparse hierarchical clustering.

Sparse statistical modelling Tom Bartlett 23 / 28

Page 24: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse clustering

Some clustering methods, such as K-means, need a slightly modifiedapproach.

K-means seeks to minimise the within-cluster sum of squares

K∑k=1

∑i∈Ck

‖xi − xk‖22 =1

2N

K∑k=1

∑i ,i ′∈Ck

‖xi − xi ′‖22

where Ck is the set of samples in cluster k and xk is thecorresponding centroid.

Hence, a weighted K-means could proceed according to theoptimisation:

minimisew∈Rp

p∑

j=1

wj

K∑k=1

1

nk

∑i ,i ′∈Ck

di ,i ′,j

,

where di ,i ′,j = (xij − xi ′j)2, and nk is the number of samples

in cluster k .

Sparse statistical modelling Tom Bartlett 24 / 28

Page 25: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse clustering

However, for the optimisation

minimisew∈Rp

p∑

j=1

wj

K∑k=1

1

nk

∑i ,i ′∈Ck

di ,i ′,j

,

it is not possible to choose a set of constraints which guarantee anon-pathological solution as well as convexity.

Instead, the between-cluster sum of squares can be maximised:

maximisew∈Rp

p∑

j=1

wj

1

n

n∑i=1

n∑i ′=1

di ,i ′,j −K∑

k=1

1

nk

∑i ,i ′∈Ck

di ,i ′,j

subject to ‖w‖2 ≤ 1, ‖w‖1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p}.

Sparse statistical modelling Tom Bartlett 25 / 28

Page 26: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse clustering - real data examples

Applied (sparse) hierarchalclustering to the samebenchmark expressiondata-set (14349 genes, for347 cells of different types).

Used R package sparcl [5] forthe sparse clustering. Plotsshow normalised mutualinformation (NMI) andadjusted Rand index (ARI)comparing sparse withstandard clustering.

2 5 10 20 50 100 200 500 10000.0

0.4

0.8

L1 bound

NM

I

2 5 10 20 50 100 200 500 10000.0

0.4

0.8

L1 bound

ARI

Sparse hierarchical clustering hierarchical clustering

5Daniela M Witten and Robert Tibshirani. “A framework for feature selection in clustering”.In: Journal of the American Statistical Association (2012)

Sparse statistical modelling Tom Bartlett 26 / 28

Page 27: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse clustering - real data examples

Applied (sparse) k-means tothe same benchmarkexpression data-set (14349genes, for 347 cells ofdifferent types).

Used R package sparcl for thesparse clustering. Plotsshow normalised mutualinformation (NMI) andadjusted Rand index (ARI)comparing sparse withstandard clustering.

2 5 10 20 50 100 200 500 10000.0

0.4

0.8

L1 bound

NM

I

2 5 10 20 50 100 200 500 10000.0

0.4

0.8

L1 bound

ARI

Sparse k−means k−means

Sparse statistical modelling Tom Bartlett 27 / 28

Page 28: Sparse statistical modelling - UCLrmjbale/Stat/7.pdf · 2016-11-30 · Sparse statistical modelling Tom Bartlett ... Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical

Sparse clustering - real data examplesSpectral clustering essentially

uses k-means clustering (orsimilar) in dimensionally-reduced (e.g., PCA) space.

Applied standard k-means insparse-PCA space to thesame benchmark expressiondata-set (14349 genes, for347 cells of different types).

Offers computationaladvantages, running in 9seconds on a 2.8GHzMacbook, compared with19 seconds for standardk-means, and 35 secondsfor sparse k-means.

0.1 0.2 0.5 1.00.0

0.4

0.8

L1 bound / sqrt(n)

NM

I

0.1 0.2 0.5 1.00.0

0.4

0.8

L1 bound / sqrt(n)

ARI

Sparse spectral k−means k−means

Sparse statistical modelling Tom Bartlett 28 / 28