sparse statistical modelling - uclrmjbale/stat/7.pdf · 2016-11-30 · sparse statistical modelling...

Sparse statistical modelling

Tom Bartlett

Sparse statistical modelling Tom Bartlett 1 / 28

Introduction

‘A sparse statistical model is one having only a small number ofnonzero parameters or weights.’[1]

The number of features or variables measured on a person or objectcan be very large (e.g., expression levels of ∼ 30000 genes)

These measurements are often highly correlated, i.e., contain muchredundant information

This scenario is particularly relevant in the age of ‘big-data’

1Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity:the lasso and generalizations. CRC Press, 2015


Outline

Sparse linear models

Sparse PCA

Sparse SVD

Sparse CCA

Sparse LDA

Sparse clustering



A linear model can be written as

yi =α +

p∑j=1

xijβj + εi , i = 1, ..., n

=α + x>i β + εi

Hence, the model can be fit by minimising the objective function

minimisea,β

{N∑i=1

(yi − α− x>i β)2

}

Adding a penalisation term to the objective function makes thesolution more sparse:

minimisea,β

{1

2N

N∑i=1

(yi − α− x>i β)2 + λ‖β‖qq

}, where q = 1 or 2



The penalty term λ‖β‖qq means that only the bare minimum is used ofall the information available in the p predictor variables xij , j = 1, ...p.

minimisea,β

{1

2N

N∑i=1

(yi − α− x>i β)2 + λ‖β‖qq

}

q is typically chosen as q = 1 or q = 2, because these produce convexsolutions and hence are computationally much nicer!

q = 1 is called the ‘lasso’; it tends to set as many elements of β aspossible to zero

q = 2 is called ‘ridge regression’, and it tends to minimise the size ofall the elements of β

Penalisation is equally applicable to other types of linear models:logistic regression, generalised linear models etc


Sparse linear models - simple example

0.0 0.2 0.4 0.6 0.8 1.0

−50

510

Coe

ffici

ents

hs

college

college4not−hs

funding

Lasso

0.0 0.2 0.4 0.6 0.8 1.0

−50

510

Coe

ffici

ents

hs

college

college4not−hs

funding

Ridge Regression

β 1/ β 1 β 2/ β 2

Crime-rate modelled according to 5 predictors: annual police funding indollars per resident (funding), percent of people 25 years and older withfour years of high school (hs), percent of 16- to 19-year olds not in highschool and not high school graduates (not-hs), percent of 18- to 24-yearolds in college (college), and percent of people 25 years and older with atleast four years of college (college4).


Sparse linear models - genomics exampleGene expression data, for p = 17280 genes, for nc = 530 cancersamples + nh = 61 healthy tissue samplesFit logistic (i.e., 2 class, cancer/healthy) lasso model using the R

package glmnet, selecting λ by cross-validationOut of 17280 possible genes for prediction, lasso chooses just these25 (shown with their fitted model coefficients)

ADAMTS5 -0.0666 HPD -0.00679 NUP210 0.00582ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0.297CA4 -0.151 IGSF10 -0.356 TACC3 0.128

CCDC36 -0.335 LRRTM2 -0.0711 TESC -0.0568CDH12 -0.253 LRRC3B -0.211 TRPM3 -1.24CES1 -0.302 MEG3 -0.022 TSLP -0.0841

COL10A1 0.747 MMP11 0.22 WDR51A 0.0722DPP6 -0.107 NUAK2 0.0354 WISP1 0.14HHATL -0.0665

Caveat: these are not necessarily the only ‘predictive’ genes. If weremoved these genes from the data-set and fitted the model again,lasso would choose an entirely new set of genes which might bealmost as good at predicting!


Sparse PCA

Ordinary PCA finds v by carrying out the optimisation:

maximise‖v‖2=1

{v>

X>X

nv

},

with X ∈ Rn×p (i.e., n samples and p variables).

With p >> n, the eigenvectors of the sample covariance matrixX>X/n are not necessarily close to those of the population covariancematrix [2].

Hence ordinary PCA can fail in this context. This motivates sparsePCA, in which many entries of v are encouraged to be zero, byfinding v by carrying out the optimisation:

maximise‖v‖2=1

{v>X>Xv

}, subject to: ‖v‖1 ≤ t.

In effect this discards some variables such that p is closer to n.2Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components

analysis”. In: Annals of statistics (2001), pp. 295–327


Sparse SVD

The SVD of a matrix X ∈ Rn×p, with n > p, can be expressed asX = UDV>, where U ∈ Rn×p and V ∈ Rp×p are orthogonal andD ∈ Rp×p is diagonal. The SVD can hence be found by carrying outthe optimisation:

minimiseU∈Rn×p ,V∈Rp×p ,D∈Rp×p

‖X−UDV>‖2.

Hence, a sparse SVD with rank r can be obtained by carrying out theoptimisation:

minimiseU∈Rn×r ,V∈Rp×r ,D∈Rr×r

{‖X−UDV>‖2 + λ1‖U‖1 + λ2‖V‖1

}.

This allows SVD to be applied to the p > n scenario.


Sparse PCA and SVD - an algorithm

SVD is a generalisation of PCA. Hence, algorithms to solve the SVDproblem can be applied to the PCA problem

The sparse PCA can thus be re-formulated as:

maximise‖u‖2=‖v‖2=1

{u>Xv

}, subject to: ‖v‖1 ≤ t,

which is biconvex in u and v and can be solved by alternatingbetween the updates:

u← Xv

‖Xv‖2, and v←

Sλ(X>u

)‖Sλ (X>u) ‖2

, (1)

where Sλ is the soft-thresholding operator Sλ = sign(x) (|x | − λ)+.


Sparse PCA - simulation study

Define Σ as a p × p block-diagonalmatrix, with p = 200 and 10 blocksof 1s of size 20× 20.

Hence, we would expect there to be 10independent components of variationin the corresponding distribution.

Generate n samples x ∼ Normal(0, Σ)

Estimate Σ =∑

(x− x)(x− x)>/n

Correlate eigenvectors of Σ witheigenvectors of Σ

Repeat 100 times for eachdifferent value of n

0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Top 10 PCs

n/p

Eig

enve

ctor

cor

rela

tion

The plot shows the means ofthese correlations over the100 repetitions for differentvalues of n.



An implementation of sparse PCA isavailable in the R package PMA as thefunction spca. It proceeds similarlyto the algorithm described earlier,which is presented in more detail byWitten, Tibshirani and Hastie [3].

I applied this function to the samesimulation as described in theprevious slide.

The scale of the penalisation is in termsof ‖u‖1, with ‖u‖1 =

√p being the

minimum and ‖u‖1 = 1 being themaximum permissible values.

0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Top 10 PCs

n/p

Eig

enve

ctor

cor

rela

tion

The plot shows the resultwith ‖u‖1 =

√p.

3Daniela M Witten, Robert Tibshirani, and Trevor Hastie. “A penalized matrixdecomposition, with applications to sparse principal components and canonical correlationanalysis”. In: Biostatistics (2009), kxp008



0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Top 10 PCs

n/p

Eig

enve

ctor

cor

rela

tion


√p/2.

0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Top 10 PCs

n/p

Eig

enve

ctor

cor

rela

tion


√p/3.


Sparse PCA - real data example

I carried out PCA on expression levelsof 10138 genes in individual cellsfrom developing brains

There are many different cell types inthe data - some mature, someimmature, and some in between

Different cell-types are characterised bydifferent gene expression profiles

We would therefore expect to be ablevisualise some separation of thecell-types by dimensionality reductionto three dimensions

The plot shows the cellsplotted in terms of the topthree (standard) PCAcomponents.


Sparse PCA - real data example

The plot shows the cells interms of the top three sparsePCA components, with‖u‖1 = 0.1

√p (i.e., a high

level of regularisation).

The plot shows the cells interms of the top three sparsePCA components, with‖u‖1 = 0.8

√p (i.e., a low

level of regularisation).


Sparse CCA

In CCA, the aim is to find coefficient vectors u ∈ Rp and v ∈ Rq

which project the data-matrices X ∈ Rn×p and Y ∈ Rn×q so as tomaximise the correlations between these projections.

Whereas PCA aims to find the ‘direction’ of maximum variance in asingle data-matrix, CCA aims to find the ‘directions’ in the twodata-matrices in which the variances best explain each other.

The CCA problem can be solved by carrying out the optimisation:

maximiseu∈Rp ,v∈Rq

Cor(Xu, Yv)

This problem is not well posed for n < max(p, q), in which case u andv can be found which trivially give Cor(Xu, Yv) = 1.

Sparse CCA solves this problem by carrying out the optimisation:

maximiseu∈Rp ,v∈Rq

Cor(Xu, Yv), subject to ‖u‖1 < t1 and ‖v‖1 < t2.


Sparse CCA - real data example

‘Cell cycle’ is a biological processinvolved in the replication of cells

Cell-cycle can be thought of as a latentprocess which is not directlyobservable in genomics data

It is driven by a small set of genes(particularly cyclins and cyclin-dependent kinases) from which itmay be inferred

It has an effect on the expression of verymany genes: hence it can also tendto act as a confounding factor whenmodelling many other biologicalprocesses

Used CCA here as anexploratory tool, with Y thedata for the cell cycle genes,and X the data for all theother genes.


Sparse LDA

LDA assigns item i to a group G based a corresponding data-vectorxi , according to the posterior probability:

P(G = k|xi ) =πk fk(xi )∑Kl=1 πl fl(xi )

, with

fk(xi ) =1

(2π)p/2|Σ|1/2exp

{−1

2(xi − µk)>Σ−1(xi − µk)

},

with prior πk and mean µk for group k, and covariance Σ.

This assignment takes place by constructing ‘decision boundaries’between classes k and l :

logP(G = k |xi )P(G = l |xi )

= logπkπl

+ x>i Σ−1(µk − µl)

− 1

2(µk + µl)

>Σ−1(µk − µl)

Because this boundary is linear in xi , we get the name LDA.


Sparse LDAThe decision boundary

logP(G = k |xi )P(G = l |xi )

= logπkπl

+ x>i Σ−1(µk − µl)

− 1

2(µk + µl)

>Σ−1(µk − µl)

then naturally leads to the decision rule:

G (xi ) = argmaxk

{log πk + x>i Σ−1µk − µ>k Σ−1µk

}.

By assuming Σ is diagonal, i.e., there is no covariance between the pdimensions, this decision rule can be reduced to the nearest centroidsclassifier:

G (xi ) = argmink

p∑

j=1

(xj − µjk)2

σ2j− log πk

.

Typically, Σ (or σ) are estimated from the data as Σ (or σ), and theµk are estimated as µk whilst training the classifier.


Sparse LDAThe nearest centroids classifier

G (xi ) = argmink

p∑

j=1

(xj − µjk)2

σ2j− log πk

will typically use all p variables. This is often unnecessary and canlead to overfitting in high-dimensional contexts. The nearest shrunkencentroids classifier deals with this issue.Define µ = x + αk , where x is the data-mean across all classes, andαk is the class-specific deviation of the mean from x. Then, thenearest shrunken centroids classifier proceeds with the optimisation:

minimiseαk∈Rp ,k∈{1,...,K}

1

2n

K∑k=1

∑i∈Ck

p∑j=1

(xij − xj − αjk)2

σ2

+λK∑

k=1

p∑j=1

√nkσ2|αjk |

,

where Ck and nk are the set and number of samples in group k .


Sparse LDA

Hence, the αk estimated from the optimisation

minimiseαk∈Rp ,k∈{1,...,K}

1

2n

K∑k=1

∑i∈Ck

p∑j=1

(xij − xj − αjk)2

σ2

+λK∑

k=1

p∑j=1

√nkσ2|αjk |

can be used to estimate the shrunken centroids µ = x + αk , thustraining the classifier:

G (xi ) = argmink

p∑

j=1

(xj − µjk)2

σ2j− log πk

.


Sparse LDA - real data example

I applied nearest (shrunken) centroids toexpression data for 14349 genes, for347 cells of different types:leukocytes (54); lymphoblastic cells(88); fetal brain cells (16wk, 26;21wk, 24); fibroblasts (37); ductalcarcinoma (22); keratinocytes (40);B lymphoblasts (17); iPS cells (24);neural progenitors (15).

Used R packages MASS, and pamr [4].Carried out 100 repetitions of 3-foldCV. Plots show normalised mutualinformation (NMI), adjusted Randindex (ARI) and prediction accuracy.

0 5 10 15 20 25 300.0

0.4

0.8

Sparsity threshold

NM

I

0 5 10 15 20 25 300.0

0.4

0.8

Sparsity threshold

AR

I

0 5 10 15 20 25 300.0

0.4

0.8

Sparsity threshold

Acc

urac

y

Sparse LDA quantile (over 300 predictions)

100% 75% 50% 25% 0%

Regular LDA quantile (over 300 predictions)

100% 75% 50% 25% 0%

4Robert Tibshirani et al. “Class prediction by nearest shrunken centroids, with applicationsto DNA microarrays”. In: Statistical Science (2003), pp. 104–117


Sparse clustering

Many clustering methods, such hierarchical clustering, are based on adissimilarity measure Di ,i ′ =

∑pj=1 di ,i ′,j between samples i and i ′.

One popular choice of dissimilarity measure is the euclidean distance.

In high-dimensions, it is often unnecessary to use information from allof the p dimensions.

A weighted dissimilarity measure Di ,i ′ =∑p

j=1 wjdi ,i ′,j can be a usefulapproach to this problem. This can be obtained by the sparse matrixdecomposition:

maximiseu∈Rn2 ,w∈Rp

u>∆w, subject to ‖u‖2 ≤ 1, ‖w‖2 ≤ 1,

‖w‖1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p},

where w is vector of the weights wj , j ∈ {1, ..., p}, and ∆ ∈ Rn2×p isthe dissimilarity components arranged such that each row of ∆corresponds to the di ,i ′,j , j ∈ {1, ..., p} for a pair of samples i , i ′.

This weighted dissimilarity measure can then be used for sparseclustering, such as sparse hierarchical clustering.


Sparse clustering

Some clustering methods, such as K-means, need a slightly modifiedapproach.

K-means seeks to minimise the within-cluster sum of squares

K∑k=1

∑i∈Ck

‖xi − xk‖22 =1

2N

K∑k=1

∑i ,i ′∈Ck

‖xi − xi ′‖22

where Ck is the set of samples in cluster k and xk is thecorresponding centroid.

Hence, a weighted K-means could proceed according to theoptimisation:

minimisew∈Rp

p∑

j=1

wj

K∑k=1

1

nk

∑i ,i ′∈Ck

di ,i ′,j

,

where di ,i ′,j = (xij − xi ′j)2, and nk is the number of samples

in cluster k .


Sparse clustering

However, for the optimisation

minimisew∈Rp

p∑

j=1

wj

K∑k=1

1

nk

∑i ,i ′∈Ck

di ,i ′,j

,

it is not possible to choose a set of constraints which guarantee anon-pathological solution as well as convexity.

Instead, the between-cluster sum of squares can be maximised:

maximisew∈Rp

p∑

j=1

wj

1

n

n∑i=1

n∑i ′=1

di ,i ′,j −K∑

k=1

1

nk

∑i ,i ′∈Ck

di ,i ′,j

subject to ‖w‖2 ≤ 1, ‖w‖1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p}.


Sparse clustering - real data examples

Applied (sparse) hierarchalclustering to the samebenchmark expressiondata-set (14349 genes, for347 cells of different types).

Used R package sparcl [5] forthe sparse clustering. Plotsshow normalised mutualinformation (NMI) andadjusted Rand index (ARI)comparing sparse withstandard clustering.

2 5 10 20 50 100 200 500 10000.0

0.4

0.8

L1 bound

NM

I

2 5 10 20 50 100 200 500 10000.0

0.4

0.8

L1 bound

ARI

Sparse hierarchical clustering hierarchical clustering

5Daniela M Witten and Robert Tibshirani. “A framework for feature selection in clustering”.In: Journal of the American Statistical Association (2012)


Sparse clustering - real data examples

Applied (sparse) k-means tothe same benchmarkexpression data-set (14349genes, for 347 cells ofdifferent types).

Used R package sparcl for thesparse clustering. Plotsshow normalised mutualinformation (NMI) andadjusted Rand index (ARI)comparing sparse withstandard clustering.

2 5 10 20 50 100 200 500 10000.0

0.4

0.8

L1 bound

NM

I

2 5 10 20 50 100 200 500 10000.0

0.4

0.8

L1 bound

ARI

Sparse k−means k−means


Sparse clustering - real data examplesSpectral clustering essentially

uses k-means clustering (orsimilar) in dimensionally-reduced (e.g., PCA) space.

Applied standard k-means insparse-PCA space to thesame benchmark expressiondata-set (14349 genes, for347 cells of different types).

Offers computationaladvantages, running in 9seconds on a 2.8GHzMacbook, compared with19 seconds for standardk-means, and 35 secondsfor sparse k-means.

0.1 0.2 0.5 1.00.0

0.4

0.8

L1 bound / sqrt(n)

NM

I

0.1 0.2 0.5 1.00.0

0.4

0.8

L1 bound / sqrt(n)

ARI

Sparse spectral k−means k−means


sparse statistical modelling - uclrmjbale/stat/7.pdf · 2016-11-30 · sparse statistical modelling...

Documents