sparse statistical modelling - uclrmjbale/stat/7.pdf · 2016-11-30 · sparse statistical modelling...
TRANSCRIPT
Sparse statistical modelling
Tom Bartlett
Sparse statistical modelling Tom Bartlett 1 / 28
Introduction
‘A sparse statistical model is one having only a small number ofnonzero parameters or weights.’[1]
The number of features or variables measured on a person or objectcan be very large (e.g., expression levels of ∼ 30000 genes)
These measurements are often highly correlated, i.e., contain muchredundant information
This scenario is particularly relevant in the age of ‘big-data’
1Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity:the lasso and generalizations. CRC Press, 2015
Sparse statistical modelling Tom Bartlett 2 / 28
Outline
Sparse linear models
Sparse PCA
Sparse SVD
Sparse CCA
Sparse LDA
Sparse clustering
Sparse statistical modelling Tom Bartlett 3 / 28
Sparse linear models
A linear model can be written as
yi =α +
p∑j=1
xijβj + εi , i = 1, ..., n
=α + x>i β + εi
Hence, the model can be fit by minimising the objective function
minimisea,β
{N∑i=1
(yi − α− x>i β)2
}
Adding a penalisation term to the objective function makes thesolution more sparse:
minimisea,β
{1
2N
N∑i=1
(yi − α− x>i β)2 + λ‖β‖qq
}, where q = 1 or 2
Sparse statistical modelling Tom Bartlett 4 / 28
Sparse linear models
The penalty term λ‖β‖qq means that only the bare minimum is used ofall the information available in the p predictor variables xij , j = 1, ...p.
minimisea,β
{1
2N
N∑i=1
(yi − α− x>i β)2 + λ‖β‖qq
}
q is typically chosen as q = 1 or q = 2, because these produce convexsolutions and hence are computationally much nicer!
q = 1 is called the ‘lasso’; it tends to set as many elements of β aspossible to zero
q = 2 is called ‘ridge regression’, and it tends to minimise the size ofall the elements of β
Penalisation is equally applicable to other types of linear models:logistic regression, generalised linear models etc
Sparse statistical modelling Tom Bartlett 5 / 28
Sparse linear models - simple example
0.0 0.2 0.4 0.6 0.8 1.0
−50
510
Coe
ffici
ents
hs
college
college4not−hs
funding
Lasso
0.0 0.2 0.4 0.6 0.8 1.0
−50
510
Coe
ffici
ents
hs
college
college4not−hs
funding
Ridge Regression
β 1/ β 1 β 2/ β 2
Crime-rate modelled according to 5 predictors: annual police funding indollars per resident (funding), percent of people 25 years and older withfour years of high school (hs), percent of 16- to 19-year olds not in highschool and not high school graduates (not-hs), percent of 18- to 24-yearolds in college (college), and percent of people 25 years and older with atleast four years of college (college4).
Sparse statistical modelling Tom Bartlett 6 / 28
Sparse linear models - genomics exampleGene expression data, for p = 17280 genes, for nc = 530 cancersamples + nh = 61 healthy tissue samplesFit logistic (i.e., 2 class, cancer/healthy) lasso model using the R
package glmnet, selecting λ by cross-validationOut of 17280 possible genes for prediction, lasso chooses just these25 (shown with their fitted model coefficients)
ADAMTS5 -0.0666 HPD -0.00679 NUP210 0.00582ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0.297CA4 -0.151 IGSF10 -0.356 TACC3 0.128
CCDC36 -0.335 LRRTM2 -0.0711 TESC -0.0568CDH12 -0.253 LRRC3B -0.211 TRPM3 -1.24CES1 -0.302 MEG3 -0.022 TSLP -0.0841
COL10A1 0.747 MMP11 0.22 WDR51A 0.0722DPP6 -0.107 NUAK2 0.0354 WISP1 0.14HHATL -0.0665
Caveat: these are not necessarily the only ‘predictive’ genes. If weremoved these genes from the data-set and fitted the model again,lasso would choose an entirely new set of genes which might bealmost as good at predicting!
Sparse statistical modelling Tom Bartlett 7 / 28
Sparse PCA
Ordinary PCA finds v by carrying out the optimisation:
maximise‖v‖2=1
{v>
X>X
nv
},
with X ∈ Rn×p (i.e., n samples and p variables).
With p >> n, the eigenvectors of the sample covariance matrixX>X/n are not necessarily close to those of the population covariancematrix [2].
Hence ordinary PCA can fail in this context. This motivates sparsePCA, in which many entries of v are encouraged to be zero, byfinding v by carrying out the optimisation:
maximise‖v‖2=1
{v>X>Xv
}, subject to: ‖v‖1 ≤ t.
In effect this discards some variables such that p is closer to n.2Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components
analysis”. In: Annals of statistics (2001), pp. 295–327
Sparse statistical modelling Tom Bartlett 8 / 28
Sparse SVD
The SVD of a matrix X ∈ Rn×p, with n > p, can be expressed asX = UDV>, where U ∈ Rn×p and V ∈ Rp×p are orthogonal andD ∈ Rp×p is diagonal. The SVD can hence be found by carrying outthe optimisation:
minimiseU∈Rn×p ,V∈Rp×p ,D∈Rp×p
‖X−UDV>‖2.
Hence, a sparse SVD with rank r can be obtained by carrying out theoptimisation:
minimiseU∈Rn×r ,V∈Rp×r ,D∈Rr×r
{‖X−UDV>‖2 + λ1‖U‖1 + λ2‖V‖1
}.
This allows SVD to be applied to the p > n scenario.
Sparse statistical modelling Tom Bartlett 9 / 28
Sparse PCA and SVD - an algorithm
SVD is a generalisation of PCA. Hence, algorithms to solve the SVDproblem can be applied to the PCA problem
The sparse PCA can thus be re-formulated as:
maximise‖u‖2=‖v‖2=1
{u>Xv
}, subject to: ‖v‖1 ≤ t,
which is biconvex in u and v and can be solved by alternatingbetween the updates:
u← Xv
‖Xv‖2, and v←
Sλ(X>u
)‖Sλ (X>u) ‖2
, (1)
where Sλ is the soft-thresholding operator Sλ = sign(x) (|x | − λ)+.
Sparse statistical modelling Tom Bartlett 10 / 28
Sparse PCA - simulation study
Define Σ as a p × p block-diagonalmatrix, with p = 200 and 10 blocksof 1s of size 20× 20.
Hence, we would expect there to be 10independent components of variationin the corresponding distribution.
Generate n samples x ∼ Normal(0, Σ)
Estimate Σ =∑
(x− x)(x− x)>/n
Correlate eigenvectors of Σ witheigenvectors of Σ
Repeat 100 times for eachdifferent value of n
0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eig
enve
ctor
cor
rela
tion
The plot shows the means ofthese correlations over the100 repetitions for differentvalues of n.
Sparse statistical modelling Tom Bartlett 11 / 28
Sparse PCA - simulation study
An implementation of sparse PCA isavailable in the R package PMA as thefunction spca. It proceeds similarlyto the algorithm described earlier,which is presented in more detail byWitten, Tibshirani and Hastie [3].
I applied this function to the samesimulation as described in theprevious slide.
The scale of the penalisation is in termsof ‖u‖1, with ‖u‖1 =
√p being the
minimum and ‖u‖1 = 1 being themaximum permissible values.
0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eig
enve
ctor
cor
rela
tion
The plot shows the resultwith ‖u‖1 =
√p.
3Daniela M Witten, Robert Tibshirani, and Trevor Hastie. “A penalized matrixdecomposition, with applications to sparse principal components and canonical correlationanalysis”. In: Biostatistics (2009), kxp008
Sparse statistical modelling Tom Bartlett 12 / 28
Sparse PCA - simulation study
0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eig
enve
ctor
cor
rela
tion
The plot shows the resultwith ‖u‖1 =
√p/2.
0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eig
enve
ctor
cor
rela
tion
The plot shows the resultwith ‖u‖1 =
√p/3.
Sparse statistical modelling Tom Bartlett 13 / 28
Sparse PCA - real data example
I carried out PCA on expression levelsof 10138 genes in individual cellsfrom developing brains
There are many different cell types inthe data - some mature, someimmature, and some in between
Different cell-types are characterised bydifferent gene expression profiles
We would therefore expect to be ablevisualise some separation of thecell-types by dimensionality reductionto three dimensions
The plot shows the cellsplotted in terms of the topthree (standard) PCAcomponents.
Sparse statistical modelling Tom Bartlett 14 / 28
Sparse PCA - real data example
The plot shows the cells interms of the top three sparsePCA components, with‖u‖1 = 0.1
√p (i.e., a high
level of regularisation).
The plot shows the cells interms of the top three sparsePCA components, with‖u‖1 = 0.8
√p (i.e., a low
level of regularisation).
Sparse statistical modelling Tom Bartlett 15 / 28
Sparse CCA
In CCA, the aim is to find coefficient vectors u ∈ Rp and v ∈ Rq
which project the data-matrices X ∈ Rn×p and Y ∈ Rn×q so as tomaximise the correlations between these projections.
Whereas PCA aims to find the ‘direction’ of maximum variance in asingle data-matrix, CCA aims to find the ‘directions’ in the twodata-matrices in which the variances best explain each other.
The CCA problem can be solved by carrying out the optimisation:
maximiseu∈Rp ,v∈Rq
Cor(Xu, Yv)
This problem is not well posed for n < max(p, q), in which case u andv can be found which trivially give Cor(Xu, Yv) = 1.
Sparse CCA solves this problem by carrying out the optimisation:
maximiseu∈Rp ,v∈Rq
Cor(Xu, Yv), subject to ‖u‖1 < t1 and ‖v‖1 < t2.
Sparse statistical modelling Tom Bartlett 16 / 28
Sparse CCA - real data example
‘Cell cycle’ is a biological processinvolved in the replication of cells
Cell-cycle can be thought of as a latentprocess which is not directlyobservable in genomics data
It is driven by a small set of genes(particularly cyclins and cyclin-dependent kinases) from which itmay be inferred
It has an effect on the expression of verymany genes: hence it can also tendto act as a confounding factor whenmodelling many other biologicalprocesses
Used CCA here as anexploratory tool, with Y thedata for the cell cycle genes,and X the data for all theother genes.
Sparse statistical modelling Tom Bartlett 17 / 28
Sparse LDA
LDA assigns item i to a group G based a corresponding data-vectorxi , according to the posterior probability:
P(G = k|xi ) =πk fk(xi )∑Kl=1 πl fl(xi )
, with
fk(xi ) =1
(2π)p/2|Σ|1/2exp
{−1
2(xi − µk)>Σ−1(xi − µk)
},
with prior πk and mean µk for group k, and covariance Σ.
This assignment takes place by constructing ‘decision boundaries’between classes k and l :
logP(G = k |xi )P(G = l |xi )
= logπkπl
+ x>i Σ−1(µk − µl)
− 1
2(µk + µl)
>Σ−1(µk − µl)
Because this boundary is linear in xi , we get the name LDA.
Sparse statistical modelling Tom Bartlett 18 / 28
Sparse LDAThe decision boundary
logP(G = k |xi )P(G = l |xi )
= logπkπl
+ x>i Σ−1(µk − µl)
− 1
2(µk + µl)
>Σ−1(µk − µl)
then naturally leads to the decision rule:
G (xi ) = argmaxk
{log πk + x>i Σ−1µk − µ>k Σ−1µk
}.
By assuming Σ is diagonal, i.e., there is no covariance between the pdimensions, this decision rule can be reduced to the nearest centroidsclassifier:
G (xi ) = argmink
p∑
j=1
(xj − µjk)2
σ2j− log πk
.
Typically, Σ (or σ) are estimated from the data as Σ (or σ), and theµk are estimated as µk whilst training the classifier.
Sparse statistical modelling Tom Bartlett 19 / 28
Sparse LDAThe nearest centroids classifier
G (xi ) = argmink
p∑
j=1
(xj − µjk)2
σ2j− log πk
will typically use all p variables. This is often unnecessary and canlead to overfitting in high-dimensional contexts. The nearest shrunkencentroids classifier deals with this issue.Define µ = x + αk , where x is the data-mean across all classes, andαk is the class-specific deviation of the mean from x. Then, thenearest shrunken centroids classifier proceeds with the optimisation:
minimiseαk∈Rp ,k∈{1,...,K}
1
2n
K∑k=1
∑i∈Ck
p∑j=1
(xij − xj − αjk)2
σ2
+λK∑
k=1
p∑j=1
√nkσ2|αjk |
,
where Ck and nk are the set and number of samples in group k .
Sparse statistical modelling Tom Bartlett 20 / 28
Sparse LDA
Hence, the αk estimated from the optimisation
minimiseαk∈Rp ,k∈{1,...,K}
1
2n
K∑k=1
∑i∈Ck
p∑j=1
(xij − xj − αjk)2
σ2
+λK∑
k=1
p∑j=1
√nkσ2|αjk |
can be used to estimate the shrunken centroids µ = x + αk , thustraining the classifier:
G (xi ) = argmink
p∑
j=1
(xj − µjk)2
σ2j− log πk
.
Sparse statistical modelling Tom Bartlett 21 / 28
Sparse LDA - real data example
I applied nearest (shrunken) centroids toexpression data for 14349 genes, for347 cells of different types:leukocytes (54); lymphoblastic cells(88); fetal brain cells (16wk, 26;21wk, 24); fibroblasts (37); ductalcarcinoma (22); keratinocytes (40);B lymphoblasts (17); iPS cells (24);neural progenitors (15).
Used R packages MASS, and pamr [4].Carried out 100 repetitions of 3-foldCV. Plots show normalised mutualinformation (NMI), adjusted Randindex (ARI) and prediction accuracy.
0 5 10 15 20 25 300.0
0.4
0.8
Sparsity threshold
NM
I
0 5 10 15 20 25 300.0
0.4
0.8
Sparsity threshold
AR
I
0 5 10 15 20 25 300.0
0.4
0.8
Sparsity threshold
Acc
urac
y
Sparse LDA quantile (over 300 predictions)
100% 75% 50% 25% 0%
Regular LDA quantile (over 300 predictions)
100% 75% 50% 25% 0%
4Robert Tibshirani et al. “Class prediction by nearest shrunken centroids, with applicationsto DNA microarrays”. In: Statistical Science (2003), pp. 104–117
Sparse statistical modelling Tom Bartlett 22 / 28
Sparse clustering
Many clustering methods, such hierarchical clustering, are based on adissimilarity measure Di ,i ′ =
∑pj=1 di ,i ′,j between samples i and i ′.
One popular choice of dissimilarity measure is the euclidean distance.
In high-dimensions, it is often unnecessary to use information from allof the p dimensions.
A weighted dissimilarity measure Di ,i ′ =∑p
j=1 wjdi ,i ′,j can be a usefulapproach to this problem. This can be obtained by the sparse matrixdecomposition:
maximiseu∈Rn2 ,w∈Rp
u>∆w, subject to ‖u‖2 ≤ 1, ‖w‖2 ≤ 1,
‖w‖1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p},
where w is vector of the weights wj , j ∈ {1, ..., p}, and ∆ ∈ Rn2×p isthe dissimilarity components arranged such that each row of ∆corresponds to the di ,i ′,j , j ∈ {1, ..., p} for a pair of samples i , i ′.
This weighted dissimilarity measure can then be used for sparseclustering, such as sparse hierarchical clustering.
Sparse statistical modelling Tom Bartlett 23 / 28
Sparse clustering
Some clustering methods, such as K-means, need a slightly modifiedapproach.
K-means seeks to minimise the within-cluster sum of squares
K∑k=1
∑i∈Ck
‖xi − xk‖22 =1
2N
K∑k=1
∑i ,i ′∈Ck
‖xi − xi ′‖22
where Ck is the set of samples in cluster k and xk is thecorresponding centroid.
Hence, a weighted K-means could proceed according to theoptimisation:
minimisew∈Rp
p∑
j=1
wj
K∑k=1
1
nk
∑i ,i ′∈Ck
di ,i ′,j
,
where di ,i ′,j = (xij − xi ′j)2, and nk is the number of samples
in cluster k .
Sparse statistical modelling Tom Bartlett 24 / 28
Sparse clustering
However, for the optimisation
minimisew∈Rp
p∑
j=1
wj
K∑k=1
1
nk
∑i ,i ′∈Ck
di ,i ′,j
,
it is not possible to choose a set of constraints which guarantee anon-pathological solution as well as convexity.
Instead, the between-cluster sum of squares can be maximised:
maximisew∈Rp
p∑
j=1
wj
1
n
n∑i=1
n∑i ′=1
di ,i ′,j −K∑
k=1
1
nk
∑i ,i ′∈Ck
di ,i ′,j
subject to ‖w‖2 ≤ 1, ‖w‖1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p}.
Sparse statistical modelling Tom Bartlett 25 / 28
Sparse clustering - real data examples
Applied (sparse) hierarchalclustering to the samebenchmark expressiondata-set (14349 genes, for347 cells of different types).
Used R package sparcl [5] forthe sparse clustering. Plotsshow normalised mutualinformation (NMI) andadjusted Rand index (ARI)comparing sparse withstandard clustering.
2 5 10 20 50 100 200 500 10000.0
0.4
0.8
L1 bound
NM
I
2 5 10 20 50 100 200 500 10000.0
0.4
0.8
L1 bound
ARI
Sparse hierarchical clustering hierarchical clustering
5Daniela M Witten and Robert Tibshirani. “A framework for feature selection in clustering”.In: Journal of the American Statistical Association (2012)
Sparse statistical modelling Tom Bartlett 26 / 28
Sparse clustering - real data examples
Applied (sparse) k-means tothe same benchmarkexpression data-set (14349genes, for 347 cells ofdifferent types).
Used R package sparcl for thesparse clustering. Plotsshow normalised mutualinformation (NMI) andadjusted Rand index (ARI)comparing sparse withstandard clustering.
2 5 10 20 50 100 200 500 10000.0
0.4
0.8
L1 bound
NM
I
2 5 10 20 50 100 200 500 10000.0
0.4
0.8
L1 bound
ARI
Sparse k−means k−means
Sparse statistical modelling Tom Bartlett 27 / 28
Sparse clustering - real data examplesSpectral clustering essentially
uses k-means clustering (orsimilar) in dimensionally-reduced (e.g., PCA) space.
Applied standard k-means insparse-PCA space to thesame benchmark expressiondata-set (14349 genes, for347 cells of different types).
Offers computationaladvantages, running in 9seconds on a 2.8GHzMacbook, compared with19 seconds for standardk-means, and 35 secondsfor sparse k-means.
0.1 0.2 0.5 1.00.0
0.4
0.8
L1 bound / sqrt(n)
NM
I
0.1 0.2 0.5 1.00.0
0.4
0.8
L1 bound / sqrt(n)
ARI
Sparse spectral k−means k−means
Sparse statistical modelling Tom Bartlett 28 / 28