structured sparse principal components analysis with the ... · structured sparse principal...

14
HAL Id: cea-01883278 https://hal-cea.archives-ouvertes.fr/cea-01883278 Submitted on 27 Sep 2018 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty Amicie de Pierrefeu, Tommy Lofstedt, Fouad Hadj-Selem, Mathieu Dubois, Renaud Jardri, Thomas Fovet, Philippe Ciuciu, Vincent Frouin, Edouard Duchesnay To cite this version: Amicie de Pierrefeu, Tommy Lofstedt, Fouad Hadj-Selem, Mathieu Dubois, Renaud Jardri, et al.. Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac- tions on Medical Imaging, Institute of Electrical and Electronics Engineers, 2018, 37 (2), pp.396 - 407. 10.1109/tmi.2017.2749140. cea-01883278

Upload: others

Post on 10-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

HAL Id cea-01883278httpshal-ceaarchives-ouvertesfrcea-01883278

Submitted on 27 Sep 2018

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents whether they are pub-lished or not The documents may come fromteaching and research institutions in France orabroad or from public or private research centers

Lrsquoarchive ouverte pluridisciplinaire HAL estdestineacutee au deacutepocirct et agrave la diffusion de documentsscientifiques de niveau recherche publieacutes ou noneacutemanant des eacutetablissements drsquoenseignement et derecherche franccedilais ou eacutetrangers des laboratoirespublics ou priveacutes

Structured Sparse Principal Components Analysis Withthe TV-Elastic Net Penalty

Amicie de Pierrefeu Tommy Lofstedt Fouad Hadj-Selem Mathieu DuboisRenaud Jardri Thomas Fovet Philippe Ciuciu Vincent Frouin Edouard

Duchesnay

To cite this versionAmicie de Pierrefeu Tommy Lofstedt Fouad Hadj-Selem Mathieu Dubois Renaud Jardri et alStructured Sparse Principal Components Analysis With the TV-Elastic Net Penalty IEEE Transac-tions on Medical Imaging Institute of Electrical and Electronics Engineers 2018 37 (2) pp396 - 407101109tmi20172749140 cea-01883278

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 1

Structured Sparse Principal Components Analysiswith the TV-Elastic Net Penalty

Amicie de Pierrefeu Tommy Lofstedt Fouad Hadj-Selem Mathieu Dubois Renaud Jardri Thomas Fovet PhilippeCiuciu Senior Member Vincent Frouin and Edouard Duchesnay

AbstractmdashPrincipal component analysis (PCA) is an ex-ploratory tool widely used in data analysis to uncover dominantpatterns of variability within a population Despite its ability torepresent a data set in a low-dimensional space PCArsquos inter-pretability remains limited Indeed the components produced byPCA are often noisy or exhibit no visually meaningful patternsFurthermore the fact that the components are usually non-sparsemay also impede interpretation unless arbitrary thresholding isapplied However in neuroimaging it is essential to uncover clin-ically interpretable phenotypic markers that would account forthe main variability in the brain images of a population Recentlysome alternatives to the standard PCA approach such as SparsePCA have been proposed their aim being to limit the densityof the components Nonetheless sparsity alone does not entirelysolve the interpretability problem in neuroimaging since it mayyield scattered and unstable components We hypothesized thatthe incorporation of prior information regarding the structure ofthe data may lead to improved relevance and interpretabilityof brain patterns We therefore present a simple extensionof the popular PCA framework that adds structured sparsitypenalties on the loading vectors in order to identify the few stableregions in the brain images that capture most of the variabilitySuch structured sparsity can be obtained by combining eg `1and total variation (TV) penalties where the TV regularizationencodes information on the underlying structure of the dataThis paper presents the structured sparse PCA (denoted SPCA-TV) optimization framework and its resolution We demonstrateSPCA-TVrsquos effectiveness and versatility on three different datasets It can be applied to any kind of structured data such aseg N -dimensional array images or meshes of cortical surfacesThe gains of SPCA-TV over unstructured approaches (such asSparse PCA and ElasticNet PCA) or structured approach (suchas GraphNet PCA) are significant since SPCA-TV reveals thevariability within a data set in the form of intelligible brainpatterns that are easier to interpret and more stable acrossdifferent samples

KeywordsmdashMRI unsupervised machine learning PCA totalvariation

I INTRODUCTION

Principal components analysis (PCA) is an unsupervisedstatistical procedure whose aim is to capture dominant patternsof variability in order to provide an optimal representation of a

A de Pierrefeu E Duchesnay P Ciuciu M Dubois and V Frouin are withNeuroSpin CEA Paris-Saclay Gif-sur-Yvette - France

F Hadj-Selem is with the Energy Transition Institute VeDeCoM - FranceT Lofstedt is with Department of Radiation Sciences Umea University

Umea - SwedenR Jardri and T Fovet are with Univ Lille CNRS UMR 9193 SCALab

CHU Lille Pole de Psychiatrie (unit CURE) Lille France

data set in a lower-dimensional space defined by the principalcomponents (PCs) Given a data set X isin RNtimesP of N samplesand P centered variables PCA aims to find the most accuraterank-K approximation of the data

minUDV

∥∥XminusUDVT∥∥2F (1)

st UTU = IVTV = I d1 ge middot middot middot ge dK gt 0

where F is the Frobenius norm of a matrix V =[v1 middot middot middot vK ] isin RPtimesK are the K loading vectors (rightsingular vectors) that define the new coordinate system wherethe original features are uncorrelated D is the diagonal matrixof the K singular values and U = [u1 middot middot middot uK ] isin RNtimesK arethe K projections of the original samples in the new coordinatesystem (called principal components (PCs) or left singularvector) Using K = rank(X) components leads to the singularvalue decomposition (SVD) A vast majority of neuroimagingproblems involve high-dimensional feature spaces (asymp 105

features ie voxels or mesh (nodes over the cortical surface)with a relatively limited sample size (asymp 102 participants Withsuch ldquolarge P small Nrdquo problems the SVD formulationbased on the data matrix is much more efficient than aneigenvalue decomposition of the large PtimesP covariance matrix

In a neuroimaging context our goal is to discover thephenotypic markers accounting for the main variability in apopulationrsquos brain images For example when consideringstructural images of patients that will convert to Alzheimerdisease (AD) we are interested in revealing the brain patternsof atrophy explaining the variability in this population Thisprovides indications of possible stratification of the cohort intohomogeneous sub-groups that may be clinically similar butwith a different pattern of atrophy This could suggest differentsub-types of patients with AD or some other etiologies suchas dementia with Lewy bodies Clustering methods might benatural approaches to address such situations however theycan not reveal subtle differences that go beyond a global andtrivial pattern of atrophy Such patterns are usually captured bythe first component of PCA which after being removed offersthe possibility to identify spatial patterns on the subsequentcomponents

However PCA provides dense loading vectors (patterns)that cannot be used to identify brain markers without arbitrarythresholding

Recently some alternatives propose to add sparsity in thismatrix factorization problem [33] [36] [43] The sparsedictionary learning framework proposed by [36] provides asparse coding (rows of U) of samples through a sparse

Copyright ccopy 2017 IEEE Personal use of this material is permitted However permission to use thismaterial for any other purposes must be obtained from the IEEE by sending a request to pubs-permissionsieeeorg

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 2

linear combination of dense basis elements (columns of V)However the identification of biomarkers requires a sparsedictionary (columns of V) This is precisely the objective ofSparse PCA (SPCA) proposed in [30] [51] [14] [49] [31]which adds a sparsity-inducing penalty on the columns of VImposing such sparsity constraints on the loading coefficientsis a procedure that has been used in fMRI to produce sparserepresentation of brain functional networks [20][45]

However sparse PCA is limited by the fact that it ignoresthe inherent spatial correlation in the data It leads to scatteredpatterns that are difficult to interpret Furthermore constrainingonly the number of features included in the PCs might notalways be fully relevant since most data sets are expected tohave a spatial structure For instance MRI data is naturallyencoded on a grid some voxels are neighbors while othersare not

We hypothesize that brain patterns are organized into dis-tributed regions across the brain([11][21][41]) Recent studiestried to overcome this limitation by encoding prior informationconcerning the spatial structure of the data (see [29] [24][48]) However they used methods that are difficult to pluginto the optimization scheme (eg spline smoothing waveletsmoothing) and incorporated prior information that sometimesmay be difficult to define One simple solution is the use ofa GraphNet penalty ([23] [32] [40] [18] [38]) It promoteslocal smoothness of the weight map by simply forcing adjacentvoxels to have similar weights using an λ2 penalty on thegradient of the weight map Nonetheless we hypothesizedthat Graph-net provided smooth solution rather than clearlyidentified regions In data classification problems when ex-tracting structured and sparse predictive maps the goals arelargely aligned with those of PCA Some classification studieshave revealed stable and interpretable results by adding atotal variation (TV) penalty to the sparsity constraint (see[19]) TV is widely used as a tool in image denoising andrestoration It accounts for the spatial structure of images byencoding piecewise smoothness and enabling the recovery ofhomogeneous regions separated by sharp boundaries

For simplicity rather than solving Eq (2) we solve aslightly different criterion which results from using the La-grange form rather than the bound form of the constraints onV Then we extend the Lagrangian form by adding penalties(`1 `2 and TV) to the minimization problem

minUDV

1

NXminusUDVgt2F

+Ksumk=1

λ2vk22 + λ1vk1 + λ

sumgisinGAgv2

(2)

s t uk22 = 1forallk = 1 middot middot middot K

where λ1 λ2 and λ are hyper-parameters controlling therelative strength of each penalty We further propose a genericoptimization framework that can combine any differentiableconvex (penalized) loss function with (i) penalties whoseproximal operator is known (here middot1) and (ii) a large rangeof complex non-smooth convex structured penalties that canbe formulated as a middot21-norm defined over a set of groups G

Such group-penalties cover eg total variation and overlappinggroup lasso

This new problem aims at finding a linear combinationof original variables that points in directions explaining asmuch variance as possible in data while enforcing sparsity andstructure (piecewise smoothness for TV) of the loadings

To achieve this it is necessary to sacrifice some of theexplained variance as well as the orthogonality of both theloading and the principal components Most existing SPCAalgorithms [51] [14] [49] [31] do not impose orthogonalloading directions either While we forced the componentsto have unit norm for visualization purposes we do not inthis formulation enforce vk2 = 1 Instead the value ofv2 is controlled by the hyper-parameter λ2 This penaltyon the loading together with the unit norm constraint on thecomponent prevents us from obtaining trivial solutions Theoptional 1

N factor acts on and conveniently normalizes the lossto account for the number of samples in order to simplify thesettings of the hyper-parameters λ1 λ2 λ

This paper presents an extension of the popular PCA frame-work by adding structured sparsity-inducing penalties on theloading vectors in order to identify the few stable regions inthe brain images accounting for most of the variability Theaddition of a prior that reflects the datarsquos structure within thelearning process gives the paper a scope that goes beyondSparse PCA To our knowledge very few papers ([1] [24][29] [48]) addressed the use of structural constraint in PCAThe study [29] proposes a norm that induces structured sparsity(called SSPCA) by restraining the support of the solution tobe sparse with a certain set of group of variables Possiblesupports include set of variables forming rectangles whenarranged on a grid Only one study recently used the totalvariation prior [1] in a context of multi-subject dictionarylearning based on a different optimization scheme [5]

Section II presents our main contribution a simple optimiza-tion algorithm that combines well known methods (deflationscheme and alternate minimization) with an original continua-tion algorithm based on Nesterovrsquos smoothing technique Ourproposed algorithm has the ability to include the TV penaltybut many other non-smooth penalties such as eg overlappinggroup lasso could also be used This versatile mathematicalframework is an essential feature in neuroimaging Indeed itenables a straightforward application to all kinds of data withknown structure such as N -dimensional images (of voxels) ormeshes of (cortical) surfaces Section III demonstrates the rele-vance of structured sparsity on both simulated and experimen-tal data for structural and functional MRI (fMRI) acquisitionsSPCA-TV achieved a higher reconstruction accuracy and morestable solutions than ElasticNet PCA Sparse PCA GraphNetPCA and SSPCA (from [29]) More importantly SPCA-TVyields more interpretable loading vectors than other methods

II METHOD

A common approach to solve the PCA problem see [14][31] [49]) is to compute a rank-1 approximation of thedata matrix and then repeat this on the deflated matrix [34]where the influence of the PCs are successively extracted and

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 3

discarded We first detail the notation for estimating a singlecomponent (Section II-A) and its solution using an alternatingminimization pipeline (Section II-B) Then we develop theTV regularization framework (Section II-C and Section II-D)Last we discuss the algorithm used to solve the minimizationproblem and its ability to converge toward stable pairs ofcomponentsloading vectors (Section II-E) and (Section II-F)

A Single component computationGiven a pair of loadingcomponent vectors u isin RN v isin

RP the best rank-1 approximation of the problem given inEq (2) is equivalent [49] to

minuv

f equiv

smooth︷ ︸︸ ︷minus 1

NugtXv + λ2v22︸ ︷︷ ︸

l(v)

+

non-smooth︷ ︸︸ ︷λ1 v1︸ ︷︷ ︸

h(v)

+λsumgisinGAgv2︸ ︷︷ ︸s(v)

(3)

s t u22 le 1

where l(v) is the penalized smooth (ie differentiable) lossh(v) is a sparsity-inducing penalty whose proximal operatoris known and s(v) is a complex penalty on the structure ofthe input variables with an unknown proximal operator

This problem is convex in u and in v but not in (uv)

B Alternating minimization of the bi-convex problemThe objective function to minimize is bi-convex [9] The

most common approach to solve a bi-convex optimizationproblem (which does not guarantee global optimality of thesolution) is to alternatively update u and v by fixing oneof them at the time and solving the corresponding convexoptimization problem on the other parameter vector

On the one hand when v is fixed the problem to solve is

minuisinRN

minus 1

NugtXv (4)

s t u22 le 1

with the associated explicit solution

ulowast(v) =Xv

Xv2 (5)

On the other hand solving the equation with respect to vwith a fixed u presents a higher level of difficulty that will bediscussed in Section II-E

C Reformulating TV as a linear operatorBefore discussing the minimization with respect to v we

provide details on the encoding of the spatial structure withinthe s(v) penalty

It is essential to note that the algorithm is independent ofthe spatial structure of the data All the structural informationis encoded in a linear operator A that is computed outsideof the algorithm Thus the algorithm has the ability to address

various structured data and most importantly other penaltiesthan just the TV penalty The algorithm requires the settingof two parameters (i) the linear operator A (ii) a projectionfunction detailed in Eq (12)

This section presents the formulation and the design of A inthe specific case of a TV penalty applied to the loading vectorv measured on a 3-dimensional (3D) image or a 2D mesh ofthe cortical surface

1) 3D image The brain mask is used to establish a mappingg(i j k) between the coordinates (i j k) in the 3D grid andan index g isin [[1P ]] in the collapsed image We extract thespatial neighborhood of g of size le 4 corresponding to voxelg and its 3 neighboring voxels within the mask in the i j andk directions By definition we have

TV(v) equivPsumg=1

∥∥nabla (vg(ijk)) ∥∥2 (6)

The first order approximation of the spatial gradientnabla(vg(ijk)) is computed by applying the linear operator A

prime

g isinR3times4 to the loading vector vg in the spatial neighborhood ofg ie

nabla(vg(ijk)

)=

[ minus1 1 0 0minus1 0 1 0minus1 0 0 1

]︸ ︷︷ ︸

Aprimeg

vg(ijk)vg(i+1jk)

vg(ij+1k)

vg(ijk+1)

︸ ︷︷ ︸

vg

(7)

where vg(ijk) is the loading coefficient at index g in thecollapsed image corresponding to voxel (i j k) in the 3Dimage Then A

prime

g is extended using zeros to a large but verysparse matrix Ag isin R3timesP in order to be directly applied onthe full vector v If some neighbors lie outside the mask thecorresponding rows in Ag are removed Noticing that for TVthere is one group per voxel in the mask (G = [[1P ]]) we canreformulate TV from Eq (6) using a general expression

TV(v) =sumgisinGAgv2 (8)

Finally with a vertical concatenation of all the Ag matriceswe obtain the full linear operator A isin R3PtimesP that will beused in Section II-E

2) Mesh of cortical surface The linear operator Aprime

g usedto compute a first order approximation of the spatial gradientcan be obtained by examining the neighboring vertices ofeach vertex g With common triangle-tessellated surfaces theneighborhood size is le 7 (including g) In this setting wehave A

prime

g isin R3times7 which can be extended and concatenated toobtain the full linear operator A

D Nesterovrsquos smoothing of the structured penaltyWe consider the convex non-smooth minimization of Eq (3)

with respect to v where thus u is fixed This problem includesa general structured penalty s(middot) that covers the specific caseof TV A widely used approach when dealing with non-smoothproblems is to use methods based on the proximal operator of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 4

the penalties For the `1 penalty alone the proximal operatoris analytically known and efficient iterative algorithms such asISTA and FISTA are available (see [4]) However since theproximal operator of the TV+`1 penalty is not closed formstandard implementation of those algorithms is not suitable Inorder to overcome this barrier we used Nesterovrsquos smoothingtechnique [39] It consists of approximating the non-smoothpenalties for which the proximal operator is unknown (egTV) with a smooth function (of which the gradient is known)Non-smooth penalties with known proximal operators (eg `1)are not affected Hence as described in [50] it allows to usean exact accelerated proximal gradient algorithm Thus we cansolve the PCA problem penalized by TV and elastic net wherean exact `1 penalty is used

Using the dual norm of the `2-norm (which happens to bethe `2-norm too) Eq (8) can be reformulated as

s(v) =sumgisinGAgv2 =

sumgisinG

maxαg2le1

αgtgAgv (9)

where αg isin Kg = αg isin R3 αg2 le 1 is a vectorof auxiliary variables in the `2 unit ball associated withAgv As with A isin R3PtimesP which is the vertical concate-nation of all the Ag we concatenate all the αg to form theα isin K = [αgt1 αgtP ]gt αg isin Kg isin R3P K is theCartesian product of 3D unit balls in Euclidean space andtherefore a compact convex set Eq (9) can further be writtenas

s(v) = maxαisinK

αgtAv (10)

Given this formulation of s(v) we can apply Nesterovrsquossmoothing For a given smoothing parameter micro gt 0 the s(v)function is approximated by the smooth function

smicro(v) = maxαisinK

αgtAv minus micro

2α22

(11)

for which limmicrorarr0 smicro(v) = s(v) Nesterov [39] demonstratesthis convergence using the inequality in Eq (15) The valueof αlowastmicro(v) = [αlowastgtmicro1 α

lowastgtmicrog α

lowastgtmicroP ]

gt that maximizesEq (11) is the concatenation of projections of vectors Agv isinR3 to the `2 ball (Kg) αlowastmicrog(v) = projKg

(Agvmicro

) where

projKg (x) =

x if x2 le 1

xx2 otherwise

(12)

The function smicro ie the Nesterovrsquos smooth transform of sis convex and differentiable Its gradient given by [39]

nabla(smicro)(v) = Agtαlowastmicro(v) (13)

is Lipschitz-continuous with constant

L(nabla(smicro)

)=A22micro

(14)

where A2 is the matrix spectral norm of A MoreoverNesterov [39] provides the following inequality relating smicroand s

smicro(v) le s(v) le smicro(v) + microM forallv isin Rp (15)

where M = maxαisinKα22

2 = P2

Thus a new (smoothed) optimization problem closely re-lated to Eq (3) (with fixed u) arises from this regularizationas

minv

smooth︷ ︸︸ ︷minus 1

nugtXv+ λ2v22︸ ︷︷ ︸

l(v)

+λαlowastmicro(v)

gtAv minus micro

2αlowast22

︸ ︷︷ ︸

smicro(v)

+λ1

non-smooth︷ ︸︸ ︷v1︸ ︷︷ ︸h(v)

(16)

Since we are now able to explicitly compute the gradient ofthe smooth part nabla(l + λsmicro) (Eq (18)) its Lipschitz constant(Eq (19)) and also the proximal operator of the non-smoothpart we have all the ingredients necessary to solve thisminimization function using an accelerated proximal gradientmethods [4] Given a starting point v0 and a smoothingparameters micro FISTA (Algorithm 1) minimizes the smoothedproblem and reaches a prescribed precision εmicro

However in order to control the convergence of the algo-rithm (presented in Section II-E1) we introduce the Fencheldual function and the corresponding dual gap of the objectivefunction The Fenchel duality requires the loss to be stronglyconvex which is why we further reformulate Eq (16) slightlyAll penalty terms are divided by λ2 and by using the followingequivalent formulation for the loss we obtain the minimizationproblem

minvfmicro equiv

l(v)︷ ︸︸ ︷1

2

∥∥∥∥v minus Xgtu

nλ2

∥∥∥∥22︸ ︷︷ ︸

L(v)

+1

2v22 +

λ

λ2

smicro(v)︷ ︸︸ ︷αlowastmicro(v)

gtAv minus micro

2αlowast22

+λ1λ2

h(v)︷ ︸︸ ︷v1︸ ︷︷ ︸

ψmicro(v)

(17)

This new formulation of the smoothed objective func-tion (noted fmicro) preserves the decomposition of fmicro into a sumof a smooth term l + λ

λ2smicro and a non-smooth term h Such

decomposition is required for the application of FISTA withNesterovrsquos smoothing Moreover this formulation provides adecomposition of fmicro into a sum of a smooth loss L anda penalty term ψmicro required for the calculation of the gappresented in Section II-E1)

We provide all the required quantities to minimize Eq (17)using Algorithm 1 Using Eq (13) we compute the gradientof the smooth part as

nabla(l +

λ

λ2smicro

)= nabla(l) + λ

λ2nabla(smicro)

= (2v minus Xgtu

nλ2) +

λ

λ2Agtαlowastmicro(v

k) (18)

and its Lipschitz constant (using Eq (14))

L

(nabla(l +

λ

λ2smicro

))= 2 +

λ

λ2

A22micro

(19)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 5

Algorithm 1 FISTA(Xgtu v0 εmicro micro A λ L(nabla(g))

)1 v1 = v0 k = 22 Compute the gradient of the smooth part nabla(g + λsmicro)

(Eq (18)) and its Lipschitz constant Lmicro (Eq (19))3 Compute the size tmicro = Lminus1micro4 repeat5 z = vkminus1 + kminus2

k+1

(vkminus1 minus vkminus2

)6 vk = proxh

(zminus tmicronabla(g + λsmicro)(z)

)7 until GAPmicro(v

k) le εmicro8 return vk

E Minimization of the loading vectors with CONESTA

The step size tmicro computed in Line 3 of Algorithm 1 dependson the smoothing parameter micro (see Eq (19)) Hence there is atrade-off between speed and precision Indeed high precisionwith a small micro will lead to a slow convergence (small tmicro)Conversely poor precision (large micro) will lead to rapid con-vergence (large tmicro) Thus we propose a continuation approach(Algorithm 2) which decreases the smoothing parameter withrespect to the distance to the minimum On the one hand whenwe are far from vlowast (the minimum of Eq (17)) we can use alarge micro to rapidly decrease the objective function On the otherhand when we are close to vlowast we need a small micro in orderto obtain an accurate approximation of the original objectivefunction

1) Duality gap The distance to the unknown f(vlowast) isestimated using the duality gap Duality formulations are oftenused to control the achieved precision level when minimizingconvex functions They provide an estimation of the errorf(vk) minus f(vlowast) for any v when the minimum is unknownThe duality gap is the cornerstone of the CONESTA algorithmIndeed it is used three times

i As the stopping criterion in the inner FISTA loop (Line7 in Algorithm 1) FISTA will stop as soon as the cur-rent precision is achieved using the current smoothingparameter micro This prevents unnecessary convergencetoward the approximated (smoothed) objective function

ii In the ith CONESTA iteration as a way to estimate thecurrent error f(vi)minusf(vlowast) (Line 7 in Algorithm 2) Theerror is estimated using the gap of the smoothed prob-lem GAPmicro=microi(v

i+1) which avoid unnecessary compu-tation since it has already been computed during the lastiteration of FISTA The inequality in Eq (15) is used toobtain the gap εi to the original non-smoothed problemThe next desired precision εi+1 and the smoothingparameter microi+1 are derived from this value

iii Finally as the global stopping criterion within CON-ESTA (Line 10 in Algorithm 2) This will guaranteethat the obtained approximation of the minimum vi atconvergence satisfies f(vi)minus f(vlowast) lt ε

Based on Eq (17) which decomposes the smoothed ob-jective function as a sum of a strongly convex loss and thepenalty

fmicro(v) = L(v) + ψmicro(v)

we compute the duality gap that provides an upper boundestimation of the error to the optimum At any step k ofthe algorithm given the current primal vk and the dualσ(vk) equiv nablaL(vk) variables [8] we can compute the dualitygap using the Fenchel duality rules [35]

GAP(vk) equiv fmicro(vk) + Llowast(σ(vk)

)+ ψlowastmicro

(minus σ(vk)

) (20)

where Llowast and ψlowastmicro are respectively the Fenchel conjugates ofL and ψmicro Denoting by vlowast the minimum of fmicro (solution ofEq (17)) the interest of the duality gap is that it provides anupper bound for the difference with the optimal value of thefunction Moreover it vanishes at the minimum

GAP(vk) ge f(vk)minus f(vlowast) ge 0GAP(vlowast) = 0

(21)

The dual variable is

σ(vk) equiv nablaL(vk) = v minus Xgtu

nλ2 (22)

the Fenchel conjugate of the squared loss L(vk) is

Llowast(σ(vk)) = 1

2σ(vk)22 + σ(vk)gt

Xgtu

nλ2 (23)

In [25] the authors provide the expression of the Fenchelconjugate of the penalty ψmicro(vk)

ψlowastmicro(minusσ(vk)) =1

2

Psumj=1

([∣∣∣minus σ(vk)j minus λ

λ2

(Agtαlowastmicro(v

k))j

∣∣∣minus λ1λ2

]2+

)

+λmicro

2λ2

∥∥αlowastmicro(vk)∥∥22 (24)

where [middot]+ = max(0 middot)The expression of the duality gap in Eq (20) provides an

estimation of the distance to the minimum This distance isgeometrically decreased by a factor τ = 05 at the end of eachcontinuation and the decreased value defines the precision thatshould be reached by the next iteration (Line 8 of Algorithm2) Thus the algorithm dynamically generates a sequence ofdecreasing prescribed precisions εi Such a scheme ensures theconvergence [25] towards a globally desired final precision εwhich is the only parameter that the user needs to provide

2) Determining the optimal smoothing parameter Given thecurrent prescribed precision εi we need to compute an optimalsmoothing parameter microopt(εi) (Line 9 in Algorithm 2) thatminimizes the number of FISTA iterations needed to achievesuch precision when minimizing Eq (3) (with fixed u) viaEq (17) (ie such that f(v(k))minus f(vlowast) lt εi)

In [25] the authors provide the expression of this optimalsmoothing parameter

microopt(εi) =

minusλMA22 +radic(λMA22)2 +ML(nabla(l))A22εiML(nabla(l))

(25)

where M = P2 (Eq (15)) and L(nabla(l)) = 2 is the Lipschitzconstant of the gradient of l as defined in Eq (17)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 6

We call the resulting algorithm CONESTA (short forCOntinuation with NEsterov smoothing in a Shrinkage-Thresholding Algorithm) It is presented in detail with con-vergence proofs in [25]

Let K be the total number of FISTA loops used in CON-ESTA then we have experimentally verified that the conver-gence rate to the solution of Eq (16) is O

(1K2

)(which is

the optimal convergence rate for first-order methods) Alsothe algorithm works even if some of the weights λ1 or λare zero which thus allows us to solve eg the elastic netusing CONESTA Note that it has been rigorously provedthat the continuation technique improves the convergence ratecompared to the simple smoothing using a single value of microIndeed it has been demonstrated in [6] (see also [50]) thatthe convergence rate obtained with single value of micro evenoptimised is O

(1K2

)+ O(1K) However it has recently

been proved in [25] that the CONESTA algorithm achieves aO(1K) for general convex functions

We note that CONESTA could easily be adapted to manyother penalties For example to add the group lasso (GL)constraint to our structure we just have to design a specificlinear operator AGL and concatenate it to the actual linearoperator A

Algorithm 2 CONESTA(Xgtu ε

)1 Initialize v0 isin RP2 ε0 = τ middot GAPmicro=10minus8(v0)3 micro0 = microopt

(ε0)

4 repeat5 εimicro = εi minus microiγM6 vi+1 = FISTA(Xgtu vi εimicro )7 εi = GAPmicro=microi(v

i+1) + microiγM8 εi+1 = τ middot εi9 microi+1 = microopt

(εi+1

)10 until εi le ε11 return vi+1

F The algorithm for the SPCA-TV problemThe computation of a single component through SPCA-TV

can be achieved by combining CONESTA and Eq (5) withinan alternating minimization loop Mackey [34] demonstratedthat further components can be efficiently obtained by incor-porating this single-unit procedure in a deflation scheme asdone in eg [14] [31] The stopping criterion is defined as

STOPPINGCRITERION =

∥∥∥Xk minus ui+1vi+1gt∥∥∥Fminus∥∥∥Xk minus uivi

gt∥∥∥F∥∥Xk minus ui+1vi+1gt

∥∥F

(26)

All the presented building blocks were combined into Al-gorithm 3 to solve the SPCA-TV problem

III EXPERIMENTS

We evaluated the performance of SPCA-TV using threeexperiments One simulation study carried out on a synthetic

Algorithm 3 SPCA-TV(X ε)

1 X0 = X2 for all k = 0 K do Components3 Initialize u0 isin RN4 repeat Alternating minimization5 vi+1 = CONESTA(Xgtku

i ε)

6 ui+1 = Xkvi+1

Xkvi+127 until STOPPINGCRITERION le ε8 vk+1 = vi+1

9 uk+1 = ui+1

10 Xk+1 = Xk minus uk+1vk+1gt Deflation11 end for12 return U = [u1 middot middot middot uK ]V = [v1 middot middot middot vK ]

data set and two on neuroimaging data sets In order tocompare the performance of SPCA-TV with existing sparsePCA models we also included results obtained with SparsePCA ElasticNet PCA GraphNet PCA and SSPCA from[29] We used the scikit-learn implementation [42] for theSparse PCA while we used the Parsimony package (httpsgithubcomneurospinpylearn-parsimony) for the ElasticNetGraphNet PCA and SPCA-TV methods Concerning SSPCAwe used the MATLAB implementation provided in [29]

The number of parameters to set for each method is differ-ent For Sparse PCA the λ1 parameter selects its optimal valuefrom the range 01 10 50 100 ElasticNet PCA requiresthe setting of the λ1 and the λ2 penalties weights MeanwhileGraphNet PCA and SPCA-TV requires the settings of anadditional parameter namely the spatial constraint penalty λWe operated a re-parametrization of these penalty weights inratios A global parameter α isin 001 01 10 controls theweight attributed to the whole penalty term including thespatial and the `1 regularization Individual constraints areexpressed in terms of ratios the `1 ratio λ1(λ1 + λ2 + λ)isin 01 05 08 and the `TV (or `GN for GraphNet) λ(λ1+λ2+λ) isin 01 05 08 For ElasticNet we explorethe grid of parameters composed of the Cartesian product ofα and `1 ratio subsets For GraphNet PCA and SPCA-TVwe perform a parameter search on a grid of parameters givenby the Cartesian product of respectively (α `1 `GN ) subsetsand (α `1 `TV ) subsets Concerning SSPCA method theregularization parameter selects its optimal value in the range10minus8 108

However in order to ensure that the components extractedhave a minimum amount of sparsity we also included acriteria controlling sparsity At least half of the features ofthe components have to be zero For both real neuroimagingexperiments performance was evaluated through a 5-fold x5-fold double cross validation pipeline The double cross-validation process consists of two nested cross-validation loopswhich are referred to as internal and external cross-validationloops In the outer (external) loop all samples are randomlysplit into subsets referred to as training and test sets The testsets are exclusively used for model assessment while the trainsets are used in the inner (internal) loop for model fitting

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 7

and selection The inner folds select the set of parametersminimizing the reconstruction error on the outer fold Forthe synthetic data we used 50 different purposely-generateddata sets and 5 inner folds for parameters selection In or-der to evaluate the reconstruction accuracy of the methodswe reported the mean Frobenius norm of the reconstructionerror across the foldsdata sets on independent test dataThe hypothesis we wanted to test was whether there was asubstantial decrease in the reconstruction error of independentdata when using SPCA-TV compared to when using SparsePCA ElasticNet PCA GraphNet PCA and SSPCA It wastested through a related two samples t-test This choice tocompare methods performance on independent test data wasmotivated by the fact that the optimal reconstruction of thetraining set is necessarily hindered by spatial and sparsityconstraints We therefore expect SPCA-TV to perform worseon train data than other less constrained methods Howeverthe TV penalty has a more important purpose than just tominimize the reconstruction error the estimation of coherentand reproducible loadings Indeed clinicians expect that ifimages from other patients with comparable clinical conditionshad been used the extracted loading vectors would have turnedout to be similar Therefore since the ultimate goal of SPCA-TV is to yield stable and reproducible weight maps it is morerelevant to evaluate methods on independent test data

The stability of the loading vectors obtained across varioustraining data sets (variation in the learning samples) wasassessed through a similarity measure the pairwise Dice indexbetween loading vectors obtained with different foldsdata sets[16] We tested whether pairwise Dice indices are significantlyhigher in SPCA-TV compared other methods Testing thishypothesis is equivalent to testing the sign of the difference ofpairwise Dice indices between methods However since thepairwise Dice indices are not independent from one another(the folds share many of their learning samples) the directsignificance measures are biased We therefore used permuta-tion testing to estimate empirical p-values The null hypothesiswas tested by simulating samples from the null distributionWe generated 1 000 random permutations of the sign of thedifference of pairwise Dice index between the PCA methodsunder comparisons and then the statistics on the true datawere compared with the ones obtained on the reshuffled datato obtain empirical p-values

For each experiment we made the initial choice to retrievethe first ten components However given the length constraintwe only present the weights maps associated to the top threecomponents for Sparse PCA and SPCA-TV in this paperElasticNet PCA GraphNet PCA and SSPCA rsquos weights mapsof experiments are presented in the supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A Simulation study

We generated 50 sets of synthetic data each composed of500 images of size 100 times 100 pixels Images are generatedusing the following noisy linear system

u1V1 + u2V

2 + u3V3 + ε isin R10 000 (27)

where V = [V 1 V 2 V 3] isin R10 000times3 are sparse and structuredloading vectors illustrated in Fig 1 The support of V 1 definesthe two upper dots the support of V 2 defines the two lowerdots while V 3 rsquos support delineates the middle dot Thecoefficients u = [u1 u2 u3] that linearly combine the com-ponents of V are generated according to a centered Gaussiandistribution The elements of the noise vector ε are independentand identically distributed according to a centered Gaussiandistribution with a 01 signal-to-noise ratio (SNR) This SNRwas selected by a previous calibration pipeline where wetested the efficiency of data reconstruction at multiple SNRranges running from 0 to 05 We decided to work with a 01SNR because it is located in the range of values where standardPCA starts being less efficient in the recovery process

component 1 component 2 component 3Fig 1 Loading vectors V = [V 1 V 2 V 3] isin R10 000times3 usedto generate the images

We splitted the 500 artificial images into a test and a trainingset with 250 images in each set and learned the decompositionon the training set Sparse PCA

SPCA - TVcomponent 1

component 1

component 2

component 2

component 3

component 3Fig 2 Loading vectors recovered from 250 images usingSparse PCA and SPCA-TV

Fig 2 represents the loading vectors extracted with one dataset Please note that the sign is arbitrary Indeed if we considerthe loss of Eq (3) ugt and v can be both multiply by -1

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 8

TABLE I Scores are averaged across the 50 independent datasets We tested whether the scores obtained with existing PCAmethods are significantly different from scores obtained withSPCA-TV Significance notations p le 10minus3

Scores

Methods Test Data Reconstruction Error MSE Dice Index

Sparse PCA 15760 091 028ElasticNet PCA 15724 083 043GraphNet PCA 15708 083 030SSPCA 15719 154 007SPCA-TV 15701 064 052

without changing anything We observe that Sparse PCA yieldsvery scattered loading vectors The loading vectors of SPCA-TV on the other hand are sparse but also organized in clearregions SPCA-TV provides loading vectors that closely matchthe ground truth The reconstruction error is evaluated on thetest sets (Tab I) with its value over the 50 data sets being sig-nificantly lower in SPCA-TV than in Sparse PCA (T = 945p = 39 middot 10minus57) ElasticNet PCA (T = 332 p = 27 middot 10minus35GraphNet PCA (T = 127 p = 36 middot 10minus17 and SSPCA from[29] (T = 189 p = 39 middot10minus24) ) methods Additional detailsconcerning the reconstruction accuracy of both the train andtest data is presented in Figure 1 of supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A different way of quantifying the reconstruction accuracyfor each method is to evaluate how closely the extractedloadings match the known ground truth of simulated dataset We computed the mean squared error (MSE) betweenthe ground truth and the estimated loadings The results arepresented in Tab I We note that the MSE is significantlylower with SPCA-TV than with Sparse PCA (T = 69p = 80 middot 10minus9) ElasticNet PCA (T = 62 p = 11 middot 10minus07)GraphNet-PCA (T = 41 p = 14 middot 10minus04) and SSPCA(T = 226 p = 15 middot 10minus27)

Moreover when evaluating the stability of the loadingvectors across resampling we found a higher statisticallysignificant mean Dice index when using SPCA-TV comparedto the other methods (p lt 0001) The results are presented inTab I They indicate that SPCA-TV is more robust to variationin the learning samples than the other sparse methods SPCA-TV yields reproducible loading vectors across data sets

These results indicate that the SPCA-TV loadings are notonly more stable across resampling but also achieve a betterrecovery of the underlying variability in independent data thanthe Sparse PCA ElasticNet PCAGraphNet PCA and SSPCAmethods

One of the issues linked to biconvex optimization is therisk of falling into locals minima Conscious of this potentialrisk we set up an experiment in which we ran 50 times theoptimization of the same problem with a different startingpoint at each run We then compare the resulting loadingvectors obtained at each run and computed a similarity mea-sure the Dice index It quantifies the proximity between each

independently-run solution with a different starting point Weobtained a Dice index of 099 on the 1st component 099 onthe 2nd component and 072 on the 3rd component Off thestrength of this indices we are confident of this algorithmrobustness and ability to converge toward the same stablesolution independently from the choice of the starting point

B 3D images of functional MRI of patients with schizophreniaWe then applied the methods on 3D images of BOLD

functional MRI (fMRI) acquired with the same scanner andpulse sequence Imaging was performed on a 15 T scannerusing a standard head-coil For all functional scans the field-of-view was 206lowast206lowast153 mm with a resolution close to 35mm in all directions The parameters of the PRESTO sequencewere TE = 96 ms TR = 1925 ms EPI-factor = 15 flip angle= 9 Each fMRI run consisted of 900 volumes collected Thecohort is composed of 23 patients with schizophrenia (averageage = 3496 years 8 Females15 Males) Brain activationwas measured while subjects experienced multimodal hallu-cinations The fMRI data was pre-processed using SPM12(WELLCOME Department of Imaging Neuroscience Lon-don UK) Data preprocessing consisted of motion correction(realignment) coregistration of the individual anatomical T1image to the functional images spatial normalization to MNIspace using DARTEL based on segmented T1 scans

We considered each set of consecutive images under pre-hallucinations state as a block Since most of the patientshallucinate more than once during the scanning session wehave more blocks than patients (83 blocks) The activationmaps are computed from these blocks Based on the generallinear model approach we regressed for each block the fMRIsignal time course on a linear ramp function Indeed we hy-pothesized that activation in some regions presents a ramp-likeincrease during the time preceding the onset of hallucinations(See example of regression in figure 3 in the supplementarymaterialsavailable in the supplementary files multimedia tab)The activation maps that we used as an input to the SPCA-TVmethod are the statistical parametric maps associated with thecoefficients of the block regression (See one example in Figure4 of supplementary materials available in the supplementaryfiles multimedia tab) We obtained a data set of n = 83 mapsand p = 63 966 features We hypothesized that the principalcomponents extracted with SPCA-TV from these activationmaps could uncover major trends of variability within pre-hallucination patterns Thus they might reveal the existence ofsubgroups of patients according to the sensory modality (egvision or audition) involved during hallucinations

We applied all PCA methods under study to this data setexcept SSPCA Indeed the SSPCA method ([29]) could notbe applied to this specific example since datasets have to beconstituted from closed cubic forms without any holes to beeligible for SSPCA method application It does not supportmasked data such as the one used here

The loading vectors extracted from the activation mapsof pre-hallucinations scans with Sparse PCA and SPCA-TVare presented in Fig 3 We observe a similar behavior asin the synthetic example namely that the loading vectors of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Sparse PCA tend to be scattered and produce irregular patternsHowever SPCA-TV seems to yield structured and smoothsources of variability which can be interpreted clinicallyFurthermore the SPCA-TV loading vectors are not redundantand revealed different patterns

Indeed the loading vectors obtained by SPCA-TV are ofgreat interest because they revealed insightful patterns ofvariability in the data the second loading is composed ofinteresting areas such as the precuneus cortex and the cingulategyrus but also areas related to vision-processing areas such asthe occipital fusiform gyrus and the parietal operculum cortexregions The third loading reveals important weights in themiddle temporal gyrus the parietal operculum cortex and thefrontal pole The first loading vector encompasses all featuresof the brain One might see this first component as a globalvariability affecting the whole brain such as the overarchingeffect of age SPCA-TV selects this dense configuration inspite of the sparsity constraint It is highly desirable to removeany sort of global effect at first in order to start identifyinglocal patterns in next components that are not impacted by aglobal variability of this kind We can identified a widespreadset of dysfunctional language-related or vision-related areasthat present increasing activity during the time preceding theonset of hallucinations The regions extracted by SPCA-TVare found to be pertinent according to the existing literatureon the topic ([28] [27] [7]])

These results seem to indicate the possible existence ofsubgroups of patients according to the hallucination modalitiesinvolved An interesting application would be to use the scoreof the second component extracted by SPCA-TV in orderto distinguish patients with visual hallucinations from thosesuffering mainly from auditory hallucinations

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 139 p = 15 middot 10minus4) ElasticNetPCA (T = 71 p = 21 middot 10minus3) and GraphNet PCA (T = 46p = 10 middot 10minus2) Moreover when assessing the stability of theloading vectors across the folds we found a higher statisticallysignificant mean Dice index in SPCA-TV compared to SparsePCA (p = 40 middot 10minus3) ElasticNet PCA (p = 40 middot 10minus3) andGraphNet PCA (p = 20 middot 10minus3) as presented in Tab II Addi-tional details regarding the reconstruction accuracy on both thetrain and test sets and the Dice index is presented in Figure5 of supplementary materials (available in the supplementaryfiles multimedia tab)

TABLE II Scores of the fMRI data are averaged across the5 folds We tested whether the averaged scores obtained withexisting PCA methods are significantly different from scoresobtained with SPCA-TV Significance notations p le10minus3 p le 10minus2

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 15152 034ElasticNet PCA 14827 032GraphNet PCA 14281 058SPCA-TV 14140 063

In conclusion SPCA-TV significantly outperforms SparseElasticNet and GraphNet PCA in terms of the reconstructionerror on independent test data and in the sense that loadingvectors are both more clinically interpretable and more stable

We also evaluated the convergence speed of Sparse PCAMini-Batch Sparse PCA (a variant of Sparse PCA that isfaster but less accurate) ElasticNet PCA GraphNet PCAand SPCA-TV for this functional MRI data set of n = 83samples and p = 63 966 features We compared the time ofexecution required for each algorithm to achieve a given levelof precision in Tab III Sparse PCA and ElasticNet PCA aresimilar in terms of convergence time while mini-batch sparsePCA is much faster but does not converge to high precision Asexpected structured methods (GraphNet PCA and SPCA-TV)take longer than other sparse methods because of the inclusion

Sparse PCAcompon

ent 1compon

ent 2compon

ent 30

-123

-1110

0-105

component 1

component 2

component 3

-063+063-073+073

+4800

SPCA - TV

Fig 3 Loading vectors recovered from the 83 activation mapsusing Sparse PCA and SPCA-TV

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 2: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 1

Structured Sparse Principal Components Analysiswith the TV-Elastic Net Penalty

Amicie de Pierrefeu Tommy Lofstedt Fouad Hadj-Selem Mathieu Dubois Renaud Jardri Thomas Fovet PhilippeCiuciu Senior Member Vincent Frouin and Edouard Duchesnay

AbstractmdashPrincipal component analysis (PCA) is an ex-ploratory tool widely used in data analysis to uncover dominantpatterns of variability within a population Despite its ability torepresent a data set in a low-dimensional space PCArsquos inter-pretability remains limited Indeed the components produced byPCA are often noisy or exhibit no visually meaningful patternsFurthermore the fact that the components are usually non-sparsemay also impede interpretation unless arbitrary thresholding isapplied However in neuroimaging it is essential to uncover clin-ically interpretable phenotypic markers that would account forthe main variability in the brain images of a population Recentlysome alternatives to the standard PCA approach such as SparsePCA have been proposed their aim being to limit the densityof the components Nonetheless sparsity alone does not entirelysolve the interpretability problem in neuroimaging since it mayyield scattered and unstable components We hypothesized thatthe incorporation of prior information regarding the structure ofthe data may lead to improved relevance and interpretabilityof brain patterns We therefore present a simple extensionof the popular PCA framework that adds structured sparsitypenalties on the loading vectors in order to identify the few stableregions in the brain images that capture most of the variabilitySuch structured sparsity can be obtained by combining eg `1and total variation (TV) penalties where the TV regularizationencodes information on the underlying structure of the dataThis paper presents the structured sparse PCA (denoted SPCA-TV) optimization framework and its resolution We demonstrateSPCA-TVrsquos effectiveness and versatility on three different datasets It can be applied to any kind of structured data such aseg N -dimensional array images or meshes of cortical surfacesThe gains of SPCA-TV over unstructured approaches (such asSparse PCA and ElasticNet PCA) or structured approach (suchas GraphNet PCA) are significant since SPCA-TV reveals thevariability within a data set in the form of intelligible brainpatterns that are easier to interpret and more stable acrossdifferent samples

KeywordsmdashMRI unsupervised machine learning PCA totalvariation

I INTRODUCTION

Principal components analysis (PCA) is an unsupervisedstatistical procedure whose aim is to capture dominant patternsof variability in order to provide an optimal representation of a

A de Pierrefeu E Duchesnay P Ciuciu M Dubois and V Frouin are withNeuroSpin CEA Paris-Saclay Gif-sur-Yvette - France

F Hadj-Selem is with the Energy Transition Institute VeDeCoM - FranceT Lofstedt is with Department of Radiation Sciences Umea University

Umea - SwedenR Jardri and T Fovet are with Univ Lille CNRS UMR 9193 SCALab

CHU Lille Pole de Psychiatrie (unit CURE) Lille France

data set in a lower-dimensional space defined by the principalcomponents (PCs) Given a data set X isin RNtimesP of N samplesand P centered variables PCA aims to find the most accuraterank-K approximation of the data

minUDV

∥∥XminusUDVT∥∥2F (1)

st UTU = IVTV = I d1 ge middot middot middot ge dK gt 0

where F is the Frobenius norm of a matrix V =[v1 middot middot middot vK ] isin RPtimesK are the K loading vectors (rightsingular vectors) that define the new coordinate system wherethe original features are uncorrelated D is the diagonal matrixof the K singular values and U = [u1 middot middot middot uK ] isin RNtimesK arethe K projections of the original samples in the new coordinatesystem (called principal components (PCs) or left singularvector) Using K = rank(X) components leads to the singularvalue decomposition (SVD) A vast majority of neuroimagingproblems involve high-dimensional feature spaces (asymp 105

features ie voxels or mesh (nodes over the cortical surface)with a relatively limited sample size (asymp 102 participants Withsuch ldquolarge P small Nrdquo problems the SVD formulationbased on the data matrix is much more efficient than aneigenvalue decomposition of the large PtimesP covariance matrix

In a neuroimaging context our goal is to discover thephenotypic markers accounting for the main variability in apopulationrsquos brain images For example when consideringstructural images of patients that will convert to Alzheimerdisease (AD) we are interested in revealing the brain patternsof atrophy explaining the variability in this population Thisprovides indications of possible stratification of the cohort intohomogeneous sub-groups that may be clinically similar butwith a different pattern of atrophy This could suggest differentsub-types of patients with AD or some other etiologies suchas dementia with Lewy bodies Clustering methods might benatural approaches to address such situations however theycan not reveal subtle differences that go beyond a global andtrivial pattern of atrophy Such patterns are usually captured bythe first component of PCA which after being removed offersthe possibility to identify spatial patterns on the subsequentcomponents

However PCA provides dense loading vectors (patterns)that cannot be used to identify brain markers without arbitrarythresholding

Recently some alternatives propose to add sparsity in thismatrix factorization problem [33] [36] [43] The sparsedictionary learning framework proposed by [36] provides asparse coding (rows of U) of samples through a sparse

Copyright ccopy 2017 IEEE Personal use of this material is permitted However permission to use thismaterial for any other purposes must be obtained from the IEEE by sending a request to pubs-permissionsieeeorg

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 2

linear combination of dense basis elements (columns of V)However the identification of biomarkers requires a sparsedictionary (columns of V) This is precisely the objective ofSparse PCA (SPCA) proposed in [30] [51] [14] [49] [31]which adds a sparsity-inducing penalty on the columns of VImposing such sparsity constraints on the loading coefficientsis a procedure that has been used in fMRI to produce sparserepresentation of brain functional networks [20][45]

However sparse PCA is limited by the fact that it ignoresthe inherent spatial correlation in the data It leads to scatteredpatterns that are difficult to interpret Furthermore constrainingonly the number of features included in the PCs might notalways be fully relevant since most data sets are expected tohave a spatial structure For instance MRI data is naturallyencoded on a grid some voxels are neighbors while othersare not

We hypothesize that brain patterns are organized into dis-tributed regions across the brain([11][21][41]) Recent studiestried to overcome this limitation by encoding prior informationconcerning the spatial structure of the data (see [29] [24][48]) However they used methods that are difficult to pluginto the optimization scheme (eg spline smoothing waveletsmoothing) and incorporated prior information that sometimesmay be difficult to define One simple solution is the use ofa GraphNet penalty ([23] [32] [40] [18] [38]) It promoteslocal smoothness of the weight map by simply forcing adjacentvoxels to have similar weights using an λ2 penalty on thegradient of the weight map Nonetheless we hypothesizedthat Graph-net provided smooth solution rather than clearlyidentified regions In data classification problems when ex-tracting structured and sparse predictive maps the goals arelargely aligned with those of PCA Some classification studieshave revealed stable and interpretable results by adding atotal variation (TV) penalty to the sparsity constraint (see[19]) TV is widely used as a tool in image denoising andrestoration It accounts for the spatial structure of images byencoding piecewise smoothness and enabling the recovery ofhomogeneous regions separated by sharp boundaries

For simplicity rather than solving Eq (2) we solve aslightly different criterion which results from using the La-grange form rather than the bound form of the constraints onV Then we extend the Lagrangian form by adding penalties(`1 `2 and TV) to the minimization problem

minUDV

1

NXminusUDVgt2F

+Ksumk=1

λ2vk22 + λ1vk1 + λ

sumgisinGAgv2

(2)

s t uk22 = 1forallk = 1 middot middot middot K

where λ1 λ2 and λ are hyper-parameters controlling therelative strength of each penalty We further propose a genericoptimization framework that can combine any differentiableconvex (penalized) loss function with (i) penalties whoseproximal operator is known (here middot1) and (ii) a large rangeof complex non-smooth convex structured penalties that canbe formulated as a middot21-norm defined over a set of groups G

Such group-penalties cover eg total variation and overlappinggroup lasso

This new problem aims at finding a linear combinationof original variables that points in directions explaining asmuch variance as possible in data while enforcing sparsity andstructure (piecewise smoothness for TV) of the loadings

To achieve this it is necessary to sacrifice some of theexplained variance as well as the orthogonality of both theloading and the principal components Most existing SPCAalgorithms [51] [14] [49] [31] do not impose orthogonalloading directions either While we forced the componentsto have unit norm for visualization purposes we do not inthis formulation enforce vk2 = 1 Instead the value ofv2 is controlled by the hyper-parameter λ2 This penaltyon the loading together with the unit norm constraint on thecomponent prevents us from obtaining trivial solutions Theoptional 1

N factor acts on and conveniently normalizes the lossto account for the number of samples in order to simplify thesettings of the hyper-parameters λ1 λ2 λ

This paper presents an extension of the popular PCA frame-work by adding structured sparsity-inducing penalties on theloading vectors in order to identify the few stable regions inthe brain images accounting for most of the variability Theaddition of a prior that reflects the datarsquos structure within thelearning process gives the paper a scope that goes beyondSparse PCA To our knowledge very few papers ([1] [24][29] [48]) addressed the use of structural constraint in PCAThe study [29] proposes a norm that induces structured sparsity(called SSPCA) by restraining the support of the solution tobe sparse with a certain set of group of variables Possiblesupports include set of variables forming rectangles whenarranged on a grid Only one study recently used the totalvariation prior [1] in a context of multi-subject dictionarylearning based on a different optimization scheme [5]

Section II presents our main contribution a simple optimiza-tion algorithm that combines well known methods (deflationscheme and alternate minimization) with an original continua-tion algorithm based on Nesterovrsquos smoothing technique Ourproposed algorithm has the ability to include the TV penaltybut many other non-smooth penalties such as eg overlappinggroup lasso could also be used This versatile mathematicalframework is an essential feature in neuroimaging Indeed itenables a straightforward application to all kinds of data withknown structure such as N -dimensional images (of voxels) ormeshes of (cortical) surfaces Section III demonstrates the rele-vance of structured sparsity on both simulated and experimen-tal data for structural and functional MRI (fMRI) acquisitionsSPCA-TV achieved a higher reconstruction accuracy and morestable solutions than ElasticNet PCA Sparse PCA GraphNetPCA and SSPCA (from [29]) More importantly SPCA-TVyields more interpretable loading vectors than other methods

II METHOD

A common approach to solve the PCA problem see [14][31] [49]) is to compute a rank-1 approximation of thedata matrix and then repeat this on the deflated matrix [34]where the influence of the PCs are successively extracted and

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 3

discarded We first detail the notation for estimating a singlecomponent (Section II-A) and its solution using an alternatingminimization pipeline (Section II-B) Then we develop theTV regularization framework (Section II-C and Section II-D)Last we discuss the algorithm used to solve the minimizationproblem and its ability to converge toward stable pairs ofcomponentsloading vectors (Section II-E) and (Section II-F)

A Single component computationGiven a pair of loadingcomponent vectors u isin RN v isin

RP the best rank-1 approximation of the problem given inEq (2) is equivalent [49] to

minuv

f equiv

smooth︷ ︸︸ ︷minus 1

NugtXv + λ2v22︸ ︷︷ ︸

l(v)

+

non-smooth︷ ︸︸ ︷λ1 v1︸ ︷︷ ︸

h(v)

+λsumgisinGAgv2︸ ︷︷ ︸s(v)

(3)

s t u22 le 1

where l(v) is the penalized smooth (ie differentiable) lossh(v) is a sparsity-inducing penalty whose proximal operatoris known and s(v) is a complex penalty on the structure ofthe input variables with an unknown proximal operator

This problem is convex in u and in v but not in (uv)

B Alternating minimization of the bi-convex problemThe objective function to minimize is bi-convex [9] The

most common approach to solve a bi-convex optimizationproblem (which does not guarantee global optimality of thesolution) is to alternatively update u and v by fixing oneof them at the time and solving the corresponding convexoptimization problem on the other parameter vector

On the one hand when v is fixed the problem to solve is

minuisinRN

minus 1

NugtXv (4)

s t u22 le 1

with the associated explicit solution

ulowast(v) =Xv

Xv2 (5)

On the other hand solving the equation with respect to vwith a fixed u presents a higher level of difficulty that will bediscussed in Section II-E

C Reformulating TV as a linear operatorBefore discussing the minimization with respect to v we

provide details on the encoding of the spatial structure withinthe s(v) penalty

It is essential to note that the algorithm is independent ofthe spatial structure of the data All the structural informationis encoded in a linear operator A that is computed outsideof the algorithm Thus the algorithm has the ability to address

various structured data and most importantly other penaltiesthan just the TV penalty The algorithm requires the settingof two parameters (i) the linear operator A (ii) a projectionfunction detailed in Eq (12)

This section presents the formulation and the design of A inthe specific case of a TV penalty applied to the loading vectorv measured on a 3-dimensional (3D) image or a 2D mesh ofthe cortical surface

1) 3D image The brain mask is used to establish a mappingg(i j k) between the coordinates (i j k) in the 3D grid andan index g isin [[1P ]] in the collapsed image We extract thespatial neighborhood of g of size le 4 corresponding to voxelg and its 3 neighboring voxels within the mask in the i j andk directions By definition we have

TV(v) equivPsumg=1

∥∥nabla (vg(ijk)) ∥∥2 (6)

The first order approximation of the spatial gradientnabla(vg(ijk)) is computed by applying the linear operator A

prime

g isinR3times4 to the loading vector vg in the spatial neighborhood ofg ie

nabla(vg(ijk)

)=

[ minus1 1 0 0minus1 0 1 0minus1 0 0 1

]︸ ︷︷ ︸

Aprimeg

vg(ijk)vg(i+1jk)

vg(ij+1k)

vg(ijk+1)

︸ ︷︷ ︸

vg

(7)

where vg(ijk) is the loading coefficient at index g in thecollapsed image corresponding to voxel (i j k) in the 3Dimage Then A

prime

g is extended using zeros to a large but verysparse matrix Ag isin R3timesP in order to be directly applied onthe full vector v If some neighbors lie outside the mask thecorresponding rows in Ag are removed Noticing that for TVthere is one group per voxel in the mask (G = [[1P ]]) we canreformulate TV from Eq (6) using a general expression

TV(v) =sumgisinGAgv2 (8)

Finally with a vertical concatenation of all the Ag matriceswe obtain the full linear operator A isin R3PtimesP that will beused in Section II-E

2) Mesh of cortical surface The linear operator Aprime

g usedto compute a first order approximation of the spatial gradientcan be obtained by examining the neighboring vertices ofeach vertex g With common triangle-tessellated surfaces theneighborhood size is le 7 (including g) In this setting wehave A

prime

g isin R3times7 which can be extended and concatenated toobtain the full linear operator A

D Nesterovrsquos smoothing of the structured penaltyWe consider the convex non-smooth minimization of Eq (3)

with respect to v where thus u is fixed This problem includesa general structured penalty s(middot) that covers the specific caseof TV A widely used approach when dealing with non-smoothproblems is to use methods based on the proximal operator of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 4

the penalties For the `1 penalty alone the proximal operatoris analytically known and efficient iterative algorithms such asISTA and FISTA are available (see [4]) However since theproximal operator of the TV+`1 penalty is not closed formstandard implementation of those algorithms is not suitable Inorder to overcome this barrier we used Nesterovrsquos smoothingtechnique [39] It consists of approximating the non-smoothpenalties for which the proximal operator is unknown (egTV) with a smooth function (of which the gradient is known)Non-smooth penalties with known proximal operators (eg `1)are not affected Hence as described in [50] it allows to usean exact accelerated proximal gradient algorithm Thus we cansolve the PCA problem penalized by TV and elastic net wherean exact `1 penalty is used

Using the dual norm of the `2-norm (which happens to bethe `2-norm too) Eq (8) can be reformulated as

s(v) =sumgisinGAgv2 =

sumgisinG

maxαg2le1

αgtgAgv (9)

where αg isin Kg = αg isin R3 αg2 le 1 is a vectorof auxiliary variables in the `2 unit ball associated withAgv As with A isin R3PtimesP which is the vertical concate-nation of all the Ag we concatenate all the αg to form theα isin K = [αgt1 αgtP ]gt αg isin Kg isin R3P K is theCartesian product of 3D unit balls in Euclidean space andtherefore a compact convex set Eq (9) can further be writtenas

s(v) = maxαisinK

αgtAv (10)

Given this formulation of s(v) we can apply Nesterovrsquossmoothing For a given smoothing parameter micro gt 0 the s(v)function is approximated by the smooth function

smicro(v) = maxαisinK

αgtAv minus micro

2α22

(11)

for which limmicrorarr0 smicro(v) = s(v) Nesterov [39] demonstratesthis convergence using the inequality in Eq (15) The valueof αlowastmicro(v) = [αlowastgtmicro1 α

lowastgtmicrog α

lowastgtmicroP ]

gt that maximizesEq (11) is the concatenation of projections of vectors Agv isinR3 to the `2 ball (Kg) αlowastmicrog(v) = projKg

(Agvmicro

) where

projKg (x) =

x if x2 le 1

xx2 otherwise

(12)

The function smicro ie the Nesterovrsquos smooth transform of sis convex and differentiable Its gradient given by [39]

nabla(smicro)(v) = Agtαlowastmicro(v) (13)

is Lipschitz-continuous with constant

L(nabla(smicro)

)=A22micro

(14)

where A2 is the matrix spectral norm of A MoreoverNesterov [39] provides the following inequality relating smicroand s

smicro(v) le s(v) le smicro(v) + microM forallv isin Rp (15)

where M = maxαisinKα22

2 = P2

Thus a new (smoothed) optimization problem closely re-lated to Eq (3) (with fixed u) arises from this regularizationas

minv

smooth︷ ︸︸ ︷minus 1

nugtXv+ λ2v22︸ ︷︷ ︸

l(v)

+λαlowastmicro(v)

gtAv minus micro

2αlowast22

︸ ︷︷ ︸

smicro(v)

+λ1

non-smooth︷ ︸︸ ︷v1︸ ︷︷ ︸h(v)

(16)

Since we are now able to explicitly compute the gradient ofthe smooth part nabla(l + λsmicro) (Eq (18)) its Lipschitz constant(Eq (19)) and also the proximal operator of the non-smoothpart we have all the ingredients necessary to solve thisminimization function using an accelerated proximal gradientmethods [4] Given a starting point v0 and a smoothingparameters micro FISTA (Algorithm 1) minimizes the smoothedproblem and reaches a prescribed precision εmicro

However in order to control the convergence of the algo-rithm (presented in Section II-E1) we introduce the Fencheldual function and the corresponding dual gap of the objectivefunction The Fenchel duality requires the loss to be stronglyconvex which is why we further reformulate Eq (16) slightlyAll penalty terms are divided by λ2 and by using the followingequivalent formulation for the loss we obtain the minimizationproblem

minvfmicro equiv

l(v)︷ ︸︸ ︷1

2

∥∥∥∥v minus Xgtu

nλ2

∥∥∥∥22︸ ︷︷ ︸

L(v)

+1

2v22 +

λ

λ2

smicro(v)︷ ︸︸ ︷αlowastmicro(v)

gtAv minus micro

2αlowast22

+λ1λ2

h(v)︷ ︸︸ ︷v1︸ ︷︷ ︸

ψmicro(v)

(17)

This new formulation of the smoothed objective func-tion (noted fmicro) preserves the decomposition of fmicro into a sumof a smooth term l + λ

λ2smicro and a non-smooth term h Such

decomposition is required for the application of FISTA withNesterovrsquos smoothing Moreover this formulation provides adecomposition of fmicro into a sum of a smooth loss L anda penalty term ψmicro required for the calculation of the gappresented in Section II-E1)

We provide all the required quantities to minimize Eq (17)using Algorithm 1 Using Eq (13) we compute the gradientof the smooth part as

nabla(l +

λ

λ2smicro

)= nabla(l) + λ

λ2nabla(smicro)

= (2v minus Xgtu

nλ2) +

λ

λ2Agtαlowastmicro(v

k) (18)

and its Lipschitz constant (using Eq (14))

L

(nabla(l +

λ

λ2smicro

))= 2 +

λ

λ2

A22micro

(19)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 5

Algorithm 1 FISTA(Xgtu v0 εmicro micro A λ L(nabla(g))

)1 v1 = v0 k = 22 Compute the gradient of the smooth part nabla(g + λsmicro)

(Eq (18)) and its Lipschitz constant Lmicro (Eq (19))3 Compute the size tmicro = Lminus1micro4 repeat5 z = vkminus1 + kminus2

k+1

(vkminus1 minus vkminus2

)6 vk = proxh

(zminus tmicronabla(g + λsmicro)(z)

)7 until GAPmicro(v

k) le εmicro8 return vk

E Minimization of the loading vectors with CONESTA

The step size tmicro computed in Line 3 of Algorithm 1 dependson the smoothing parameter micro (see Eq (19)) Hence there is atrade-off between speed and precision Indeed high precisionwith a small micro will lead to a slow convergence (small tmicro)Conversely poor precision (large micro) will lead to rapid con-vergence (large tmicro) Thus we propose a continuation approach(Algorithm 2) which decreases the smoothing parameter withrespect to the distance to the minimum On the one hand whenwe are far from vlowast (the minimum of Eq (17)) we can use alarge micro to rapidly decrease the objective function On the otherhand when we are close to vlowast we need a small micro in orderto obtain an accurate approximation of the original objectivefunction

1) Duality gap The distance to the unknown f(vlowast) isestimated using the duality gap Duality formulations are oftenused to control the achieved precision level when minimizingconvex functions They provide an estimation of the errorf(vk) minus f(vlowast) for any v when the minimum is unknownThe duality gap is the cornerstone of the CONESTA algorithmIndeed it is used three times

i As the stopping criterion in the inner FISTA loop (Line7 in Algorithm 1) FISTA will stop as soon as the cur-rent precision is achieved using the current smoothingparameter micro This prevents unnecessary convergencetoward the approximated (smoothed) objective function

ii In the ith CONESTA iteration as a way to estimate thecurrent error f(vi)minusf(vlowast) (Line 7 in Algorithm 2) Theerror is estimated using the gap of the smoothed prob-lem GAPmicro=microi(v

i+1) which avoid unnecessary compu-tation since it has already been computed during the lastiteration of FISTA The inequality in Eq (15) is used toobtain the gap εi to the original non-smoothed problemThe next desired precision εi+1 and the smoothingparameter microi+1 are derived from this value

iii Finally as the global stopping criterion within CON-ESTA (Line 10 in Algorithm 2) This will guaranteethat the obtained approximation of the minimum vi atconvergence satisfies f(vi)minus f(vlowast) lt ε

Based on Eq (17) which decomposes the smoothed ob-jective function as a sum of a strongly convex loss and thepenalty

fmicro(v) = L(v) + ψmicro(v)

we compute the duality gap that provides an upper boundestimation of the error to the optimum At any step k ofthe algorithm given the current primal vk and the dualσ(vk) equiv nablaL(vk) variables [8] we can compute the dualitygap using the Fenchel duality rules [35]

GAP(vk) equiv fmicro(vk) + Llowast(σ(vk)

)+ ψlowastmicro

(minus σ(vk)

) (20)

where Llowast and ψlowastmicro are respectively the Fenchel conjugates ofL and ψmicro Denoting by vlowast the minimum of fmicro (solution ofEq (17)) the interest of the duality gap is that it provides anupper bound for the difference with the optimal value of thefunction Moreover it vanishes at the minimum

GAP(vk) ge f(vk)minus f(vlowast) ge 0GAP(vlowast) = 0

(21)

The dual variable is

σ(vk) equiv nablaL(vk) = v minus Xgtu

nλ2 (22)

the Fenchel conjugate of the squared loss L(vk) is

Llowast(σ(vk)) = 1

2σ(vk)22 + σ(vk)gt

Xgtu

nλ2 (23)

In [25] the authors provide the expression of the Fenchelconjugate of the penalty ψmicro(vk)

ψlowastmicro(minusσ(vk)) =1

2

Psumj=1

([∣∣∣minus σ(vk)j minus λ

λ2

(Agtαlowastmicro(v

k))j

∣∣∣minus λ1λ2

]2+

)

+λmicro

2λ2

∥∥αlowastmicro(vk)∥∥22 (24)

where [middot]+ = max(0 middot)The expression of the duality gap in Eq (20) provides an

estimation of the distance to the minimum This distance isgeometrically decreased by a factor τ = 05 at the end of eachcontinuation and the decreased value defines the precision thatshould be reached by the next iteration (Line 8 of Algorithm2) Thus the algorithm dynamically generates a sequence ofdecreasing prescribed precisions εi Such a scheme ensures theconvergence [25] towards a globally desired final precision εwhich is the only parameter that the user needs to provide

2) Determining the optimal smoothing parameter Given thecurrent prescribed precision εi we need to compute an optimalsmoothing parameter microopt(εi) (Line 9 in Algorithm 2) thatminimizes the number of FISTA iterations needed to achievesuch precision when minimizing Eq (3) (with fixed u) viaEq (17) (ie such that f(v(k))minus f(vlowast) lt εi)

In [25] the authors provide the expression of this optimalsmoothing parameter

microopt(εi) =

minusλMA22 +radic(λMA22)2 +ML(nabla(l))A22εiML(nabla(l))

(25)

where M = P2 (Eq (15)) and L(nabla(l)) = 2 is the Lipschitzconstant of the gradient of l as defined in Eq (17)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 6

We call the resulting algorithm CONESTA (short forCOntinuation with NEsterov smoothing in a Shrinkage-Thresholding Algorithm) It is presented in detail with con-vergence proofs in [25]

Let K be the total number of FISTA loops used in CON-ESTA then we have experimentally verified that the conver-gence rate to the solution of Eq (16) is O

(1K2

)(which is

the optimal convergence rate for first-order methods) Alsothe algorithm works even if some of the weights λ1 or λare zero which thus allows us to solve eg the elastic netusing CONESTA Note that it has been rigorously provedthat the continuation technique improves the convergence ratecompared to the simple smoothing using a single value of microIndeed it has been demonstrated in [6] (see also [50]) thatthe convergence rate obtained with single value of micro evenoptimised is O

(1K2

)+ O(1K) However it has recently

been proved in [25] that the CONESTA algorithm achieves aO(1K) for general convex functions

We note that CONESTA could easily be adapted to manyother penalties For example to add the group lasso (GL)constraint to our structure we just have to design a specificlinear operator AGL and concatenate it to the actual linearoperator A

Algorithm 2 CONESTA(Xgtu ε

)1 Initialize v0 isin RP2 ε0 = τ middot GAPmicro=10minus8(v0)3 micro0 = microopt

(ε0)

4 repeat5 εimicro = εi minus microiγM6 vi+1 = FISTA(Xgtu vi εimicro )7 εi = GAPmicro=microi(v

i+1) + microiγM8 εi+1 = τ middot εi9 microi+1 = microopt

(εi+1

)10 until εi le ε11 return vi+1

F The algorithm for the SPCA-TV problemThe computation of a single component through SPCA-TV

can be achieved by combining CONESTA and Eq (5) withinan alternating minimization loop Mackey [34] demonstratedthat further components can be efficiently obtained by incor-porating this single-unit procedure in a deflation scheme asdone in eg [14] [31] The stopping criterion is defined as

STOPPINGCRITERION =

∥∥∥Xk minus ui+1vi+1gt∥∥∥Fminus∥∥∥Xk minus uivi

gt∥∥∥F∥∥Xk minus ui+1vi+1gt

∥∥F

(26)

All the presented building blocks were combined into Al-gorithm 3 to solve the SPCA-TV problem

III EXPERIMENTS

We evaluated the performance of SPCA-TV using threeexperiments One simulation study carried out on a synthetic

Algorithm 3 SPCA-TV(X ε)

1 X0 = X2 for all k = 0 K do Components3 Initialize u0 isin RN4 repeat Alternating minimization5 vi+1 = CONESTA(Xgtku

i ε)

6 ui+1 = Xkvi+1

Xkvi+127 until STOPPINGCRITERION le ε8 vk+1 = vi+1

9 uk+1 = ui+1

10 Xk+1 = Xk minus uk+1vk+1gt Deflation11 end for12 return U = [u1 middot middot middot uK ]V = [v1 middot middot middot vK ]

data set and two on neuroimaging data sets In order tocompare the performance of SPCA-TV with existing sparsePCA models we also included results obtained with SparsePCA ElasticNet PCA GraphNet PCA and SSPCA from[29] We used the scikit-learn implementation [42] for theSparse PCA while we used the Parsimony package (httpsgithubcomneurospinpylearn-parsimony) for the ElasticNetGraphNet PCA and SPCA-TV methods Concerning SSPCAwe used the MATLAB implementation provided in [29]

The number of parameters to set for each method is differ-ent For Sparse PCA the λ1 parameter selects its optimal valuefrom the range 01 10 50 100 ElasticNet PCA requiresthe setting of the λ1 and the λ2 penalties weights MeanwhileGraphNet PCA and SPCA-TV requires the settings of anadditional parameter namely the spatial constraint penalty λWe operated a re-parametrization of these penalty weights inratios A global parameter α isin 001 01 10 controls theweight attributed to the whole penalty term including thespatial and the `1 regularization Individual constraints areexpressed in terms of ratios the `1 ratio λ1(λ1 + λ2 + λ)isin 01 05 08 and the `TV (or `GN for GraphNet) λ(λ1+λ2+λ) isin 01 05 08 For ElasticNet we explorethe grid of parameters composed of the Cartesian product ofα and `1 ratio subsets For GraphNet PCA and SPCA-TVwe perform a parameter search on a grid of parameters givenby the Cartesian product of respectively (α `1 `GN ) subsetsand (α `1 `TV ) subsets Concerning SSPCA method theregularization parameter selects its optimal value in the range10minus8 108

However in order to ensure that the components extractedhave a minimum amount of sparsity we also included acriteria controlling sparsity At least half of the features ofthe components have to be zero For both real neuroimagingexperiments performance was evaluated through a 5-fold x5-fold double cross validation pipeline The double cross-validation process consists of two nested cross-validation loopswhich are referred to as internal and external cross-validationloops In the outer (external) loop all samples are randomlysplit into subsets referred to as training and test sets The testsets are exclusively used for model assessment while the trainsets are used in the inner (internal) loop for model fitting

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 7

and selection The inner folds select the set of parametersminimizing the reconstruction error on the outer fold Forthe synthetic data we used 50 different purposely-generateddata sets and 5 inner folds for parameters selection In or-der to evaluate the reconstruction accuracy of the methodswe reported the mean Frobenius norm of the reconstructionerror across the foldsdata sets on independent test dataThe hypothesis we wanted to test was whether there was asubstantial decrease in the reconstruction error of independentdata when using SPCA-TV compared to when using SparsePCA ElasticNet PCA GraphNet PCA and SSPCA It wastested through a related two samples t-test This choice tocompare methods performance on independent test data wasmotivated by the fact that the optimal reconstruction of thetraining set is necessarily hindered by spatial and sparsityconstraints We therefore expect SPCA-TV to perform worseon train data than other less constrained methods Howeverthe TV penalty has a more important purpose than just tominimize the reconstruction error the estimation of coherentand reproducible loadings Indeed clinicians expect that ifimages from other patients with comparable clinical conditionshad been used the extracted loading vectors would have turnedout to be similar Therefore since the ultimate goal of SPCA-TV is to yield stable and reproducible weight maps it is morerelevant to evaluate methods on independent test data

The stability of the loading vectors obtained across varioustraining data sets (variation in the learning samples) wasassessed through a similarity measure the pairwise Dice indexbetween loading vectors obtained with different foldsdata sets[16] We tested whether pairwise Dice indices are significantlyhigher in SPCA-TV compared other methods Testing thishypothesis is equivalent to testing the sign of the difference ofpairwise Dice indices between methods However since thepairwise Dice indices are not independent from one another(the folds share many of their learning samples) the directsignificance measures are biased We therefore used permuta-tion testing to estimate empirical p-values The null hypothesiswas tested by simulating samples from the null distributionWe generated 1 000 random permutations of the sign of thedifference of pairwise Dice index between the PCA methodsunder comparisons and then the statistics on the true datawere compared with the ones obtained on the reshuffled datato obtain empirical p-values

For each experiment we made the initial choice to retrievethe first ten components However given the length constraintwe only present the weights maps associated to the top threecomponents for Sparse PCA and SPCA-TV in this paperElasticNet PCA GraphNet PCA and SSPCA rsquos weights mapsof experiments are presented in the supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A Simulation study

We generated 50 sets of synthetic data each composed of500 images of size 100 times 100 pixels Images are generatedusing the following noisy linear system

u1V1 + u2V

2 + u3V3 + ε isin R10 000 (27)

where V = [V 1 V 2 V 3] isin R10 000times3 are sparse and structuredloading vectors illustrated in Fig 1 The support of V 1 definesthe two upper dots the support of V 2 defines the two lowerdots while V 3 rsquos support delineates the middle dot Thecoefficients u = [u1 u2 u3] that linearly combine the com-ponents of V are generated according to a centered Gaussiandistribution The elements of the noise vector ε are independentand identically distributed according to a centered Gaussiandistribution with a 01 signal-to-noise ratio (SNR) This SNRwas selected by a previous calibration pipeline where wetested the efficiency of data reconstruction at multiple SNRranges running from 0 to 05 We decided to work with a 01SNR because it is located in the range of values where standardPCA starts being less efficient in the recovery process

component 1 component 2 component 3Fig 1 Loading vectors V = [V 1 V 2 V 3] isin R10 000times3 usedto generate the images

We splitted the 500 artificial images into a test and a trainingset with 250 images in each set and learned the decompositionon the training set Sparse PCA

SPCA - TVcomponent 1

component 1

component 2

component 2

component 3

component 3Fig 2 Loading vectors recovered from 250 images usingSparse PCA and SPCA-TV

Fig 2 represents the loading vectors extracted with one dataset Please note that the sign is arbitrary Indeed if we considerthe loss of Eq (3) ugt and v can be both multiply by -1

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 8

TABLE I Scores are averaged across the 50 independent datasets We tested whether the scores obtained with existing PCAmethods are significantly different from scores obtained withSPCA-TV Significance notations p le 10minus3

Scores

Methods Test Data Reconstruction Error MSE Dice Index

Sparse PCA 15760 091 028ElasticNet PCA 15724 083 043GraphNet PCA 15708 083 030SSPCA 15719 154 007SPCA-TV 15701 064 052

without changing anything We observe that Sparse PCA yieldsvery scattered loading vectors The loading vectors of SPCA-TV on the other hand are sparse but also organized in clearregions SPCA-TV provides loading vectors that closely matchthe ground truth The reconstruction error is evaluated on thetest sets (Tab I) with its value over the 50 data sets being sig-nificantly lower in SPCA-TV than in Sparse PCA (T = 945p = 39 middot 10minus57) ElasticNet PCA (T = 332 p = 27 middot 10minus35GraphNet PCA (T = 127 p = 36 middot 10minus17 and SSPCA from[29] (T = 189 p = 39 middot10minus24) ) methods Additional detailsconcerning the reconstruction accuracy of both the train andtest data is presented in Figure 1 of supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A different way of quantifying the reconstruction accuracyfor each method is to evaluate how closely the extractedloadings match the known ground truth of simulated dataset We computed the mean squared error (MSE) betweenthe ground truth and the estimated loadings The results arepresented in Tab I We note that the MSE is significantlylower with SPCA-TV than with Sparse PCA (T = 69p = 80 middot 10minus9) ElasticNet PCA (T = 62 p = 11 middot 10minus07)GraphNet-PCA (T = 41 p = 14 middot 10minus04) and SSPCA(T = 226 p = 15 middot 10minus27)

Moreover when evaluating the stability of the loadingvectors across resampling we found a higher statisticallysignificant mean Dice index when using SPCA-TV comparedto the other methods (p lt 0001) The results are presented inTab I They indicate that SPCA-TV is more robust to variationin the learning samples than the other sparse methods SPCA-TV yields reproducible loading vectors across data sets

These results indicate that the SPCA-TV loadings are notonly more stable across resampling but also achieve a betterrecovery of the underlying variability in independent data thanthe Sparse PCA ElasticNet PCAGraphNet PCA and SSPCAmethods

One of the issues linked to biconvex optimization is therisk of falling into locals minima Conscious of this potentialrisk we set up an experiment in which we ran 50 times theoptimization of the same problem with a different startingpoint at each run We then compare the resulting loadingvectors obtained at each run and computed a similarity mea-sure the Dice index It quantifies the proximity between each

independently-run solution with a different starting point Weobtained a Dice index of 099 on the 1st component 099 onthe 2nd component and 072 on the 3rd component Off thestrength of this indices we are confident of this algorithmrobustness and ability to converge toward the same stablesolution independently from the choice of the starting point

B 3D images of functional MRI of patients with schizophreniaWe then applied the methods on 3D images of BOLD

functional MRI (fMRI) acquired with the same scanner andpulse sequence Imaging was performed on a 15 T scannerusing a standard head-coil For all functional scans the field-of-view was 206lowast206lowast153 mm with a resolution close to 35mm in all directions The parameters of the PRESTO sequencewere TE = 96 ms TR = 1925 ms EPI-factor = 15 flip angle= 9 Each fMRI run consisted of 900 volumes collected Thecohort is composed of 23 patients with schizophrenia (averageage = 3496 years 8 Females15 Males) Brain activationwas measured while subjects experienced multimodal hallu-cinations The fMRI data was pre-processed using SPM12(WELLCOME Department of Imaging Neuroscience Lon-don UK) Data preprocessing consisted of motion correction(realignment) coregistration of the individual anatomical T1image to the functional images spatial normalization to MNIspace using DARTEL based on segmented T1 scans

We considered each set of consecutive images under pre-hallucinations state as a block Since most of the patientshallucinate more than once during the scanning session wehave more blocks than patients (83 blocks) The activationmaps are computed from these blocks Based on the generallinear model approach we regressed for each block the fMRIsignal time course on a linear ramp function Indeed we hy-pothesized that activation in some regions presents a ramp-likeincrease during the time preceding the onset of hallucinations(See example of regression in figure 3 in the supplementarymaterialsavailable in the supplementary files multimedia tab)The activation maps that we used as an input to the SPCA-TVmethod are the statistical parametric maps associated with thecoefficients of the block regression (See one example in Figure4 of supplementary materials available in the supplementaryfiles multimedia tab) We obtained a data set of n = 83 mapsand p = 63 966 features We hypothesized that the principalcomponents extracted with SPCA-TV from these activationmaps could uncover major trends of variability within pre-hallucination patterns Thus they might reveal the existence ofsubgroups of patients according to the sensory modality (egvision or audition) involved during hallucinations

We applied all PCA methods under study to this data setexcept SSPCA Indeed the SSPCA method ([29]) could notbe applied to this specific example since datasets have to beconstituted from closed cubic forms without any holes to beeligible for SSPCA method application It does not supportmasked data such as the one used here

The loading vectors extracted from the activation mapsof pre-hallucinations scans with Sparse PCA and SPCA-TVare presented in Fig 3 We observe a similar behavior asin the synthetic example namely that the loading vectors of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Sparse PCA tend to be scattered and produce irregular patternsHowever SPCA-TV seems to yield structured and smoothsources of variability which can be interpreted clinicallyFurthermore the SPCA-TV loading vectors are not redundantand revealed different patterns

Indeed the loading vectors obtained by SPCA-TV are ofgreat interest because they revealed insightful patterns ofvariability in the data the second loading is composed ofinteresting areas such as the precuneus cortex and the cingulategyrus but also areas related to vision-processing areas such asthe occipital fusiform gyrus and the parietal operculum cortexregions The third loading reveals important weights in themiddle temporal gyrus the parietal operculum cortex and thefrontal pole The first loading vector encompasses all featuresof the brain One might see this first component as a globalvariability affecting the whole brain such as the overarchingeffect of age SPCA-TV selects this dense configuration inspite of the sparsity constraint It is highly desirable to removeany sort of global effect at first in order to start identifyinglocal patterns in next components that are not impacted by aglobal variability of this kind We can identified a widespreadset of dysfunctional language-related or vision-related areasthat present increasing activity during the time preceding theonset of hallucinations The regions extracted by SPCA-TVare found to be pertinent according to the existing literatureon the topic ([28] [27] [7]])

These results seem to indicate the possible existence ofsubgroups of patients according to the hallucination modalitiesinvolved An interesting application would be to use the scoreof the second component extracted by SPCA-TV in orderto distinguish patients with visual hallucinations from thosesuffering mainly from auditory hallucinations

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 139 p = 15 middot 10minus4) ElasticNetPCA (T = 71 p = 21 middot 10minus3) and GraphNet PCA (T = 46p = 10 middot 10minus2) Moreover when assessing the stability of theloading vectors across the folds we found a higher statisticallysignificant mean Dice index in SPCA-TV compared to SparsePCA (p = 40 middot 10minus3) ElasticNet PCA (p = 40 middot 10minus3) andGraphNet PCA (p = 20 middot 10minus3) as presented in Tab II Addi-tional details regarding the reconstruction accuracy on both thetrain and test sets and the Dice index is presented in Figure5 of supplementary materials (available in the supplementaryfiles multimedia tab)

TABLE II Scores of the fMRI data are averaged across the5 folds We tested whether the averaged scores obtained withexisting PCA methods are significantly different from scoresobtained with SPCA-TV Significance notations p le10minus3 p le 10minus2

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 15152 034ElasticNet PCA 14827 032GraphNet PCA 14281 058SPCA-TV 14140 063

In conclusion SPCA-TV significantly outperforms SparseElasticNet and GraphNet PCA in terms of the reconstructionerror on independent test data and in the sense that loadingvectors are both more clinically interpretable and more stable

We also evaluated the convergence speed of Sparse PCAMini-Batch Sparse PCA (a variant of Sparse PCA that isfaster but less accurate) ElasticNet PCA GraphNet PCAand SPCA-TV for this functional MRI data set of n = 83samples and p = 63 966 features We compared the time ofexecution required for each algorithm to achieve a given levelof precision in Tab III Sparse PCA and ElasticNet PCA aresimilar in terms of convergence time while mini-batch sparsePCA is much faster but does not converge to high precision Asexpected structured methods (GraphNet PCA and SPCA-TV)take longer than other sparse methods because of the inclusion

Sparse PCAcompon

ent 1compon

ent 2compon

ent 30

-123

-1110

0-105

component 1

component 2

component 3

-063+063-073+073

+4800

SPCA - TV

Fig 3 Loading vectors recovered from the 83 activation mapsusing Sparse PCA and SPCA-TV

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 3: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 2

linear combination of dense basis elements (columns of V)However the identification of biomarkers requires a sparsedictionary (columns of V) This is precisely the objective ofSparse PCA (SPCA) proposed in [30] [51] [14] [49] [31]which adds a sparsity-inducing penalty on the columns of VImposing such sparsity constraints on the loading coefficientsis a procedure that has been used in fMRI to produce sparserepresentation of brain functional networks [20][45]

However sparse PCA is limited by the fact that it ignoresthe inherent spatial correlation in the data It leads to scatteredpatterns that are difficult to interpret Furthermore constrainingonly the number of features included in the PCs might notalways be fully relevant since most data sets are expected tohave a spatial structure For instance MRI data is naturallyencoded on a grid some voxels are neighbors while othersare not

We hypothesize that brain patterns are organized into dis-tributed regions across the brain([11][21][41]) Recent studiestried to overcome this limitation by encoding prior informationconcerning the spatial structure of the data (see [29] [24][48]) However they used methods that are difficult to pluginto the optimization scheme (eg spline smoothing waveletsmoothing) and incorporated prior information that sometimesmay be difficult to define One simple solution is the use ofa GraphNet penalty ([23] [32] [40] [18] [38]) It promoteslocal smoothness of the weight map by simply forcing adjacentvoxels to have similar weights using an λ2 penalty on thegradient of the weight map Nonetheless we hypothesizedthat Graph-net provided smooth solution rather than clearlyidentified regions In data classification problems when ex-tracting structured and sparse predictive maps the goals arelargely aligned with those of PCA Some classification studieshave revealed stable and interpretable results by adding atotal variation (TV) penalty to the sparsity constraint (see[19]) TV is widely used as a tool in image denoising andrestoration It accounts for the spatial structure of images byencoding piecewise smoothness and enabling the recovery ofhomogeneous regions separated by sharp boundaries

For simplicity rather than solving Eq (2) we solve aslightly different criterion which results from using the La-grange form rather than the bound form of the constraints onV Then we extend the Lagrangian form by adding penalties(`1 `2 and TV) to the minimization problem

minUDV

1

NXminusUDVgt2F

+Ksumk=1

λ2vk22 + λ1vk1 + λ

sumgisinGAgv2

(2)

s t uk22 = 1forallk = 1 middot middot middot K

where λ1 λ2 and λ are hyper-parameters controlling therelative strength of each penalty We further propose a genericoptimization framework that can combine any differentiableconvex (penalized) loss function with (i) penalties whoseproximal operator is known (here middot1) and (ii) a large rangeof complex non-smooth convex structured penalties that canbe formulated as a middot21-norm defined over a set of groups G

Such group-penalties cover eg total variation and overlappinggroup lasso

This new problem aims at finding a linear combinationof original variables that points in directions explaining asmuch variance as possible in data while enforcing sparsity andstructure (piecewise smoothness for TV) of the loadings

To achieve this it is necessary to sacrifice some of theexplained variance as well as the orthogonality of both theloading and the principal components Most existing SPCAalgorithms [51] [14] [49] [31] do not impose orthogonalloading directions either While we forced the componentsto have unit norm for visualization purposes we do not inthis formulation enforce vk2 = 1 Instead the value ofv2 is controlled by the hyper-parameter λ2 This penaltyon the loading together with the unit norm constraint on thecomponent prevents us from obtaining trivial solutions Theoptional 1

N factor acts on and conveniently normalizes the lossto account for the number of samples in order to simplify thesettings of the hyper-parameters λ1 λ2 λ

This paper presents an extension of the popular PCA frame-work by adding structured sparsity-inducing penalties on theloading vectors in order to identify the few stable regions inthe brain images accounting for most of the variability Theaddition of a prior that reflects the datarsquos structure within thelearning process gives the paper a scope that goes beyondSparse PCA To our knowledge very few papers ([1] [24][29] [48]) addressed the use of structural constraint in PCAThe study [29] proposes a norm that induces structured sparsity(called SSPCA) by restraining the support of the solution tobe sparse with a certain set of group of variables Possiblesupports include set of variables forming rectangles whenarranged on a grid Only one study recently used the totalvariation prior [1] in a context of multi-subject dictionarylearning based on a different optimization scheme [5]

Section II presents our main contribution a simple optimiza-tion algorithm that combines well known methods (deflationscheme and alternate minimization) with an original continua-tion algorithm based on Nesterovrsquos smoothing technique Ourproposed algorithm has the ability to include the TV penaltybut many other non-smooth penalties such as eg overlappinggroup lasso could also be used This versatile mathematicalframework is an essential feature in neuroimaging Indeed itenables a straightforward application to all kinds of data withknown structure such as N -dimensional images (of voxels) ormeshes of (cortical) surfaces Section III demonstrates the rele-vance of structured sparsity on both simulated and experimen-tal data for structural and functional MRI (fMRI) acquisitionsSPCA-TV achieved a higher reconstruction accuracy and morestable solutions than ElasticNet PCA Sparse PCA GraphNetPCA and SSPCA (from [29]) More importantly SPCA-TVyields more interpretable loading vectors than other methods

II METHOD

A common approach to solve the PCA problem see [14][31] [49]) is to compute a rank-1 approximation of thedata matrix and then repeat this on the deflated matrix [34]where the influence of the PCs are successively extracted and

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 3

discarded We first detail the notation for estimating a singlecomponent (Section II-A) and its solution using an alternatingminimization pipeline (Section II-B) Then we develop theTV regularization framework (Section II-C and Section II-D)Last we discuss the algorithm used to solve the minimizationproblem and its ability to converge toward stable pairs ofcomponentsloading vectors (Section II-E) and (Section II-F)

A Single component computationGiven a pair of loadingcomponent vectors u isin RN v isin

RP the best rank-1 approximation of the problem given inEq (2) is equivalent [49] to

minuv

f equiv

smooth︷ ︸︸ ︷minus 1

NugtXv + λ2v22︸ ︷︷ ︸

l(v)

+

non-smooth︷ ︸︸ ︷λ1 v1︸ ︷︷ ︸

h(v)

+λsumgisinGAgv2︸ ︷︷ ︸s(v)

(3)

s t u22 le 1

where l(v) is the penalized smooth (ie differentiable) lossh(v) is a sparsity-inducing penalty whose proximal operatoris known and s(v) is a complex penalty on the structure ofthe input variables with an unknown proximal operator

This problem is convex in u and in v but not in (uv)

B Alternating minimization of the bi-convex problemThe objective function to minimize is bi-convex [9] The

most common approach to solve a bi-convex optimizationproblem (which does not guarantee global optimality of thesolution) is to alternatively update u and v by fixing oneof them at the time and solving the corresponding convexoptimization problem on the other parameter vector

On the one hand when v is fixed the problem to solve is

minuisinRN

minus 1

NugtXv (4)

s t u22 le 1

with the associated explicit solution

ulowast(v) =Xv

Xv2 (5)

On the other hand solving the equation with respect to vwith a fixed u presents a higher level of difficulty that will bediscussed in Section II-E

C Reformulating TV as a linear operatorBefore discussing the minimization with respect to v we

provide details on the encoding of the spatial structure withinthe s(v) penalty

It is essential to note that the algorithm is independent ofthe spatial structure of the data All the structural informationis encoded in a linear operator A that is computed outsideof the algorithm Thus the algorithm has the ability to address

various structured data and most importantly other penaltiesthan just the TV penalty The algorithm requires the settingof two parameters (i) the linear operator A (ii) a projectionfunction detailed in Eq (12)

This section presents the formulation and the design of A inthe specific case of a TV penalty applied to the loading vectorv measured on a 3-dimensional (3D) image or a 2D mesh ofthe cortical surface

1) 3D image The brain mask is used to establish a mappingg(i j k) between the coordinates (i j k) in the 3D grid andan index g isin [[1P ]] in the collapsed image We extract thespatial neighborhood of g of size le 4 corresponding to voxelg and its 3 neighboring voxels within the mask in the i j andk directions By definition we have

TV(v) equivPsumg=1

∥∥nabla (vg(ijk)) ∥∥2 (6)

The first order approximation of the spatial gradientnabla(vg(ijk)) is computed by applying the linear operator A

prime

g isinR3times4 to the loading vector vg in the spatial neighborhood ofg ie

nabla(vg(ijk)

)=

[ minus1 1 0 0minus1 0 1 0minus1 0 0 1

]︸ ︷︷ ︸

Aprimeg

vg(ijk)vg(i+1jk)

vg(ij+1k)

vg(ijk+1)

︸ ︷︷ ︸

vg

(7)

where vg(ijk) is the loading coefficient at index g in thecollapsed image corresponding to voxel (i j k) in the 3Dimage Then A

prime

g is extended using zeros to a large but verysparse matrix Ag isin R3timesP in order to be directly applied onthe full vector v If some neighbors lie outside the mask thecorresponding rows in Ag are removed Noticing that for TVthere is one group per voxel in the mask (G = [[1P ]]) we canreformulate TV from Eq (6) using a general expression

TV(v) =sumgisinGAgv2 (8)

Finally with a vertical concatenation of all the Ag matriceswe obtain the full linear operator A isin R3PtimesP that will beused in Section II-E

2) Mesh of cortical surface The linear operator Aprime

g usedto compute a first order approximation of the spatial gradientcan be obtained by examining the neighboring vertices ofeach vertex g With common triangle-tessellated surfaces theneighborhood size is le 7 (including g) In this setting wehave A

prime

g isin R3times7 which can be extended and concatenated toobtain the full linear operator A

D Nesterovrsquos smoothing of the structured penaltyWe consider the convex non-smooth minimization of Eq (3)

with respect to v where thus u is fixed This problem includesa general structured penalty s(middot) that covers the specific caseof TV A widely used approach when dealing with non-smoothproblems is to use methods based on the proximal operator of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 4

the penalties For the `1 penalty alone the proximal operatoris analytically known and efficient iterative algorithms such asISTA and FISTA are available (see [4]) However since theproximal operator of the TV+`1 penalty is not closed formstandard implementation of those algorithms is not suitable Inorder to overcome this barrier we used Nesterovrsquos smoothingtechnique [39] It consists of approximating the non-smoothpenalties for which the proximal operator is unknown (egTV) with a smooth function (of which the gradient is known)Non-smooth penalties with known proximal operators (eg `1)are not affected Hence as described in [50] it allows to usean exact accelerated proximal gradient algorithm Thus we cansolve the PCA problem penalized by TV and elastic net wherean exact `1 penalty is used

Using the dual norm of the `2-norm (which happens to bethe `2-norm too) Eq (8) can be reformulated as

s(v) =sumgisinGAgv2 =

sumgisinG

maxαg2le1

αgtgAgv (9)

where αg isin Kg = αg isin R3 αg2 le 1 is a vectorof auxiliary variables in the `2 unit ball associated withAgv As with A isin R3PtimesP which is the vertical concate-nation of all the Ag we concatenate all the αg to form theα isin K = [αgt1 αgtP ]gt αg isin Kg isin R3P K is theCartesian product of 3D unit balls in Euclidean space andtherefore a compact convex set Eq (9) can further be writtenas

s(v) = maxαisinK

αgtAv (10)

Given this formulation of s(v) we can apply Nesterovrsquossmoothing For a given smoothing parameter micro gt 0 the s(v)function is approximated by the smooth function

smicro(v) = maxαisinK

αgtAv minus micro

2α22

(11)

for which limmicrorarr0 smicro(v) = s(v) Nesterov [39] demonstratesthis convergence using the inequality in Eq (15) The valueof αlowastmicro(v) = [αlowastgtmicro1 α

lowastgtmicrog α

lowastgtmicroP ]

gt that maximizesEq (11) is the concatenation of projections of vectors Agv isinR3 to the `2 ball (Kg) αlowastmicrog(v) = projKg

(Agvmicro

) where

projKg (x) =

x if x2 le 1

xx2 otherwise

(12)

The function smicro ie the Nesterovrsquos smooth transform of sis convex and differentiable Its gradient given by [39]

nabla(smicro)(v) = Agtαlowastmicro(v) (13)

is Lipschitz-continuous with constant

L(nabla(smicro)

)=A22micro

(14)

where A2 is the matrix spectral norm of A MoreoverNesterov [39] provides the following inequality relating smicroand s

smicro(v) le s(v) le smicro(v) + microM forallv isin Rp (15)

where M = maxαisinKα22

2 = P2

Thus a new (smoothed) optimization problem closely re-lated to Eq (3) (with fixed u) arises from this regularizationas

minv

smooth︷ ︸︸ ︷minus 1

nugtXv+ λ2v22︸ ︷︷ ︸

l(v)

+λαlowastmicro(v)

gtAv minus micro

2αlowast22

︸ ︷︷ ︸

smicro(v)

+λ1

non-smooth︷ ︸︸ ︷v1︸ ︷︷ ︸h(v)

(16)

Since we are now able to explicitly compute the gradient ofthe smooth part nabla(l + λsmicro) (Eq (18)) its Lipschitz constant(Eq (19)) and also the proximal operator of the non-smoothpart we have all the ingredients necessary to solve thisminimization function using an accelerated proximal gradientmethods [4] Given a starting point v0 and a smoothingparameters micro FISTA (Algorithm 1) minimizes the smoothedproblem and reaches a prescribed precision εmicro

However in order to control the convergence of the algo-rithm (presented in Section II-E1) we introduce the Fencheldual function and the corresponding dual gap of the objectivefunction The Fenchel duality requires the loss to be stronglyconvex which is why we further reformulate Eq (16) slightlyAll penalty terms are divided by λ2 and by using the followingequivalent formulation for the loss we obtain the minimizationproblem

minvfmicro equiv

l(v)︷ ︸︸ ︷1

2

∥∥∥∥v minus Xgtu

nλ2

∥∥∥∥22︸ ︷︷ ︸

L(v)

+1

2v22 +

λ

λ2

smicro(v)︷ ︸︸ ︷αlowastmicro(v)

gtAv minus micro

2αlowast22

+λ1λ2

h(v)︷ ︸︸ ︷v1︸ ︷︷ ︸

ψmicro(v)

(17)

This new formulation of the smoothed objective func-tion (noted fmicro) preserves the decomposition of fmicro into a sumof a smooth term l + λ

λ2smicro and a non-smooth term h Such

decomposition is required for the application of FISTA withNesterovrsquos smoothing Moreover this formulation provides adecomposition of fmicro into a sum of a smooth loss L anda penalty term ψmicro required for the calculation of the gappresented in Section II-E1)

We provide all the required quantities to minimize Eq (17)using Algorithm 1 Using Eq (13) we compute the gradientof the smooth part as

nabla(l +

λ

λ2smicro

)= nabla(l) + λ

λ2nabla(smicro)

= (2v minus Xgtu

nλ2) +

λ

λ2Agtαlowastmicro(v

k) (18)

and its Lipschitz constant (using Eq (14))

L

(nabla(l +

λ

λ2smicro

))= 2 +

λ

λ2

A22micro

(19)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 5

Algorithm 1 FISTA(Xgtu v0 εmicro micro A λ L(nabla(g))

)1 v1 = v0 k = 22 Compute the gradient of the smooth part nabla(g + λsmicro)

(Eq (18)) and its Lipschitz constant Lmicro (Eq (19))3 Compute the size tmicro = Lminus1micro4 repeat5 z = vkminus1 + kminus2

k+1

(vkminus1 minus vkminus2

)6 vk = proxh

(zminus tmicronabla(g + λsmicro)(z)

)7 until GAPmicro(v

k) le εmicro8 return vk

E Minimization of the loading vectors with CONESTA

The step size tmicro computed in Line 3 of Algorithm 1 dependson the smoothing parameter micro (see Eq (19)) Hence there is atrade-off between speed and precision Indeed high precisionwith a small micro will lead to a slow convergence (small tmicro)Conversely poor precision (large micro) will lead to rapid con-vergence (large tmicro) Thus we propose a continuation approach(Algorithm 2) which decreases the smoothing parameter withrespect to the distance to the minimum On the one hand whenwe are far from vlowast (the minimum of Eq (17)) we can use alarge micro to rapidly decrease the objective function On the otherhand when we are close to vlowast we need a small micro in orderto obtain an accurate approximation of the original objectivefunction

1) Duality gap The distance to the unknown f(vlowast) isestimated using the duality gap Duality formulations are oftenused to control the achieved precision level when minimizingconvex functions They provide an estimation of the errorf(vk) minus f(vlowast) for any v when the minimum is unknownThe duality gap is the cornerstone of the CONESTA algorithmIndeed it is used three times

i As the stopping criterion in the inner FISTA loop (Line7 in Algorithm 1) FISTA will stop as soon as the cur-rent precision is achieved using the current smoothingparameter micro This prevents unnecessary convergencetoward the approximated (smoothed) objective function

ii In the ith CONESTA iteration as a way to estimate thecurrent error f(vi)minusf(vlowast) (Line 7 in Algorithm 2) Theerror is estimated using the gap of the smoothed prob-lem GAPmicro=microi(v

i+1) which avoid unnecessary compu-tation since it has already been computed during the lastiteration of FISTA The inequality in Eq (15) is used toobtain the gap εi to the original non-smoothed problemThe next desired precision εi+1 and the smoothingparameter microi+1 are derived from this value

iii Finally as the global stopping criterion within CON-ESTA (Line 10 in Algorithm 2) This will guaranteethat the obtained approximation of the minimum vi atconvergence satisfies f(vi)minus f(vlowast) lt ε

Based on Eq (17) which decomposes the smoothed ob-jective function as a sum of a strongly convex loss and thepenalty

fmicro(v) = L(v) + ψmicro(v)

we compute the duality gap that provides an upper boundestimation of the error to the optimum At any step k ofthe algorithm given the current primal vk and the dualσ(vk) equiv nablaL(vk) variables [8] we can compute the dualitygap using the Fenchel duality rules [35]

GAP(vk) equiv fmicro(vk) + Llowast(σ(vk)

)+ ψlowastmicro

(minus σ(vk)

) (20)

where Llowast and ψlowastmicro are respectively the Fenchel conjugates ofL and ψmicro Denoting by vlowast the minimum of fmicro (solution ofEq (17)) the interest of the duality gap is that it provides anupper bound for the difference with the optimal value of thefunction Moreover it vanishes at the minimum

GAP(vk) ge f(vk)minus f(vlowast) ge 0GAP(vlowast) = 0

(21)

The dual variable is

σ(vk) equiv nablaL(vk) = v minus Xgtu

nλ2 (22)

the Fenchel conjugate of the squared loss L(vk) is

Llowast(σ(vk)) = 1

2σ(vk)22 + σ(vk)gt

Xgtu

nλ2 (23)

In [25] the authors provide the expression of the Fenchelconjugate of the penalty ψmicro(vk)

ψlowastmicro(minusσ(vk)) =1

2

Psumj=1

([∣∣∣minus σ(vk)j minus λ

λ2

(Agtαlowastmicro(v

k))j

∣∣∣minus λ1λ2

]2+

)

+λmicro

2λ2

∥∥αlowastmicro(vk)∥∥22 (24)

where [middot]+ = max(0 middot)The expression of the duality gap in Eq (20) provides an

estimation of the distance to the minimum This distance isgeometrically decreased by a factor τ = 05 at the end of eachcontinuation and the decreased value defines the precision thatshould be reached by the next iteration (Line 8 of Algorithm2) Thus the algorithm dynamically generates a sequence ofdecreasing prescribed precisions εi Such a scheme ensures theconvergence [25] towards a globally desired final precision εwhich is the only parameter that the user needs to provide

2) Determining the optimal smoothing parameter Given thecurrent prescribed precision εi we need to compute an optimalsmoothing parameter microopt(εi) (Line 9 in Algorithm 2) thatminimizes the number of FISTA iterations needed to achievesuch precision when minimizing Eq (3) (with fixed u) viaEq (17) (ie such that f(v(k))minus f(vlowast) lt εi)

In [25] the authors provide the expression of this optimalsmoothing parameter

microopt(εi) =

minusλMA22 +radic(λMA22)2 +ML(nabla(l))A22εiML(nabla(l))

(25)

where M = P2 (Eq (15)) and L(nabla(l)) = 2 is the Lipschitzconstant of the gradient of l as defined in Eq (17)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 6

We call the resulting algorithm CONESTA (short forCOntinuation with NEsterov smoothing in a Shrinkage-Thresholding Algorithm) It is presented in detail with con-vergence proofs in [25]

Let K be the total number of FISTA loops used in CON-ESTA then we have experimentally verified that the conver-gence rate to the solution of Eq (16) is O

(1K2

)(which is

the optimal convergence rate for first-order methods) Alsothe algorithm works even if some of the weights λ1 or λare zero which thus allows us to solve eg the elastic netusing CONESTA Note that it has been rigorously provedthat the continuation technique improves the convergence ratecompared to the simple smoothing using a single value of microIndeed it has been demonstrated in [6] (see also [50]) thatthe convergence rate obtained with single value of micro evenoptimised is O

(1K2

)+ O(1K) However it has recently

been proved in [25] that the CONESTA algorithm achieves aO(1K) for general convex functions

We note that CONESTA could easily be adapted to manyother penalties For example to add the group lasso (GL)constraint to our structure we just have to design a specificlinear operator AGL and concatenate it to the actual linearoperator A

Algorithm 2 CONESTA(Xgtu ε

)1 Initialize v0 isin RP2 ε0 = τ middot GAPmicro=10minus8(v0)3 micro0 = microopt

(ε0)

4 repeat5 εimicro = εi minus microiγM6 vi+1 = FISTA(Xgtu vi εimicro )7 εi = GAPmicro=microi(v

i+1) + microiγM8 εi+1 = τ middot εi9 microi+1 = microopt

(εi+1

)10 until εi le ε11 return vi+1

F The algorithm for the SPCA-TV problemThe computation of a single component through SPCA-TV

can be achieved by combining CONESTA and Eq (5) withinan alternating minimization loop Mackey [34] demonstratedthat further components can be efficiently obtained by incor-porating this single-unit procedure in a deflation scheme asdone in eg [14] [31] The stopping criterion is defined as

STOPPINGCRITERION =

∥∥∥Xk minus ui+1vi+1gt∥∥∥Fminus∥∥∥Xk minus uivi

gt∥∥∥F∥∥Xk minus ui+1vi+1gt

∥∥F

(26)

All the presented building blocks were combined into Al-gorithm 3 to solve the SPCA-TV problem

III EXPERIMENTS

We evaluated the performance of SPCA-TV using threeexperiments One simulation study carried out on a synthetic

Algorithm 3 SPCA-TV(X ε)

1 X0 = X2 for all k = 0 K do Components3 Initialize u0 isin RN4 repeat Alternating minimization5 vi+1 = CONESTA(Xgtku

i ε)

6 ui+1 = Xkvi+1

Xkvi+127 until STOPPINGCRITERION le ε8 vk+1 = vi+1

9 uk+1 = ui+1

10 Xk+1 = Xk minus uk+1vk+1gt Deflation11 end for12 return U = [u1 middot middot middot uK ]V = [v1 middot middot middot vK ]

data set and two on neuroimaging data sets In order tocompare the performance of SPCA-TV with existing sparsePCA models we also included results obtained with SparsePCA ElasticNet PCA GraphNet PCA and SSPCA from[29] We used the scikit-learn implementation [42] for theSparse PCA while we used the Parsimony package (httpsgithubcomneurospinpylearn-parsimony) for the ElasticNetGraphNet PCA and SPCA-TV methods Concerning SSPCAwe used the MATLAB implementation provided in [29]

The number of parameters to set for each method is differ-ent For Sparse PCA the λ1 parameter selects its optimal valuefrom the range 01 10 50 100 ElasticNet PCA requiresthe setting of the λ1 and the λ2 penalties weights MeanwhileGraphNet PCA and SPCA-TV requires the settings of anadditional parameter namely the spatial constraint penalty λWe operated a re-parametrization of these penalty weights inratios A global parameter α isin 001 01 10 controls theweight attributed to the whole penalty term including thespatial and the `1 regularization Individual constraints areexpressed in terms of ratios the `1 ratio λ1(λ1 + λ2 + λ)isin 01 05 08 and the `TV (or `GN for GraphNet) λ(λ1+λ2+λ) isin 01 05 08 For ElasticNet we explorethe grid of parameters composed of the Cartesian product ofα and `1 ratio subsets For GraphNet PCA and SPCA-TVwe perform a parameter search on a grid of parameters givenby the Cartesian product of respectively (α `1 `GN ) subsetsand (α `1 `TV ) subsets Concerning SSPCA method theregularization parameter selects its optimal value in the range10minus8 108

However in order to ensure that the components extractedhave a minimum amount of sparsity we also included acriteria controlling sparsity At least half of the features ofthe components have to be zero For both real neuroimagingexperiments performance was evaluated through a 5-fold x5-fold double cross validation pipeline The double cross-validation process consists of two nested cross-validation loopswhich are referred to as internal and external cross-validationloops In the outer (external) loop all samples are randomlysplit into subsets referred to as training and test sets The testsets are exclusively used for model assessment while the trainsets are used in the inner (internal) loop for model fitting

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 7

and selection The inner folds select the set of parametersminimizing the reconstruction error on the outer fold Forthe synthetic data we used 50 different purposely-generateddata sets and 5 inner folds for parameters selection In or-der to evaluate the reconstruction accuracy of the methodswe reported the mean Frobenius norm of the reconstructionerror across the foldsdata sets on independent test dataThe hypothesis we wanted to test was whether there was asubstantial decrease in the reconstruction error of independentdata when using SPCA-TV compared to when using SparsePCA ElasticNet PCA GraphNet PCA and SSPCA It wastested through a related two samples t-test This choice tocompare methods performance on independent test data wasmotivated by the fact that the optimal reconstruction of thetraining set is necessarily hindered by spatial and sparsityconstraints We therefore expect SPCA-TV to perform worseon train data than other less constrained methods Howeverthe TV penalty has a more important purpose than just tominimize the reconstruction error the estimation of coherentand reproducible loadings Indeed clinicians expect that ifimages from other patients with comparable clinical conditionshad been used the extracted loading vectors would have turnedout to be similar Therefore since the ultimate goal of SPCA-TV is to yield stable and reproducible weight maps it is morerelevant to evaluate methods on independent test data

The stability of the loading vectors obtained across varioustraining data sets (variation in the learning samples) wasassessed through a similarity measure the pairwise Dice indexbetween loading vectors obtained with different foldsdata sets[16] We tested whether pairwise Dice indices are significantlyhigher in SPCA-TV compared other methods Testing thishypothesis is equivalent to testing the sign of the difference ofpairwise Dice indices between methods However since thepairwise Dice indices are not independent from one another(the folds share many of their learning samples) the directsignificance measures are biased We therefore used permuta-tion testing to estimate empirical p-values The null hypothesiswas tested by simulating samples from the null distributionWe generated 1 000 random permutations of the sign of thedifference of pairwise Dice index between the PCA methodsunder comparisons and then the statistics on the true datawere compared with the ones obtained on the reshuffled datato obtain empirical p-values

For each experiment we made the initial choice to retrievethe first ten components However given the length constraintwe only present the weights maps associated to the top threecomponents for Sparse PCA and SPCA-TV in this paperElasticNet PCA GraphNet PCA and SSPCA rsquos weights mapsof experiments are presented in the supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A Simulation study

We generated 50 sets of synthetic data each composed of500 images of size 100 times 100 pixels Images are generatedusing the following noisy linear system

u1V1 + u2V

2 + u3V3 + ε isin R10 000 (27)

where V = [V 1 V 2 V 3] isin R10 000times3 are sparse and structuredloading vectors illustrated in Fig 1 The support of V 1 definesthe two upper dots the support of V 2 defines the two lowerdots while V 3 rsquos support delineates the middle dot Thecoefficients u = [u1 u2 u3] that linearly combine the com-ponents of V are generated according to a centered Gaussiandistribution The elements of the noise vector ε are independentand identically distributed according to a centered Gaussiandistribution with a 01 signal-to-noise ratio (SNR) This SNRwas selected by a previous calibration pipeline where wetested the efficiency of data reconstruction at multiple SNRranges running from 0 to 05 We decided to work with a 01SNR because it is located in the range of values where standardPCA starts being less efficient in the recovery process

component 1 component 2 component 3Fig 1 Loading vectors V = [V 1 V 2 V 3] isin R10 000times3 usedto generate the images

We splitted the 500 artificial images into a test and a trainingset with 250 images in each set and learned the decompositionon the training set Sparse PCA

SPCA - TVcomponent 1

component 1

component 2

component 2

component 3

component 3Fig 2 Loading vectors recovered from 250 images usingSparse PCA and SPCA-TV

Fig 2 represents the loading vectors extracted with one dataset Please note that the sign is arbitrary Indeed if we considerthe loss of Eq (3) ugt and v can be both multiply by -1

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 8

TABLE I Scores are averaged across the 50 independent datasets We tested whether the scores obtained with existing PCAmethods are significantly different from scores obtained withSPCA-TV Significance notations p le 10minus3

Scores

Methods Test Data Reconstruction Error MSE Dice Index

Sparse PCA 15760 091 028ElasticNet PCA 15724 083 043GraphNet PCA 15708 083 030SSPCA 15719 154 007SPCA-TV 15701 064 052

without changing anything We observe that Sparse PCA yieldsvery scattered loading vectors The loading vectors of SPCA-TV on the other hand are sparse but also organized in clearregions SPCA-TV provides loading vectors that closely matchthe ground truth The reconstruction error is evaluated on thetest sets (Tab I) with its value over the 50 data sets being sig-nificantly lower in SPCA-TV than in Sparse PCA (T = 945p = 39 middot 10minus57) ElasticNet PCA (T = 332 p = 27 middot 10minus35GraphNet PCA (T = 127 p = 36 middot 10minus17 and SSPCA from[29] (T = 189 p = 39 middot10minus24) ) methods Additional detailsconcerning the reconstruction accuracy of both the train andtest data is presented in Figure 1 of supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A different way of quantifying the reconstruction accuracyfor each method is to evaluate how closely the extractedloadings match the known ground truth of simulated dataset We computed the mean squared error (MSE) betweenthe ground truth and the estimated loadings The results arepresented in Tab I We note that the MSE is significantlylower with SPCA-TV than with Sparse PCA (T = 69p = 80 middot 10minus9) ElasticNet PCA (T = 62 p = 11 middot 10minus07)GraphNet-PCA (T = 41 p = 14 middot 10minus04) and SSPCA(T = 226 p = 15 middot 10minus27)

Moreover when evaluating the stability of the loadingvectors across resampling we found a higher statisticallysignificant mean Dice index when using SPCA-TV comparedto the other methods (p lt 0001) The results are presented inTab I They indicate that SPCA-TV is more robust to variationin the learning samples than the other sparse methods SPCA-TV yields reproducible loading vectors across data sets

These results indicate that the SPCA-TV loadings are notonly more stable across resampling but also achieve a betterrecovery of the underlying variability in independent data thanthe Sparse PCA ElasticNet PCAGraphNet PCA and SSPCAmethods

One of the issues linked to biconvex optimization is therisk of falling into locals minima Conscious of this potentialrisk we set up an experiment in which we ran 50 times theoptimization of the same problem with a different startingpoint at each run We then compare the resulting loadingvectors obtained at each run and computed a similarity mea-sure the Dice index It quantifies the proximity between each

independently-run solution with a different starting point Weobtained a Dice index of 099 on the 1st component 099 onthe 2nd component and 072 on the 3rd component Off thestrength of this indices we are confident of this algorithmrobustness and ability to converge toward the same stablesolution independently from the choice of the starting point

B 3D images of functional MRI of patients with schizophreniaWe then applied the methods on 3D images of BOLD

functional MRI (fMRI) acquired with the same scanner andpulse sequence Imaging was performed on a 15 T scannerusing a standard head-coil For all functional scans the field-of-view was 206lowast206lowast153 mm with a resolution close to 35mm in all directions The parameters of the PRESTO sequencewere TE = 96 ms TR = 1925 ms EPI-factor = 15 flip angle= 9 Each fMRI run consisted of 900 volumes collected Thecohort is composed of 23 patients with schizophrenia (averageage = 3496 years 8 Females15 Males) Brain activationwas measured while subjects experienced multimodal hallu-cinations The fMRI data was pre-processed using SPM12(WELLCOME Department of Imaging Neuroscience Lon-don UK) Data preprocessing consisted of motion correction(realignment) coregistration of the individual anatomical T1image to the functional images spatial normalization to MNIspace using DARTEL based on segmented T1 scans

We considered each set of consecutive images under pre-hallucinations state as a block Since most of the patientshallucinate more than once during the scanning session wehave more blocks than patients (83 blocks) The activationmaps are computed from these blocks Based on the generallinear model approach we regressed for each block the fMRIsignal time course on a linear ramp function Indeed we hy-pothesized that activation in some regions presents a ramp-likeincrease during the time preceding the onset of hallucinations(See example of regression in figure 3 in the supplementarymaterialsavailable in the supplementary files multimedia tab)The activation maps that we used as an input to the SPCA-TVmethod are the statistical parametric maps associated with thecoefficients of the block regression (See one example in Figure4 of supplementary materials available in the supplementaryfiles multimedia tab) We obtained a data set of n = 83 mapsand p = 63 966 features We hypothesized that the principalcomponents extracted with SPCA-TV from these activationmaps could uncover major trends of variability within pre-hallucination patterns Thus they might reveal the existence ofsubgroups of patients according to the sensory modality (egvision or audition) involved during hallucinations

We applied all PCA methods under study to this data setexcept SSPCA Indeed the SSPCA method ([29]) could notbe applied to this specific example since datasets have to beconstituted from closed cubic forms without any holes to beeligible for SSPCA method application It does not supportmasked data such as the one used here

The loading vectors extracted from the activation mapsof pre-hallucinations scans with Sparse PCA and SPCA-TVare presented in Fig 3 We observe a similar behavior asin the synthetic example namely that the loading vectors of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Sparse PCA tend to be scattered and produce irregular patternsHowever SPCA-TV seems to yield structured and smoothsources of variability which can be interpreted clinicallyFurthermore the SPCA-TV loading vectors are not redundantand revealed different patterns

Indeed the loading vectors obtained by SPCA-TV are ofgreat interest because they revealed insightful patterns ofvariability in the data the second loading is composed ofinteresting areas such as the precuneus cortex and the cingulategyrus but also areas related to vision-processing areas such asthe occipital fusiform gyrus and the parietal operculum cortexregions The third loading reveals important weights in themiddle temporal gyrus the parietal operculum cortex and thefrontal pole The first loading vector encompasses all featuresof the brain One might see this first component as a globalvariability affecting the whole brain such as the overarchingeffect of age SPCA-TV selects this dense configuration inspite of the sparsity constraint It is highly desirable to removeany sort of global effect at first in order to start identifyinglocal patterns in next components that are not impacted by aglobal variability of this kind We can identified a widespreadset of dysfunctional language-related or vision-related areasthat present increasing activity during the time preceding theonset of hallucinations The regions extracted by SPCA-TVare found to be pertinent according to the existing literatureon the topic ([28] [27] [7]])

These results seem to indicate the possible existence ofsubgroups of patients according to the hallucination modalitiesinvolved An interesting application would be to use the scoreof the second component extracted by SPCA-TV in orderto distinguish patients with visual hallucinations from thosesuffering mainly from auditory hallucinations

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 139 p = 15 middot 10minus4) ElasticNetPCA (T = 71 p = 21 middot 10minus3) and GraphNet PCA (T = 46p = 10 middot 10minus2) Moreover when assessing the stability of theloading vectors across the folds we found a higher statisticallysignificant mean Dice index in SPCA-TV compared to SparsePCA (p = 40 middot 10minus3) ElasticNet PCA (p = 40 middot 10minus3) andGraphNet PCA (p = 20 middot 10minus3) as presented in Tab II Addi-tional details regarding the reconstruction accuracy on both thetrain and test sets and the Dice index is presented in Figure5 of supplementary materials (available in the supplementaryfiles multimedia tab)

TABLE II Scores of the fMRI data are averaged across the5 folds We tested whether the averaged scores obtained withexisting PCA methods are significantly different from scoresobtained with SPCA-TV Significance notations p le10minus3 p le 10minus2

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 15152 034ElasticNet PCA 14827 032GraphNet PCA 14281 058SPCA-TV 14140 063

In conclusion SPCA-TV significantly outperforms SparseElasticNet and GraphNet PCA in terms of the reconstructionerror on independent test data and in the sense that loadingvectors are both more clinically interpretable and more stable

We also evaluated the convergence speed of Sparse PCAMini-Batch Sparse PCA (a variant of Sparse PCA that isfaster but less accurate) ElasticNet PCA GraphNet PCAand SPCA-TV for this functional MRI data set of n = 83samples and p = 63 966 features We compared the time ofexecution required for each algorithm to achieve a given levelof precision in Tab III Sparse PCA and ElasticNet PCA aresimilar in terms of convergence time while mini-batch sparsePCA is much faster but does not converge to high precision Asexpected structured methods (GraphNet PCA and SPCA-TV)take longer than other sparse methods because of the inclusion

Sparse PCAcompon

ent 1compon

ent 2compon

ent 30

-123

-1110

0-105

component 1

component 2

component 3

-063+063-073+073

+4800

SPCA - TV

Fig 3 Loading vectors recovered from the 83 activation mapsusing Sparse PCA and SPCA-TV

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 4: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 3

discarded We first detail the notation for estimating a singlecomponent (Section II-A) and its solution using an alternatingminimization pipeline (Section II-B) Then we develop theTV regularization framework (Section II-C and Section II-D)Last we discuss the algorithm used to solve the minimizationproblem and its ability to converge toward stable pairs ofcomponentsloading vectors (Section II-E) and (Section II-F)

A Single component computationGiven a pair of loadingcomponent vectors u isin RN v isin

RP the best rank-1 approximation of the problem given inEq (2) is equivalent [49] to

minuv

f equiv

smooth︷ ︸︸ ︷minus 1

NugtXv + λ2v22︸ ︷︷ ︸

l(v)

+

non-smooth︷ ︸︸ ︷λ1 v1︸ ︷︷ ︸

h(v)

+λsumgisinGAgv2︸ ︷︷ ︸s(v)

(3)

s t u22 le 1

where l(v) is the penalized smooth (ie differentiable) lossh(v) is a sparsity-inducing penalty whose proximal operatoris known and s(v) is a complex penalty on the structure ofthe input variables with an unknown proximal operator

This problem is convex in u and in v but not in (uv)

B Alternating minimization of the bi-convex problemThe objective function to minimize is bi-convex [9] The

most common approach to solve a bi-convex optimizationproblem (which does not guarantee global optimality of thesolution) is to alternatively update u and v by fixing oneof them at the time and solving the corresponding convexoptimization problem on the other parameter vector

On the one hand when v is fixed the problem to solve is

minuisinRN

minus 1

NugtXv (4)

s t u22 le 1

with the associated explicit solution

ulowast(v) =Xv

Xv2 (5)

On the other hand solving the equation with respect to vwith a fixed u presents a higher level of difficulty that will bediscussed in Section II-E

C Reformulating TV as a linear operatorBefore discussing the minimization with respect to v we

provide details on the encoding of the spatial structure withinthe s(v) penalty

It is essential to note that the algorithm is independent ofthe spatial structure of the data All the structural informationis encoded in a linear operator A that is computed outsideof the algorithm Thus the algorithm has the ability to address

various structured data and most importantly other penaltiesthan just the TV penalty The algorithm requires the settingof two parameters (i) the linear operator A (ii) a projectionfunction detailed in Eq (12)

This section presents the formulation and the design of A inthe specific case of a TV penalty applied to the loading vectorv measured on a 3-dimensional (3D) image or a 2D mesh ofthe cortical surface

1) 3D image The brain mask is used to establish a mappingg(i j k) between the coordinates (i j k) in the 3D grid andan index g isin [[1P ]] in the collapsed image We extract thespatial neighborhood of g of size le 4 corresponding to voxelg and its 3 neighboring voxels within the mask in the i j andk directions By definition we have

TV(v) equivPsumg=1

∥∥nabla (vg(ijk)) ∥∥2 (6)

The first order approximation of the spatial gradientnabla(vg(ijk)) is computed by applying the linear operator A

prime

g isinR3times4 to the loading vector vg in the spatial neighborhood ofg ie

nabla(vg(ijk)

)=

[ minus1 1 0 0minus1 0 1 0minus1 0 0 1

]︸ ︷︷ ︸

Aprimeg

vg(ijk)vg(i+1jk)

vg(ij+1k)

vg(ijk+1)

︸ ︷︷ ︸

vg

(7)

where vg(ijk) is the loading coefficient at index g in thecollapsed image corresponding to voxel (i j k) in the 3Dimage Then A

prime

g is extended using zeros to a large but verysparse matrix Ag isin R3timesP in order to be directly applied onthe full vector v If some neighbors lie outside the mask thecorresponding rows in Ag are removed Noticing that for TVthere is one group per voxel in the mask (G = [[1P ]]) we canreformulate TV from Eq (6) using a general expression

TV(v) =sumgisinGAgv2 (8)

Finally with a vertical concatenation of all the Ag matriceswe obtain the full linear operator A isin R3PtimesP that will beused in Section II-E

2) Mesh of cortical surface The linear operator Aprime

g usedto compute a first order approximation of the spatial gradientcan be obtained by examining the neighboring vertices ofeach vertex g With common triangle-tessellated surfaces theneighborhood size is le 7 (including g) In this setting wehave A

prime

g isin R3times7 which can be extended and concatenated toobtain the full linear operator A

D Nesterovrsquos smoothing of the structured penaltyWe consider the convex non-smooth minimization of Eq (3)

with respect to v where thus u is fixed This problem includesa general structured penalty s(middot) that covers the specific caseof TV A widely used approach when dealing with non-smoothproblems is to use methods based on the proximal operator of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 4

the penalties For the `1 penalty alone the proximal operatoris analytically known and efficient iterative algorithms such asISTA and FISTA are available (see [4]) However since theproximal operator of the TV+`1 penalty is not closed formstandard implementation of those algorithms is not suitable Inorder to overcome this barrier we used Nesterovrsquos smoothingtechnique [39] It consists of approximating the non-smoothpenalties for which the proximal operator is unknown (egTV) with a smooth function (of which the gradient is known)Non-smooth penalties with known proximal operators (eg `1)are not affected Hence as described in [50] it allows to usean exact accelerated proximal gradient algorithm Thus we cansolve the PCA problem penalized by TV and elastic net wherean exact `1 penalty is used

Using the dual norm of the `2-norm (which happens to bethe `2-norm too) Eq (8) can be reformulated as

s(v) =sumgisinGAgv2 =

sumgisinG

maxαg2le1

αgtgAgv (9)

where αg isin Kg = αg isin R3 αg2 le 1 is a vectorof auxiliary variables in the `2 unit ball associated withAgv As with A isin R3PtimesP which is the vertical concate-nation of all the Ag we concatenate all the αg to form theα isin K = [αgt1 αgtP ]gt αg isin Kg isin R3P K is theCartesian product of 3D unit balls in Euclidean space andtherefore a compact convex set Eq (9) can further be writtenas

s(v) = maxαisinK

αgtAv (10)

Given this formulation of s(v) we can apply Nesterovrsquossmoothing For a given smoothing parameter micro gt 0 the s(v)function is approximated by the smooth function

smicro(v) = maxαisinK

αgtAv minus micro

2α22

(11)

for which limmicrorarr0 smicro(v) = s(v) Nesterov [39] demonstratesthis convergence using the inequality in Eq (15) The valueof αlowastmicro(v) = [αlowastgtmicro1 α

lowastgtmicrog α

lowastgtmicroP ]

gt that maximizesEq (11) is the concatenation of projections of vectors Agv isinR3 to the `2 ball (Kg) αlowastmicrog(v) = projKg

(Agvmicro

) where

projKg (x) =

x if x2 le 1

xx2 otherwise

(12)

The function smicro ie the Nesterovrsquos smooth transform of sis convex and differentiable Its gradient given by [39]

nabla(smicro)(v) = Agtαlowastmicro(v) (13)

is Lipschitz-continuous with constant

L(nabla(smicro)

)=A22micro

(14)

where A2 is the matrix spectral norm of A MoreoverNesterov [39] provides the following inequality relating smicroand s

smicro(v) le s(v) le smicro(v) + microM forallv isin Rp (15)

where M = maxαisinKα22

2 = P2

Thus a new (smoothed) optimization problem closely re-lated to Eq (3) (with fixed u) arises from this regularizationas

minv

smooth︷ ︸︸ ︷minus 1

nugtXv+ λ2v22︸ ︷︷ ︸

l(v)

+λαlowastmicro(v)

gtAv minus micro

2αlowast22

︸ ︷︷ ︸

smicro(v)

+λ1

non-smooth︷ ︸︸ ︷v1︸ ︷︷ ︸h(v)

(16)

Since we are now able to explicitly compute the gradient ofthe smooth part nabla(l + λsmicro) (Eq (18)) its Lipschitz constant(Eq (19)) and also the proximal operator of the non-smoothpart we have all the ingredients necessary to solve thisminimization function using an accelerated proximal gradientmethods [4] Given a starting point v0 and a smoothingparameters micro FISTA (Algorithm 1) minimizes the smoothedproblem and reaches a prescribed precision εmicro

However in order to control the convergence of the algo-rithm (presented in Section II-E1) we introduce the Fencheldual function and the corresponding dual gap of the objectivefunction The Fenchel duality requires the loss to be stronglyconvex which is why we further reformulate Eq (16) slightlyAll penalty terms are divided by λ2 and by using the followingequivalent formulation for the loss we obtain the minimizationproblem

minvfmicro equiv

l(v)︷ ︸︸ ︷1

2

∥∥∥∥v minus Xgtu

nλ2

∥∥∥∥22︸ ︷︷ ︸

L(v)

+1

2v22 +

λ

λ2

smicro(v)︷ ︸︸ ︷αlowastmicro(v)

gtAv minus micro

2αlowast22

+λ1λ2

h(v)︷ ︸︸ ︷v1︸ ︷︷ ︸

ψmicro(v)

(17)

This new formulation of the smoothed objective func-tion (noted fmicro) preserves the decomposition of fmicro into a sumof a smooth term l + λ

λ2smicro and a non-smooth term h Such

decomposition is required for the application of FISTA withNesterovrsquos smoothing Moreover this formulation provides adecomposition of fmicro into a sum of a smooth loss L anda penalty term ψmicro required for the calculation of the gappresented in Section II-E1)

We provide all the required quantities to minimize Eq (17)using Algorithm 1 Using Eq (13) we compute the gradientof the smooth part as

nabla(l +

λ

λ2smicro

)= nabla(l) + λ

λ2nabla(smicro)

= (2v minus Xgtu

nλ2) +

λ

λ2Agtαlowastmicro(v

k) (18)

and its Lipschitz constant (using Eq (14))

L

(nabla(l +

λ

λ2smicro

))= 2 +

λ

λ2

A22micro

(19)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 5

Algorithm 1 FISTA(Xgtu v0 εmicro micro A λ L(nabla(g))

)1 v1 = v0 k = 22 Compute the gradient of the smooth part nabla(g + λsmicro)

(Eq (18)) and its Lipschitz constant Lmicro (Eq (19))3 Compute the size tmicro = Lminus1micro4 repeat5 z = vkminus1 + kminus2

k+1

(vkminus1 minus vkminus2

)6 vk = proxh

(zminus tmicronabla(g + λsmicro)(z)

)7 until GAPmicro(v

k) le εmicro8 return vk

E Minimization of the loading vectors with CONESTA

The step size tmicro computed in Line 3 of Algorithm 1 dependson the smoothing parameter micro (see Eq (19)) Hence there is atrade-off between speed and precision Indeed high precisionwith a small micro will lead to a slow convergence (small tmicro)Conversely poor precision (large micro) will lead to rapid con-vergence (large tmicro) Thus we propose a continuation approach(Algorithm 2) which decreases the smoothing parameter withrespect to the distance to the minimum On the one hand whenwe are far from vlowast (the minimum of Eq (17)) we can use alarge micro to rapidly decrease the objective function On the otherhand when we are close to vlowast we need a small micro in orderto obtain an accurate approximation of the original objectivefunction

1) Duality gap The distance to the unknown f(vlowast) isestimated using the duality gap Duality formulations are oftenused to control the achieved precision level when minimizingconvex functions They provide an estimation of the errorf(vk) minus f(vlowast) for any v when the minimum is unknownThe duality gap is the cornerstone of the CONESTA algorithmIndeed it is used three times

i As the stopping criterion in the inner FISTA loop (Line7 in Algorithm 1) FISTA will stop as soon as the cur-rent precision is achieved using the current smoothingparameter micro This prevents unnecessary convergencetoward the approximated (smoothed) objective function

ii In the ith CONESTA iteration as a way to estimate thecurrent error f(vi)minusf(vlowast) (Line 7 in Algorithm 2) Theerror is estimated using the gap of the smoothed prob-lem GAPmicro=microi(v

i+1) which avoid unnecessary compu-tation since it has already been computed during the lastiteration of FISTA The inequality in Eq (15) is used toobtain the gap εi to the original non-smoothed problemThe next desired precision εi+1 and the smoothingparameter microi+1 are derived from this value

iii Finally as the global stopping criterion within CON-ESTA (Line 10 in Algorithm 2) This will guaranteethat the obtained approximation of the minimum vi atconvergence satisfies f(vi)minus f(vlowast) lt ε

Based on Eq (17) which decomposes the smoothed ob-jective function as a sum of a strongly convex loss and thepenalty

fmicro(v) = L(v) + ψmicro(v)

we compute the duality gap that provides an upper boundestimation of the error to the optimum At any step k ofthe algorithm given the current primal vk and the dualσ(vk) equiv nablaL(vk) variables [8] we can compute the dualitygap using the Fenchel duality rules [35]

GAP(vk) equiv fmicro(vk) + Llowast(σ(vk)

)+ ψlowastmicro

(minus σ(vk)

) (20)

where Llowast and ψlowastmicro are respectively the Fenchel conjugates ofL and ψmicro Denoting by vlowast the minimum of fmicro (solution ofEq (17)) the interest of the duality gap is that it provides anupper bound for the difference with the optimal value of thefunction Moreover it vanishes at the minimum

GAP(vk) ge f(vk)minus f(vlowast) ge 0GAP(vlowast) = 0

(21)

The dual variable is

σ(vk) equiv nablaL(vk) = v minus Xgtu

nλ2 (22)

the Fenchel conjugate of the squared loss L(vk) is

Llowast(σ(vk)) = 1

2σ(vk)22 + σ(vk)gt

Xgtu

nλ2 (23)

In [25] the authors provide the expression of the Fenchelconjugate of the penalty ψmicro(vk)

ψlowastmicro(minusσ(vk)) =1

2

Psumj=1

([∣∣∣minus σ(vk)j minus λ

λ2

(Agtαlowastmicro(v

k))j

∣∣∣minus λ1λ2

]2+

)

+λmicro

2λ2

∥∥αlowastmicro(vk)∥∥22 (24)

where [middot]+ = max(0 middot)The expression of the duality gap in Eq (20) provides an

estimation of the distance to the minimum This distance isgeometrically decreased by a factor τ = 05 at the end of eachcontinuation and the decreased value defines the precision thatshould be reached by the next iteration (Line 8 of Algorithm2) Thus the algorithm dynamically generates a sequence ofdecreasing prescribed precisions εi Such a scheme ensures theconvergence [25] towards a globally desired final precision εwhich is the only parameter that the user needs to provide

2) Determining the optimal smoothing parameter Given thecurrent prescribed precision εi we need to compute an optimalsmoothing parameter microopt(εi) (Line 9 in Algorithm 2) thatminimizes the number of FISTA iterations needed to achievesuch precision when minimizing Eq (3) (with fixed u) viaEq (17) (ie such that f(v(k))minus f(vlowast) lt εi)

In [25] the authors provide the expression of this optimalsmoothing parameter

microopt(εi) =

minusλMA22 +radic(λMA22)2 +ML(nabla(l))A22εiML(nabla(l))

(25)

where M = P2 (Eq (15)) and L(nabla(l)) = 2 is the Lipschitzconstant of the gradient of l as defined in Eq (17)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 6

We call the resulting algorithm CONESTA (short forCOntinuation with NEsterov smoothing in a Shrinkage-Thresholding Algorithm) It is presented in detail with con-vergence proofs in [25]

Let K be the total number of FISTA loops used in CON-ESTA then we have experimentally verified that the conver-gence rate to the solution of Eq (16) is O

(1K2

)(which is

the optimal convergence rate for first-order methods) Alsothe algorithm works even if some of the weights λ1 or λare zero which thus allows us to solve eg the elastic netusing CONESTA Note that it has been rigorously provedthat the continuation technique improves the convergence ratecompared to the simple smoothing using a single value of microIndeed it has been demonstrated in [6] (see also [50]) thatthe convergence rate obtained with single value of micro evenoptimised is O

(1K2

)+ O(1K) However it has recently

been proved in [25] that the CONESTA algorithm achieves aO(1K) for general convex functions

We note that CONESTA could easily be adapted to manyother penalties For example to add the group lasso (GL)constraint to our structure we just have to design a specificlinear operator AGL and concatenate it to the actual linearoperator A

Algorithm 2 CONESTA(Xgtu ε

)1 Initialize v0 isin RP2 ε0 = τ middot GAPmicro=10minus8(v0)3 micro0 = microopt

(ε0)

4 repeat5 εimicro = εi minus microiγM6 vi+1 = FISTA(Xgtu vi εimicro )7 εi = GAPmicro=microi(v

i+1) + microiγM8 εi+1 = τ middot εi9 microi+1 = microopt

(εi+1

)10 until εi le ε11 return vi+1

F The algorithm for the SPCA-TV problemThe computation of a single component through SPCA-TV

can be achieved by combining CONESTA and Eq (5) withinan alternating minimization loop Mackey [34] demonstratedthat further components can be efficiently obtained by incor-porating this single-unit procedure in a deflation scheme asdone in eg [14] [31] The stopping criterion is defined as

STOPPINGCRITERION =

∥∥∥Xk minus ui+1vi+1gt∥∥∥Fminus∥∥∥Xk minus uivi

gt∥∥∥F∥∥Xk minus ui+1vi+1gt

∥∥F

(26)

All the presented building blocks were combined into Al-gorithm 3 to solve the SPCA-TV problem

III EXPERIMENTS

We evaluated the performance of SPCA-TV using threeexperiments One simulation study carried out on a synthetic

Algorithm 3 SPCA-TV(X ε)

1 X0 = X2 for all k = 0 K do Components3 Initialize u0 isin RN4 repeat Alternating minimization5 vi+1 = CONESTA(Xgtku

i ε)

6 ui+1 = Xkvi+1

Xkvi+127 until STOPPINGCRITERION le ε8 vk+1 = vi+1

9 uk+1 = ui+1

10 Xk+1 = Xk minus uk+1vk+1gt Deflation11 end for12 return U = [u1 middot middot middot uK ]V = [v1 middot middot middot vK ]

data set and two on neuroimaging data sets In order tocompare the performance of SPCA-TV with existing sparsePCA models we also included results obtained with SparsePCA ElasticNet PCA GraphNet PCA and SSPCA from[29] We used the scikit-learn implementation [42] for theSparse PCA while we used the Parsimony package (httpsgithubcomneurospinpylearn-parsimony) for the ElasticNetGraphNet PCA and SPCA-TV methods Concerning SSPCAwe used the MATLAB implementation provided in [29]

The number of parameters to set for each method is differ-ent For Sparse PCA the λ1 parameter selects its optimal valuefrom the range 01 10 50 100 ElasticNet PCA requiresthe setting of the λ1 and the λ2 penalties weights MeanwhileGraphNet PCA and SPCA-TV requires the settings of anadditional parameter namely the spatial constraint penalty λWe operated a re-parametrization of these penalty weights inratios A global parameter α isin 001 01 10 controls theweight attributed to the whole penalty term including thespatial and the `1 regularization Individual constraints areexpressed in terms of ratios the `1 ratio λ1(λ1 + λ2 + λ)isin 01 05 08 and the `TV (or `GN for GraphNet) λ(λ1+λ2+λ) isin 01 05 08 For ElasticNet we explorethe grid of parameters composed of the Cartesian product ofα and `1 ratio subsets For GraphNet PCA and SPCA-TVwe perform a parameter search on a grid of parameters givenby the Cartesian product of respectively (α `1 `GN ) subsetsand (α `1 `TV ) subsets Concerning SSPCA method theregularization parameter selects its optimal value in the range10minus8 108

However in order to ensure that the components extractedhave a minimum amount of sparsity we also included acriteria controlling sparsity At least half of the features ofthe components have to be zero For both real neuroimagingexperiments performance was evaluated through a 5-fold x5-fold double cross validation pipeline The double cross-validation process consists of two nested cross-validation loopswhich are referred to as internal and external cross-validationloops In the outer (external) loop all samples are randomlysplit into subsets referred to as training and test sets The testsets are exclusively used for model assessment while the trainsets are used in the inner (internal) loop for model fitting

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 7

and selection The inner folds select the set of parametersminimizing the reconstruction error on the outer fold Forthe synthetic data we used 50 different purposely-generateddata sets and 5 inner folds for parameters selection In or-der to evaluate the reconstruction accuracy of the methodswe reported the mean Frobenius norm of the reconstructionerror across the foldsdata sets on independent test dataThe hypothesis we wanted to test was whether there was asubstantial decrease in the reconstruction error of independentdata when using SPCA-TV compared to when using SparsePCA ElasticNet PCA GraphNet PCA and SSPCA It wastested through a related two samples t-test This choice tocompare methods performance on independent test data wasmotivated by the fact that the optimal reconstruction of thetraining set is necessarily hindered by spatial and sparsityconstraints We therefore expect SPCA-TV to perform worseon train data than other less constrained methods Howeverthe TV penalty has a more important purpose than just tominimize the reconstruction error the estimation of coherentand reproducible loadings Indeed clinicians expect that ifimages from other patients with comparable clinical conditionshad been used the extracted loading vectors would have turnedout to be similar Therefore since the ultimate goal of SPCA-TV is to yield stable and reproducible weight maps it is morerelevant to evaluate methods on independent test data

The stability of the loading vectors obtained across varioustraining data sets (variation in the learning samples) wasassessed through a similarity measure the pairwise Dice indexbetween loading vectors obtained with different foldsdata sets[16] We tested whether pairwise Dice indices are significantlyhigher in SPCA-TV compared other methods Testing thishypothesis is equivalent to testing the sign of the difference ofpairwise Dice indices between methods However since thepairwise Dice indices are not independent from one another(the folds share many of their learning samples) the directsignificance measures are biased We therefore used permuta-tion testing to estimate empirical p-values The null hypothesiswas tested by simulating samples from the null distributionWe generated 1 000 random permutations of the sign of thedifference of pairwise Dice index between the PCA methodsunder comparisons and then the statistics on the true datawere compared with the ones obtained on the reshuffled datato obtain empirical p-values

For each experiment we made the initial choice to retrievethe first ten components However given the length constraintwe only present the weights maps associated to the top threecomponents for Sparse PCA and SPCA-TV in this paperElasticNet PCA GraphNet PCA and SSPCA rsquos weights mapsof experiments are presented in the supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A Simulation study

We generated 50 sets of synthetic data each composed of500 images of size 100 times 100 pixels Images are generatedusing the following noisy linear system

u1V1 + u2V

2 + u3V3 + ε isin R10 000 (27)

where V = [V 1 V 2 V 3] isin R10 000times3 are sparse and structuredloading vectors illustrated in Fig 1 The support of V 1 definesthe two upper dots the support of V 2 defines the two lowerdots while V 3 rsquos support delineates the middle dot Thecoefficients u = [u1 u2 u3] that linearly combine the com-ponents of V are generated according to a centered Gaussiandistribution The elements of the noise vector ε are independentand identically distributed according to a centered Gaussiandistribution with a 01 signal-to-noise ratio (SNR) This SNRwas selected by a previous calibration pipeline where wetested the efficiency of data reconstruction at multiple SNRranges running from 0 to 05 We decided to work with a 01SNR because it is located in the range of values where standardPCA starts being less efficient in the recovery process

component 1 component 2 component 3Fig 1 Loading vectors V = [V 1 V 2 V 3] isin R10 000times3 usedto generate the images

We splitted the 500 artificial images into a test and a trainingset with 250 images in each set and learned the decompositionon the training set Sparse PCA

SPCA - TVcomponent 1

component 1

component 2

component 2

component 3

component 3Fig 2 Loading vectors recovered from 250 images usingSparse PCA and SPCA-TV

Fig 2 represents the loading vectors extracted with one dataset Please note that the sign is arbitrary Indeed if we considerthe loss of Eq (3) ugt and v can be both multiply by -1

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 8

TABLE I Scores are averaged across the 50 independent datasets We tested whether the scores obtained with existing PCAmethods are significantly different from scores obtained withSPCA-TV Significance notations p le 10minus3

Scores

Methods Test Data Reconstruction Error MSE Dice Index

Sparse PCA 15760 091 028ElasticNet PCA 15724 083 043GraphNet PCA 15708 083 030SSPCA 15719 154 007SPCA-TV 15701 064 052

without changing anything We observe that Sparse PCA yieldsvery scattered loading vectors The loading vectors of SPCA-TV on the other hand are sparse but also organized in clearregions SPCA-TV provides loading vectors that closely matchthe ground truth The reconstruction error is evaluated on thetest sets (Tab I) with its value over the 50 data sets being sig-nificantly lower in SPCA-TV than in Sparse PCA (T = 945p = 39 middot 10minus57) ElasticNet PCA (T = 332 p = 27 middot 10minus35GraphNet PCA (T = 127 p = 36 middot 10minus17 and SSPCA from[29] (T = 189 p = 39 middot10minus24) ) methods Additional detailsconcerning the reconstruction accuracy of both the train andtest data is presented in Figure 1 of supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A different way of quantifying the reconstruction accuracyfor each method is to evaluate how closely the extractedloadings match the known ground truth of simulated dataset We computed the mean squared error (MSE) betweenthe ground truth and the estimated loadings The results arepresented in Tab I We note that the MSE is significantlylower with SPCA-TV than with Sparse PCA (T = 69p = 80 middot 10minus9) ElasticNet PCA (T = 62 p = 11 middot 10minus07)GraphNet-PCA (T = 41 p = 14 middot 10minus04) and SSPCA(T = 226 p = 15 middot 10minus27)

Moreover when evaluating the stability of the loadingvectors across resampling we found a higher statisticallysignificant mean Dice index when using SPCA-TV comparedto the other methods (p lt 0001) The results are presented inTab I They indicate that SPCA-TV is more robust to variationin the learning samples than the other sparse methods SPCA-TV yields reproducible loading vectors across data sets

These results indicate that the SPCA-TV loadings are notonly more stable across resampling but also achieve a betterrecovery of the underlying variability in independent data thanthe Sparse PCA ElasticNet PCAGraphNet PCA and SSPCAmethods

One of the issues linked to biconvex optimization is therisk of falling into locals minima Conscious of this potentialrisk we set up an experiment in which we ran 50 times theoptimization of the same problem with a different startingpoint at each run We then compare the resulting loadingvectors obtained at each run and computed a similarity mea-sure the Dice index It quantifies the proximity between each

independently-run solution with a different starting point Weobtained a Dice index of 099 on the 1st component 099 onthe 2nd component and 072 on the 3rd component Off thestrength of this indices we are confident of this algorithmrobustness and ability to converge toward the same stablesolution independently from the choice of the starting point

B 3D images of functional MRI of patients with schizophreniaWe then applied the methods on 3D images of BOLD

functional MRI (fMRI) acquired with the same scanner andpulse sequence Imaging was performed on a 15 T scannerusing a standard head-coil For all functional scans the field-of-view was 206lowast206lowast153 mm with a resolution close to 35mm in all directions The parameters of the PRESTO sequencewere TE = 96 ms TR = 1925 ms EPI-factor = 15 flip angle= 9 Each fMRI run consisted of 900 volumes collected Thecohort is composed of 23 patients with schizophrenia (averageage = 3496 years 8 Females15 Males) Brain activationwas measured while subjects experienced multimodal hallu-cinations The fMRI data was pre-processed using SPM12(WELLCOME Department of Imaging Neuroscience Lon-don UK) Data preprocessing consisted of motion correction(realignment) coregistration of the individual anatomical T1image to the functional images spatial normalization to MNIspace using DARTEL based on segmented T1 scans

We considered each set of consecutive images under pre-hallucinations state as a block Since most of the patientshallucinate more than once during the scanning session wehave more blocks than patients (83 blocks) The activationmaps are computed from these blocks Based on the generallinear model approach we regressed for each block the fMRIsignal time course on a linear ramp function Indeed we hy-pothesized that activation in some regions presents a ramp-likeincrease during the time preceding the onset of hallucinations(See example of regression in figure 3 in the supplementarymaterialsavailable in the supplementary files multimedia tab)The activation maps that we used as an input to the SPCA-TVmethod are the statistical parametric maps associated with thecoefficients of the block regression (See one example in Figure4 of supplementary materials available in the supplementaryfiles multimedia tab) We obtained a data set of n = 83 mapsand p = 63 966 features We hypothesized that the principalcomponents extracted with SPCA-TV from these activationmaps could uncover major trends of variability within pre-hallucination patterns Thus they might reveal the existence ofsubgroups of patients according to the sensory modality (egvision or audition) involved during hallucinations

We applied all PCA methods under study to this data setexcept SSPCA Indeed the SSPCA method ([29]) could notbe applied to this specific example since datasets have to beconstituted from closed cubic forms without any holes to beeligible for SSPCA method application It does not supportmasked data such as the one used here

The loading vectors extracted from the activation mapsof pre-hallucinations scans with Sparse PCA and SPCA-TVare presented in Fig 3 We observe a similar behavior asin the synthetic example namely that the loading vectors of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Sparse PCA tend to be scattered and produce irregular patternsHowever SPCA-TV seems to yield structured and smoothsources of variability which can be interpreted clinicallyFurthermore the SPCA-TV loading vectors are not redundantand revealed different patterns

Indeed the loading vectors obtained by SPCA-TV are ofgreat interest because they revealed insightful patterns ofvariability in the data the second loading is composed ofinteresting areas such as the precuneus cortex and the cingulategyrus but also areas related to vision-processing areas such asthe occipital fusiform gyrus and the parietal operculum cortexregions The third loading reveals important weights in themiddle temporal gyrus the parietal operculum cortex and thefrontal pole The first loading vector encompasses all featuresof the brain One might see this first component as a globalvariability affecting the whole brain such as the overarchingeffect of age SPCA-TV selects this dense configuration inspite of the sparsity constraint It is highly desirable to removeany sort of global effect at first in order to start identifyinglocal patterns in next components that are not impacted by aglobal variability of this kind We can identified a widespreadset of dysfunctional language-related or vision-related areasthat present increasing activity during the time preceding theonset of hallucinations The regions extracted by SPCA-TVare found to be pertinent according to the existing literatureon the topic ([28] [27] [7]])

These results seem to indicate the possible existence ofsubgroups of patients according to the hallucination modalitiesinvolved An interesting application would be to use the scoreof the second component extracted by SPCA-TV in orderto distinguish patients with visual hallucinations from thosesuffering mainly from auditory hallucinations

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 139 p = 15 middot 10minus4) ElasticNetPCA (T = 71 p = 21 middot 10minus3) and GraphNet PCA (T = 46p = 10 middot 10minus2) Moreover when assessing the stability of theloading vectors across the folds we found a higher statisticallysignificant mean Dice index in SPCA-TV compared to SparsePCA (p = 40 middot 10minus3) ElasticNet PCA (p = 40 middot 10minus3) andGraphNet PCA (p = 20 middot 10minus3) as presented in Tab II Addi-tional details regarding the reconstruction accuracy on both thetrain and test sets and the Dice index is presented in Figure5 of supplementary materials (available in the supplementaryfiles multimedia tab)

TABLE II Scores of the fMRI data are averaged across the5 folds We tested whether the averaged scores obtained withexisting PCA methods are significantly different from scoresobtained with SPCA-TV Significance notations p le10minus3 p le 10minus2

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 15152 034ElasticNet PCA 14827 032GraphNet PCA 14281 058SPCA-TV 14140 063

In conclusion SPCA-TV significantly outperforms SparseElasticNet and GraphNet PCA in terms of the reconstructionerror on independent test data and in the sense that loadingvectors are both more clinically interpretable and more stable

We also evaluated the convergence speed of Sparse PCAMini-Batch Sparse PCA (a variant of Sparse PCA that isfaster but less accurate) ElasticNet PCA GraphNet PCAand SPCA-TV for this functional MRI data set of n = 83samples and p = 63 966 features We compared the time ofexecution required for each algorithm to achieve a given levelof precision in Tab III Sparse PCA and ElasticNet PCA aresimilar in terms of convergence time while mini-batch sparsePCA is much faster but does not converge to high precision Asexpected structured methods (GraphNet PCA and SPCA-TV)take longer than other sparse methods because of the inclusion

Sparse PCAcompon

ent 1compon

ent 2compon

ent 30

-123

-1110

0-105

component 1

component 2

component 3

-063+063-073+073

+4800

SPCA - TV

Fig 3 Loading vectors recovered from the 83 activation mapsusing Sparse PCA and SPCA-TV

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 5: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 4

the penalties For the `1 penalty alone the proximal operatoris analytically known and efficient iterative algorithms such asISTA and FISTA are available (see [4]) However since theproximal operator of the TV+`1 penalty is not closed formstandard implementation of those algorithms is not suitable Inorder to overcome this barrier we used Nesterovrsquos smoothingtechnique [39] It consists of approximating the non-smoothpenalties for which the proximal operator is unknown (egTV) with a smooth function (of which the gradient is known)Non-smooth penalties with known proximal operators (eg `1)are not affected Hence as described in [50] it allows to usean exact accelerated proximal gradient algorithm Thus we cansolve the PCA problem penalized by TV and elastic net wherean exact `1 penalty is used

Using the dual norm of the `2-norm (which happens to bethe `2-norm too) Eq (8) can be reformulated as

s(v) =sumgisinGAgv2 =

sumgisinG

maxαg2le1

αgtgAgv (9)

where αg isin Kg = αg isin R3 αg2 le 1 is a vectorof auxiliary variables in the `2 unit ball associated withAgv As with A isin R3PtimesP which is the vertical concate-nation of all the Ag we concatenate all the αg to form theα isin K = [αgt1 αgtP ]gt αg isin Kg isin R3P K is theCartesian product of 3D unit balls in Euclidean space andtherefore a compact convex set Eq (9) can further be writtenas

s(v) = maxαisinK

αgtAv (10)

Given this formulation of s(v) we can apply Nesterovrsquossmoothing For a given smoothing parameter micro gt 0 the s(v)function is approximated by the smooth function

smicro(v) = maxαisinK

αgtAv minus micro

2α22

(11)

for which limmicrorarr0 smicro(v) = s(v) Nesterov [39] demonstratesthis convergence using the inequality in Eq (15) The valueof αlowastmicro(v) = [αlowastgtmicro1 α

lowastgtmicrog α

lowastgtmicroP ]

gt that maximizesEq (11) is the concatenation of projections of vectors Agv isinR3 to the `2 ball (Kg) αlowastmicrog(v) = projKg

(Agvmicro

) where

projKg (x) =

x if x2 le 1

xx2 otherwise

(12)

The function smicro ie the Nesterovrsquos smooth transform of sis convex and differentiable Its gradient given by [39]

nabla(smicro)(v) = Agtαlowastmicro(v) (13)

is Lipschitz-continuous with constant

L(nabla(smicro)

)=A22micro

(14)

where A2 is the matrix spectral norm of A MoreoverNesterov [39] provides the following inequality relating smicroand s

smicro(v) le s(v) le smicro(v) + microM forallv isin Rp (15)

where M = maxαisinKα22

2 = P2

Thus a new (smoothed) optimization problem closely re-lated to Eq (3) (with fixed u) arises from this regularizationas

minv

smooth︷ ︸︸ ︷minus 1

nugtXv+ λ2v22︸ ︷︷ ︸

l(v)

+λαlowastmicro(v)

gtAv minus micro

2αlowast22

︸ ︷︷ ︸

smicro(v)

+λ1

non-smooth︷ ︸︸ ︷v1︸ ︷︷ ︸h(v)

(16)

Since we are now able to explicitly compute the gradient ofthe smooth part nabla(l + λsmicro) (Eq (18)) its Lipschitz constant(Eq (19)) and also the proximal operator of the non-smoothpart we have all the ingredients necessary to solve thisminimization function using an accelerated proximal gradientmethods [4] Given a starting point v0 and a smoothingparameters micro FISTA (Algorithm 1) minimizes the smoothedproblem and reaches a prescribed precision εmicro

However in order to control the convergence of the algo-rithm (presented in Section II-E1) we introduce the Fencheldual function and the corresponding dual gap of the objectivefunction The Fenchel duality requires the loss to be stronglyconvex which is why we further reformulate Eq (16) slightlyAll penalty terms are divided by λ2 and by using the followingequivalent formulation for the loss we obtain the minimizationproblem

minvfmicro equiv

l(v)︷ ︸︸ ︷1

2

∥∥∥∥v minus Xgtu

nλ2

∥∥∥∥22︸ ︷︷ ︸

L(v)

+1

2v22 +

λ

λ2

smicro(v)︷ ︸︸ ︷αlowastmicro(v)

gtAv minus micro

2αlowast22

+λ1λ2

h(v)︷ ︸︸ ︷v1︸ ︷︷ ︸

ψmicro(v)

(17)

This new formulation of the smoothed objective func-tion (noted fmicro) preserves the decomposition of fmicro into a sumof a smooth term l + λ

λ2smicro and a non-smooth term h Such

decomposition is required for the application of FISTA withNesterovrsquos smoothing Moreover this formulation provides adecomposition of fmicro into a sum of a smooth loss L anda penalty term ψmicro required for the calculation of the gappresented in Section II-E1)

We provide all the required quantities to minimize Eq (17)using Algorithm 1 Using Eq (13) we compute the gradientof the smooth part as

nabla(l +

λ

λ2smicro

)= nabla(l) + λ

λ2nabla(smicro)

= (2v minus Xgtu

nλ2) +

λ

λ2Agtαlowastmicro(v

k) (18)

and its Lipschitz constant (using Eq (14))

L

(nabla(l +

λ

λ2smicro

))= 2 +

λ

λ2

A22micro

(19)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 5

Algorithm 1 FISTA(Xgtu v0 εmicro micro A λ L(nabla(g))

)1 v1 = v0 k = 22 Compute the gradient of the smooth part nabla(g + λsmicro)

(Eq (18)) and its Lipschitz constant Lmicro (Eq (19))3 Compute the size tmicro = Lminus1micro4 repeat5 z = vkminus1 + kminus2

k+1

(vkminus1 minus vkminus2

)6 vk = proxh

(zminus tmicronabla(g + λsmicro)(z)

)7 until GAPmicro(v

k) le εmicro8 return vk

E Minimization of the loading vectors with CONESTA

The step size tmicro computed in Line 3 of Algorithm 1 dependson the smoothing parameter micro (see Eq (19)) Hence there is atrade-off between speed and precision Indeed high precisionwith a small micro will lead to a slow convergence (small tmicro)Conversely poor precision (large micro) will lead to rapid con-vergence (large tmicro) Thus we propose a continuation approach(Algorithm 2) which decreases the smoothing parameter withrespect to the distance to the minimum On the one hand whenwe are far from vlowast (the minimum of Eq (17)) we can use alarge micro to rapidly decrease the objective function On the otherhand when we are close to vlowast we need a small micro in orderto obtain an accurate approximation of the original objectivefunction

1) Duality gap The distance to the unknown f(vlowast) isestimated using the duality gap Duality formulations are oftenused to control the achieved precision level when minimizingconvex functions They provide an estimation of the errorf(vk) minus f(vlowast) for any v when the minimum is unknownThe duality gap is the cornerstone of the CONESTA algorithmIndeed it is used three times

i As the stopping criterion in the inner FISTA loop (Line7 in Algorithm 1) FISTA will stop as soon as the cur-rent precision is achieved using the current smoothingparameter micro This prevents unnecessary convergencetoward the approximated (smoothed) objective function

ii In the ith CONESTA iteration as a way to estimate thecurrent error f(vi)minusf(vlowast) (Line 7 in Algorithm 2) Theerror is estimated using the gap of the smoothed prob-lem GAPmicro=microi(v

i+1) which avoid unnecessary compu-tation since it has already been computed during the lastiteration of FISTA The inequality in Eq (15) is used toobtain the gap εi to the original non-smoothed problemThe next desired precision εi+1 and the smoothingparameter microi+1 are derived from this value

iii Finally as the global stopping criterion within CON-ESTA (Line 10 in Algorithm 2) This will guaranteethat the obtained approximation of the minimum vi atconvergence satisfies f(vi)minus f(vlowast) lt ε

Based on Eq (17) which decomposes the smoothed ob-jective function as a sum of a strongly convex loss and thepenalty

fmicro(v) = L(v) + ψmicro(v)

we compute the duality gap that provides an upper boundestimation of the error to the optimum At any step k ofthe algorithm given the current primal vk and the dualσ(vk) equiv nablaL(vk) variables [8] we can compute the dualitygap using the Fenchel duality rules [35]

GAP(vk) equiv fmicro(vk) + Llowast(σ(vk)

)+ ψlowastmicro

(minus σ(vk)

) (20)

where Llowast and ψlowastmicro are respectively the Fenchel conjugates ofL and ψmicro Denoting by vlowast the minimum of fmicro (solution ofEq (17)) the interest of the duality gap is that it provides anupper bound for the difference with the optimal value of thefunction Moreover it vanishes at the minimum

GAP(vk) ge f(vk)minus f(vlowast) ge 0GAP(vlowast) = 0

(21)

The dual variable is

σ(vk) equiv nablaL(vk) = v minus Xgtu

nλ2 (22)

the Fenchel conjugate of the squared loss L(vk) is

Llowast(σ(vk)) = 1

2σ(vk)22 + σ(vk)gt

Xgtu

nλ2 (23)

In [25] the authors provide the expression of the Fenchelconjugate of the penalty ψmicro(vk)

ψlowastmicro(minusσ(vk)) =1

2

Psumj=1

([∣∣∣minus σ(vk)j minus λ

λ2

(Agtαlowastmicro(v

k))j

∣∣∣minus λ1λ2

]2+

)

+λmicro

2λ2

∥∥αlowastmicro(vk)∥∥22 (24)

where [middot]+ = max(0 middot)The expression of the duality gap in Eq (20) provides an

estimation of the distance to the minimum This distance isgeometrically decreased by a factor τ = 05 at the end of eachcontinuation and the decreased value defines the precision thatshould be reached by the next iteration (Line 8 of Algorithm2) Thus the algorithm dynamically generates a sequence ofdecreasing prescribed precisions εi Such a scheme ensures theconvergence [25] towards a globally desired final precision εwhich is the only parameter that the user needs to provide

2) Determining the optimal smoothing parameter Given thecurrent prescribed precision εi we need to compute an optimalsmoothing parameter microopt(εi) (Line 9 in Algorithm 2) thatminimizes the number of FISTA iterations needed to achievesuch precision when minimizing Eq (3) (with fixed u) viaEq (17) (ie such that f(v(k))minus f(vlowast) lt εi)

In [25] the authors provide the expression of this optimalsmoothing parameter

microopt(εi) =

minusλMA22 +radic(λMA22)2 +ML(nabla(l))A22εiML(nabla(l))

(25)

where M = P2 (Eq (15)) and L(nabla(l)) = 2 is the Lipschitzconstant of the gradient of l as defined in Eq (17)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 6

We call the resulting algorithm CONESTA (short forCOntinuation with NEsterov smoothing in a Shrinkage-Thresholding Algorithm) It is presented in detail with con-vergence proofs in [25]

Let K be the total number of FISTA loops used in CON-ESTA then we have experimentally verified that the conver-gence rate to the solution of Eq (16) is O

(1K2

)(which is

the optimal convergence rate for first-order methods) Alsothe algorithm works even if some of the weights λ1 or λare zero which thus allows us to solve eg the elastic netusing CONESTA Note that it has been rigorously provedthat the continuation technique improves the convergence ratecompared to the simple smoothing using a single value of microIndeed it has been demonstrated in [6] (see also [50]) thatthe convergence rate obtained with single value of micro evenoptimised is O

(1K2

)+ O(1K) However it has recently

been proved in [25] that the CONESTA algorithm achieves aO(1K) for general convex functions

We note that CONESTA could easily be adapted to manyother penalties For example to add the group lasso (GL)constraint to our structure we just have to design a specificlinear operator AGL and concatenate it to the actual linearoperator A

Algorithm 2 CONESTA(Xgtu ε

)1 Initialize v0 isin RP2 ε0 = τ middot GAPmicro=10minus8(v0)3 micro0 = microopt

(ε0)

4 repeat5 εimicro = εi minus microiγM6 vi+1 = FISTA(Xgtu vi εimicro )7 εi = GAPmicro=microi(v

i+1) + microiγM8 εi+1 = τ middot εi9 microi+1 = microopt

(εi+1

)10 until εi le ε11 return vi+1

F The algorithm for the SPCA-TV problemThe computation of a single component through SPCA-TV

can be achieved by combining CONESTA and Eq (5) withinan alternating minimization loop Mackey [34] demonstratedthat further components can be efficiently obtained by incor-porating this single-unit procedure in a deflation scheme asdone in eg [14] [31] The stopping criterion is defined as

STOPPINGCRITERION =

∥∥∥Xk minus ui+1vi+1gt∥∥∥Fminus∥∥∥Xk minus uivi

gt∥∥∥F∥∥Xk minus ui+1vi+1gt

∥∥F

(26)

All the presented building blocks were combined into Al-gorithm 3 to solve the SPCA-TV problem

III EXPERIMENTS

We evaluated the performance of SPCA-TV using threeexperiments One simulation study carried out on a synthetic

Algorithm 3 SPCA-TV(X ε)

1 X0 = X2 for all k = 0 K do Components3 Initialize u0 isin RN4 repeat Alternating minimization5 vi+1 = CONESTA(Xgtku

i ε)

6 ui+1 = Xkvi+1

Xkvi+127 until STOPPINGCRITERION le ε8 vk+1 = vi+1

9 uk+1 = ui+1

10 Xk+1 = Xk minus uk+1vk+1gt Deflation11 end for12 return U = [u1 middot middot middot uK ]V = [v1 middot middot middot vK ]

data set and two on neuroimaging data sets In order tocompare the performance of SPCA-TV with existing sparsePCA models we also included results obtained with SparsePCA ElasticNet PCA GraphNet PCA and SSPCA from[29] We used the scikit-learn implementation [42] for theSparse PCA while we used the Parsimony package (httpsgithubcomneurospinpylearn-parsimony) for the ElasticNetGraphNet PCA and SPCA-TV methods Concerning SSPCAwe used the MATLAB implementation provided in [29]

The number of parameters to set for each method is differ-ent For Sparse PCA the λ1 parameter selects its optimal valuefrom the range 01 10 50 100 ElasticNet PCA requiresthe setting of the λ1 and the λ2 penalties weights MeanwhileGraphNet PCA and SPCA-TV requires the settings of anadditional parameter namely the spatial constraint penalty λWe operated a re-parametrization of these penalty weights inratios A global parameter α isin 001 01 10 controls theweight attributed to the whole penalty term including thespatial and the `1 regularization Individual constraints areexpressed in terms of ratios the `1 ratio λ1(λ1 + λ2 + λ)isin 01 05 08 and the `TV (or `GN for GraphNet) λ(λ1+λ2+λ) isin 01 05 08 For ElasticNet we explorethe grid of parameters composed of the Cartesian product ofα and `1 ratio subsets For GraphNet PCA and SPCA-TVwe perform a parameter search on a grid of parameters givenby the Cartesian product of respectively (α `1 `GN ) subsetsand (α `1 `TV ) subsets Concerning SSPCA method theregularization parameter selects its optimal value in the range10minus8 108

However in order to ensure that the components extractedhave a minimum amount of sparsity we also included acriteria controlling sparsity At least half of the features ofthe components have to be zero For both real neuroimagingexperiments performance was evaluated through a 5-fold x5-fold double cross validation pipeline The double cross-validation process consists of two nested cross-validation loopswhich are referred to as internal and external cross-validationloops In the outer (external) loop all samples are randomlysplit into subsets referred to as training and test sets The testsets are exclusively used for model assessment while the trainsets are used in the inner (internal) loop for model fitting

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 7

and selection The inner folds select the set of parametersminimizing the reconstruction error on the outer fold Forthe synthetic data we used 50 different purposely-generateddata sets and 5 inner folds for parameters selection In or-der to evaluate the reconstruction accuracy of the methodswe reported the mean Frobenius norm of the reconstructionerror across the foldsdata sets on independent test dataThe hypothesis we wanted to test was whether there was asubstantial decrease in the reconstruction error of independentdata when using SPCA-TV compared to when using SparsePCA ElasticNet PCA GraphNet PCA and SSPCA It wastested through a related two samples t-test This choice tocompare methods performance on independent test data wasmotivated by the fact that the optimal reconstruction of thetraining set is necessarily hindered by spatial and sparsityconstraints We therefore expect SPCA-TV to perform worseon train data than other less constrained methods Howeverthe TV penalty has a more important purpose than just tominimize the reconstruction error the estimation of coherentand reproducible loadings Indeed clinicians expect that ifimages from other patients with comparable clinical conditionshad been used the extracted loading vectors would have turnedout to be similar Therefore since the ultimate goal of SPCA-TV is to yield stable and reproducible weight maps it is morerelevant to evaluate methods on independent test data

The stability of the loading vectors obtained across varioustraining data sets (variation in the learning samples) wasassessed through a similarity measure the pairwise Dice indexbetween loading vectors obtained with different foldsdata sets[16] We tested whether pairwise Dice indices are significantlyhigher in SPCA-TV compared other methods Testing thishypothesis is equivalent to testing the sign of the difference ofpairwise Dice indices between methods However since thepairwise Dice indices are not independent from one another(the folds share many of their learning samples) the directsignificance measures are biased We therefore used permuta-tion testing to estimate empirical p-values The null hypothesiswas tested by simulating samples from the null distributionWe generated 1 000 random permutations of the sign of thedifference of pairwise Dice index between the PCA methodsunder comparisons and then the statistics on the true datawere compared with the ones obtained on the reshuffled datato obtain empirical p-values

For each experiment we made the initial choice to retrievethe first ten components However given the length constraintwe only present the weights maps associated to the top threecomponents for Sparse PCA and SPCA-TV in this paperElasticNet PCA GraphNet PCA and SSPCA rsquos weights mapsof experiments are presented in the supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A Simulation study

We generated 50 sets of synthetic data each composed of500 images of size 100 times 100 pixels Images are generatedusing the following noisy linear system

u1V1 + u2V

2 + u3V3 + ε isin R10 000 (27)

where V = [V 1 V 2 V 3] isin R10 000times3 are sparse and structuredloading vectors illustrated in Fig 1 The support of V 1 definesthe two upper dots the support of V 2 defines the two lowerdots while V 3 rsquos support delineates the middle dot Thecoefficients u = [u1 u2 u3] that linearly combine the com-ponents of V are generated according to a centered Gaussiandistribution The elements of the noise vector ε are independentand identically distributed according to a centered Gaussiandistribution with a 01 signal-to-noise ratio (SNR) This SNRwas selected by a previous calibration pipeline where wetested the efficiency of data reconstruction at multiple SNRranges running from 0 to 05 We decided to work with a 01SNR because it is located in the range of values where standardPCA starts being less efficient in the recovery process

component 1 component 2 component 3Fig 1 Loading vectors V = [V 1 V 2 V 3] isin R10 000times3 usedto generate the images

We splitted the 500 artificial images into a test and a trainingset with 250 images in each set and learned the decompositionon the training set Sparse PCA

SPCA - TVcomponent 1

component 1

component 2

component 2

component 3

component 3Fig 2 Loading vectors recovered from 250 images usingSparse PCA and SPCA-TV

Fig 2 represents the loading vectors extracted with one dataset Please note that the sign is arbitrary Indeed if we considerthe loss of Eq (3) ugt and v can be both multiply by -1

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 8

TABLE I Scores are averaged across the 50 independent datasets We tested whether the scores obtained with existing PCAmethods are significantly different from scores obtained withSPCA-TV Significance notations p le 10minus3

Scores

Methods Test Data Reconstruction Error MSE Dice Index

Sparse PCA 15760 091 028ElasticNet PCA 15724 083 043GraphNet PCA 15708 083 030SSPCA 15719 154 007SPCA-TV 15701 064 052

without changing anything We observe that Sparse PCA yieldsvery scattered loading vectors The loading vectors of SPCA-TV on the other hand are sparse but also organized in clearregions SPCA-TV provides loading vectors that closely matchthe ground truth The reconstruction error is evaluated on thetest sets (Tab I) with its value over the 50 data sets being sig-nificantly lower in SPCA-TV than in Sparse PCA (T = 945p = 39 middot 10minus57) ElasticNet PCA (T = 332 p = 27 middot 10minus35GraphNet PCA (T = 127 p = 36 middot 10minus17 and SSPCA from[29] (T = 189 p = 39 middot10minus24) ) methods Additional detailsconcerning the reconstruction accuracy of both the train andtest data is presented in Figure 1 of supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A different way of quantifying the reconstruction accuracyfor each method is to evaluate how closely the extractedloadings match the known ground truth of simulated dataset We computed the mean squared error (MSE) betweenthe ground truth and the estimated loadings The results arepresented in Tab I We note that the MSE is significantlylower with SPCA-TV than with Sparse PCA (T = 69p = 80 middot 10minus9) ElasticNet PCA (T = 62 p = 11 middot 10minus07)GraphNet-PCA (T = 41 p = 14 middot 10minus04) and SSPCA(T = 226 p = 15 middot 10minus27)

Moreover when evaluating the stability of the loadingvectors across resampling we found a higher statisticallysignificant mean Dice index when using SPCA-TV comparedto the other methods (p lt 0001) The results are presented inTab I They indicate that SPCA-TV is more robust to variationin the learning samples than the other sparse methods SPCA-TV yields reproducible loading vectors across data sets

These results indicate that the SPCA-TV loadings are notonly more stable across resampling but also achieve a betterrecovery of the underlying variability in independent data thanthe Sparse PCA ElasticNet PCAGraphNet PCA and SSPCAmethods

One of the issues linked to biconvex optimization is therisk of falling into locals minima Conscious of this potentialrisk we set up an experiment in which we ran 50 times theoptimization of the same problem with a different startingpoint at each run We then compare the resulting loadingvectors obtained at each run and computed a similarity mea-sure the Dice index It quantifies the proximity between each

independently-run solution with a different starting point Weobtained a Dice index of 099 on the 1st component 099 onthe 2nd component and 072 on the 3rd component Off thestrength of this indices we are confident of this algorithmrobustness and ability to converge toward the same stablesolution independently from the choice of the starting point

B 3D images of functional MRI of patients with schizophreniaWe then applied the methods on 3D images of BOLD

functional MRI (fMRI) acquired with the same scanner andpulse sequence Imaging was performed on a 15 T scannerusing a standard head-coil For all functional scans the field-of-view was 206lowast206lowast153 mm with a resolution close to 35mm in all directions The parameters of the PRESTO sequencewere TE = 96 ms TR = 1925 ms EPI-factor = 15 flip angle= 9 Each fMRI run consisted of 900 volumes collected Thecohort is composed of 23 patients with schizophrenia (averageage = 3496 years 8 Females15 Males) Brain activationwas measured while subjects experienced multimodal hallu-cinations The fMRI data was pre-processed using SPM12(WELLCOME Department of Imaging Neuroscience Lon-don UK) Data preprocessing consisted of motion correction(realignment) coregistration of the individual anatomical T1image to the functional images spatial normalization to MNIspace using DARTEL based on segmented T1 scans

We considered each set of consecutive images under pre-hallucinations state as a block Since most of the patientshallucinate more than once during the scanning session wehave more blocks than patients (83 blocks) The activationmaps are computed from these blocks Based on the generallinear model approach we regressed for each block the fMRIsignal time course on a linear ramp function Indeed we hy-pothesized that activation in some regions presents a ramp-likeincrease during the time preceding the onset of hallucinations(See example of regression in figure 3 in the supplementarymaterialsavailable in the supplementary files multimedia tab)The activation maps that we used as an input to the SPCA-TVmethod are the statistical parametric maps associated with thecoefficients of the block regression (See one example in Figure4 of supplementary materials available in the supplementaryfiles multimedia tab) We obtained a data set of n = 83 mapsand p = 63 966 features We hypothesized that the principalcomponents extracted with SPCA-TV from these activationmaps could uncover major trends of variability within pre-hallucination patterns Thus they might reveal the existence ofsubgroups of patients according to the sensory modality (egvision or audition) involved during hallucinations

We applied all PCA methods under study to this data setexcept SSPCA Indeed the SSPCA method ([29]) could notbe applied to this specific example since datasets have to beconstituted from closed cubic forms without any holes to beeligible for SSPCA method application It does not supportmasked data such as the one used here

The loading vectors extracted from the activation mapsof pre-hallucinations scans with Sparse PCA and SPCA-TVare presented in Fig 3 We observe a similar behavior asin the synthetic example namely that the loading vectors of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Sparse PCA tend to be scattered and produce irregular patternsHowever SPCA-TV seems to yield structured and smoothsources of variability which can be interpreted clinicallyFurthermore the SPCA-TV loading vectors are not redundantand revealed different patterns

Indeed the loading vectors obtained by SPCA-TV are ofgreat interest because they revealed insightful patterns ofvariability in the data the second loading is composed ofinteresting areas such as the precuneus cortex and the cingulategyrus but also areas related to vision-processing areas such asthe occipital fusiform gyrus and the parietal operculum cortexregions The third loading reveals important weights in themiddle temporal gyrus the parietal operculum cortex and thefrontal pole The first loading vector encompasses all featuresof the brain One might see this first component as a globalvariability affecting the whole brain such as the overarchingeffect of age SPCA-TV selects this dense configuration inspite of the sparsity constraint It is highly desirable to removeany sort of global effect at first in order to start identifyinglocal patterns in next components that are not impacted by aglobal variability of this kind We can identified a widespreadset of dysfunctional language-related or vision-related areasthat present increasing activity during the time preceding theonset of hallucinations The regions extracted by SPCA-TVare found to be pertinent according to the existing literatureon the topic ([28] [27] [7]])

These results seem to indicate the possible existence ofsubgroups of patients according to the hallucination modalitiesinvolved An interesting application would be to use the scoreof the second component extracted by SPCA-TV in orderto distinguish patients with visual hallucinations from thosesuffering mainly from auditory hallucinations

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 139 p = 15 middot 10minus4) ElasticNetPCA (T = 71 p = 21 middot 10minus3) and GraphNet PCA (T = 46p = 10 middot 10minus2) Moreover when assessing the stability of theloading vectors across the folds we found a higher statisticallysignificant mean Dice index in SPCA-TV compared to SparsePCA (p = 40 middot 10minus3) ElasticNet PCA (p = 40 middot 10minus3) andGraphNet PCA (p = 20 middot 10minus3) as presented in Tab II Addi-tional details regarding the reconstruction accuracy on both thetrain and test sets and the Dice index is presented in Figure5 of supplementary materials (available in the supplementaryfiles multimedia tab)

TABLE II Scores of the fMRI data are averaged across the5 folds We tested whether the averaged scores obtained withexisting PCA methods are significantly different from scoresobtained with SPCA-TV Significance notations p le10minus3 p le 10minus2

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 15152 034ElasticNet PCA 14827 032GraphNet PCA 14281 058SPCA-TV 14140 063

In conclusion SPCA-TV significantly outperforms SparseElasticNet and GraphNet PCA in terms of the reconstructionerror on independent test data and in the sense that loadingvectors are both more clinically interpretable and more stable

We also evaluated the convergence speed of Sparse PCAMini-Batch Sparse PCA (a variant of Sparse PCA that isfaster but less accurate) ElasticNet PCA GraphNet PCAand SPCA-TV for this functional MRI data set of n = 83samples and p = 63 966 features We compared the time ofexecution required for each algorithm to achieve a given levelof precision in Tab III Sparse PCA and ElasticNet PCA aresimilar in terms of convergence time while mini-batch sparsePCA is much faster but does not converge to high precision Asexpected structured methods (GraphNet PCA and SPCA-TV)take longer than other sparse methods because of the inclusion

Sparse PCAcompon

ent 1compon

ent 2compon

ent 30

-123

-1110

0-105

component 1

component 2

component 3

-063+063-073+073

+4800

SPCA - TV

Fig 3 Loading vectors recovered from the 83 activation mapsusing Sparse PCA and SPCA-TV

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 6: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 5

Algorithm 1 FISTA(Xgtu v0 εmicro micro A λ L(nabla(g))

)1 v1 = v0 k = 22 Compute the gradient of the smooth part nabla(g + λsmicro)

(Eq (18)) and its Lipschitz constant Lmicro (Eq (19))3 Compute the size tmicro = Lminus1micro4 repeat5 z = vkminus1 + kminus2

k+1

(vkminus1 minus vkminus2

)6 vk = proxh

(zminus tmicronabla(g + λsmicro)(z)

)7 until GAPmicro(v

k) le εmicro8 return vk

E Minimization of the loading vectors with CONESTA

The step size tmicro computed in Line 3 of Algorithm 1 dependson the smoothing parameter micro (see Eq (19)) Hence there is atrade-off between speed and precision Indeed high precisionwith a small micro will lead to a slow convergence (small tmicro)Conversely poor precision (large micro) will lead to rapid con-vergence (large tmicro) Thus we propose a continuation approach(Algorithm 2) which decreases the smoothing parameter withrespect to the distance to the minimum On the one hand whenwe are far from vlowast (the minimum of Eq (17)) we can use alarge micro to rapidly decrease the objective function On the otherhand when we are close to vlowast we need a small micro in orderto obtain an accurate approximation of the original objectivefunction

1) Duality gap The distance to the unknown f(vlowast) isestimated using the duality gap Duality formulations are oftenused to control the achieved precision level when minimizingconvex functions They provide an estimation of the errorf(vk) minus f(vlowast) for any v when the minimum is unknownThe duality gap is the cornerstone of the CONESTA algorithmIndeed it is used three times

i As the stopping criterion in the inner FISTA loop (Line7 in Algorithm 1) FISTA will stop as soon as the cur-rent precision is achieved using the current smoothingparameter micro This prevents unnecessary convergencetoward the approximated (smoothed) objective function

ii In the ith CONESTA iteration as a way to estimate thecurrent error f(vi)minusf(vlowast) (Line 7 in Algorithm 2) Theerror is estimated using the gap of the smoothed prob-lem GAPmicro=microi(v

i+1) which avoid unnecessary compu-tation since it has already been computed during the lastiteration of FISTA The inequality in Eq (15) is used toobtain the gap εi to the original non-smoothed problemThe next desired precision εi+1 and the smoothingparameter microi+1 are derived from this value

iii Finally as the global stopping criterion within CON-ESTA (Line 10 in Algorithm 2) This will guaranteethat the obtained approximation of the minimum vi atconvergence satisfies f(vi)minus f(vlowast) lt ε

Based on Eq (17) which decomposes the smoothed ob-jective function as a sum of a strongly convex loss and thepenalty

fmicro(v) = L(v) + ψmicro(v)

we compute the duality gap that provides an upper boundestimation of the error to the optimum At any step k ofthe algorithm given the current primal vk and the dualσ(vk) equiv nablaL(vk) variables [8] we can compute the dualitygap using the Fenchel duality rules [35]

GAP(vk) equiv fmicro(vk) + Llowast(σ(vk)

)+ ψlowastmicro

(minus σ(vk)

) (20)

where Llowast and ψlowastmicro are respectively the Fenchel conjugates ofL and ψmicro Denoting by vlowast the minimum of fmicro (solution ofEq (17)) the interest of the duality gap is that it provides anupper bound for the difference with the optimal value of thefunction Moreover it vanishes at the minimum

GAP(vk) ge f(vk)minus f(vlowast) ge 0GAP(vlowast) = 0

(21)

The dual variable is

σ(vk) equiv nablaL(vk) = v minus Xgtu

nλ2 (22)

the Fenchel conjugate of the squared loss L(vk) is

Llowast(σ(vk)) = 1

2σ(vk)22 + σ(vk)gt

Xgtu

nλ2 (23)

In [25] the authors provide the expression of the Fenchelconjugate of the penalty ψmicro(vk)

ψlowastmicro(minusσ(vk)) =1

2

Psumj=1

([∣∣∣minus σ(vk)j minus λ

λ2

(Agtαlowastmicro(v

k))j

∣∣∣minus λ1λ2

]2+

)

+λmicro

2λ2

∥∥αlowastmicro(vk)∥∥22 (24)

where [middot]+ = max(0 middot)The expression of the duality gap in Eq (20) provides an

estimation of the distance to the minimum This distance isgeometrically decreased by a factor τ = 05 at the end of eachcontinuation and the decreased value defines the precision thatshould be reached by the next iteration (Line 8 of Algorithm2) Thus the algorithm dynamically generates a sequence ofdecreasing prescribed precisions εi Such a scheme ensures theconvergence [25] towards a globally desired final precision εwhich is the only parameter that the user needs to provide

2) Determining the optimal smoothing parameter Given thecurrent prescribed precision εi we need to compute an optimalsmoothing parameter microopt(εi) (Line 9 in Algorithm 2) thatminimizes the number of FISTA iterations needed to achievesuch precision when minimizing Eq (3) (with fixed u) viaEq (17) (ie such that f(v(k))minus f(vlowast) lt εi)

In [25] the authors provide the expression of this optimalsmoothing parameter

microopt(εi) =

minusλMA22 +radic(λMA22)2 +ML(nabla(l))A22εiML(nabla(l))

(25)

where M = P2 (Eq (15)) and L(nabla(l)) = 2 is the Lipschitzconstant of the gradient of l as defined in Eq (17)

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 6

We call the resulting algorithm CONESTA (short forCOntinuation with NEsterov smoothing in a Shrinkage-Thresholding Algorithm) It is presented in detail with con-vergence proofs in [25]

Let K be the total number of FISTA loops used in CON-ESTA then we have experimentally verified that the conver-gence rate to the solution of Eq (16) is O

(1K2

)(which is

the optimal convergence rate for first-order methods) Alsothe algorithm works even if some of the weights λ1 or λare zero which thus allows us to solve eg the elastic netusing CONESTA Note that it has been rigorously provedthat the continuation technique improves the convergence ratecompared to the simple smoothing using a single value of microIndeed it has been demonstrated in [6] (see also [50]) thatthe convergence rate obtained with single value of micro evenoptimised is O

(1K2

)+ O(1K) However it has recently

been proved in [25] that the CONESTA algorithm achieves aO(1K) for general convex functions

We note that CONESTA could easily be adapted to manyother penalties For example to add the group lasso (GL)constraint to our structure we just have to design a specificlinear operator AGL and concatenate it to the actual linearoperator A

Algorithm 2 CONESTA(Xgtu ε

)1 Initialize v0 isin RP2 ε0 = τ middot GAPmicro=10minus8(v0)3 micro0 = microopt

(ε0)

4 repeat5 εimicro = εi minus microiγM6 vi+1 = FISTA(Xgtu vi εimicro )7 εi = GAPmicro=microi(v

i+1) + microiγM8 εi+1 = τ middot εi9 microi+1 = microopt

(εi+1

)10 until εi le ε11 return vi+1

F The algorithm for the SPCA-TV problemThe computation of a single component through SPCA-TV

can be achieved by combining CONESTA and Eq (5) withinan alternating minimization loop Mackey [34] demonstratedthat further components can be efficiently obtained by incor-porating this single-unit procedure in a deflation scheme asdone in eg [14] [31] The stopping criterion is defined as

STOPPINGCRITERION =

∥∥∥Xk minus ui+1vi+1gt∥∥∥Fminus∥∥∥Xk minus uivi

gt∥∥∥F∥∥Xk minus ui+1vi+1gt

∥∥F

(26)

All the presented building blocks were combined into Al-gorithm 3 to solve the SPCA-TV problem

III EXPERIMENTS

We evaluated the performance of SPCA-TV using threeexperiments One simulation study carried out on a synthetic

Algorithm 3 SPCA-TV(X ε)

1 X0 = X2 for all k = 0 K do Components3 Initialize u0 isin RN4 repeat Alternating minimization5 vi+1 = CONESTA(Xgtku

i ε)

6 ui+1 = Xkvi+1

Xkvi+127 until STOPPINGCRITERION le ε8 vk+1 = vi+1

9 uk+1 = ui+1

10 Xk+1 = Xk minus uk+1vk+1gt Deflation11 end for12 return U = [u1 middot middot middot uK ]V = [v1 middot middot middot vK ]

data set and two on neuroimaging data sets In order tocompare the performance of SPCA-TV with existing sparsePCA models we also included results obtained with SparsePCA ElasticNet PCA GraphNet PCA and SSPCA from[29] We used the scikit-learn implementation [42] for theSparse PCA while we used the Parsimony package (httpsgithubcomneurospinpylearn-parsimony) for the ElasticNetGraphNet PCA and SPCA-TV methods Concerning SSPCAwe used the MATLAB implementation provided in [29]

The number of parameters to set for each method is differ-ent For Sparse PCA the λ1 parameter selects its optimal valuefrom the range 01 10 50 100 ElasticNet PCA requiresthe setting of the λ1 and the λ2 penalties weights MeanwhileGraphNet PCA and SPCA-TV requires the settings of anadditional parameter namely the spatial constraint penalty λWe operated a re-parametrization of these penalty weights inratios A global parameter α isin 001 01 10 controls theweight attributed to the whole penalty term including thespatial and the `1 regularization Individual constraints areexpressed in terms of ratios the `1 ratio λ1(λ1 + λ2 + λ)isin 01 05 08 and the `TV (or `GN for GraphNet) λ(λ1+λ2+λ) isin 01 05 08 For ElasticNet we explorethe grid of parameters composed of the Cartesian product ofα and `1 ratio subsets For GraphNet PCA and SPCA-TVwe perform a parameter search on a grid of parameters givenby the Cartesian product of respectively (α `1 `GN ) subsetsand (α `1 `TV ) subsets Concerning SSPCA method theregularization parameter selects its optimal value in the range10minus8 108

However in order to ensure that the components extractedhave a minimum amount of sparsity we also included acriteria controlling sparsity At least half of the features ofthe components have to be zero For both real neuroimagingexperiments performance was evaluated through a 5-fold x5-fold double cross validation pipeline The double cross-validation process consists of two nested cross-validation loopswhich are referred to as internal and external cross-validationloops In the outer (external) loop all samples are randomlysplit into subsets referred to as training and test sets The testsets are exclusively used for model assessment while the trainsets are used in the inner (internal) loop for model fitting

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 7

and selection The inner folds select the set of parametersminimizing the reconstruction error on the outer fold Forthe synthetic data we used 50 different purposely-generateddata sets and 5 inner folds for parameters selection In or-der to evaluate the reconstruction accuracy of the methodswe reported the mean Frobenius norm of the reconstructionerror across the foldsdata sets on independent test dataThe hypothesis we wanted to test was whether there was asubstantial decrease in the reconstruction error of independentdata when using SPCA-TV compared to when using SparsePCA ElasticNet PCA GraphNet PCA and SSPCA It wastested through a related two samples t-test This choice tocompare methods performance on independent test data wasmotivated by the fact that the optimal reconstruction of thetraining set is necessarily hindered by spatial and sparsityconstraints We therefore expect SPCA-TV to perform worseon train data than other less constrained methods Howeverthe TV penalty has a more important purpose than just tominimize the reconstruction error the estimation of coherentand reproducible loadings Indeed clinicians expect that ifimages from other patients with comparable clinical conditionshad been used the extracted loading vectors would have turnedout to be similar Therefore since the ultimate goal of SPCA-TV is to yield stable and reproducible weight maps it is morerelevant to evaluate methods on independent test data

The stability of the loading vectors obtained across varioustraining data sets (variation in the learning samples) wasassessed through a similarity measure the pairwise Dice indexbetween loading vectors obtained with different foldsdata sets[16] We tested whether pairwise Dice indices are significantlyhigher in SPCA-TV compared other methods Testing thishypothesis is equivalent to testing the sign of the difference ofpairwise Dice indices between methods However since thepairwise Dice indices are not independent from one another(the folds share many of their learning samples) the directsignificance measures are biased We therefore used permuta-tion testing to estimate empirical p-values The null hypothesiswas tested by simulating samples from the null distributionWe generated 1 000 random permutations of the sign of thedifference of pairwise Dice index between the PCA methodsunder comparisons and then the statistics on the true datawere compared with the ones obtained on the reshuffled datato obtain empirical p-values

For each experiment we made the initial choice to retrievethe first ten components However given the length constraintwe only present the weights maps associated to the top threecomponents for Sparse PCA and SPCA-TV in this paperElasticNet PCA GraphNet PCA and SSPCA rsquos weights mapsof experiments are presented in the supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A Simulation study

We generated 50 sets of synthetic data each composed of500 images of size 100 times 100 pixels Images are generatedusing the following noisy linear system

u1V1 + u2V

2 + u3V3 + ε isin R10 000 (27)

where V = [V 1 V 2 V 3] isin R10 000times3 are sparse and structuredloading vectors illustrated in Fig 1 The support of V 1 definesthe two upper dots the support of V 2 defines the two lowerdots while V 3 rsquos support delineates the middle dot Thecoefficients u = [u1 u2 u3] that linearly combine the com-ponents of V are generated according to a centered Gaussiandistribution The elements of the noise vector ε are independentand identically distributed according to a centered Gaussiandistribution with a 01 signal-to-noise ratio (SNR) This SNRwas selected by a previous calibration pipeline where wetested the efficiency of data reconstruction at multiple SNRranges running from 0 to 05 We decided to work with a 01SNR because it is located in the range of values where standardPCA starts being less efficient in the recovery process

component 1 component 2 component 3Fig 1 Loading vectors V = [V 1 V 2 V 3] isin R10 000times3 usedto generate the images

We splitted the 500 artificial images into a test and a trainingset with 250 images in each set and learned the decompositionon the training set Sparse PCA

SPCA - TVcomponent 1

component 1

component 2

component 2

component 3

component 3Fig 2 Loading vectors recovered from 250 images usingSparse PCA and SPCA-TV

Fig 2 represents the loading vectors extracted with one dataset Please note that the sign is arbitrary Indeed if we considerthe loss of Eq (3) ugt and v can be both multiply by -1

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 8

TABLE I Scores are averaged across the 50 independent datasets We tested whether the scores obtained with existing PCAmethods are significantly different from scores obtained withSPCA-TV Significance notations p le 10minus3

Scores

Methods Test Data Reconstruction Error MSE Dice Index

Sparse PCA 15760 091 028ElasticNet PCA 15724 083 043GraphNet PCA 15708 083 030SSPCA 15719 154 007SPCA-TV 15701 064 052

without changing anything We observe that Sparse PCA yieldsvery scattered loading vectors The loading vectors of SPCA-TV on the other hand are sparse but also organized in clearregions SPCA-TV provides loading vectors that closely matchthe ground truth The reconstruction error is evaluated on thetest sets (Tab I) with its value over the 50 data sets being sig-nificantly lower in SPCA-TV than in Sparse PCA (T = 945p = 39 middot 10minus57) ElasticNet PCA (T = 332 p = 27 middot 10minus35GraphNet PCA (T = 127 p = 36 middot 10minus17 and SSPCA from[29] (T = 189 p = 39 middot10minus24) ) methods Additional detailsconcerning the reconstruction accuracy of both the train andtest data is presented in Figure 1 of supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A different way of quantifying the reconstruction accuracyfor each method is to evaluate how closely the extractedloadings match the known ground truth of simulated dataset We computed the mean squared error (MSE) betweenthe ground truth and the estimated loadings The results arepresented in Tab I We note that the MSE is significantlylower with SPCA-TV than with Sparse PCA (T = 69p = 80 middot 10minus9) ElasticNet PCA (T = 62 p = 11 middot 10minus07)GraphNet-PCA (T = 41 p = 14 middot 10minus04) and SSPCA(T = 226 p = 15 middot 10minus27)

Moreover when evaluating the stability of the loadingvectors across resampling we found a higher statisticallysignificant mean Dice index when using SPCA-TV comparedto the other methods (p lt 0001) The results are presented inTab I They indicate that SPCA-TV is more robust to variationin the learning samples than the other sparse methods SPCA-TV yields reproducible loading vectors across data sets

These results indicate that the SPCA-TV loadings are notonly more stable across resampling but also achieve a betterrecovery of the underlying variability in independent data thanthe Sparse PCA ElasticNet PCAGraphNet PCA and SSPCAmethods

One of the issues linked to biconvex optimization is therisk of falling into locals minima Conscious of this potentialrisk we set up an experiment in which we ran 50 times theoptimization of the same problem with a different startingpoint at each run We then compare the resulting loadingvectors obtained at each run and computed a similarity mea-sure the Dice index It quantifies the proximity between each

independently-run solution with a different starting point Weobtained a Dice index of 099 on the 1st component 099 onthe 2nd component and 072 on the 3rd component Off thestrength of this indices we are confident of this algorithmrobustness and ability to converge toward the same stablesolution independently from the choice of the starting point

B 3D images of functional MRI of patients with schizophreniaWe then applied the methods on 3D images of BOLD

functional MRI (fMRI) acquired with the same scanner andpulse sequence Imaging was performed on a 15 T scannerusing a standard head-coil For all functional scans the field-of-view was 206lowast206lowast153 mm with a resolution close to 35mm in all directions The parameters of the PRESTO sequencewere TE = 96 ms TR = 1925 ms EPI-factor = 15 flip angle= 9 Each fMRI run consisted of 900 volumes collected Thecohort is composed of 23 patients with schizophrenia (averageage = 3496 years 8 Females15 Males) Brain activationwas measured while subjects experienced multimodal hallu-cinations The fMRI data was pre-processed using SPM12(WELLCOME Department of Imaging Neuroscience Lon-don UK) Data preprocessing consisted of motion correction(realignment) coregistration of the individual anatomical T1image to the functional images spatial normalization to MNIspace using DARTEL based on segmented T1 scans

We considered each set of consecutive images under pre-hallucinations state as a block Since most of the patientshallucinate more than once during the scanning session wehave more blocks than patients (83 blocks) The activationmaps are computed from these blocks Based on the generallinear model approach we regressed for each block the fMRIsignal time course on a linear ramp function Indeed we hy-pothesized that activation in some regions presents a ramp-likeincrease during the time preceding the onset of hallucinations(See example of regression in figure 3 in the supplementarymaterialsavailable in the supplementary files multimedia tab)The activation maps that we used as an input to the SPCA-TVmethod are the statistical parametric maps associated with thecoefficients of the block regression (See one example in Figure4 of supplementary materials available in the supplementaryfiles multimedia tab) We obtained a data set of n = 83 mapsand p = 63 966 features We hypothesized that the principalcomponents extracted with SPCA-TV from these activationmaps could uncover major trends of variability within pre-hallucination patterns Thus they might reveal the existence ofsubgroups of patients according to the sensory modality (egvision or audition) involved during hallucinations

We applied all PCA methods under study to this data setexcept SSPCA Indeed the SSPCA method ([29]) could notbe applied to this specific example since datasets have to beconstituted from closed cubic forms without any holes to beeligible for SSPCA method application It does not supportmasked data such as the one used here

The loading vectors extracted from the activation mapsof pre-hallucinations scans with Sparse PCA and SPCA-TVare presented in Fig 3 We observe a similar behavior asin the synthetic example namely that the loading vectors of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Sparse PCA tend to be scattered and produce irregular patternsHowever SPCA-TV seems to yield structured and smoothsources of variability which can be interpreted clinicallyFurthermore the SPCA-TV loading vectors are not redundantand revealed different patterns

Indeed the loading vectors obtained by SPCA-TV are ofgreat interest because they revealed insightful patterns ofvariability in the data the second loading is composed ofinteresting areas such as the precuneus cortex and the cingulategyrus but also areas related to vision-processing areas such asthe occipital fusiform gyrus and the parietal operculum cortexregions The third loading reveals important weights in themiddle temporal gyrus the parietal operculum cortex and thefrontal pole The first loading vector encompasses all featuresof the brain One might see this first component as a globalvariability affecting the whole brain such as the overarchingeffect of age SPCA-TV selects this dense configuration inspite of the sparsity constraint It is highly desirable to removeany sort of global effect at first in order to start identifyinglocal patterns in next components that are not impacted by aglobal variability of this kind We can identified a widespreadset of dysfunctional language-related or vision-related areasthat present increasing activity during the time preceding theonset of hallucinations The regions extracted by SPCA-TVare found to be pertinent according to the existing literatureon the topic ([28] [27] [7]])

These results seem to indicate the possible existence ofsubgroups of patients according to the hallucination modalitiesinvolved An interesting application would be to use the scoreof the second component extracted by SPCA-TV in orderto distinguish patients with visual hallucinations from thosesuffering mainly from auditory hallucinations

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 139 p = 15 middot 10minus4) ElasticNetPCA (T = 71 p = 21 middot 10minus3) and GraphNet PCA (T = 46p = 10 middot 10minus2) Moreover when assessing the stability of theloading vectors across the folds we found a higher statisticallysignificant mean Dice index in SPCA-TV compared to SparsePCA (p = 40 middot 10minus3) ElasticNet PCA (p = 40 middot 10minus3) andGraphNet PCA (p = 20 middot 10minus3) as presented in Tab II Addi-tional details regarding the reconstruction accuracy on both thetrain and test sets and the Dice index is presented in Figure5 of supplementary materials (available in the supplementaryfiles multimedia tab)

TABLE II Scores of the fMRI data are averaged across the5 folds We tested whether the averaged scores obtained withexisting PCA methods are significantly different from scoresobtained with SPCA-TV Significance notations p le10minus3 p le 10minus2

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 15152 034ElasticNet PCA 14827 032GraphNet PCA 14281 058SPCA-TV 14140 063

In conclusion SPCA-TV significantly outperforms SparseElasticNet and GraphNet PCA in terms of the reconstructionerror on independent test data and in the sense that loadingvectors are both more clinically interpretable and more stable

We also evaluated the convergence speed of Sparse PCAMini-Batch Sparse PCA (a variant of Sparse PCA that isfaster but less accurate) ElasticNet PCA GraphNet PCAand SPCA-TV for this functional MRI data set of n = 83samples and p = 63 966 features We compared the time ofexecution required for each algorithm to achieve a given levelof precision in Tab III Sparse PCA and ElasticNet PCA aresimilar in terms of convergence time while mini-batch sparsePCA is much faster but does not converge to high precision Asexpected structured methods (GraphNet PCA and SPCA-TV)take longer than other sparse methods because of the inclusion

Sparse PCAcompon

ent 1compon

ent 2compon

ent 30

-123

-1110

0-105

component 1

component 2

component 3

-063+063-073+073

+4800

SPCA - TV

Fig 3 Loading vectors recovered from the 83 activation mapsusing Sparse PCA and SPCA-TV

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 7: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 6

We call the resulting algorithm CONESTA (short forCOntinuation with NEsterov smoothing in a Shrinkage-Thresholding Algorithm) It is presented in detail with con-vergence proofs in [25]

Let K be the total number of FISTA loops used in CON-ESTA then we have experimentally verified that the conver-gence rate to the solution of Eq (16) is O

(1K2

)(which is

the optimal convergence rate for first-order methods) Alsothe algorithm works even if some of the weights λ1 or λare zero which thus allows us to solve eg the elastic netusing CONESTA Note that it has been rigorously provedthat the continuation technique improves the convergence ratecompared to the simple smoothing using a single value of microIndeed it has been demonstrated in [6] (see also [50]) thatthe convergence rate obtained with single value of micro evenoptimised is O

(1K2

)+ O(1K) However it has recently

been proved in [25] that the CONESTA algorithm achieves aO(1K) for general convex functions

We note that CONESTA could easily be adapted to manyother penalties For example to add the group lasso (GL)constraint to our structure we just have to design a specificlinear operator AGL and concatenate it to the actual linearoperator A

Algorithm 2 CONESTA(Xgtu ε

)1 Initialize v0 isin RP2 ε0 = τ middot GAPmicro=10minus8(v0)3 micro0 = microopt

(ε0)

4 repeat5 εimicro = εi minus microiγM6 vi+1 = FISTA(Xgtu vi εimicro )7 εi = GAPmicro=microi(v

i+1) + microiγM8 εi+1 = τ middot εi9 microi+1 = microopt

(εi+1

)10 until εi le ε11 return vi+1

F The algorithm for the SPCA-TV problemThe computation of a single component through SPCA-TV

can be achieved by combining CONESTA and Eq (5) withinan alternating minimization loop Mackey [34] demonstratedthat further components can be efficiently obtained by incor-porating this single-unit procedure in a deflation scheme asdone in eg [14] [31] The stopping criterion is defined as

STOPPINGCRITERION =

∥∥∥Xk minus ui+1vi+1gt∥∥∥Fminus∥∥∥Xk minus uivi

gt∥∥∥F∥∥Xk minus ui+1vi+1gt

∥∥F

(26)

All the presented building blocks were combined into Al-gorithm 3 to solve the SPCA-TV problem

III EXPERIMENTS

We evaluated the performance of SPCA-TV using threeexperiments One simulation study carried out on a synthetic

Algorithm 3 SPCA-TV(X ε)

1 X0 = X2 for all k = 0 K do Components3 Initialize u0 isin RN4 repeat Alternating minimization5 vi+1 = CONESTA(Xgtku

i ε)

6 ui+1 = Xkvi+1

Xkvi+127 until STOPPINGCRITERION le ε8 vk+1 = vi+1

9 uk+1 = ui+1

10 Xk+1 = Xk minus uk+1vk+1gt Deflation11 end for12 return U = [u1 middot middot middot uK ]V = [v1 middot middot middot vK ]

data set and two on neuroimaging data sets In order tocompare the performance of SPCA-TV with existing sparsePCA models we also included results obtained with SparsePCA ElasticNet PCA GraphNet PCA and SSPCA from[29] We used the scikit-learn implementation [42] for theSparse PCA while we used the Parsimony package (httpsgithubcomneurospinpylearn-parsimony) for the ElasticNetGraphNet PCA and SPCA-TV methods Concerning SSPCAwe used the MATLAB implementation provided in [29]

The number of parameters to set for each method is differ-ent For Sparse PCA the λ1 parameter selects its optimal valuefrom the range 01 10 50 100 ElasticNet PCA requiresthe setting of the λ1 and the λ2 penalties weights MeanwhileGraphNet PCA and SPCA-TV requires the settings of anadditional parameter namely the spatial constraint penalty λWe operated a re-parametrization of these penalty weights inratios A global parameter α isin 001 01 10 controls theweight attributed to the whole penalty term including thespatial and the `1 regularization Individual constraints areexpressed in terms of ratios the `1 ratio λ1(λ1 + λ2 + λ)isin 01 05 08 and the `TV (or `GN for GraphNet) λ(λ1+λ2+λ) isin 01 05 08 For ElasticNet we explorethe grid of parameters composed of the Cartesian product ofα and `1 ratio subsets For GraphNet PCA and SPCA-TVwe perform a parameter search on a grid of parameters givenby the Cartesian product of respectively (α `1 `GN ) subsetsand (α `1 `TV ) subsets Concerning SSPCA method theregularization parameter selects its optimal value in the range10minus8 108

However in order to ensure that the components extractedhave a minimum amount of sparsity we also included acriteria controlling sparsity At least half of the features ofthe components have to be zero For both real neuroimagingexperiments performance was evaluated through a 5-fold x5-fold double cross validation pipeline The double cross-validation process consists of two nested cross-validation loopswhich are referred to as internal and external cross-validationloops In the outer (external) loop all samples are randomlysplit into subsets referred to as training and test sets The testsets are exclusively used for model assessment while the trainsets are used in the inner (internal) loop for model fitting

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 7

and selection The inner folds select the set of parametersminimizing the reconstruction error on the outer fold Forthe synthetic data we used 50 different purposely-generateddata sets and 5 inner folds for parameters selection In or-der to evaluate the reconstruction accuracy of the methodswe reported the mean Frobenius norm of the reconstructionerror across the foldsdata sets on independent test dataThe hypothesis we wanted to test was whether there was asubstantial decrease in the reconstruction error of independentdata when using SPCA-TV compared to when using SparsePCA ElasticNet PCA GraphNet PCA and SSPCA It wastested through a related two samples t-test This choice tocompare methods performance on independent test data wasmotivated by the fact that the optimal reconstruction of thetraining set is necessarily hindered by spatial and sparsityconstraints We therefore expect SPCA-TV to perform worseon train data than other less constrained methods Howeverthe TV penalty has a more important purpose than just tominimize the reconstruction error the estimation of coherentand reproducible loadings Indeed clinicians expect that ifimages from other patients with comparable clinical conditionshad been used the extracted loading vectors would have turnedout to be similar Therefore since the ultimate goal of SPCA-TV is to yield stable and reproducible weight maps it is morerelevant to evaluate methods on independent test data

The stability of the loading vectors obtained across varioustraining data sets (variation in the learning samples) wasassessed through a similarity measure the pairwise Dice indexbetween loading vectors obtained with different foldsdata sets[16] We tested whether pairwise Dice indices are significantlyhigher in SPCA-TV compared other methods Testing thishypothesis is equivalent to testing the sign of the difference ofpairwise Dice indices between methods However since thepairwise Dice indices are not independent from one another(the folds share many of their learning samples) the directsignificance measures are biased We therefore used permuta-tion testing to estimate empirical p-values The null hypothesiswas tested by simulating samples from the null distributionWe generated 1 000 random permutations of the sign of thedifference of pairwise Dice index between the PCA methodsunder comparisons and then the statistics on the true datawere compared with the ones obtained on the reshuffled datato obtain empirical p-values

For each experiment we made the initial choice to retrievethe first ten components However given the length constraintwe only present the weights maps associated to the top threecomponents for Sparse PCA and SPCA-TV in this paperElasticNet PCA GraphNet PCA and SSPCA rsquos weights mapsof experiments are presented in the supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A Simulation study

We generated 50 sets of synthetic data each composed of500 images of size 100 times 100 pixels Images are generatedusing the following noisy linear system

u1V1 + u2V

2 + u3V3 + ε isin R10 000 (27)

where V = [V 1 V 2 V 3] isin R10 000times3 are sparse and structuredloading vectors illustrated in Fig 1 The support of V 1 definesthe two upper dots the support of V 2 defines the two lowerdots while V 3 rsquos support delineates the middle dot Thecoefficients u = [u1 u2 u3] that linearly combine the com-ponents of V are generated according to a centered Gaussiandistribution The elements of the noise vector ε are independentand identically distributed according to a centered Gaussiandistribution with a 01 signal-to-noise ratio (SNR) This SNRwas selected by a previous calibration pipeline where wetested the efficiency of data reconstruction at multiple SNRranges running from 0 to 05 We decided to work with a 01SNR because it is located in the range of values where standardPCA starts being less efficient in the recovery process

component 1 component 2 component 3Fig 1 Loading vectors V = [V 1 V 2 V 3] isin R10 000times3 usedto generate the images

We splitted the 500 artificial images into a test and a trainingset with 250 images in each set and learned the decompositionon the training set Sparse PCA

SPCA - TVcomponent 1

component 1

component 2

component 2

component 3

component 3Fig 2 Loading vectors recovered from 250 images usingSparse PCA and SPCA-TV

Fig 2 represents the loading vectors extracted with one dataset Please note that the sign is arbitrary Indeed if we considerthe loss of Eq (3) ugt and v can be both multiply by -1

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 8

TABLE I Scores are averaged across the 50 independent datasets We tested whether the scores obtained with existing PCAmethods are significantly different from scores obtained withSPCA-TV Significance notations p le 10minus3

Scores

Methods Test Data Reconstruction Error MSE Dice Index

Sparse PCA 15760 091 028ElasticNet PCA 15724 083 043GraphNet PCA 15708 083 030SSPCA 15719 154 007SPCA-TV 15701 064 052

without changing anything We observe that Sparse PCA yieldsvery scattered loading vectors The loading vectors of SPCA-TV on the other hand are sparse but also organized in clearregions SPCA-TV provides loading vectors that closely matchthe ground truth The reconstruction error is evaluated on thetest sets (Tab I) with its value over the 50 data sets being sig-nificantly lower in SPCA-TV than in Sparse PCA (T = 945p = 39 middot 10minus57) ElasticNet PCA (T = 332 p = 27 middot 10minus35GraphNet PCA (T = 127 p = 36 middot 10minus17 and SSPCA from[29] (T = 189 p = 39 middot10minus24) ) methods Additional detailsconcerning the reconstruction accuracy of both the train andtest data is presented in Figure 1 of supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A different way of quantifying the reconstruction accuracyfor each method is to evaluate how closely the extractedloadings match the known ground truth of simulated dataset We computed the mean squared error (MSE) betweenthe ground truth and the estimated loadings The results arepresented in Tab I We note that the MSE is significantlylower with SPCA-TV than with Sparse PCA (T = 69p = 80 middot 10minus9) ElasticNet PCA (T = 62 p = 11 middot 10minus07)GraphNet-PCA (T = 41 p = 14 middot 10minus04) and SSPCA(T = 226 p = 15 middot 10minus27)

Moreover when evaluating the stability of the loadingvectors across resampling we found a higher statisticallysignificant mean Dice index when using SPCA-TV comparedto the other methods (p lt 0001) The results are presented inTab I They indicate that SPCA-TV is more robust to variationin the learning samples than the other sparse methods SPCA-TV yields reproducible loading vectors across data sets

These results indicate that the SPCA-TV loadings are notonly more stable across resampling but also achieve a betterrecovery of the underlying variability in independent data thanthe Sparse PCA ElasticNet PCAGraphNet PCA and SSPCAmethods

One of the issues linked to biconvex optimization is therisk of falling into locals minima Conscious of this potentialrisk we set up an experiment in which we ran 50 times theoptimization of the same problem with a different startingpoint at each run We then compare the resulting loadingvectors obtained at each run and computed a similarity mea-sure the Dice index It quantifies the proximity between each

independently-run solution with a different starting point Weobtained a Dice index of 099 on the 1st component 099 onthe 2nd component and 072 on the 3rd component Off thestrength of this indices we are confident of this algorithmrobustness and ability to converge toward the same stablesolution independently from the choice of the starting point

B 3D images of functional MRI of patients with schizophreniaWe then applied the methods on 3D images of BOLD

functional MRI (fMRI) acquired with the same scanner andpulse sequence Imaging was performed on a 15 T scannerusing a standard head-coil For all functional scans the field-of-view was 206lowast206lowast153 mm with a resolution close to 35mm in all directions The parameters of the PRESTO sequencewere TE = 96 ms TR = 1925 ms EPI-factor = 15 flip angle= 9 Each fMRI run consisted of 900 volumes collected Thecohort is composed of 23 patients with schizophrenia (averageage = 3496 years 8 Females15 Males) Brain activationwas measured while subjects experienced multimodal hallu-cinations The fMRI data was pre-processed using SPM12(WELLCOME Department of Imaging Neuroscience Lon-don UK) Data preprocessing consisted of motion correction(realignment) coregistration of the individual anatomical T1image to the functional images spatial normalization to MNIspace using DARTEL based on segmented T1 scans

We considered each set of consecutive images under pre-hallucinations state as a block Since most of the patientshallucinate more than once during the scanning session wehave more blocks than patients (83 blocks) The activationmaps are computed from these blocks Based on the generallinear model approach we regressed for each block the fMRIsignal time course on a linear ramp function Indeed we hy-pothesized that activation in some regions presents a ramp-likeincrease during the time preceding the onset of hallucinations(See example of regression in figure 3 in the supplementarymaterialsavailable in the supplementary files multimedia tab)The activation maps that we used as an input to the SPCA-TVmethod are the statistical parametric maps associated with thecoefficients of the block regression (See one example in Figure4 of supplementary materials available in the supplementaryfiles multimedia tab) We obtained a data set of n = 83 mapsand p = 63 966 features We hypothesized that the principalcomponents extracted with SPCA-TV from these activationmaps could uncover major trends of variability within pre-hallucination patterns Thus they might reveal the existence ofsubgroups of patients according to the sensory modality (egvision or audition) involved during hallucinations

We applied all PCA methods under study to this data setexcept SSPCA Indeed the SSPCA method ([29]) could notbe applied to this specific example since datasets have to beconstituted from closed cubic forms without any holes to beeligible for SSPCA method application It does not supportmasked data such as the one used here

The loading vectors extracted from the activation mapsof pre-hallucinations scans with Sparse PCA and SPCA-TVare presented in Fig 3 We observe a similar behavior asin the synthetic example namely that the loading vectors of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Sparse PCA tend to be scattered and produce irregular patternsHowever SPCA-TV seems to yield structured and smoothsources of variability which can be interpreted clinicallyFurthermore the SPCA-TV loading vectors are not redundantand revealed different patterns

Indeed the loading vectors obtained by SPCA-TV are ofgreat interest because they revealed insightful patterns ofvariability in the data the second loading is composed ofinteresting areas such as the precuneus cortex and the cingulategyrus but also areas related to vision-processing areas such asthe occipital fusiform gyrus and the parietal operculum cortexregions The third loading reveals important weights in themiddle temporal gyrus the parietal operculum cortex and thefrontal pole The first loading vector encompasses all featuresof the brain One might see this first component as a globalvariability affecting the whole brain such as the overarchingeffect of age SPCA-TV selects this dense configuration inspite of the sparsity constraint It is highly desirable to removeany sort of global effect at first in order to start identifyinglocal patterns in next components that are not impacted by aglobal variability of this kind We can identified a widespreadset of dysfunctional language-related or vision-related areasthat present increasing activity during the time preceding theonset of hallucinations The regions extracted by SPCA-TVare found to be pertinent according to the existing literatureon the topic ([28] [27] [7]])

These results seem to indicate the possible existence ofsubgroups of patients according to the hallucination modalitiesinvolved An interesting application would be to use the scoreof the second component extracted by SPCA-TV in orderto distinguish patients with visual hallucinations from thosesuffering mainly from auditory hallucinations

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 139 p = 15 middot 10minus4) ElasticNetPCA (T = 71 p = 21 middot 10minus3) and GraphNet PCA (T = 46p = 10 middot 10minus2) Moreover when assessing the stability of theloading vectors across the folds we found a higher statisticallysignificant mean Dice index in SPCA-TV compared to SparsePCA (p = 40 middot 10minus3) ElasticNet PCA (p = 40 middot 10minus3) andGraphNet PCA (p = 20 middot 10minus3) as presented in Tab II Addi-tional details regarding the reconstruction accuracy on both thetrain and test sets and the Dice index is presented in Figure5 of supplementary materials (available in the supplementaryfiles multimedia tab)

TABLE II Scores of the fMRI data are averaged across the5 folds We tested whether the averaged scores obtained withexisting PCA methods are significantly different from scoresobtained with SPCA-TV Significance notations p le10minus3 p le 10minus2

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 15152 034ElasticNet PCA 14827 032GraphNet PCA 14281 058SPCA-TV 14140 063

In conclusion SPCA-TV significantly outperforms SparseElasticNet and GraphNet PCA in terms of the reconstructionerror on independent test data and in the sense that loadingvectors are both more clinically interpretable and more stable

We also evaluated the convergence speed of Sparse PCAMini-Batch Sparse PCA (a variant of Sparse PCA that isfaster but less accurate) ElasticNet PCA GraphNet PCAand SPCA-TV for this functional MRI data set of n = 83samples and p = 63 966 features We compared the time ofexecution required for each algorithm to achieve a given levelof precision in Tab III Sparse PCA and ElasticNet PCA aresimilar in terms of convergence time while mini-batch sparsePCA is much faster but does not converge to high precision Asexpected structured methods (GraphNet PCA and SPCA-TV)take longer than other sparse methods because of the inclusion

Sparse PCAcompon

ent 1compon

ent 2compon

ent 30

-123

-1110

0-105

component 1

component 2

component 3

-063+063-073+073

+4800

SPCA - TV

Fig 3 Loading vectors recovered from the 83 activation mapsusing Sparse PCA and SPCA-TV

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 8: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 7

and selection The inner folds select the set of parametersminimizing the reconstruction error on the outer fold Forthe synthetic data we used 50 different purposely-generateddata sets and 5 inner folds for parameters selection In or-der to evaluate the reconstruction accuracy of the methodswe reported the mean Frobenius norm of the reconstructionerror across the foldsdata sets on independent test dataThe hypothesis we wanted to test was whether there was asubstantial decrease in the reconstruction error of independentdata when using SPCA-TV compared to when using SparsePCA ElasticNet PCA GraphNet PCA and SSPCA It wastested through a related two samples t-test This choice tocompare methods performance on independent test data wasmotivated by the fact that the optimal reconstruction of thetraining set is necessarily hindered by spatial and sparsityconstraints We therefore expect SPCA-TV to perform worseon train data than other less constrained methods Howeverthe TV penalty has a more important purpose than just tominimize the reconstruction error the estimation of coherentand reproducible loadings Indeed clinicians expect that ifimages from other patients with comparable clinical conditionshad been used the extracted loading vectors would have turnedout to be similar Therefore since the ultimate goal of SPCA-TV is to yield stable and reproducible weight maps it is morerelevant to evaluate methods on independent test data

The stability of the loading vectors obtained across varioustraining data sets (variation in the learning samples) wasassessed through a similarity measure the pairwise Dice indexbetween loading vectors obtained with different foldsdata sets[16] We tested whether pairwise Dice indices are significantlyhigher in SPCA-TV compared other methods Testing thishypothesis is equivalent to testing the sign of the difference ofpairwise Dice indices between methods However since thepairwise Dice indices are not independent from one another(the folds share many of their learning samples) the directsignificance measures are biased We therefore used permuta-tion testing to estimate empirical p-values The null hypothesiswas tested by simulating samples from the null distributionWe generated 1 000 random permutations of the sign of thedifference of pairwise Dice index between the PCA methodsunder comparisons and then the statistics on the true datawere compared with the ones obtained on the reshuffled datato obtain empirical p-values

For each experiment we made the initial choice to retrievethe first ten components However given the length constraintwe only present the weights maps associated to the top threecomponents for Sparse PCA and SPCA-TV in this paperElasticNet PCA GraphNet PCA and SSPCA rsquos weights mapsof experiments are presented in the supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A Simulation study

We generated 50 sets of synthetic data each composed of500 images of size 100 times 100 pixels Images are generatedusing the following noisy linear system

u1V1 + u2V

2 + u3V3 + ε isin R10 000 (27)

where V = [V 1 V 2 V 3] isin R10 000times3 are sparse and structuredloading vectors illustrated in Fig 1 The support of V 1 definesthe two upper dots the support of V 2 defines the two lowerdots while V 3 rsquos support delineates the middle dot Thecoefficients u = [u1 u2 u3] that linearly combine the com-ponents of V are generated according to a centered Gaussiandistribution The elements of the noise vector ε are independentand identically distributed according to a centered Gaussiandistribution with a 01 signal-to-noise ratio (SNR) This SNRwas selected by a previous calibration pipeline where wetested the efficiency of data reconstruction at multiple SNRranges running from 0 to 05 We decided to work with a 01SNR because it is located in the range of values where standardPCA starts being less efficient in the recovery process

component 1 component 2 component 3Fig 1 Loading vectors V = [V 1 V 2 V 3] isin R10 000times3 usedto generate the images

We splitted the 500 artificial images into a test and a trainingset with 250 images in each set and learned the decompositionon the training set Sparse PCA

SPCA - TVcomponent 1

component 1

component 2

component 2

component 3

component 3Fig 2 Loading vectors recovered from 250 images usingSparse PCA and SPCA-TV

Fig 2 represents the loading vectors extracted with one dataset Please note that the sign is arbitrary Indeed if we considerthe loss of Eq (3) ugt and v can be both multiply by -1

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 8

TABLE I Scores are averaged across the 50 independent datasets We tested whether the scores obtained with existing PCAmethods are significantly different from scores obtained withSPCA-TV Significance notations p le 10minus3

Scores

Methods Test Data Reconstruction Error MSE Dice Index

Sparse PCA 15760 091 028ElasticNet PCA 15724 083 043GraphNet PCA 15708 083 030SSPCA 15719 154 007SPCA-TV 15701 064 052

without changing anything We observe that Sparse PCA yieldsvery scattered loading vectors The loading vectors of SPCA-TV on the other hand are sparse but also organized in clearregions SPCA-TV provides loading vectors that closely matchthe ground truth The reconstruction error is evaluated on thetest sets (Tab I) with its value over the 50 data sets being sig-nificantly lower in SPCA-TV than in Sparse PCA (T = 945p = 39 middot 10minus57) ElasticNet PCA (T = 332 p = 27 middot 10minus35GraphNet PCA (T = 127 p = 36 middot 10minus17 and SSPCA from[29] (T = 189 p = 39 middot10minus24) ) methods Additional detailsconcerning the reconstruction accuracy of both the train andtest data is presented in Figure 1 of supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A different way of quantifying the reconstruction accuracyfor each method is to evaluate how closely the extractedloadings match the known ground truth of simulated dataset We computed the mean squared error (MSE) betweenthe ground truth and the estimated loadings The results arepresented in Tab I We note that the MSE is significantlylower with SPCA-TV than with Sparse PCA (T = 69p = 80 middot 10minus9) ElasticNet PCA (T = 62 p = 11 middot 10minus07)GraphNet-PCA (T = 41 p = 14 middot 10minus04) and SSPCA(T = 226 p = 15 middot 10minus27)

Moreover when evaluating the stability of the loadingvectors across resampling we found a higher statisticallysignificant mean Dice index when using SPCA-TV comparedto the other methods (p lt 0001) The results are presented inTab I They indicate that SPCA-TV is more robust to variationin the learning samples than the other sparse methods SPCA-TV yields reproducible loading vectors across data sets

These results indicate that the SPCA-TV loadings are notonly more stable across resampling but also achieve a betterrecovery of the underlying variability in independent data thanthe Sparse PCA ElasticNet PCAGraphNet PCA and SSPCAmethods

One of the issues linked to biconvex optimization is therisk of falling into locals minima Conscious of this potentialrisk we set up an experiment in which we ran 50 times theoptimization of the same problem with a different startingpoint at each run We then compare the resulting loadingvectors obtained at each run and computed a similarity mea-sure the Dice index It quantifies the proximity between each

independently-run solution with a different starting point Weobtained a Dice index of 099 on the 1st component 099 onthe 2nd component and 072 on the 3rd component Off thestrength of this indices we are confident of this algorithmrobustness and ability to converge toward the same stablesolution independently from the choice of the starting point

B 3D images of functional MRI of patients with schizophreniaWe then applied the methods on 3D images of BOLD

functional MRI (fMRI) acquired with the same scanner andpulse sequence Imaging was performed on a 15 T scannerusing a standard head-coil For all functional scans the field-of-view was 206lowast206lowast153 mm with a resolution close to 35mm in all directions The parameters of the PRESTO sequencewere TE = 96 ms TR = 1925 ms EPI-factor = 15 flip angle= 9 Each fMRI run consisted of 900 volumes collected Thecohort is composed of 23 patients with schizophrenia (averageage = 3496 years 8 Females15 Males) Brain activationwas measured while subjects experienced multimodal hallu-cinations The fMRI data was pre-processed using SPM12(WELLCOME Department of Imaging Neuroscience Lon-don UK) Data preprocessing consisted of motion correction(realignment) coregistration of the individual anatomical T1image to the functional images spatial normalization to MNIspace using DARTEL based on segmented T1 scans

We considered each set of consecutive images under pre-hallucinations state as a block Since most of the patientshallucinate more than once during the scanning session wehave more blocks than patients (83 blocks) The activationmaps are computed from these blocks Based on the generallinear model approach we regressed for each block the fMRIsignal time course on a linear ramp function Indeed we hy-pothesized that activation in some regions presents a ramp-likeincrease during the time preceding the onset of hallucinations(See example of regression in figure 3 in the supplementarymaterialsavailable in the supplementary files multimedia tab)The activation maps that we used as an input to the SPCA-TVmethod are the statistical parametric maps associated with thecoefficients of the block regression (See one example in Figure4 of supplementary materials available in the supplementaryfiles multimedia tab) We obtained a data set of n = 83 mapsand p = 63 966 features We hypothesized that the principalcomponents extracted with SPCA-TV from these activationmaps could uncover major trends of variability within pre-hallucination patterns Thus they might reveal the existence ofsubgroups of patients according to the sensory modality (egvision or audition) involved during hallucinations

We applied all PCA methods under study to this data setexcept SSPCA Indeed the SSPCA method ([29]) could notbe applied to this specific example since datasets have to beconstituted from closed cubic forms without any holes to beeligible for SSPCA method application It does not supportmasked data such as the one used here

The loading vectors extracted from the activation mapsof pre-hallucinations scans with Sparse PCA and SPCA-TVare presented in Fig 3 We observe a similar behavior asin the synthetic example namely that the loading vectors of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Sparse PCA tend to be scattered and produce irregular patternsHowever SPCA-TV seems to yield structured and smoothsources of variability which can be interpreted clinicallyFurthermore the SPCA-TV loading vectors are not redundantand revealed different patterns

Indeed the loading vectors obtained by SPCA-TV are ofgreat interest because they revealed insightful patterns ofvariability in the data the second loading is composed ofinteresting areas such as the precuneus cortex and the cingulategyrus but also areas related to vision-processing areas such asthe occipital fusiform gyrus and the parietal operculum cortexregions The third loading reveals important weights in themiddle temporal gyrus the parietal operculum cortex and thefrontal pole The first loading vector encompasses all featuresof the brain One might see this first component as a globalvariability affecting the whole brain such as the overarchingeffect of age SPCA-TV selects this dense configuration inspite of the sparsity constraint It is highly desirable to removeany sort of global effect at first in order to start identifyinglocal patterns in next components that are not impacted by aglobal variability of this kind We can identified a widespreadset of dysfunctional language-related or vision-related areasthat present increasing activity during the time preceding theonset of hallucinations The regions extracted by SPCA-TVare found to be pertinent according to the existing literatureon the topic ([28] [27] [7]])

These results seem to indicate the possible existence ofsubgroups of patients according to the hallucination modalitiesinvolved An interesting application would be to use the scoreof the second component extracted by SPCA-TV in orderto distinguish patients with visual hallucinations from thosesuffering mainly from auditory hallucinations

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 139 p = 15 middot 10minus4) ElasticNetPCA (T = 71 p = 21 middot 10minus3) and GraphNet PCA (T = 46p = 10 middot 10minus2) Moreover when assessing the stability of theloading vectors across the folds we found a higher statisticallysignificant mean Dice index in SPCA-TV compared to SparsePCA (p = 40 middot 10minus3) ElasticNet PCA (p = 40 middot 10minus3) andGraphNet PCA (p = 20 middot 10minus3) as presented in Tab II Addi-tional details regarding the reconstruction accuracy on both thetrain and test sets and the Dice index is presented in Figure5 of supplementary materials (available in the supplementaryfiles multimedia tab)

TABLE II Scores of the fMRI data are averaged across the5 folds We tested whether the averaged scores obtained withexisting PCA methods are significantly different from scoresobtained with SPCA-TV Significance notations p le10minus3 p le 10minus2

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 15152 034ElasticNet PCA 14827 032GraphNet PCA 14281 058SPCA-TV 14140 063

In conclusion SPCA-TV significantly outperforms SparseElasticNet and GraphNet PCA in terms of the reconstructionerror on independent test data and in the sense that loadingvectors are both more clinically interpretable and more stable

We also evaluated the convergence speed of Sparse PCAMini-Batch Sparse PCA (a variant of Sparse PCA that isfaster but less accurate) ElasticNet PCA GraphNet PCAand SPCA-TV for this functional MRI data set of n = 83samples and p = 63 966 features We compared the time ofexecution required for each algorithm to achieve a given levelof precision in Tab III Sparse PCA and ElasticNet PCA aresimilar in terms of convergence time while mini-batch sparsePCA is much faster but does not converge to high precision Asexpected structured methods (GraphNet PCA and SPCA-TV)take longer than other sparse methods because of the inclusion

Sparse PCAcompon

ent 1compon

ent 2compon

ent 30

-123

-1110

0-105

component 1

component 2

component 3

-063+063-073+073

+4800

SPCA - TV

Fig 3 Loading vectors recovered from the 83 activation mapsusing Sparse PCA and SPCA-TV

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 9: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 8

TABLE I Scores are averaged across the 50 independent datasets We tested whether the scores obtained with existing PCAmethods are significantly different from scores obtained withSPCA-TV Significance notations p le 10minus3

Scores

Methods Test Data Reconstruction Error MSE Dice Index

Sparse PCA 15760 091 028ElasticNet PCA 15724 083 043GraphNet PCA 15708 083 030SSPCA 15719 154 007SPCA-TV 15701 064 052

without changing anything We observe that Sparse PCA yieldsvery scattered loading vectors The loading vectors of SPCA-TV on the other hand are sparse but also organized in clearregions SPCA-TV provides loading vectors that closely matchthe ground truth The reconstruction error is evaluated on thetest sets (Tab I) with its value over the 50 data sets being sig-nificantly lower in SPCA-TV than in Sparse PCA (T = 945p = 39 middot 10minus57) ElasticNet PCA (T = 332 p = 27 middot 10minus35GraphNet PCA (T = 127 p = 36 middot 10minus17 and SSPCA from[29] (T = 189 p = 39 middot10minus24) ) methods Additional detailsconcerning the reconstruction accuracy of both the train andtest data is presented in Figure 1 of supplementary materials(supplementary materials are available in the supplementaryfiles multimedia tab)

A different way of quantifying the reconstruction accuracyfor each method is to evaluate how closely the extractedloadings match the known ground truth of simulated dataset We computed the mean squared error (MSE) betweenthe ground truth and the estimated loadings The results arepresented in Tab I We note that the MSE is significantlylower with SPCA-TV than with Sparse PCA (T = 69p = 80 middot 10minus9) ElasticNet PCA (T = 62 p = 11 middot 10minus07)GraphNet-PCA (T = 41 p = 14 middot 10minus04) and SSPCA(T = 226 p = 15 middot 10minus27)

Moreover when evaluating the stability of the loadingvectors across resampling we found a higher statisticallysignificant mean Dice index when using SPCA-TV comparedto the other methods (p lt 0001) The results are presented inTab I They indicate that SPCA-TV is more robust to variationin the learning samples than the other sparse methods SPCA-TV yields reproducible loading vectors across data sets

These results indicate that the SPCA-TV loadings are notonly more stable across resampling but also achieve a betterrecovery of the underlying variability in independent data thanthe Sparse PCA ElasticNet PCAGraphNet PCA and SSPCAmethods

One of the issues linked to biconvex optimization is therisk of falling into locals minima Conscious of this potentialrisk we set up an experiment in which we ran 50 times theoptimization of the same problem with a different startingpoint at each run We then compare the resulting loadingvectors obtained at each run and computed a similarity mea-sure the Dice index It quantifies the proximity between each

independently-run solution with a different starting point Weobtained a Dice index of 099 on the 1st component 099 onthe 2nd component and 072 on the 3rd component Off thestrength of this indices we are confident of this algorithmrobustness and ability to converge toward the same stablesolution independently from the choice of the starting point

B 3D images of functional MRI of patients with schizophreniaWe then applied the methods on 3D images of BOLD

functional MRI (fMRI) acquired with the same scanner andpulse sequence Imaging was performed on a 15 T scannerusing a standard head-coil For all functional scans the field-of-view was 206lowast206lowast153 mm with a resolution close to 35mm in all directions The parameters of the PRESTO sequencewere TE = 96 ms TR = 1925 ms EPI-factor = 15 flip angle= 9 Each fMRI run consisted of 900 volumes collected Thecohort is composed of 23 patients with schizophrenia (averageage = 3496 years 8 Females15 Males) Brain activationwas measured while subjects experienced multimodal hallu-cinations The fMRI data was pre-processed using SPM12(WELLCOME Department of Imaging Neuroscience Lon-don UK) Data preprocessing consisted of motion correction(realignment) coregistration of the individual anatomical T1image to the functional images spatial normalization to MNIspace using DARTEL based on segmented T1 scans

We considered each set of consecutive images under pre-hallucinations state as a block Since most of the patientshallucinate more than once during the scanning session wehave more blocks than patients (83 blocks) The activationmaps are computed from these blocks Based on the generallinear model approach we regressed for each block the fMRIsignal time course on a linear ramp function Indeed we hy-pothesized that activation in some regions presents a ramp-likeincrease during the time preceding the onset of hallucinations(See example of regression in figure 3 in the supplementarymaterialsavailable in the supplementary files multimedia tab)The activation maps that we used as an input to the SPCA-TVmethod are the statistical parametric maps associated with thecoefficients of the block regression (See one example in Figure4 of supplementary materials available in the supplementaryfiles multimedia tab) We obtained a data set of n = 83 mapsand p = 63 966 features We hypothesized that the principalcomponents extracted with SPCA-TV from these activationmaps could uncover major trends of variability within pre-hallucination patterns Thus they might reveal the existence ofsubgroups of patients according to the sensory modality (egvision or audition) involved during hallucinations

We applied all PCA methods under study to this data setexcept SSPCA Indeed the SSPCA method ([29]) could notbe applied to this specific example since datasets have to beconstituted from closed cubic forms without any holes to beeligible for SSPCA method application It does not supportmasked data such as the one used here

The loading vectors extracted from the activation mapsof pre-hallucinations scans with Sparse PCA and SPCA-TVare presented in Fig 3 We observe a similar behavior asin the synthetic example namely that the loading vectors of

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Sparse PCA tend to be scattered and produce irregular patternsHowever SPCA-TV seems to yield structured and smoothsources of variability which can be interpreted clinicallyFurthermore the SPCA-TV loading vectors are not redundantand revealed different patterns

Indeed the loading vectors obtained by SPCA-TV are ofgreat interest because they revealed insightful patterns ofvariability in the data the second loading is composed ofinteresting areas such as the precuneus cortex and the cingulategyrus but also areas related to vision-processing areas such asthe occipital fusiform gyrus and the parietal operculum cortexregions The third loading reveals important weights in themiddle temporal gyrus the parietal operculum cortex and thefrontal pole The first loading vector encompasses all featuresof the brain One might see this first component as a globalvariability affecting the whole brain such as the overarchingeffect of age SPCA-TV selects this dense configuration inspite of the sparsity constraint It is highly desirable to removeany sort of global effect at first in order to start identifyinglocal patterns in next components that are not impacted by aglobal variability of this kind We can identified a widespreadset of dysfunctional language-related or vision-related areasthat present increasing activity during the time preceding theonset of hallucinations The regions extracted by SPCA-TVare found to be pertinent according to the existing literatureon the topic ([28] [27] [7]])

These results seem to indicate the possible existence ofsubgroups of patients according to the hallucination modalitiesinvolved An interesting application would be to use the scoreof the second component extracted by SPCA-TV in orderto distinguish patients with visual hallucinations from thosesuffering mainly from auditory hallucinations

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 139 p = 15 middot 10minus4) ElasticNetPCA (T = 71 p = 21 middot 10minus3) and GraphNet PCA (T = 46p = 10 middot 10minus2) Moreover when assessing the stability of theloading vectors across the folds we found a higher statisticallysignificant mean Dice index in SPCA-TV compared to SparsePCA (p = 40 middot 10minus3) ElasticNet PCA (p = 40 middot 10minus3) andGraphNet PCA (p = 20 middot 10minus3) as presented in Tab II Addi-tional details regarding the reconstruction accuracy on both thetrain and test sets and the Dice index is presented in Figure5 of supplementary materials (available in the supplementaryfiles multimedia tab)

TABLE II Scores of the fMRI data are averaged across the5 folds We tested whether the averaged scores obtained withexisting PCA methods are significantly different from scoresobtained with SPCA-TV Significance notations p le10minus3 p le 10minus2

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 15152 034ElasticNet PCA 14827 032GraphNet PCA 14281 058SPCA-TV 14140 063

In conclusion SPCA-TV significantly outperforms SparseElasticNet and GraphNet PCA in terms of the reconstructionerror on independent test data and in the sense that loadingvectors are both more clinically interpretable and more stable

We also evaluated the convergence speed of Sparse PCAMini-Batch Sparse PCA (a variant of Sparse PCA that isfaster but less accurate) ElasticNet PCA GraphNet PCAand SPCA-TV for this functional MRI data set of n = 83samples and p = 63 966 features We compared the time ofexecution required for each algorithm to achieve a given levelof precision in Tab III Sparse PCA and ElasticNet PCA aresimilar in terms of convergence time while mini-batch sparsePCA is much faster but does not converge to high precision Asexpected structured methods (GraphNet PCA and SPCA-TV)take longer than other sparse methods because of the inclusion

Sparse PCAcompon

ent 1compon

ent 2compon

ent 30

-123

-1110

0-105

component 1

component 2

component 3

-063+063-073+073

+4800

SPCA - TV

Fig 3 Loading vectors recovered from the 83 activation mapsusing Sparse PCA and SPCA-TV

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 10: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Sparse PCA tend to be scattered and produce irregular patternsHowever SPCA-TV seems to yield structured and smoothsources of variability which can be interpreted clinicallyFurthermore the SPCA-TV loading vectors are not redundantand revealed different patterns

Indeed the loading vectors obtained by SPCA-TV are ofgreat interest because they revealed insightful patterns ofvariability in the data the second loading is composed ofinteresting areas such as the precuneus cortex and the cingulategyrus but also areas related to vision-processing areas such asthe occipital fusiform gyrus and the parietal operculum cortexregions The third loading reveals important weights in themiddle temporal gyrus the parietal operculum cortex and thefrontal pole The first loading vector encompasses all featuresof the brain One might see this first component as a globalvariability affecting the whole brain such as the overarchingeffect of age SPCA-TV selects this dense configuration inspite of the sparsity constraint It is highly desirable to removeany sort of global effect at first in order to start identifyinglocal patterns in next components that are not impacted by aglobal variability of this kind We can identified a widespreadset of dysfunctional language-related or vision-related areasthat present increasing activity during the time preceding theonset of hallucinations The regions extracted by SPCA-TVare found to be pertinent according to the existing literatureon the topic ([28] [27] [7]])

These results seem to indicate the possible existence ofsubgroups of patients according to the hallucination modalitiesinvolved An interesting application would be to use the scoreof the second component extracted by SPCA-TV in orderto distinguish patients with visual hallucinations from thosesuffering mainly from auditory hallucinations

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 139 p = 15 middot 10minus4) ElasticNetPCA (T = 71 p = 21 middot 10minus3) and GraphNet PCA (T = 46p = 10 middot 10minus2) Moreover when assessing the stability of theloading vectors across the folds we found a higher statisticallysignificant mean Dice index in SPCA-TV compared to SparsePCA (p = 40 middot 10minus3) ElasticNet PCA (p = 40 middot 10minus3) andGraphNet PCA (p = 20 middot 10minus3) as presented in Tab II Addi-tional details regarding the reconstruction accuracy on both thetrain and test sets and the Dice index is presented in Figure5 of supplementary materials (available in the supplementaryfiles multimedia tab)

TABLE II Scores of the fMRI data are averaged across the5 folds We tested whether the averaged scores obtained withexisting PCA methods are significantly different from scoresobtained with SPCA-TV Significance notations p le10minus3 p le 10minus2

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 15152 034ElasticNet PCA 14827 032GraphNet PCA 14281 058SPCA-TV 14140 063

In conclusion SPCA-TV significantly outperforms SparseElasticNet and GraphNet PCA in terms of the reconstructionerror on independent test data and in the sense that loadingvectors are both more clinically interpretable and more stable

We also evaluated the convergence speed of Sparse PCAMini-Batch Sparse PCA (a variant of Sparse PCA that isfaster but less accurate) ElasticNet PCA GraphNet PCAand SPCA-TV for this functional MRI data set of n = 83samples and p = 63 966 features We compared the time ofexecution required for each algorithm to achieve a given levelof precision in Tab III Sparse PCA and ElasticNet PCA aresimilar in terms of convergence time while mini-batch sparsePCA is much faster but does not converge to high precision Asexpected structured methods (GraphNet PCA and SPCA-TV)take longer than other sparse methods because of the inclusion

Sparse PCAcompon

ent 1compon

ent 2compon

ent 30

-123

-1110

0-105

component 1

component 2

component 3

-063+063-073+073

+4800

SPCA - TV

Fig 3 Loading vectors recovered from the 83 activation mapsusing Sparse PCA and SPCA-TV

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 11: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 10

TABLE III Comparison of the execution time required foreach sparse method to reach the same precision Times re-ported in seconds

Time to reach a given precision in seconds

Methods 10 1 10minus1 10minus2 10minus3

Mini-batch Sparse PCA 532 - - - -Sparse PCA 1580 2312 3443 3868 4501ElasticNet PCA 1237 1381 3027 3964 4063GraphNet PCA 3019 5216 8131 8814 8884SPCA-TV 4277 29586 80930 138134 144599

of spatial constraints Especially SPCA-TV is much longerthan other methods but the convergence time is still reasonablefor an fMRI data set with 65 000 voxels

C Surfaces meshes of cortical thickness in Alzheimer disease

Finally SPCA-TV was applied to the whole brain anatomi-cal MRI from the ADNI database the Alzheimerrsquos DiseaseNeuroimaging Initiative (httpadniloniuscedu) The MRscans are T1-weighted MR images acquired at 15 T accordingto the ADNI acquisition protocol We selected 133 patientswith a diagnosis of mild cognitive impairments (MCI) fromthe ADNI database who converted to AD within two yearsduring the follow-up period We used PCA to reveal patterns ofatrophy explaining the variability in this population This couldprovide indication of possible stratification of the populationinto more homogeneous subgroups that may be clinicallysimilar but with different brain patterns In order to demon-strate the relevance of using SPCA-TV to reveal variability inany kind of imaging data we worked on meshes of corticalthickness The 317 379 features are the cortical thicknessvalues at each vertex of the cortical surface Cortical thicknessrepresents a direct index of atrophy and thus is a potentiallypowerful candidate to assist in the diagnosis of Alzheimerrsquosdisease ([3] [17]) Therefore we hypothesized that applyingSPCA-TV to the ADNI data set would reveal important sourcesof variability in cortical thickness measurements Corticalthickness measures were performed with the FreeSurfer imageanalysis suite (Massachusetts General Hospital Boston MAUSA) which is documented and freely available for downloadonline (httpsurfernmrmghharvardedu) The technical de-tails of this procedure are described in [46] [13] and [2] Allthe cortical thickness maps were registered onto the FreeSurfercommon template (fsaverage)

We applied all PCA methods under study to this data setexcept SSPCA Indeed we could not applied SSPCA methodto this data set due to some intrinsic limitations of the methodSSPCA rsquos application is restricted to N -dimensional arrayimages It does not support meshes of cortical surfaces suchas the data set used here

The loading vectors obtained from the data set with sparsePCA and SPCA-TV are presented in Fig 4 As expectedSparse PCA loadings are not easily interpretable becausethe patterns are irregular and dispersed throughout the brain

surface In contrast SPCA-TV reveals structured and smoothclusters in relevant regions

The first loading vector which maps the whole surface ofthe brain can be interpreted as the variability between patientsresulting from a global cortical atrophy as often observed inAD patients The second loading vector includes variability inthe entorhinal cortex hippocampus and in temporal regionsLast the third loading vector might be related to the atrophyof the frontal lobe and captures variability in the precuneustoo Thus SPCA-TV provides a smooth map that closelymatches the well-known brain regions involved in Alzheimerrsquosdisease[22]

Indeed it is well-documented that cortical atrophy pro-gresses over three main stages in Alzheimer disease([10][15]) The cortical structures are sequentially being affectedbecause of the accumulation of amyloid plaques Corticalatrophy is first observed in the mild stage of the disease inregions surrounding the hippocampus ([26] [44] [47]) andthe enthorinal cortex ([12]) as seen in the second componentThis is consistent with early memory deficits Then the dis-ease progresses to a moderate stage where atrophy graduallyextends to the prefrontal association cortex as revealed in thethird component ([37]) In the severe stage of the disease thewhole cortex is affected by atrophy ([15]) (as revealed in thefirst component) In order to assess the clinical significanceof these weight maps we tested the correlation between thescores corresponding to the three components and performanceon a clinical test ADAS The Alzheimerrsquos Disease AssessmentScale-Cognitive subscale is the most widely used generalcognitive measure in AD ADAS is scored in terms of errorsso a high score indicates poor performance We obtainedsignificant correlations between ADAS test performance andcomponents rsquoscores in Fig 5 r = minus034 p = 42 middot 10minus11for the first component r = minus026 p = 36 middot 10minus7 for thesecond component and r = minus035 p = 40 middot 45minus12 for thethird component) The same behavior is observable for allthree components The ADAS score grows proportionately tothe level to which a patient is affected and to the severityof atrophy he presents (in temporal pole prefrontal regionand also globally) Conversely controls subjects score low onthe ADAS metric and present low level of cortical atrophyTherefore SPCA-TV provides us with clear biomarkers thatare perfectly relevant to the scope of Alzheimerrsquos diseaseprogression

The reconstruction error is significantly lower in SPCA-TVthan in Sparse PCA (T = 127 p = 21 middot 10minus4) ElasticNetPCA (T = 68 p = 23 middot10minus3) and GraphNet PCA (T = 283p = 47middot10minus2) The results are presented in Tab IV Moreoverwhen assessing the stability of the loading vectors across thefolds the mean Dice index is significantly higher in SPCA-TV than in other methods Additional details regarding thereconstruction accuracy on both the train and test sets and theDice index is presented in Figure 7 of supplementary materials(available in the supplementary files multimedia tab)

IV CONCLUSION

We proposed an extension of Sparse PCA that takes intoaccount the spatial structure of the data The optimization

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 12: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 11

TABLE IV Scores are averaged across the 5 folds We testedwhether the averaged scores obtained with existing PCA meth-ods are significantly lower from scores obtained with SPCA-TV Significance notations p le 10minus3 p le 10minus2 p le 10minus1

Scores

Methods Test Data Reconstruction Error Dice Index

Sparse PCA 29918 044ElasticNet PCA 28326 043GraphNet PCA 28136 062SPCA-TV 27950 065

scheme is able to minimize any combination of the `1 `2and TV penalties while preserving the exact `1 penalty Weobserve that SPCA-TV in contrast to other existing sparsePCA methods yields clinically interpretable results and revealsmajor sources of variability in data by highlighting structuredclusters of interest in the loading vectors Furthermore SPCA-TV rsquos loading vectors were more stable across the learningsamples compared to other methods SPCA-TV was validatedand its applicability was demonstrated on three distinct datasets we may reach the conclusion that SPCA-TV can be usedon any kind of structured configurations and is able to presentstructure within the data

Sparse PCA

SPCA - TV+025

0

+012

-012

+009

-009

com

ponent

1co

mp

onent

2co

mp

onent

3

+45

-45

+19

-19

+36

-36

com

ponent

1co

mp

onent

2co

mp

onent

3

Fig 4 Loading vectors recovered from the 133 MCI patientsusing Sparse PCA and SPCA-TV

SUPPLEMENTARY MATERIAL

a) The ParsimonY Python librarybull Url httpsgithubcomneurospinpylearn-parsimonybull Description ParsimonY is Python library for structured

and sparse machine learning ParsimonY is open-source(BSD License) and compliant with scikit-learn API

b) Data sets and scriptsbull Url ftpftpceafrpubunatibrainomicspaperspcatvbull Description This url provides the simulation data set

and the Python script used to create Fig2 for the paper

REFERENCES

[1] A Abraham E Dohmatob B Thirion D Samaras and G VaroquauxExtracting brain regions from rest fmri with total-variation constraineddictionary learning In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2013 - 16th International ConferenceNagoya Japan 2013 Proceedings Part II pages 607ndash615 2013

[2] Bruce B Fischl M Sereno and A Dale Cortical surface-basedanalysis Ii Inflation flattening and a surface-based coordinate systemNeuroImage 9(2)195 ndash 207 1999

[3] A Bakkour JC Morris and BC Dickerson The cortical signature ofprodromal ad regional thinning predicts mild ad dementia Neurology721048ndash1055 2009

[4] A Beck and M Teboulle A Fast Iterative Shrinkage-ThresholdingAlgorithm for Linear Inverse Problems SIAM Journal on ImagingSciences 2(1)183ndash202 2009

Fig 5 Correlation of components scores with ADAS testperformance

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 13: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 12

[5] A Beck and M Teboulle Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems IEEETrans Image Processing 18(11)2419ndash2434 2009

[6] A Beck and M Teboulle Smoothing and first order methods A unifiedframework SIAM Journal on Optimization 22(2)557ndash580 2012

[7] L Bentaleb M Beauregard P Liddle and E Stip Cerebral activityassociated with auditory verbal hallucinations a functional magneticresonance imaging case study Journal of psychiatry amp neuroscienceJPN 27(2)110 2002

[8] JM Borwein and AS Lewis Convex Analysis and Nonlinear Opti-mization Theory and Examples CMS Books in Mathematics Springer2006

[9] S Boyd and L Vandenberghe Convex Optimization CambridgeUniversity Press New York NY USA 2004

[10] H Braak and E Braak Neuropathological stageing of alzheimer-relatedchanges Acta Neuropathologica 82(4)239ndash259 1991

[11] K Brodmann Vergleichende lokalisationslehre der grosshirnrinde inihren prinzipien dargestellt auf grund des zellenbaues 1909

[12] VA Cardenas LL Chao C Studholme K Yaffe BL Miller C MadisonST Buckley D Mungas N Schuff and MW Weiner Brain atrophyassociated with baseline and longitudinal measures of cognition Neu-robiology of aging 32(4)572ndash580 2011

[13] Anders Dale Bruce Fischl and Martin I Sereno Cortical surface-based analysis I segmentation and surface reconstruction NeuroImage9(2)179 ndash 194 1999

[14] A drsquoAspremont L El Ghaoui M Jordan and G Lanckriet A DirectFormulation for Sparse PCA Using Semidefinite Programming SIAMReview 49(3)434ndash448 2007

[15] A Delacourte JP David N Sergeant L Buee A Wattez P VermerschF Ghozali C Fallet-Bianco F Pasquier F Lebert et al The biochemicalpathway of neurofibrillary degeneration in aging and alzheimers diseaseNeurology 52(6)1158ndash1158 1999

[16] L Dice Measures of the amount of ecologic association betweenspecies Ecology 26297ndash302 1945

[17] BC Dickerson E Feczko JC Augustinack J Pacheco JC Morrisand B Fischl Differential effects of aging and alzheimerrsquos disease onmedial temporal lobe cortical thickness and surface area Neurobiologyof aging 30432ndash440 2009

[18] E Dohmatob M Eickenberg B Thirion and G Varoquaux Speeding-up model-selection in GraphNet via early-stopping and univariatefeature-screening June 2015

[19] M Dubois F Hadj-Selem T Lofstedt M Perrot C Fischer V Frouinand E Duchesnay Predictive support recovery with TV-Elastic Netpenalties and logistic regression an application to structural MRI InProceedings of the fourth International Workshop on Pattern Recogni-tion in Neuroimaging (PRNI 2014) 2014

[20] H Eavani TD Satterthwaite RE Filipovych RC Gur and C Da-vatzikos Identifying sparse connectivity patterns in the brain usingresting-state fmri Neuroimage 105286ndash299 2015

[21] D Felleman and D Van Essen Distributed hierarchical processing inthe primate cerebral cortex Cerebral cortex (New York NY 1991)1(1)1ndash47 1991

[22] GB Frisoni NC Fox CR Jack P Scheltens and PM Thompson Theclinical use of structural MRI in Alzheimer disease Nat Rev Neurol6(2)67ndash77 2010

[23] L Grosenick B Klingenberg K Katovich B Knutson and J TaylorInterpretable whole-brain prediction analysis with GraphNet NeuroIm-age 72304ndash321 May 2013

[24] R Guo M Ahn H Zhu and the Alzheimers Disease Neuroimag-ing Initiative Spatially weighted principal component analysis for imag-ing classification Journal of Computational and Graphical Statistics24274ndash296 2015

[25] F Hadj-Selem T Lofstedt V Frouin V Guillemot and E DuchesnayAn Iterative Smoothing Algorithm for Regression with StructuredSparsity arXiv160509658 [stat] 2016 arXiv 160509658

[26] CR Jack MM Shiung JL Gunter PC Obrien SD Weigand David SKnopman Bradley F Boeve Robert J Ivnik Glenn E Smith RH Chaet al Comparison of different mri brain atrophy rate measures withclinical disease progression in ad Neurology 62(4)591ndash600 2004

[27] R Jardri A Pouchet D Pins and P Thomas Cortical activationsduring auditory verbal hallucinations in schizophrenia a coordinate-based meta-analysis American Journal of Psychiatry 168(1)73ndash812011

[28] R Jardri P Thomas C Delmaire P Delion and D Pins The neuro-dynamic organization of modality-dependent hallucinations CerebralCortex pages 1108ndash1117 2013

[29] R Jenatton G Obozinski and F Bach Structured sparse principalcomponent analysis In International Conference on Artificial Intelli-gence and Statistics (AISTATS) 2010

[30] I Jolliffe N Trendafilov and M Uddin A Modified PrincipalComponent Technique Based on the LASSO Journal of Computationaland Graphical Statistics 12(3)531ndash547 2003

[31] M Journe Y Nesterov P Richtrik and R Sepulchre GeneralizedPower Method for Sparse Principal Component Analysis J MachLearn Res 11517ndash553 2010

[32] B Kandel D Wolk J Gee and B Avants Predicting Cognitive Datafrom Medical Images Using Sparse Linear Regression Informationprocessing in medical imaging proceedings of the conference2386ndash97 2013

[33] M Li Y Liu F Chen and D Hu Including signal intensity increasesthe performance of blind source separation on brain imaging data IEEEtransactions on medical imaging 34(2)551ndash563 2015

[34] Lester W Mackey Deflation Methods for Sparse PCA In D KollerD Schuurmans Y Bengio and L Bottou editors Advances in NeuralInformation Processing Systems 21 pages 1017ndash1024 Curran Asso-ciates Inc 2009

[35] J Mairal Sparse coding for machine learning image processing andcomputer vision PhD thesis Ecole normale superieure Cachan 2010

[36] J Mairal F Bach Jean J Ponce and G Sapiro Online Learning forMatrix Factorization and Sparse Coding J Mach Learn Res 1119ndash60 2010

[37] CR McDonald L McEvoy L Gharapetian C Fennema-Notestine DJHagler D Holland A Koyama JB Brewer AM Dale AlzheimersDisease Neuroimaging Initiative et al Regional rates of neocorticalatrophy from normal aging to early alzheimer disease Neurology73(6)457ndash465 2009

[38] H Mohr U Wolfensteller S Frimmel and H Ruge Sparse regu-larization techniques provide novel insights into outcome integrationprocesses NeuroImage 104163ndash176 January 2015

[39] Y Nesterov Smooth minimization of non-smooth functions Mathe-matical Programming 103(1)127ndash152 2005

[40] B Ng A Vahdat G Hamarneh and R Abugharbieh GeneralizedSparse Classifiers for Decoding Cognitive States in fMRI In Springer-Link pages 108ndash115 Beijing China September 2012 Springer BerlinHeidelberg DOI 101007978-3-642-15948-0 14

[41] R Nieuwenhuys The myeloarchitectonic studies on the human cerebralcortex of the vogtndashvogt school and their significance for the interpre-tation of functional neuroimaging data Brain Structure and Function218(2)303ndash352 2013

[42] F Pedregosa G Varoquaux A Gramfort V Michel B ThirionO Grisel M Blondel P Prettenhofer R Weiss V Dubourg J Van-derplas A Passos D Cournapeau M Brucher M Perrot and EDuchesnay Scikit-learn Machine learning in Python Journal ofMachine Learning Research 122825ndash2830 2011

[43] M Ramezani K Marble HTrang and P Abolmaesumi IS JohnsrudeJoint sparse representation of brain activity patterns in multi-task fmridata IEEE transactions on medical imaging 34(1)2ndash12 2015

[44] B Ridha V Anderson J Barnes R Boyes S Price M RossorJ Whitwell L Jenkins R Black Michae M Grundman et al

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006

Page 14: Structured Sparse Principal Components Analysis With the ... · Structured Sparse Principal Components Analysis With the TV-Elastic Net Penalty. IEEE Transac-tions on Medical Imaging,

0278-0062 (c) 2017 IEEE Personal use is permitted but republicationredistribution requires IEEE permission See httpwwwieeeorgpublications_standardspublicationsrightsindexhtml for more information

This article has been accepted for publication in a future issue of this journal but has not been fully edited Content may change prior to final publication Citation information DOI 101109TMI20172749140 IEEETransactions on Medical Imaging

IEEE TRANSACTIONS ON MEDICAL IMAGING 13

Volumetric mri and cognitive measures in alzheimer disease Journalof neurology 255(4)567ndash574 2008

[45] H Shen H Xu L Wang Y Lei L Yang P Zhang J Qin L ZengZ Zhou Z Yang and D Hu Making group inferences using sparserepresentation of resting-state functional mri data with application tosleep deprivation Human Brain Mapping 38(9)4671ndash4689 2017

[46] JG Sled AP Zijdenbos and AC Evans A nonparametric method forautomatic correction of intensity nonuniformity in mri data IEEE TransMed Imaging 1787ndash97 1998

[47] P Thompson K Hayashi G de Zubicaray A Janke S Rose J Sem-ple M Hong D Herman D Gravano D Doddrell and A TogaMapping hippocampal and ventricular change in alzheimer diseaseNeuroImage 22(4)1754 ndash 1766 2004

[48] W-T Wang and H-C Huang Regularized Principal ComponentAnalysis for Spatial Data ArXiv e-prints 2015

[49] D M Witten R Tibshirani and T Hastie A penalized matrixdecomposition with applications to sparse principal components andcanonical correlation analysis Biostatistics 10(3)515ndash534 2009

[50] XChen L Qihang K Seyoung J Carbonell and E Xing Smoothingproximal gradient method for general structured sparse regression TheAnnals of Applied Statistics 6(2)719ndash752 2012

[51] H Zou T Hastie and R Tibshirani Sparse Principal Component Anal-ysis Journal of Computational and Graphical Statistics 15(2)265ndash286 2006