high dimensional classification with combined adaptive

HAL Id: hal-01587360https://hal.archives-ouvertes.fr/hal-01587360

Submitted on 14 Sep 2017

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

High Dimensional Classification with combinedAdaptive Sparse PLS and Logistic Regression

Ghislain Durif, Laurent Modolo, Jakob Michaelsson, Jeff Mold, SophieLambert-Lacroix, Franck Picard

To cite this version:Ghislain Durif, Laurent Modolo, Jakob Michaelsson, Jeff Mold, Sophie Lambert-Lacroix, et al.. HighDimensional Classification with combined Adaptive Sparse PLS and Logistic Regression. Bioinfor-matics, Oxford University Press (OUP), 2018, 34 (3), pp.485-493. �10.1093/bioinformatics/btx571�.�hal-01587360�

https://hal.archives-ouvertes.fr/hal-01587360

https://hal.archives-ouvertes.fr

High Dimensional Classification with combinedAdaptive Sparse PLS and Logistic RegressionG. Durif 1,2,∗, L. Modolo 1,3,4, J. Michaelsson 4, J. E. Mold 4,S. Lambert-Lacroix 5 and F. Picard 1

1LBBE, UMR CNRS 5558, Université Lyon 1, F-69622 Villeurbanne, France,2INRIA Grenoble Alpes, THOTH team, F-38330 Montbonnot, France,3LBMC UMR 5239 CNRS/ENS Lyon, F-69007 Lyon, France,4Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden,5UMR 5525 Université Grenoble Alpes/CNRS/TIMC-IMAG, F-38041 Grenoble, France.

∗To whom correspondence should be addressed.

Abstract

Motivation: The high dimensionality of genomic data calls for the development of specific classificationmethodologies, especially to prevent over-optimistic predictions. This challenge can be tackled bycompression and variable selection, which combined constitute a powerful framework for classification, aswell as data visualization and interpretation. However, current proposed combinations lead to unstableand non convergent methods due to inappropriate computational frameworks. We hereby propose acomputationally stable and convergent approach for classification in high dimensional based on sparsePartial Least Squares (sparse PLS).Results: We start by proposing a new solution for the sparse PLS problem that is based on proximaloperators for the case of univariate responses. Then we develop an adaptive version of the sparsePLS for classification, called logit-SPLS, which combines iterative optimization of logistic regression andsparse PLS to ensure computational convergence and stability. Our results are confirmed on syntheticand experimental data. In particular we show how crucial convergence and stability can be when cross-validation is involved for calibration purposes. Using gene expression data we explore the prediction ofbreast cancer relapse. We also propose a multicategorial version of our method, used to predict cell-typesbased on single-cell expression data.Availability: Our approach is implemented in the plsgenomics R-package.Contact: [email protected] information: Supplementary materials are available at Bioinformatics online.

1 IntroductionMolecular classification is at the core of many recent studies based on Next-Generation Sequencing data. For instance, the genomic characterizationof diseases based on genomic signatures has been one Grail for manystudies to predict patient outcome, survival or relapse (Guedj et al., 2012).Moreover, following the recent advances of sequencing technologies,it is now possible to isolate and sequence the genetic material from asingle cell (Stegle et al., 2015). Single-cell data give the opportunityto characterize the genomic diversity between the individual cells of aspecific population. However, in both cases, the specific context of highdimensionality constitutes a major challenge for the development of newstatistical methodologies (Marimont and Shapiro, 1979; Donoho, 2000).Indeed, the number of recorded variables p (as gene expression) being

far larger than the sample size n, classical regression or classificationmethods are inappropriate (Aggarwal et al., 2001; Hastie et al., 2009), dueto spurious dependencies between variables, that lead to singularities inthe optimization processes, with neither unique nor stable solution.

This challenge calls for the development of specific statistical tools,such as the following dimension reduction approaches: (i) Compressionmethods that search for a representation of the data in lower dimensionalspace and (ii) Variable selection methods, based on a parsimonyhypothesis, i.e., among all recorded variables, a lot are supposed to beuninformative and can be considered as noise to be removed from themodel. For instance, the Partial Least Squares (PLS) regression (Wold,1975; Wold et al., 1983) is a compression approach appropriate for linearregression, especially with highly correlated covariates, that constructsnew components, i.e. latent directions, explaining the response. Anexample of sparsity-based approach is the Lasso (Tibshirani, 1996) wherecoefficients of less relevant variables are shrunk to zero thanks to a `1

© The Author 2016. 1

2 G. Durif et al.

penalty in the optimization procedure. Eventually, sparse PLS (SPLS)regression (Lê Cao et al., 2008; Chun and Keles, 2010) combines bothcompression and variable selection to reduce dimension. It introduces aselection step based on the Lasso in the PLS framework, constructingnew components as sparse linear combinations of predictors. It occursas well that combining compression and “sparse” approach improves theefficiency of prediction and the accuracy of selection. Such an association(compression and selection) is also relevant for data visualization, acrucial challenge when considering high dimensional data. Existing SPLSmethods are based on resolutions of approximations of the associatedoptimization problem. In this work, we first propose a new formulationof the sparse PLS optimization problem with a simple exact resolution,derived from proximal operators (Bach et al., 2012). We also introduce anadaptive sparsity-inducing penalty, inspired from the adaptive Lasso (Zou,2006), to improve the variable selection accuracy.

SPLS has shown excellent performance for regression with acontinuous response, but its adaptation to classification is notstraightforward. Chung and Keles (2010) or Lê Cao et al. (2011) proposedto use sparse PLS as a preliminary dimension reduction step before astandard classification method, such as discriminant analysis (SPLS-DA)or logistic regression, following previous approaches using classical PLSfor molecular classification (Nguyen and Rocke, 2002; Boulesteix, 2004).Their approach gives interesting results in SNPs data analysis (Lê Caoet al., 2011) or in tumor classification (Chung and Keles, 2010).

Another method for classification consists in using logistic regression(binary or multicategorial) (McCullagh and Nelder, 1989), for whichoptimization is achieved via the Iteratively Reweighted Least Squares(IRLS) algorithm (Green, 1984). However its convergence is notguaranteed especially in the high dimensional case. Computationalconvergence is a crucial issue when estimating parameters, as non-convergent methods may lead to unstable and inconsistent estimations,impacting analysis interpretation and reproducibility, especially whentuning hyper-parameters by cross-validation.

The combination of logistic regression and (sparse) PLS could leadto a classification method processing dimension reduction based on lowerspace representation and variable selection. However, the combinationof such iterative algorithms is not necessarily straightforward, due toconvergence issues. Performing compression with SPLS on the categoricalresponse as a first step before logistic regression remains counter-intuitive,because SPLS was designed to handle a continuous response withinhomoskedastic models. Based on the generalized PLS by Marx (1996)or Ding and Gentleman (2005), Chung and Keles (2010) proposed to usesparse PLS within the IRLS iterations to solve reweighted least squares ateach step, however we will see that convergence issues remain. Fort andLambert-Lacroix (2005) proposed to use a Ridge regularization (Eilerset al., 2001) to ensure the convergence of the IRLS algorithm and to usethe classical PLS to estimate predictor coefficients by using a continuouspseudo-response generated by the IRLS algorithm. We will develop asimilar approach based on sparse PLS.

Our new SPLS-based approach, called logit-SPLS, combinescompression and variable selection in a GLM framework. We showthe accuracy, the computational stability and convergence of ourmethod, compared with other state-of-the-art approaches on simulations.Especially, we show that compression increases variable selectionaccuracy, and that our method is more stable regarding the choice of hyper-parameters by cross-validation, contrary to other methods processingclassification with sparse PLS. Thus, our method is the only one thatcorrectly performs considering all criteria (prediction, selection, stability),whereas all the other approaches present a weak spot. Our simulationsillustrate the interest of both selection and compression over selection orcompression only. Our work was implemented in the existing R-packageplsgenomics, available on the CRAN.

We will first introduce our adaptive sparse PLS approach. Then, wewill develop and discuss our classification framework based on RidgeIRLS and adaptive sparse PLS for logistic regression. We will finish bya comparative study and eventually two applications of our method: (i)

binary classification to predict breast cancer relapse after 5 years basedon gene expression data, with an illustration of data visualization throughcompression, (ii) prediction of cell types with multinomial classificationbased on single-cell expression profiles. To do so, we extend our approachto the multi-group case, based on a “one-class vs a reference” type ofmulti-classification. One strength of our approach is to propose a sparsePLS that admits a closed-form solution in both binary and multi-groupclassifications. This leads to computationally efficient procedures in bothcases, contrary to sparse PLS-DA approaches for instance, that are basedon a multivariate response sparse PLS algorithm in the multi-group case,for which there is no closed-form solution (c.f. Chung and Keles, 2010;Lê Cao et al., 2011).

2 Compression and selection in the GLMframework

We first define the sparse PLS and introduce a new formulation of theassociated optimization problem, based on proximal operator. Contraryto existing approaches, this formulation provides a simple resolution ofthe covariance maximization problem associated to sparse PLS. Then, wepropose an adaptive version of the sparse PLS selection step. Eventually,we will develop our approach to combine sparse PLS and logisticregression.

2.1 Proximal sparse PLS

Let (xi, ξi)ni=1 be a n-sample, with ξ = (ξ1, . . . , ξn)T ∈ Rn a

continuous response and xi ∈ Rp a set of p covariates, gathered in thematrix Xn×p = [xT1 , . . . ,x

Tp ]T . The PLS solves a linear regression

problem. We consider centered data ξc and Xc to neglect the interceptand the model ξc = Xcβ\0 + ε, with the coefficients β\0 ∈ Rp. Themetric in the observation space Rn is weighted by the matrix Vn×n.

In the univariate response case, the PLS (Boulesteix and Strimmer,2007) consists in constructing K components tk ∈ Rn that explain theresponse, and defined as linear combinations of predictors, i.e. tk = Xwk

with weight vectors wk ∈ Rp (k = 1, . . . ,K). These weights wk

are defined to maximize the empirical covariance of the correspondingcomponents tk with the response ξc. Other PLS algorithms considerthe maximization of the squared covariance, however both definitions areequivalent in the univariate response case (De Jong, 1993; Boulesteix andStrimmer, 2007). To exclude the inherent noise induced by non pertinentcovariates in the model, the sparse PLS (Lê Cao et al., 2008; Chun andKeles, 2010) introduces a variable selection step into the PLS framework.It constructs “sparse” weight vectors, whose coordinates are required tobe null for covariates that are irrelevant to explain the response. Followingthe Lasso principle (Tibshirani, 1996), the shrinkage to zero is achievedwith a `1 norm penalty in the covariance maximization problem:

w(λs) = argminw∈Rp

{− Cov(Xcw, ξc) + λs ‖w‖1

}, (1)

under the constraints ‖w‖2 = 1 and orthogonality between components,with the sparsity parameter λs > 0. The empirical covariance betweenξc and t = Xcw is explicitly Cov(Xcw, ξc) = wT c, where c =

XTc Vξc ∈ Rp is the empirical covariance Cov(Xc, ξc), depending on

the metric weighted by V (c is a vector because the response is univariate).Different methodologies (Lê Cao et al., 2008; Chun and Keles, 2010)

have been proposed to solve the optimization problem (1). However, bothapproaches give an approximate solution. We propose a new approach to

Adaptive Sparse PLS for Logistic Regression 3

exactly solve this problem in the univariate response case. In the standardPLS algorithm, w is proven to be the dominant singular vector of theempirical covariance c. In the univariate response case (PLS1 algorithm),c is univariate and w ∝ c. Thus, we introduce the following equivalentformulation of the penalized problem (1):

w(λs) = argminw∈Rp

{1

2‖c−w‖ 2

2 + λs ‖w‖1}, (2)

under the constraints ‖w‖2 = 1 and orthogonality between components(the equivalence between (1) and (2) is shown in the Supp. Mat.). Weconsider a range of values for λs so that the problem (2) admits a solution.

Resolution. Applying the method of Lagrange multipliers, the problem (2)becomes (µ > 0):

argminw∈Rpµ>0

{1

2‖c−w‖ 2

2 + λs ‖w‖1 + µ(‖w‖ 2

2 − 1)}. (3)

The method of Lagrange multipliers was proposed by Witten et al. (2009)or Tenenhaus et al. (2014) for different decomposition problems. Theobjective is continuous and convex, thus the strong duality holds and thesolutions of primal (2) and dual (3) problems are equivalent. The resolutionof the dual problem is based on proximal (or proximity) operators definedas the solution of the following problem (Bach et al., 2012):

argminw∈Rp

{1

2‖c−w‖ 2

2 + f(w)}, (4)

for any fixed c ∈ Rp, any function f : Rp → R. It is denoted byproxf (c). When f(·) corresponds to the Elastic Net penalty (combinationof `1 and `2 penalty), i.e. f(w) = µ

2

∑pj=1 |wj |

2+λ∑pj=1 |wj | (with

λ > 0 and µ > 0), the closed-form solution of problem (4) is explicitelygiven by the proximal operator proxµ

2‖·‖ 2

2 +λ ‖·‖1 (c) whose coordinatesare defined by (Yu, 2013, Theo. 4):

proxµ2‖·‖ 2

2 +λs ‖·‖1 (c) =( 1

1 + µsgn(cj)

(|cj | − λ

)+

)j=1:p

. (5)

This corresponds to the normalized soft-thresholding operator appliedto the covariance vector c. When choosing µ = µ∗ so that w∗ =

proxµ2‖·‖ 2

2 +λ ‖·‖1 (c) has a unitary norm, the pair (w∗, µ∗) given bythe proximal operator (5) with λ = λs is a candidate point and then asolution (by convexity) for the dual problem (3). Hence, the SPLS weightsused to construct the SPLS components are given by w∗ ∈ Rp. This newresolution of the sparse PLS problem is a general result, that also appliesin the case of the standard homoskedastic linear model by replacing V

by the n × n identity matrix. This is consistent with the derivation ofthe sparse PLS by Chun and Keles (2010), but provides a more directresolution framework. In addition, λs is renormalized to lie in [0, 1] (c.f.Chun and Keles, 2010).

The resolution of the problem (2) allows to compute w1 and constructthe first components t1. At step k > 1, wk is computed by solvingEq. 2, using a “deflated” version of Xc and ξc, i.e. the residuals of therespective regression of Xc and ξc onto the previous components [t`]

k−1`=1 ,

guaranteeing the orthogonality between components. The active set ofselected variables up to component K is a subset of {1, . . . , p}, definedas the variables with a non null weights in [wk]Kk=1, and denoted by

AK = ∪Kk=1{j, wjk 6= 0}. Eventually, the estimation βSPLS\0 of β\0

in the model ξc = Xcβ\0 + ε is given by the weighted PLS regressionof ξc onto the selected variables in the active set AK . The coefficientβSPLSj is set to zero if the predictor j ∈ {1, . . . , p} is not inAK . Indeed,

following the definition of the SPLS regression, the sparse structure of the

weight vectors [wk]Kk=1 directly induces the sparse structure of βSPLS\0 .

The variables selected to construct the new components [tk]Kk=1 are theones that contribute the most to the response and correspond to those withnon-null entries in the true vector β\0.

2.2 Adaptive sparse PLS

We also propose to adjust the `1 constraint to further penalize the lesssignificant variables, which can lead to a more accurate selection process.Such an approach is inspired by component wise penalization as adaptiveLasso (Zou, 2006). We use the weights wPLS ∈ Rp from classicalPLS (without sparsity constraint) to adapt the `1 penalty on the weightvector wSPLS. The `1 penalty in problem (2) becomes Penada(w) =

λs∑pj=1 γ

j |wj |, with γj = 1/|wPLSj | to account for the significance

of the predictor j (higher weights in absolute values correspond to moreimportant variables). The closed-form solution accounts for the adaptivepenalty and remains the soft-thresholding operator applied to c but withparameter λs × γj for jth predictor (c.f. Supp. Mat.). We called thismethod adaptive sparse PLS.

2.3 Ridge-based logistic regression and logistic regression

We now present our approach based on sparse PLS for logistic regression.

The Logistic Regression model. We now consider a n-sample (xi, yi)ni=1

with yi being a label variable in {0, 1}, gathered in y = (y1, . . . , yn)T .We use the Generalized Linear Models (GLM) framework (McCullaghand Nelder, 1989) to relate the predictors to the random response variableYi, using the logistic link function, such that logit(πi) = β0 +

xTi β\0, with πi = E[Yi], logit(x) = log(x/(1 − x)), and β =

(β0, β1 . . . , βp)T = {β0,β\0}. In the sequel, Z = [(1, . . . , 1)T ,X].With ηi = zTi β, the log-likelihood of the model is defined by logL(β) =∑ni=1 [yiηi − log{1 + exp(ηi)}], and the coefficients β ∈ Rp+1 are

estimated by maximum likelihood (MLE).

The Ridge IRLS algorithm. Optimization relies on a Newton-Raphsoniterative procedure (McCullagh and Nelder, 1989) to construct a sequence

(β(t)

)t≥1, whose limit β∞ ∈ Rp+1 (if it exists) is the estimation of β.

The Iteratively Reweighted Least Squares (IRLS) algorithm (Green, 1984)

explicitly defines (β(t)

)t≥1 as the solutions of successive weighted leastsquares regressions of a pseudo-response ξ(t) ∈ Rn onto the predictorsat each iteration t. The pseudo-response is linearly generated from thepredictors based on previous iterations, c.f. Eq (6). However, when p > n,the matrix Z is singular, which leads to optimization issues. Le Cessieand Van Houwelingen (1992) proposed to optimize a Ridge penalizedlog-likelihood, i.e. logL(β) − (λR/2) βT Σβ, with Σ the diagonalempirical variance matrix of Z and λR > 0 the Ridge parameter. Aunique solution of this regularized problem always exists and is computedby the Ridge IRLS (RIRLS) algorithm (Eilers et al., 2001), where theweighted regression at each IRLS iteration is replaced by a Ridge weightedregression, hence:∣∣∣∣∣∣∣

β(t+1)

= (ZTV(t)Z + λR Σ)−1ZTV(t)ξ(t),

ξ(t+1) = Zβ(t)

+(V(t)

)−1 [y − π(t)

],

(6)

with the estimated probabilities π(t) = (π(t)i )ni=1, i.e. π

(t)i =

logit−1(zTi β

(t))for each Yi, and V(t) = diag

(π

(t)i (1 − π

(t)i ))ni=1

is the diagonal empirical variance matrix of (Yi)ni=1 at step t.

Following the definition of ξ(t), the (R)IRLS algorithm produces apseudo-response ξ∞ as the limit of the sequence (ξ(t))t≥1, verifyingξ∞ = Zβ

∞+ε, where β

∞is the solution of the likelihood optimization,

and ε is a noise vector of covariance matrix (V∞)−1, with V∞ the limitof the matrix sequence (V(t))t≥1.

4 G. Durif et al.

Sparse PLS regression. The pseudo-responseξ∞ produced by Ridge IRLSdepends on predictors through a linear model. Following the approachby Fort and Lambert-Lacroix (2005), we propose to use the sparse PLSregression on ξ∞ to process dimension reduction and estimateβ ∈ Rp+1

in the logistic model E[Yi] = logit−1(β0 + xTi β\0). In this case, the`2 metric (in the observation space) is weighted by the empirical inversecovariance matrix V∞, to account for the heteroskedasticity of noise ε.To neglect the intercept in the SPLS step, we consider the centered versionof X and ξ∞, regarding the metric weighted by V∞, denoted by Xc andξ∞c . The intercept β0 will be estimated later.

The estimates βSPLS\0 ∈ Rp are renormalized to correspond to the

non-centered and non-scaled data, i.e. β\0 = Σ−1/2

βSPLS\0 giving the

estimation β\0 in the original logistic model . The interceptβ0 is estimated

by β0 = ξ∞ − xT β\0 where ξ∞ and x are respectively the sampleaverage of the pseudo-response and the sample average vector of predictorsregarding the metric weighted by V∞. Our method can be summarizedas follow:

1. (ξ∞,V∞)← RIRLS(X,y, λR)

2. Center X and ξ∞ regarding the scalar product weighted by V∞

3.(β

SPLS\0 ,AK , [tk]Kk=1

)← SPLS(X, ξ∞,K, λs,V∞)

4. Renormalization of β = {β0, β\0}

The label ynew of new observations xnew ∈ Rp (non-centered and non-scaled) is predicted through the logit function thanks to the estimates β =

{β0, β\0}. Note that xnew does not need to be centered nor scaled thanks

to the intercept parameter β0 and to the renormalization of the coefficientestimates in the algorithm.

Our method estimates predictor coefficients β in the logistic model bysparse PLS regression of a pseudo-response, considered as continuousand therefore in accordance with the conceptual framework of PLS,while completing compression and variable selection simultaneously. Anadditional interest is that the iterative optimization in the RIRLS algorithmdoes not depend on the number of components K nor on the sparsityparameter λs. Consequently, the convergence of our method is robust tothe choice of K and λs by definition, contrary to other approaches forlogistic regression based on sparse PLS (c.f. Supp. Mat. section A.3). Ourapproach will be called logit-SPLS in the following while the method byFort and Lambert-Lacroix (2005) will be called logit-PLS.

2.4 Tuning sparsity by stability selection

Our logit-SPLS approach depends on three hyper-parameters: the sparsityparameter λs, the Ridge parameter λR and the number of componentsK.We first propose to tune all the parameters by 10-fold cross-validation(to reduce the sampling dependence). Details about the choice of thegrid of candidates values for (λs, λR,K) are given in Supp. Mat. (c.f.sections A.5.1 and A.6.1).

In addition, we propose to adapt the stability selection methoddeveloped by Meinshausen and Bühlmann (2010), to the sparse PLSframework. The interest of this approach is to avoid choosing a valuefor the sparsity parameter λs to find the degree of the sparsity in themodel, i.e. to select the relevant predictors. In this framework, the gridof all parameter candidate values for (λs, λR,K) is denoted by Λ. Theprinciple consists in fitting the model for all points ` ∈ Λ, then estimatingthe probability p`j for each covariate j to be selected over 100 resamplingsof sizen/2 depending on`, i.e. the probability for predictor j to be in the setS` = {j, βj(`) 6= 0}, where β(`) ∈ Rp are the corresponding estimatedcoefficients. Finally, the procedure retains the predictors that are in the setSstable of stable selected variables, defined as {j, max`∈Λ{p`j} ≥ πthr},where πthr is a threshold value. This means that predictors with high

selection probability are kept and predictors with low selection probabilityare discarded.

The average number of selected variables over the entire grid Λ, isdenoted by qΛ, and defined as qΛ = E[#{∪λ∈ΛSλ}]. Meinshausen andBühlmann (2010, Theo. 1) provided a bound on the expected number ofwrongly stable selected variables (equivalent to false positives) in Sstable,depending on the threshold πthr, the expectation qΛ and the number p ofcovariates:

E[FP] ≤1

2πthr − 1

q2Λ

p(7)

where FP is the number of false positives i.e. FP = #{Sc0 ∩ Sstable} andS0 the unknown set of true relevant variables. This results is derived undersome reasonable conditions that are discussed in Supp. Mat. (section A.2).Following the recommendation of Meinshausen and Bühlmann (2010, p.424), we use Eq. 7 to determine the range of the parameter grid Λ to avoidtoo many false positives (corresponding to a weak `1 penalization). Indeed,since the number of false positives is controlled by qΛ, we automaticallyexclude candidate points ` = (λs, λR,K) corresponding to small λs(near 0) for which there is no selection and for which all variables contributeto the mode, so that we can control qΛ. Without removing these points,qΛ and the number of false positives are too high. For instance, whenthe threshold probability πthr is set to 0.9, Λ is defined as a subset of theparameter grid so that qΛ =

√0.8 p ρerror. In practice, qΛ is unknown but

can be estimated by the empirical average number of selected variablesover all ` ∈ Λ. In this context, the expected number of false positives willbe lower than ρerror (in practice, we set ρerror = 10). Details about thecandidate values for (λs, λR,K) are given in Supp. Mat. (section A.6.2).

A clear interest here is that we do not have to choose a specific valuefor the hyper-parameters, instead we retain the variables that are selectedby most of the models when exploring the grid of candidate values forhyper-parameters (including K).

3 Simulation studyWe assess the performance of our approach for prediction, compressionand variable selection compared to state-of-the-art methods that werepreviously introduced. We also use a “baseline” method, called GLMNET(Friedman et al., 2010), that performs variable selection, by solvingthe GLM likelihood maximization with `1 norm penalty for selectionand `2 norm penalty for regularization, also known as the Elastic Netapproach (Zou and Hastie, 2005). We compare different approaches basedon (sparse) PLS for classification (c.f. Tab. 1 and Supp. Mat. section A.3and A.4 for details).

Simulation design. Our simulated data are constructed to assess the interestof compression and variable selection for prediction performance. Thesimulations are inspired from Zou et al. (2006), Shen and Huang (2008) orChung and Keles (2010). The purpose is to control the redundancy withinpredictors, and the relevance of each predictor to explain the response.We consider a predictor matrix X of dimension n × p, with n = 100

fixed, and p = 100, 500, 1000, 2000, so that we examine different highdimensional models. The true vector coefficients β∗ is generated to besparse, the sparsity structure is thus known. Hence, it is possible to assesswhether a method selects the relevant predictors or not. The responsevariable Yi for observation i is a Bernoulli variable, with parameterπ∗i = logit−1(xTi β

∗). The pattern of data simulation and the tuning ofhyper-parameters are detailed in Supp. Mat. (section A.5). Regarding othermethods, we use the range of parameters recommended by their respectiveauthors and the cross-validation procedures supplied in the correspondingpackages.


Table 1. The different algorithms to process dimension reduction by (sparse) PLS in the framework of the logistic regression.

Method Algorithm Sparse? Reference

GPLS(S)PLS inside the IRLS algorithm

× Ding and Gentleman (2005)SGPLS X Chung and Keles (2010)

PLS-log(S)PLS before logistic regression

× Wang et al. (1999), Nguyen and Rocke (2002)SPLS-log X Chung and Keles (2010)

logit-PLS(S)PLS on the pseudo-response after the RIRLS algorithm

× Fort and Lambert-Lacroix (2005)logit-SPLS X Our algorithm

Ridge penalty ensures convergence. Convergence is crucial whencombining PLS and IRLS algorithm as pointed by Fort and Lambert-Lacroix (2005). With the analysis of high dimensional data and the useof selection in the estimating process, it becomes even more essentialto ensure the convergence of the optimization algorithms, otherwise theoutput estimates may not be relevant. Our simulations show that theRidge regularization systematically ensures the convergence of the IRLSalgorithm in our method (logit-SPLS), for any configuration of simulation:p = n, p > n, high or low sparsity, high or low redundancy (see Tabs. A.3and A.2 in Supp. Mat.). On the contrary, approaches that use (sparse)PLS before or within the IRLS algorithm (resp. SPLS-log and (S)GPLS)encounter severe convergence issues.

Whereas the SPLS-log or (S)GPLS approaches were designed toovercome convergence issues, it appears that they do not, which questionsthe reliability of the results supplied by these methods. Then, it confirmsthe interest of the Ridge regularization to ensure the convergence of theIRLS algorithm. Moreover, this convergence seems to be fast (around 15iterations even when p = 2000), which depicts an interesting outcomefor computational time. For instance, the tuning of three parameters in thelogit-SPLS approach is less costly thanks to the fast convergence of thealgorithm. Although both SGPLS and SPLS-log methods are based on twoparameters, they iterates further (until the limit set by the user) which isless computationally efficient, especially with high dimensional data. Onthis matter, details regarding computation times are given in Supp. Mat.(section A.5.2).

Adaptive selection improves cross-validation stability. A cross-validationprocedure would be expected to be stable under multiple runs, i.e. thechosen values must not be variable when running the procedure manytimes on the same sample. Otherwise, selection and prediction becomeuncertain and not suitable for experiment reproducibility. We quantified thestandard deviation of the sparse parameter λs chosen by cross-validationfor the three sparse PLS methods (SGPLS, SPLS-log and our logit-SPLS)when repeating the procedure on the same samples. The standard deviation(all three methods consider the same range of values for λs) is smallerfor our approach (c.f. Tab. 2) than for other methods. Thus, the cross-validation procedure in our adaptive method is more stable than otherSPLS approaches. A similar comment can be made regarding the choiceof the number K of components (c.f. Fig. A.1 in Supp. Mat.). Thisbehavior can be linked to the convergence of the different approaches. Themethods with convergence issues (SGPLS and SPLS-log) present a highercross-validation instability, whereas our method (logit-SPLS) convergesefficiently and shows a better cross-validation stability. Similarly, thevariable selection accuracy, defined as the proportion of rightly selectedand rightly non selected variables (Chong and Jun, 2005), is also influencedby the cross-validation stability and the convergence of the method. Indeed,the standard deviation of the selection accuracy (computed across multipleruns) is higher for the less stable and less convergent methods (SGPLS andSPLS-log) compared to our logit-SPLS approach (c.f. Tab. 2).

Table 2. Comparing computational stability between sparse PLS approaches.σ(λs)

stands for the estimated standard deviation of the tuned hyper-parameterλs (over repetitions on the same simulated data set), which measuresthe stability of the hyper-parameter tuning by cross-validation. σ

(acc.)

standsfor the estimated standard deviation of the accuracy in variable selection,which measures the stability of the selection steps. The results are presentedfor different model dimensions (p).

Methodp = 100 p = 2000

σ(λs)

σ(acc.)

σ(λs)

σ(acc.)

logit-spls 0.09 0.11 0.11 0.09

sgpls 0.17 0.14 0.15 0.12spls-log 0.23 0.12 0.21 0.17

Compression and selection increase prediction accuracy. We now assessthe importance of compression and variable selection for predictionperformance. We consider the prediction accuracy, evaluated throughthe prediction error rate. A first interesting point is that the predictionperformance of compression methods is improved by the addition of aselection step: logit-SPLS, SGPLS and SPLS-DA perform better thanlogit-PLS, GPLS and PLS-DA respectively (c.f. Tab. 3). In addition,sparse PLS approaches also present a lower classification error rate thanthe GLMNET method that performs variable selection only. These twopoints support our claim that in any case compression and selection shouldbe both considered for prediction. Similar results are observed for otherconfigurations of simulated data (c.f. Supp. Mat. section A.5.2). Alldifferent SPLS-based approaches show similar prediction performance,even methods that are not converging (SPLS-log or SGPLS) comparedto our adaptive approach logit-SPLS. Thus, checking prediction accuracyonly may not be a sufficient criterion to assess the relevance of a method.The GPLS method is a good example of non-convergent method (c.f.Tab. 3 and Tab. A.2 in Supp. Mat.) that presents high variability and poorperformance regarding prediction.

Actually, the combination of Ridge IRLS and sparse PLS in ourmethod ensures convergence and provides good prediction performance(prediction error rate at 10% on average) even in the most difficultconfigurations n = 100 and p = 2000, which makes it an appropriateframework for classification.

Compression increases selection accuracy. A sparse model will be usefulif characterized by good prediction performances but also if the selectedcovariates are the genuine important predictors that explain the response.To assess the selection accuracy, we compare the selected predictorsreturned by each sparse method to the set of relevant ones used to constructthe response, i.e. with a non zero coefficient β∗j in our simulation model.We consider sensitivity and specificity (Chong and Jun, 2005), respectively

6 G. Durif et al.

Table 3. Prediction error and selection sensitivity/specificity (if relevant) whenp = 2000, for non-sparse or sparse approaches (delimited by the line). Resultsfor other values of p are joined in Supp. Mat. (section A.5.2).

MethodPrediction Selection Selection Selection

error sensitivity specificity accuracy

gpls 0.49 ± 0.31 / / /pls-da 0.20 ± 0.07 / / /logit-pls 0.17 ± 0.07 / / /

glmnet 0.16 ± 0.07 0.27 0.98 0.74logit-spls 0.11 ± 0.06 0.63 0.86 0.79

sgpls 0.11 ± 0.05 0.80 0.75 0.81

spls-da 0.12 ± 0.06 0.82 0.74 0.81

spls-log 0.12 ± 0.05 0.83 0.75 0.81

the proportion of true positive and true negative regarding the selectedvariables.

A first striking point is that, in our simulations (see Tab. 3 andTabs. A.4, A.5, A.6 in Supp. Mat.), the baseline GLMNET presents avery low sensitivity and a very high specificity (low true positive and lowfalse positive rates), meaning that it selects a small number of predictors(that are relevant), which leads to a lower accuracy compared to SPLS-based approaches. Thus, using approaches that combine compressionand variable selection such as sparse PLS has a true impact on selectionaccuracy, compared to “selection-only” approach such as GLMNET.

Then, we focus on the comparison of the different sparse PLSapproaches. On the one hand, our method logit-SPLS selects less irrelevantpredictors since the false positive rate is lower (higher specificity),compared to other SPLS approaches. On the other hand, SGPLS, SPLS-log and SPLS-DA select more true positives (higher sensitivity). Since allmethods achieve a similar level of accuracy, this result clearly illustratesa difference of strategy regarding variable selection. The balance betweensensitivity and specificity indicates that our method logit-SPLS selectspredictors which are more likely to be relevant, discarding most of the non-pertinent predictors, while other approaches tend to select more predictorswith higher false positive rate. With high dimensional data set (large p),we are generally interested in highly sparse model, thus it is an advantageto have a sharper control on the false positive rate, as in our method.In addition, the relative good sensitivity of other sparse PLS approaches(SGPLS and SPLS-log) is also balanced by a selection process that is lessstable than ours, as the standard deviation of the accuracy is higher oversimulations (as previously mentioned, see Tab. 2).

4 Classification of breast tumors using adaptivesparse PLS for logistic regression

We consider a publicly available data set on breast cancer (Guedj et al.,2012) containing the level expression of 54613 genes for 294 patientsaffected by breast cancer. We focus on the relapse after 5 years, consideringa {0, 1} valued response, if the relapse occurred or not. There were214 patients without relapse and 80 with a relapse. We reduce thenumber of genes by considering the top 5000 most differentially expressedgenes, by using a standard t-test with a Benjamini-Hochberg correction.Computation details (resamplings, cross-validation, stability selection,training and test set definition) are joined in Supp. Mat. (c.f. section A.6).

Convergence and stability with Ridge IRLS and adaptive sparse PLS. TheRidge IRLS algorithm confirms its usual convergence (see Tab. 4). Otherapproaches based on SPLS (SGPLS and SPLS-log) again encounter severe

Table 4. Averaged prediction error, convergence percentage over 100resamplings and standard deviation of cross-validated λs.

Method Prediction error Conv. perc. s.d. λs

glmnet 0.27 ± 0.04 / /logit-pls 0.26 ± 0.05 100% /logit-spls 0.23 ± 0.06 100% 0.15

logit-spls-ad 0.19 ± 0.04 100% 0.15

sgpls 0.5 ± 0.21 5% 0.18

spls-log 0.18 ± 0.04 1% 0.19

issues and almost never converge. Following a similar pattern, our adaptiveselection is far more stable under the tuning of the sparsity parameter λsby cross-validation than any other approach using sparse PLS (Tab.4), asthe precision on this hyper-parameter value is the highest for our method,illustrating less variability in the tuning over repetitions.

Interest of adaptive selection for prediction and selection. Regardingprediction performance, the adaptive version of our algorithm logit-SPLS gives better results (c.f. Tab. 4) which highlights the interest ofadaptive selection. It can also be noted that our approach performs betteron prediction than both logit-PLS (compression only) and GLMNET(selection only), which again supports the interest of using bothcompression and variable selection. The SGPLS method does not confirmits performance on our simulatiosn with poor and highly variable results,illustrating the potential lack of stability of non-convergent method. Onlythe SPLS-log method achieves a classification that is as good as ouradaptive method. However this point will be counterbalanced by itsassessment over the other criteria in the following.

Regarding variable selection, the stability selection analysis (see Fig. 1)shows that, when the number of false positives is bounded (on average),our approach logit-SPLS selects more genes than any other approach(SGPLS, SPLS-log and GLMNET). Hence, we discover more truepositives (because the number of false positives is bounded), unravelingmore relevant genes than other approaches. This again illustrates the goodperformance of our method for selection. More generally, approaches thatuse sparse PLS, i.e. performing selection and compression, select morevariables than GLMNET with the same false positive rate, thus retrievingmore true positives than GLMNET which performs only selection. Thisagain supports our previously developed idea that compression andselection are both very suitable for high dimensional data analysis. Werecall that the curves in Fig. 1 correspond to the number of variablesthat are selected by most models when exploring the grid of candidatevalues for hyper-parameters (including K). Additional results regardingthe overlap between the genes selected by the different methods and the listof selected genes with their score (i.e. the maximum estimated probabilityof selection) are given in Supp. Mat. (section A.6.2).

Efficient compression to discriminate the response. To assess the interestof our approach for data visualization, we represent the score ofthe observations on the first two components, i.e. the point cloud(ti1, ti2)ni=1. The points are colored according to their Y -labels. Anefficient compression technique would separate the Y -classes with fewercomponents. We fit the different compression-based approach (when thenumber of components is set to K = 2). We use PCA as a referencefor compression and data visualization, based on unsupervised learningcontrary to other compared approaches. Fig. 2 represents the first twocomponents computed by logit-PLS, logit-SPLS, SGPLS, SPLS-log andPCA. It appears that the first two components from our logit-SPLSare sufficient to easily separate the two Y -classes. On the contrary,other sparse PLS approaches do not achieve a similar efficiency in the


0

50

100

150

200

0.7 0.8 0.9proba threshold

nu

mb

er

of

sta

ble

se

lecte

d v

ari

able

s

Fig. 1. Number of variables in the set of stable selected variables versus the threshold πthr ,when forcing the average number of false positives to be smaller thanρerror = 10. Methods:glmnet (−�−), logit-spls-adapt (−N−), sgpls (− �−), spls-log (− �−). Note: here,all hyper-parameters (including K) vary across the grid of candidate values Λ (c.f. Supp.Mat. section A.6.2).

logit−spls−ad

−2

−1

0

1

2

3

−1 0 1 2 3

sgpls

−2

−1

0

1

2

3

−2 −1 0 1 2

spls.log

−2

−1

0

1

2

−2 −1 0 1 2

pca

−2

−1

0

1

2

−2 −1 0 1 2 3

Comp. 1

Co

mp.

2

Fig. 2. Observations scores on the first two components for the different methods.The points are shaped according to the value of the response: 0 (•) and 1 (N). Thescores are normalized for comparison.

compression process. Thus, our method turns out to be very efficient fordata visualization, especially compared to principal component analysis.

5 Characterization of T lymphocyte types basedon single-cell data

We generalized our approach to the multicategorial case and developed anew method, called multinomial-SPLS (or MSPLS), that was applied to theprediction of cell types using single cell expression data (Stegle et al., 2015;Gawad et al., 2016). Our approach (detailed in Supp. Mat., section A.7) isbased on a direct extension of the logistic model. It is specifically a “one-class vs a reference” type of multi-classification, in which the membershipprobabilities of each class (except the reference) are estimated based onlinear combinations of the predictors. The membership probability of thereference class is then deduced from the rest. The resolution is derived fromour logit-SPLS method. One interest is that our multi-group classificationapproach uses a univariate response sparse PLS algorithm (that admits aclosed-form solution, c.f. section 2), contrary to sparse multigroup PLS-DA for instance (c.f. Supp. Mat. section A.3).

Understanding the mechanisms of an adaptive immune response isof great interest for the creation of new vaccines. This response ismade possible thanks to antigen-specific “effector” T cells capable ofrecognizing and killing infected cells, and to the long-lasting “memory”T cells that will constitute a repertoire for later secondary immuneresponses. These two types of T cells have then been described as 4 sub-groups: CM, TSCM (“Memory”), TEMRA, EM, (“Effector Memory”).Generally speaking, CM and TSCM can be considered as “Memory” cellsand TEMRA and EM can be considered as “Effectors” as CM/TSCMand EM/TEMRA share significant functional overlap with each other(Willinger et al., 2005a; Gattinoni et al., 2011). Understanding thetranscriptomic diversity of T cells constitutes a new challenge to better

characterize the short and long-term vaccinal responses, as T cellsare increasingly recognized as being highly heterogeneous populations(Newell et al., 2012). However, these investigations have been limited bycurrent practices that consist in defining those 4 cell types based by drawingnon-overlapping gates on the 2D-space defined two surface markers only:CCR7 and CD45RA (Sallusto et al., 1999). Consequently this rule leadsto the selection of a fraction of cells that only correspond to cells with themost extreme values of markers, which ignores the complexity of a T cellpopulation sampled from real blood.

We developed a SPLS-based multi-categorical classification to bettercharacterize the transcriptomic diversity that supports the 4 different celltypes of T cells. This approach aims at classifying more cells, and atinferring the type of the non-identified cells. To do so, we consideredthe measurements of 11 surface markers (CCR7, CD45RA, CD27, IL7R,FAS, CD49F, PD1, CD57, CD3E, CD8A), along with the expression ofthe corresponding genes. All these measurements were available on thesingle-cell basis. We will show that even in this low dimensional case, theuse of variable selection will help to improve the accuracy of the results.In the following, hyper-paramaters (including K) were tuned by cross-validation. Details about the candidate values for (λs, λR,K) are givenin Supp. Mat. (section A.8.1).

We developed the following two-step analysis. We started byconsidering the measures of the 11 surface markers and the expressionof the 11 associated genes. The multinomial-SPLS was trained on a subsetof cells that were tagged manually, and used to predict the types of theunknown cells (136 annotated over 943 cells). On this training set of 136cells, including 44 CM and 28 TSCM cells (i.e. 72 “Memory” cells), 30 EMand 34 TEMRA cells (i.e. 64 “Effector” cells), a 5-fold cross-validationprocedure (with 50 repetitions) is used to tune the hyper-parameters. Thecross-validation prediction error over the resamplings was∼ 6%.Fig. A.3in Supp. Mat. shows that the cells in the training set are well discriminatedin this first step. In addition, our SPLS procedure selected the proteinsCCR7 and CD45RA in 100% of the runs, which is coherent with themanual annotation of the cells based on these two markers.

In a second step we wanted to enrich the set of genes that discriminatecell types. To proceed we considered the expression of all genes of thesepredicted cell types, and performed a differential analysis from which weretained 61 differentially expressed genes (corresponding to a 5% FDR).By considering these 61 genes added to the first 22 markers considered forthe first prediction step, we performed the MSPLS-based prediction on thecomplete data set annotated by our first prediction. Our method selected 8new biologically relevant genes (more details in Supp. Mat. section A.8.2)with a cross-validation prediction error rate over re-samplings (again 5-foldcross-validation) of∼ 16% (on the whole data set, not only considering themost extreme phenotypes). The main interests of this two-step procedurewere to be computationally efficient and to narrow the list of potentialgenes of interest, which was conclusive since this second prediction greatlyimproved the biological relevance of the predicted cell types by accountingfor more information than the one contained in the classical markers likeCCR7 and provided us with new insight to better understand the T cellsimmune response.

Fig. 3 illustrates the representation of the cells in the latentdimensional space computed by the multinomial PLS in the second stepof prediction. The reference class is “CM”. The SPLS computes latentdirections discriminating each other class (“EM”, “TEMRA” and “TSCM”respectively) versus the reference class (c.f. Supp Mat.). The cells arerepresented on the first two components for the three different pairs: “CMversus EM”, “CM versus TEMRA” and “CM versus TSCM”. The latentcomponents clearly discriminate the group of cells in the three differentcases, which confirms the result of the second prediction based both onmarkers and differentially expressed genes. The different groups are clearlyidentified but there is no gap between them, contrary to the representation

8 G. Durif et al.

CM vs EM CM vs TEMRA CM vs TSCM

−100 −50 0 50 −100 0 100 −100 −50 0 50

−40

−20

0

20

−200

−100

0

−100

0

100

200

comp1

co

mp

2

Fig. 3. Cell scores on the first two PLS components in the latent space thatdiscriminate between the reference class (“CM”) and each other class separately(“EM”, “TEMRA” and “TSCM” respectively, from left to right). T cells are identifiedby their predicted types after the second prediction step.

of the cells in the training set for the the first prediction (c.f. Supp. Mat.).This indicates that the multinomial-SPLS was able to predict the type ofthe lost common cells based on the most extreme phenotypes.

This application highlights the interest of dimension reductionby compression and variable selection, even when dealing with lowdimensional data. It can also be noted that, even when using sparseapproaches, a step of pre-selection is always useful, especially in theanalysis of single-cell expression data, which are very noisy compared tostandard RNA-seq data, because of the important inter-cellular diversity.

6 ConclusionWe have introduced a new formulation of sparse PLS and proposed anadaptive version of our algorithm to improve the selection process. Usingproximal operators, we provide an explicit resolution framework with aclosed-form solution based on soft-thresholding operators.

In addition, we developed a method that performs compression andvariable selection suitable for classification. It combines Ridge regularizedIterative Least Square algorithm and sparse PLS in the logistic regressioncontext. It is particularly appropriate for the case of high dimensional data,which appears to be a crucial issue nowadays, for instance in genomics.Our main consideration was to ensure the convergence of IRLS algorithm,which is a critical point in logistic regression. Another concern was toproperly incorporate into the GLM framework a dimension reductionapproach such as sparse PLS.

Ridge regularization ensures the convergence of the IRLS algorithm,which is confirmed in our simulations and tests on experimental datasets. Applying adaptive sparse PLS as a second step on the pseudo-response produced by IRLS respects the definition of PLS regressionfor continuous response. Moreover, the combination of compressionand variable selection increases the prediction performance and selectionaccuracy of our method, which turns out to be more efficient than state-of-the-art approaches that do not use both dimension reduction techniques.Such a combination also improves the compression process, illustratedby the efficiency of our method for data visualization compared tostandard supervised or unsupervised approaches. Furthermore it appearsthat previous procedures using sparse PLS with logistic regressionencounter convergence issues linked to a lack of stability inthe cross-validation parameter tuning process, highlighting the crucial importanceof convergence when dealing with iterative algorithms.

It can be noted that our approach can be used to include additionalcovariates in the model. For example, we used a combination of surfacemarker levels and gene expression levels in the single cell data analysis. Onthis matter, an interesting research direction would be to work on a LeastSquare-Partial Least Square (LS-PLS) approach, in which some part of thepredictors are compressed into PLS components and some others are not.There have been recent advances regarding LS-PLS for logistic regression

(see Bazzoli and Lambert-Lacroix, 2016). However, to our knowledge,there is no work on a potential LS-SPLS method, even in the regressioncase.

In addition, an interesting extension of our work would be to investigatetheoretical properties of the sparse PLS regression (especially regarding itsconsistency or any oracle properties). Deriving such properties would bean opportunity to assess the underlying statistical properties of our methodand remains an open question.

FundingThis work was supported by the french National Resarch Agency (ANR)as part of the “Algorithmics, Bioinformatics and Statistics for NextGeneration Sequencing data analysis” (ABS4NGS) ANR project [grantnumber ANR-11-BINF-0001-06] and as part of the “MACARON” ANRproject [grant number ANR-14-CE23-0003]. It was performed using thecomputing facilities of the computing center LBBE/PRABI.

ReferencesAggarwal, C. C., Hinneburg, A., and Keim, D. A. (2001). On the surprising

behavior of distance metrics in high dimensional space. In InternationalConference on Database Theory, pages 420–434. Springer.

Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2012). Optimizationwith Sparsity-Inducing Penalties. Found. Trends Mach. Learn., 4(1),1–106.

Barker, M. and Rayens, W. (2003). Partial least squares for discrimination.Journal of chemometrics, 17(3), 166–173.

Bazzoli, C. and Lambert-Lacroix, S. (2016). Classification using LS-PLS with logistic regression based on both clinical and gene expressionvariables. Preprint.

Boulesteix, A.-L. (2004). PLS dimension reduction for classificationwith microarray data. Statistical applications in genetics and molecularbiology, 3(1), 1–30.

Boulesteix, A.-L. and Strimmer, K. (2007). Partial least squares:A versatile tool for the analysis of high-dimensional genomic data.Briefings in bioinformatics, 8(1), 32–44.

Chong, I.-G. and Jun, C.-H. (2005). Performance of some variableselection methods when multicollinearity is present. Chemometrics andIntelligent Laboratory Systems, 78(1), 103–112.

Chun, H. and Keles, S. (2010). Sparse partial least squares regressionfor simultaneous dimension reduction and variable selection. Journal ofthe Royal Statistical Society: Series B (Statistical Methodology), 72(1),3–25.

Chung, D. and Keles, S. (2010). Sparse Partial Least Squares Classificationfor High Dimensional Data. Statistical Applications in Genetics andMolecular Biology, 9(1).

De Jong, S. (1993). SIMPLS: An alternative approach to partial leastsquares regression. Chemometrics and intelligent laboratory systems,18(3), 251–263.

Ding, B. and Gentleman, R. (2005). Classification Using GeneralizedPartial Least Squares. Journal of Computational and GraphicalStatistics, 14(2), 280–298.

Donoho, D. L. (2000). High-dimensional data analysis: The curses andblessings of dimensionality. AMS Math Challenges Lecture, pages 1–32.

Eilers, P. H., Boer, J. M., van Ommen, G.-J., and van Houwelingen,H. C. (2001). Classification of microarray data with penalized logisticregression. In BiOS 2001 The International Symposium on BiomedicalOptics, pages 187–198. International Society for Optics and Photonics.

Eksioglu, E. M. (2011). Sparsity regularised recursive least squaresadaptive filtering. IET signal processing, 5(5), 480–487.

Firth, D. (1993). Bias reduction of maximum likelihood estimates.Biometrika, 80(1), 27–38.


Fort, G. and Lambert-Lacroix, S. (2005). Classification using partial leastsquares with penalized logistic regression. Bioinformatics, 21(7), 1104–1111.

Fort, G., Lambert-Lacroix, S., and Peyre, J. (2005). Réduction dedimension dans les modèles linéaires généralisés: application à laclassification supervisée de données issues des biopuces (in french).Journal de la société française de statistique, 146(1-2), 117–152.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths forgeneralized linear models via coordinate descent. Journal of statisticalsoftware, 33(1), 1.

Gattinoni, L., Lugli, E., Ji, Y., Pos, Z., Paulos, C. M., Quigley, M. F.,Almeida, J. R., Gostick, E., Yu, Z., Carpenito, C., Wang, E., Douek,D. C., Price, D. A., June, C. H., Marincola, F. M., Roederer, M., andRestifo, N. P. (2011). A human memory T cell subset with stem cell-likeproperties. Nat. Med., 17(10), 1290–1297.

Gawad, C., Koh, W., and Quake, S. R. (2016). Single-cell genomesequencing: Current state of the science. Nature Reviews Genetics,17(3), 175–188.

Green, P. J. (1984). Iteratively reweighted least squares for maximumlikelihood estimation, and some robust and resistant alternatives. Journalof the Royal Statistical Society. Series B (Methodological), pages 149–192.

Guedj, M., Marisa, L., de Reynies, A., Orsetti, B., Schiappa, R., Bibeau,F., MacGrogan, G., Lerebours, F., Finetti, P., Longy, M., Bertheau,P., Bertrand, F., Bonnet, F., Martin, A. L., Feugeas, J. P., Bièche, I.,Lehmann-Che, J., Lidereau, R., Birnbaum, D., Bertucci, F., de Thé, H.,and Theillet, C. (2012). A refined molecular taxonomy of breast cancer.Oncogene, 31(9), 1196–1206.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements ofStatistical Learning. Springer Series in Statistics. Springer New York,New York, NY, second edition.

Lê Cao, K.-A., Rossouw, D., Robert-Granié, C., and Besse, P. (2008). Asparse PLS for variable selection when integrating omics data. Statisticalapplications in genetics and molecular biology, 7(1).

Lê Cao, K.-A., Boitard, S., and Besse, P. (2011). Sparse PLS discriminantanalysis: Biologically relevant feature selection and graphical displaysfor multiclass problems. BMC Bioinformatics, 12, 253.

Le Cessie, S. and Van Houwelingen, J. C. (1992). Ridge estimators inlogistic regression. Applied statistics, pages 191–201.

Marimont, R. B. and Shapiro, M. B. (1979). Nearest Neighbour Searchesand the Curse of Dimensionality. IMA Journal of Applied Mathematics,24(1), 59–70.

Marx, B. D. (1996). Iteratively reweighted partial least squares estimationfor generalized linear regression. Technometrics, 38(4), 374–381.

McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models,Second Edition. CRC Press.

Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal ofthe Royal Statistical Society: Series B (Statistical Methodology), 72(4),417–473.

Newell, E. W., Sigal, N., Bendall, S. C., Nolan, G. P., and Davis, M. M.(2012). Cytometry by time-of-flight shows combinatorial cytokineexpression and virus-specific cell niches within a continuum of CD8+ Tcell phenotypes. Immunity, 36(1), 142–152.

Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partialleast squares using microarray gene expression data. Bioinformatics,18(1), 39–50.

Sallusto, F., Lenig, D., Forster, R., Lipp, M., and Lanzavecchia, A. (1999).Two subsets of memory T lymphocytes with distinct homing potentialsand effector functions. Nature, 401(6754), 708–712.

Shen, H. and Huang, J. Z. (2008). Sparse principal component analysisvia regularized low rank matrix approximation. Journal of multivariateanalysis, 99(6), 1015–1034.

Stegle, O., Teichmann, S. A., and Marioni, J. C. (2015). Computationaland analytical challenges in single-cell transcriptomics. Nature ReviewsGenetics, 16(3), 133–145.

Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K.-A., Grill, J.,and Frouin, V. (2014). Variable selection for generalized canonicalcorrelation analysis. Biostatistics, 15(3), 569–583.

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.Journal of the Royal Statistical Society. Series B (Methodological),58(1), 267–288.

Wang, C.-Y., Chen, C.-T., Chiang, C.-P., Young, S.-T., Chow, S.-N.,and Chiang, H. K. (1999). A Probability-based Multivariate StatisticalAlgorithm for Autofluorescence Spectroscopic Identification of OralCarcinogenesis. Photochemistry and photobiology, 69(4), 471–477.

Wherry, E. J., Ha, S.-J., Kaech, S. M., Haining, W. N., Sarkar, S., Kalia, V.,Subramaniam, S., Blattman, J. N., Barber, D. L., and Ahmed, R. (2007).Molecular Signature of CD8+ T Cell Exhaustion during Chronic ViralInfection. Immunity, 27(4), 670–684.

Willinger, T., Freeman, T., Hasegawa, H., McMichael, A. J., and Callan,M. F. (2005a). Molecular signatures distinguish human central memoryfrom effector memory CD8 T cell subsets. J. Immunol., 175(9), 5895–5903.

Willinger, T., Freeman, T., Hasegawa, H., McMichael, A. J., and Callan,M. F. C. (2005b). Molecular Signatures Distinguish Human CentralMemory from Effector Memory CD8 T Cell Subsets. The Journal ofImmunology, 175(9), 5895–5903.

Witten, D. M., Tibshirani, R., and Hastie, T. (2009). A penalized matrixdecomposition, with applications to sparse principal components andcanonical correlation analysis. Biostatistics, 10(3), 515–534.

Wold, H. (1975). Soft Modeling by Latent Variables; the NonlinearIterative Partial Least Squares Approach. Perspectives in Probabilityand Statistics. Papers in Honour of M. S. Bartlett.

Wold, S., Martens, H., and Wold, H. (1983). The multivariate calibrationproblem in chemistry solved by the PLS method. In Matrix Pencils,pages 286–293. Springer.

Yu, Y.-L. (2013). On decomposing the proximal map. In Advances inNeural Information Processing Systems, pages 91–99.

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal ofthe American statistical association, 101(476), 1418–1429.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via theelastic net. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 67(2), 301–320.

Zou, H., Hastie, T., and Tibshirani, R. (2006). Sparse principal componentanalysis. Journal of computational and graphical statistics, 15(2), 265–286.

10 G. Durif et al.

Supplementary Information

A.1 Optimization in sparse PLS

A.1.1 Reformulation of the sparse PLS problem

As previously introduced, the sparse PLS constructs components assparse linear combination of the covariates. When considering the firstcomponents, i.e. t1 = Xw1, the weight vector w1 ∈ Rp is definedto maximize the empirical covariance between the component and theresponse, i.e. Cov(Xw, ξ) ∝ wTXT

c ξc (centered X and ξ) with apenalty on the `1-norm of w1 to enforce sparsity in the weights. Thus, theweight vector w1 is computed as the solution of the following optimizationproblem:

argminw∈Rp

{−wTXT

c ξc + λs∑j

|wj |},

‖w‖2 = 1 (additional constraint) ,

(A.1)

with λs > 0. The problem (A.1) is equivalent to the following, whendenoting the standard scalar product by 〈·, ·〉:

argminw∈Rp

{− 2

⟨w , XT

c ξc⟩

+ ‖w‖ 22 + 2λs

∑j

|wj |},

‖w‖2 = 1 ,

because the term ‖w‖2 is constant thanks to the additional constraint. Thisnew problem remains equivalent to the following:

argminw∈Rp

{‖XT

c ξc‖ 22 − 2

⟨w , XT

c ξc⟩

+ ‖w‖ 22 + 2λs

∑j

|wj |},

‖w‖2 = 1 ,

since the norm of the empirical covariance ‖XTc ξc‖ 2

2 is constant. Then,thanks to the Euclidean norm properties, it can be rewritten as:

argminw∈Rp

{1

2‖c−w‖ 2

2 + λs ‖w‖1},

‖w‖2 = 1 ,

(A.2)

with c = XTc ξc and when setting λ = 2ν > 0. Actually, in the case

of a univariate response, the formulation (A.2) is natural. Indeed, in thestandard (non-sparse) PLS, the optimal weight vector w is the normalizeddominant singular vector of the covariance matrix XT ξ. However, whenthe response is univariate, the matrix XT ξ is a vector and the solutionfor w is the normalized vector XT ξ (normalized to 1). This correspondsexactly to the solution of the problem:

argminw∈Rp

‖c−w‖ 22 ,

‖w‖2 = 1 ,

(without the `1 penalty).The solution of the penalized problem (A.3) defines the first component

(k = 1) of the sparse PLS. We use deflated predictors and response toconstruct the following component (k > 1).

A.1.2 Resolution of the sparse PLS problem

Applying the method of Lagrange multipliers, the problem (A.2) becomes:

argminw∈Rpµ>0

{1

2‖c−w‖ 2

2 + λs ‖w‖1 + µ(‖w‖ 2

2 − 1)}, (A.3)

withµ > 0. The objective is continuous and convex, thus the strong dualityholds and the solutions of primal and dual problems are equivalent.

To solve the problem A.3, we use proximity operator (also calledproximal operator) defined as the solution of the following problem (Bachet al., 2012):

argminw∈Rp

{1

2‖c−w‖ 2

2 + f(w)}, (A.4)

for any fixed c ∈ Rp, any function f : Rp → R. It is denoted byproxf (c). When f(·) corresponds to the Elastic Net penalty (combinationof `1 and `2 penalty), i.e. when considering the problem (with λ > 0 andµ > 0):

argminw∈Rp

{1

2‖c−w‖ 2

2 +µ

2

p∑j=1

(wj)2 + λ

p∑j=1

|wj |}, (A.5)

the closed-form solution is explicitly given by the proximal operatorproxµ

2‖·‖ 2

2 +λ ‖·‖1 that is in particular the composition of proxµ2‖·‖ 2

2and proxλ ‖·‖1 (Yu, 2013, Theo. 4), i.e.

proxµ2‖·‖ 2

2 +λ ‖·‖1 (c) = proxµ2‖·‖ 2

2◦ proxλ ‖·‖1 (c) .

Both proximal operators proxλ ‖·‖1 and proxµ2‖·‖ 2

2are known (Bach

et al., 2012), respectively being:

proxλ ‖·‖1 (c) =(

sgn(cj)(|cj | − λ

)+

)j=1:p

,

proxµ2‖·‖ 2

2(c) =

1

1 + µc,

where sgn(·) (| · |−λ)+ is the soft-thresholding operator. Eventually, thecoordinates of the solution are then:

proxµ2‖·‖ 2

2 +λ ‖·‖1 (c) =( 1

1 + µsgn(cj)

(|cj |−λ

)+

)j=1:p

, (A.6)

which correspond to the normalized soft-thresholding operator applied tothe vector c = XT

c ξc.We use the solution (A.6) of the Elastic Net problem (A.5), where

λ = λs and µ is chosen so that the solution has a unitary norm, tofind a candidate point and then the solution (by convexity) of the dualproblem (A.3).

Finally, we have reformulated the problem defining the sparse PLS as aleast squares problem with an Elastic Net penalty and we have shown thatthe solution of this problem is the (normalized) soft-thresholding operator.

A.1.3 Adaptive penalty

When considering an adaptive penalty, the optimization problemassociated to the sparse PLS can be similarly rewritten as:

argminw∈Rp

{1

2‖c−w‖ 2

2 +

p∑j=1

λj |wj |},

‖w‖2 = 1 ,

(A.7)

with the penalty constant λj = λ γj (c.f. main text). By a similarreasoning (continuity and convexity), it is possible to use Lagrangemultiplier to resolve the problem (A.7).

In order to explicitly derive the solution, we will use the proximaloperator that is solution of the following problem:

argminw∈Rp

{1

2‖c−w‖ 2

2 +µ

2

p∑j=1

(wj)2 +

p∑j=1

λj |wj |}, (A.8)

with µ > 0.


If f1(w) =∑j λj |wj |, it can be shown that the solution of the

problem (A.4) when considering f = f1 is given by:

proxf1 (c) =(

sgn(cj)(|cj | − λj

)+

)j=1:p

,

because the subgradient of f1 is given by∇sf1(w) =(λj sgn(wj)

)pj=1

(Eksioglu, 2011).The link between subgradient and proximal operator is described in

Bach et al. (2012). In particular, w∗ = proxf (c) if and only if c−w∗ ∈∂f(w∗) for any couple (c,w∗) ∈ Rp × Rp, where ∂f(w∗) is thesubdifferential of f at point w∗, i.e. the set of all subgradients∇sf(w∗)

of f at point w∗. If f is differentiable in w∗, then the only subgradient isthe gradient∇f(w∗).

The proximal operator corresponding to the function f2(w) =µ2

∑j(wj)

2 is known (c.f. previously):

proxf2 (c) =1

1 + µc.

Eventually, thanks to theorem 4 in Yu (2013), the solution ofproblem (A.8) is explicitly defined as the combination of proxf1 andproxf2 :

proxf1+f2(c) = proxf2 ◦ proxf1 (c) .

for any c ∈ Rp. Thus, the solution of the problem (A.8) is given by:

proxf1+f2(c) =

( 1

1 + µsgn(cj)

(|cj | − λj

)+

)j=1:p

,

with c = XTc ξc.

We choose µ so that the norm of the solution is unitary to finda candidate point and thus the solution (by convexity) of the adaptiveproblem (A.7).

A.2 Conditions for stability selectionThe result by Meinshausen and Bühlmann (2010) regarding the expectednumber of wrongly selected variables is derived for Λ ⊂ R+ undertwo conditions: (i) assuming that the indicators

(1{j∈S`}

)j∈S c0

are

exchangeable for any ` ∈ Λ. (ii) The original procedure of selectionis not worst than random guessing. The first assumptions assumes thatthe considered method does not “prefer” to select some covariates ratherthan some other in the set of the non-pertinent predictors. This hypothesisseems reasonable in our SPLS framework. The second one is verifiedaccording to the results on our simulations (c.f. section 3). Moreover,in the method we consider, the grid of hyper-parameters lies in (R+)3,however the parameter that truly influences the sparsity of the estimationis the parameter λs ∈ R+. Therefore, the sparse PLS appears to be areasonable framework to apply the concept of stability selection.

A.3 Comparison with state-of-the-art approachesIn the literature, other methodologies have been proposed to adapt (sparse)PLS for binary classification. We detail here different approaches based on(sparse) PLS and GLMs, especially regarding the potential issues raisedby the combination of two optimization frameworks.

PLS and GLMs. To overcome the convergence issue in the IRLS algorithm,Marx (1996) proposed to solve the weighted least square problem at each

IRLS step with a PLS regression, i.e. β(t+1)

is computed by weighted PLSregression of the pseudo-response ξ(t) onto the predictors X. However,such iterative scheme does not correspond to the optimization of an

objective function. Hence, the convergence of the procedure cannot beguaranteed and the potential solution is not clearly defined.

Alternatively, Wang et al. (1999) and Nguyen and Rocke (2002)proposed to achieve the dimension reduction before the logistic regression.Their algorithm use the PLS regression as a preliminary compression step.The components [tk]Kk=1 in the subspace of dimension K are then usedin the logistic regression instead of the predictors. Therefore, the IRLSalgorithm does not deal with high dimensional data (as K < p). In thiscontext, the PLS algorithm treats the discrete response as continuous.Such approach seems counter-intuitive as it neglects the definition ofPLS to resolve a linear regression problem and it ignores the inherentheteroskedastic context. This algorithm is called PLS-log in the following.It can be noted that Nguyen and Rocke (2002) or Boulesteix (2004)also proposed to use discriminant analysis as a classifier after the PLSstep. This method, known as PLS-DA, is not directly linked to the GLMframework but we cite it as an alternative for classification with PLS-basedapproaches. It can be noted that Barker and Rayens (2003) proposed aslightly different implementation of PLS-DA, which is however equivalentto Boulesteix’s approach in the binary response case, since they both rely onequivalent univariate response PLS algorithms (De Jong, 1993; Boulesteixand Strimmer, 2007).

Then, Ding and Gentleman (2005) proposed the GPLS method.They introduced a modification in Marx’s algorithm based on the Firthprocedure (Firth, 1993), in order to avoid the non-convergence and thepotential infinite parameter estimation in logistic regression. However, thisapproach is also characterized by the absence of an explicit optimizationcriterion. Eventually, as introduced previously, Fort and Lambert-Lacroix(2005) proposed to integrate the dimension reduction PLS step after aRidge regularized IRLS algorithm. We presented the adaptation of suchmethodology in the context of sparse PLS in the previous section.

Sparse PLS and GLMs. More recently, based on the SPLS algorithmby Chun and Keles (2010), Chung and Keles (2010) presented twodifferent approaches. The first one, called SGPLS, is a direct extensionfrom the GPLS algorithm by Ding and Gentleman (2005). It solvesthe successive weighted least square problems of IRLS using a sparsePLS regression, with the idea that variable selection reduces the modelcomplexity and helps to overwhelm numerical singularities. Unfortunately,our simulations will show that convergence issues remain. Indeed, the useof SPLS does not resolve the issue link to the absence of an associatedoptimization problem. The second approach is a generalization of thePLS-log algorithm and uses sparse PLS to reduce the dimension beforerunning the logistic regression on the SPLS components. This methodwill be called SPLS-log.In both case, i.e. in SGPLS and SPLS-log , theiterative optimization in the IRLS algorithm or modified IRLS algorithmdoes depend on the numberK of components and on the sparsity parameterλs. Thus, the convergence of the algorithm is potentially affected by thechoice of the hyper-parameters.

Eventually, we cite the SPLS-DA method developed by (Chung andKeles, 2010) or Lê Cao et al. (2011). Generalizing the approach fromBoulesteix (2004), they used sparse PLS as a preliminary dimensionreduction step before a discriminant analysis. In the binary responsecase, thanks to the equivalence between Boulesteix (2004) and Barkerand Rayens (2003) works, the sparse extension of Barker and Rayens’PLS-DA for binary classification corresponds to the work of Chung andKeles (2010) or Lê Cao et al. (2011). A disadvantage of sparse PLS-DAapproaches is that, in the multi-group classification case, they both rely onmultivariate response sparse PLS algorithms, which do not admit a closed-form solution. On the contrary, our approach uses a univariate responsesparse PLS algorithm (which admits a closed-form solution, c.f. maintext) in both binary and multi-group classifications, being computationallyefficient in both cases.

12 G. Durif et al.

A.4 Performance evaluationIn order to assess the performance of our method, we compare it to otherstate-of-the-art approaches taking into account sparsity and/or performingcompression. We eventually use a “reference” method, called GLMNET(Friedman et al., 2010), that performs variable selection, by solving theGLM likelihood maximization penalized by `1 norm penalty for selectionand `2 norm penalty for regularization, also known as the Elastic Netapproach (Zou and Hastie, 2005). Computations were performed usingthe software environment for statistics R. The GPLS approach used inour computation comes from the archive of the former R-package gpls,the methods logit-PLS and PLS-DA from the package plsgenomics,SGPLS, SPLS-log and SPLS-DA from the packagespls, GLMNET fromthe package glmnet.

A.5 Complements on the simulation study

A.5.1 Simulation design

We consider a predictor matrix X of dimension n × p, with n = 100

fixed, and p = 100, 500, 1000, 2000, so that we examine low and highdimensional models. To simulate redundancy within predictors, X ispartitioned into k∗ blocks (10 or 50 in practice) denoted by Gk for blockk. Then for each predictor j ∈ Gk ,Xij is generated depending on a latentvariable Hk as Xij = Hik + Fij , with Hik ∼ N (0, σ2

H) and somenoise Fij ∼ N (0, σ2

F ). The correlation between the blocks is regulatedby σ2

H , the higher σ2H the less dependency. In the following we consider

σH/σF = 2 or 1/3.The true vector of predictor coefficients β∗ is structured according to

the blocksGk in X. Actually, `∗ blocks inβ∗ are randomly chosen amongthe k∗ ones to be associated with non null coefficients (with `∗ = 1

or k∗/2). All coefficients within the `∗ designated blocks are constant(with value 1/2). In our model, the relevant predictors contributing to theresponse will be those with non zero coefficient, and our purpose will beto retrieve them via selection. The response variable Yi for observation i issampled as a Bernoulli variable, with parameter π∗i that follows a logisticmodel: π∗i = logit−1(xTi β

∗).For our method, the parameter values that are tuned by cross-validation

are the following: the number of components K varies from 1 to 10,candidate values for the Ridge parameter λR in RIRLS are 31 points thatare log10-linearly spaced in the range [10−2; 103], candidate values forthe sparse parameter λs are 10 points that are linearly spaced in the range[0.05; 0.95]. Other SPLS approaches (SGPLS and SPLS-log) only dependon hyper-parameters (λs,K) for which candidate values are the same asfor our method. Regarding GLMNET, we let the procedure chooses byitself the grid of hyper-parameters, as recommended by the authors in thedocumentation.

A.5.2 Additional simulation results

Convergence. Tab. A.3 summarized the convergence of the differentmethods (logit-SPLS, SGPLS and SPLS-log) during the cross-validationprocedure (including the tuning of K) on the simulations, depending onthe number of predictors p. Our approach logit-SPLS always convergeson contrary to other SPLS approaches for logistic regression.

Tab. A.2 summarized the convergence of the different methods onthe simulations, when fitting the model after tuning the hyper-parameters(including K) by cross-validation, depending on the number p ofpredictors. Again, our approach logit-SPLS always converges on contraryto other SPLS approaches for logistic regression.

In addition, Tab. A.3 shows the percentage of convergence for thedifferent SPLS approaches across cross-validation repeated runs. We see a

Table A.1. Averaged percentage of model that converged during cross-validation tuning of hyper-parameters for different values of p.

Method p = 100 p = 500 p = 1000 p = 2000

sgpls 37 34 33 33

spls-log 44 67 71 74

logit-spls 100 100 100 100

Table A.2. Averaged percentage of model fitting that converged over 75simulations for different values of p. Hyper-parameters are tuned by cross-validation.

Method p = 100 p = 500 p = 1000 p = 2000

gpls 66 59 61 56

sgpls 33 23 23 23

spls-log 84 52 39 32

logit-spls 100 100 100 100

Table A.3. Averaged precentage of runs that converged across repeatedcross-validations (tuning of all hyper-parameters, including K).

Method p = 100 p = 500 p = 1000 p = 2000

sgpls 37 34 33 33

spls-log 44 67 71 74

logit-spls 100 100 100 100

similar pattern as in Tab. A.2, only our method logit-SPLS almost certainlyconverge.

Cross-validation stability. Fig. A.1 illustrates the stability of the cross-validation procedure for the different SPLS approaches regarding thenumber of components. Our approach logit-SPLS always choosesK = 1,while other SPLS approaches mostly returnsK = 1. A first comment canbe made on the stability of the cross-validation procedure. Our approachis also more stable regarding the choice of K compared to other SPLSmethods. A second comment is that, as explained in the manuscript,the stability of the cross-validation is directly linked to convergenceof the method (c.f. Tab. A.2). Our method always converges on oursimulations and is thus more stable regarding cross-validation than otherSPLS approaches that do not converge most of the time and are less stablewhen tuning hyper-parameters. In addition, based on these results, wedecided to set K = 1 and only tune the sparsity parameter λs and theRidge parameter λR when evaluating the performance of the differentapproaches (Tab. 3 in the manuscript) to save computation time.

Prediction and selection Tabs. A.4, A.5 and A.6 collects the resultsregarding performance in prediction and selection (sensitivity, specificity,accuracy) for the different approaches compared in the simulation study,and for data respectively simulated with p = 100, 500, 1000. Theseresults are consistent with the case p = 2000 presented in the manuscript.In details, approaches that combines compression and variable selection(sparse PLS) achieve better prediction performance than compression only(PLS) or selection only (GLMNET) approaches. Regarding selection,sparse PLS is generally better in term of selection sensitivity (true positiverate) compared to GLMNET, which is too conservative. However, ourapproach logit-SPLS seems to select less false positives compared to otherSPLS approaches, since the specificity is higher for a similar accuracylevel.


logit.spls.adapt sgpls spls.log

p=

10

0p

=5

00

p=

10

00

p=

20

00

1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

K

Pro

po

rtio

n

Fig. A.1. Values chosen for K by cross-validation over repetitions for the differentSPLS approaches for different values of p (with n = 100) on simulated data.

Table A.4. Prediction error and selection sensitivity/specificity (if relevant)when p = 100, for non-sparse or sparse approaches (delimited by the line).





sgpls 0.14 ± 0.07 0.86 0.77 0.83

spls-da 0.15 ± 0.07 0.88 0.75 0.83

spls-log 0.12 ± 0.07 0.87 0.75 0.82






sgpls 0.12 ± 0.06 0.81 0.76 0.81

spls-da 0.14 ± 0.07 0.82 0.75 0.81

spls-log 0.13 ± 0.06 0.83 0.77 0.81

Computation time. Tab. A.7 shows the averaged computation time for thecross-validation runs of the different approaches on simulated data wheren = 100 and p = 100, 500, 1000, 2000. Each run was performed onthe cluster grid of the LBBE, equipped with standard multi-core CPUwith frequency between 2 and 2.5 GHz. For each method, each cross-validation runs did used a single core of a single CPU for two reason: (i)

we did perform massive simultaneous runs on the cluster). (ii) It was a farebasis for comparison, because the different packages that we used proposedifferent degrees of parallelization in their implementation. It is important






sgpls 0.12 ± 0.06 0.80 0.77 0.81

spls-da 0.13 ± 0.06 0.82 0.75 0.81

spls-log 0.13 ± 0.06 0.83 0.75 0.81

Table A.7. Averaged computation time (in seconds) of cross-validation runson a single-core of a standard CPU, when considering simulated data wheren = 100 and p = 100, 500, 1000, 2000.

Method p = 100 p = 500 p = 1000 p = 2000

glmnet 4.69 4.85 5.39 6.59logit-spls 72.98 223.13 452.21 706.86

sgpls 79.41 284.62 541.86 1103.32spls-log 3.63 11.17 20.74 37.30

to note that our approach logit-SPLS can run on multi-core architecture,which improves the results presented below.

GLMNET is the most efficient method because its implementationrelies on fortran and C codes, interfaced with R. SPLS-log is also quiteefficient (less than a minute in all cases). Indeed, it uses the glm functionfrom R that is encoded in C. However, as mentioned earlier and in thepaper, this function did encounter convergence issues in many cases. Ourmethod logit-SPLS is slower since the cross-validation takes between∼ 1

min. (when p = 100) and ∼ 11 min. (when p = 2000) in average.We can make two comments here: (i) our approach needs to calibrate anadditional hyper-parameter λR, however this additional cost is reasonable(a few minutes). (ii) The fast convergence of our approach ensures a lowercomputation time compared to the SGPLS approach, despite the additionalhyper-parameter.

In addition, it can be noted that we are currently working on a C++implementation of our algorithm, which is expected to speed up thecomputations compared to the R implementation.

Eventually, Tab. A.8 presents the averaged computation time to fit asingle model for the different approaches on simulated data wheren = 100

and p = 100, 500, 1000, 2000. Each run was performed on the clustergrid of the LBBE, equipped with standard multi-core CPU with frequencybetween 2 and 2.5 GHz. For each method, each model fitting runs did useda single core of a single CPU.

All methods are computationally efficient to fit a single model, exceptfor SGPLS. The non-convergence of this approach requires that thealgorithm iterates further, until the limit set by the users. It can be notedthat the cost of additional iterations in the case of SPLS-log is counter-balanced by the efficient use of the glm function. However, it does notguarantee its convergence (c.f. previously).

14 G. Durif et al.

Table A.8. Averaged computation time (in seconds) of a single fit run ona single-core of a standard CPU, when considering simulated data wheren = 100 and p = 100, 500, 1000, 2000.

Method p = 100 p = 500 p = 1000 p = 2000

glmnet 0.01 0.03 0.06 0.11logit-spls 0.05 0.17 0.35 0.60

sgpls 0.89 3.70 7.86 17.92spls-log 0.03 0.09 0.19 0.40

A.6 Complements on the breast cancer dataanalysis

A.6.1 Computation details

We applied the methods GLMNET, logit-PLS, logit-SPLS (adaptive ornot), SGPLS and SPLS-log to our data set. We fit each model over ahundred resamplings, where observations are randomly split into trainingand test sets with a 70%/30% ratio. For the prediction task, on eachresampling, the parameter values of each method are tuned by 10-foldcross-validation on the training set, respecting the following grid (for ourmethod logit-SPLS) K ∈ {1, . . . , 8}, candidate values for the Ridgeparameter λR in RIRLS are 31 points that are log10-linearly spaced inthe range [10−2; 103], candidate values for the sparse parameter λs are10 points that are linearly spaced in the range [0.05; 0.95]. Other SPLSapproaches (SGPLS and SPLS-log) only depend on hyper-parameters(λs,K) for which candidate values are the same as for our method.Regarding GLMNET, we let the procedure chooses by itself the grid ofhyper-parameters, as recommended by the authors in the documentation.

A.6.2 Stability selection

Hyper-parameter grid. In the study of the stability selection on the breastcancer data set, regarding our approach logit-SPLS, as a basis, we usethe same grid Λ of candidates values for (λs, λR,K) as in the cross-validation case (c.f. section A.6.1). As stated in the manuscript, the grid isthen reduced to control the false positive expected number. For other SPLSapproaches, we apply the same procedure, based on the grid (λs,K).Regarding GLMNET, the grid of candidate values for the penalty parameteris chosen by the procedure itself, but then we apply the same frameworkto extract the set of stable selected variables (as detailed in the manuscript,section 2.4).

Selected genes. The overlap between the genes selected by the differentapproaches based on the stability selection procedure (for a thresholdπthr = 0.75) are given in Fig. A.2. We can make two comments: (i)

The 28 genes selected by GLMNET are all retrieved by our approachlogit-SPLS (over 133 selected genes). In addition, genes with higherselection score (i.e. maximum estimated probability to be selected) arethe same between the two methods (c.f. Tabs. A.9 and A.10). Thus, theselection procedure based on our logit-SPLS method is consistent with ourbaseline GLMNET. (ii) Genes selected by other SPLS-based approaches(SPLS-log and SGPLS) are not consistent with the ones selected byGLMNET nor by logit-SPLS. On the contrary, they select 50 commongenes over respectively 58 and 70 selected genes for SPLS-log and SGPLS.However, the reliability of these results is questioned because of the non-convergence of these two methods (c.f. Tab. 4 in the manuscript). It canbe noted that similar observations (consistency between logit-SPLS andGLMNET, SPLS-log and SGPLS are different) can be made for other levelof probability threshold πthr.

Tabs. A.9 and A.10 give the list of genes that were selected respectivelyby GLMNET and logit-SPLS thanks to the stability selection procedure

logit.spls glmnet

4867

0105 28

sgpls glmnet

4910

2062 8

sgpls logit.spls

4809

12158 12

spls.log glmnet

4917

2555 3

spls.log logit.spls

4813

12954 4

spls.log sgpls

4922

208 50

Fig. A.2. Overlap between the genes selected by the different methods thanks to thestability selection procedure, when taking a threshold πthr = 0.75.

applied to the breast cancer data set (in particular to the 5000 mostdifferentially expressed genes between the two conditions relapse ornot). Genes are identified by their ProbeID on Affymetrix U133-Plus2.0 chip (c.f. Guedj et al., 2012). Gene identification (Symbool,Entrezid and Name) is recovered thanks to the annotate andhgu133plus2.db R-package, that are available on Bioconductor(https://www.bioconductor.org). Some ProbeID were notidentified and correspond to blank line. On contrary, other ProbeID seemto correspond to two genes and are present twice.

A.7 Sparse PLS for multi-group classificationWe generalize our approach to a multi-categorical response. This problemis known as multinomial logistic regression or polytomous regression(McCullagh and Nelder, 1989) and will be called multinomial sparse PLSin the sequel.

A.7.1 Multinomial logistic regression

The response yi takes its values in a discrete set {0, . . . , G} correspondingto G + 1 groups or classes of observations. The associated variable Yi(i = 1, . . . , n) follows a multi-categorical distribution where P(Yij =

g |xi) = πig for any class g. Based on a direct generalization of thelogistic model, a class of reference is set (generally the class 0) and foreach class g 6= 0, the probability πig that Yi = g depends on a linearcombination of predictor such as:

log

(πig

πi0

)= zTi βg , (A.9)

with a specific vector of coefficient βg ∈ Rp+1 for each class g =

1, . . . , G. Indeed, the probabilities (πig)g=1:G determine the probabilityπi0 since

∑Gg=0 πig = 1. A column of 1s is added in the matrix Z


Table A.9. List of the 28 genes selected by GLMNET thanks to the stability selection procedure (at threshold πthr = 0.75) on the breastcancer data set. Genes are sorted by selection score (maximum estimated probability to be selected). Genes are identified by their ProbeIDon Affymetrix U133-Plus2.0 chip.

PROBEID SYMBOL ENTREZID GENE NAME Selection score

217048_at 0.991553561_at TAS2R50 259296 taste 2 receptor member 50 0.97233227_at KIAA1109 84162 KIAA1109 0.97218307_at RSAD1 55316 radical S-adenosyl methionine domain containing 1 0.97241034_at GLS 2744 glutaminase 0.97211870_s_at PCDHA3 56145 protocadherin alpha 3 0.95211870_s_at PCDHA2 56146 protocadherin alpha 2 0.951561665_at LOC100421171 100421171 thyroid hormone receptor interactor 11 pseudogene 0.95216738_at 0.92236899_at 0.91227240_at NGEF 25791 neuronal guanine nucleotide exchange factor 0.91234739_at 0.911560522_at DLGAP1-AS3 201477 DLGAP1 antisense RNA 3 0.891554988_at SLC9C2 284525 solute carrier family 9 member C2 (putative) 0.86217360_x_at IGHA1 3493 immunoglobulin heavy constant alpha 1 0.85217360_x_at IGHG1 3500 immunoglobulin heavy constant gamma 1 (G1m marker) 0.85217360_x_at IGHG3 3502 immunoglobulin heavy constant gamma 3 (G3m marker) 0.85217360_x_at IGHM 3507 immunoglobulin heavy constant mu 0.85217360_x_at IGHV4-31 28396 immunoglobulin heavy variable 4-31 0.85229485_x_at SHISA3 152573 shisa family member 3 0.85239052_at 0.85242870_at 0.84228776_at GJC1 10052 gap junction protein gamma 1 0.83244849_at SEMA3A 10371 semaphorin 3A 0.83217391_x_at 0.82229215_at ASCL2 430 achaete-scute family bHLH transcription factor 2 0.82235945_at 0.821554708_s_at SPATA6L 55064 spermatogenesis associated 6 like 0.79208777_s_at PSMD11 5717 proteasome 26S subunit, non-ATPase 11 0.79213651_at INPP5J 27124 inositol polyphosphate-5-phosphatase J 0.78225792_at HOOK1 51361 hook microtubule tethering protein 1 0.781570136_at 0.761560692_at VSTM2A-OT1 285878 VSTM2A overlapping transcript 1 0.751560692_at VSTM2A 222008 V-set and transmembrane domain containing 2A 0.75

to incorporate the intercept in the linear combination zTi βg . The log-likelihood can be be explicitly formulated:

logL(β)

=

n∑i=1

G∑g=1

yig zTi βg − log(

1 +∑Gg=1 exp(zTi βg)

) ,

(A.10)where the binary variable yig = 1{yi=g} indicates the class of theobservation i (1{A} is the indicator function valued in {0, 1}, indicatingif the statementA is true (1) or false (0).)

It is possible to rearrange the data in order to formulate a vectorizedversion of the loss (A.10), and express the multinomial logistic regressionas a logistic regression of a binary response Y ∈ {0, 1}nG against amatrix of rearranged covariatesZ ∈ RnG×(p+1)G. The response vectorY of length nG is defined as follows:

Y =(

(y1g)g=1:G, . . . , (yig)g=1:G, . . . , (yng)g=1:G

)T,

where yig = 1{yi=g} as previously mentioned. The new covariate matrixZ of dimension nG× (p+ 1)G is defined by blocks as:

Z =[ZT1 , . . . ,ZTi , . . . ,ZTn

]T,

where each block i is constructed byG diagonal repetitions of the row xifrom the original covariate matrix Z, i.e.

Zi =

1 xi1 . . . xip 0

. . .0 1 xi1 . . . xip

G repeats of row zTi .

The coefficient vectors βg ∈ Rp+1 (for g = 1, . . . , G) are alsoreorganized in the vector B ∈ R(p+1)G as:

B =(

(β0g)g=1:G, . . . , (βjg)g=1:G, . . . , (βpg)g=1:G

)T,

where (βjg)j=0:p are the coordinates of βg , so that the response Ydepends on the linear combination ZB.

Thanks to this reformulation, it is possible to adapt the Ridge IRLSalgorithm to estimate the coefficients B and infer the probabilities πig thatobservations yi belongs to the class g. The algorithm that we call MRIRLSis detailed in Fort et al. (2005).

A.7.2 Multinomial SPLS

The vectorized formulation of the MIRLS algorithm allows to use ourSPLS-based dimension reduction approach. As in the binary case, the

16 G. Durif et al.

Table A.10. List of the top 50 genes (over 133) selected by logit-SPLS thanks to the stability selection procedure (at threshold πthr = 0.75) onthe breast cancer data set. Genes sorted by selection score (maximum estimated probability to be selected). Genes are identified by their ProbeIDon Affymetrix U133-Plus2.0 chip.

PROBEID SYMBOL ENTREZID GENE NAME Selection score

1553561_at TAS2R50 259296 taste 2 receptor member 50 1.00218307_at RSAD1 55316 radical S-adenosyl methionine domain containing 1 1.001560522_at DLGAP1-AS3 201477 DLGAP1 antisense RNA 3 0.99217048_at 0.99211870_s_at PCDHA3 56145 protocadherin alpha 3 0.99211870_s_at PCDHA2 56146 protocadherin alpha 2 0.99233227_at KIAA1109 84162 KIAA1109 0.99220098_at HYDIN 54768 HYDIN, axonemal central pair apparatus protein 0.98220098_at HYDIN2 100288805 HYDIN2, axonemal central pair apparatus protein (pseudogene) 0.98234739_at 0.981561665_at LOC100421171 100421171 thyroid hormone receptor interactor 11 pseudogene 0.97216738_at 0.97241034_at GLS 2744 glutaminase 0.97227240_at NGEF 25791 neuronal guanine nucleotide exchange factor 0.971554988_at SLC9C2 284525 solute carrier family 9 member C2 (putative) 0.951560692_at VSTM2A-OT1 285878 VSTM2A overlapping transcript 1 0.951560692_at VSTM2A 222008 V-set and transmembrane domain containing 2A 0.95217360_x_at IGHA1 3493 immunoglobulin heavy constant alpha 1 0.95217360_x_at IGHG1 3500 immunoglobulin heavy constant gamma 1 (G1m marker) 0.95217360_x_at IGHG3 3502 immunoglobulin heavy constant gamma 3 (G3m marker) 0.95217360_x_at IGHM 3507 immunoglobulin heavy constant mu 0.95217360_x_at IGHV4-31 28396 immunoglobulin heavy variable 4-31 0.95236899_at 0.95239052_at 0.95242870_at 0.95228507_at PDE3A 5139 phosphodiesterase 3A 0.95229081_at SLC25A13 10165 solute carrier family 25 member 13 0.951562030_at LOC284898 284898 uncharacterized LOC284898 0.94227379_at MBOAT1 154141 membrane bound O-acyltransferase domain containing 1 0.94225792_at HOOK1 51361 hook microtubule tethering protein 1 0.93234792_x_at IGHA1 3493 immunoglobulin heavy constant alpha 1 0.93234792_x_at IGHV4-31 28396 immunoglobulin heavy variable 4-31 0.93244849_at SEMA3A 10371 semaphorin 3A 0.93217697_at 0.931556937_at 0.921569126_at CCNC 892 cyclin C 0.92228776_at GJC1 10052 gap junction protein gamma 1 0.92229485_x_at SHISA3 152573 shisa family member 3 0.92232920_at KIAA1656 85371 KIAA1656 protein 0.92232920_at CCDC157 550631 coiled-coil domain containing 157 0.921563057_at 0.911568666_at PLIN5 440503 perilipin 5 0.911570116_at 0.91238824_at RPS29 6235 ribosomal protein S29 0.91243583_at 0.91206349_at LGI1 9211 leucine rich glioma inactivated 1 0.91211064_at ZNF493 284443 zinc finger protein 493 0.91231913_s_at BRCC3 79184 BRCA1/BRCA2-containing complex subunit 3 0.911570136_at 0.89206202_at MEOX2 4223 mesenchyme homeobox 2 0.89

MIRLS algorithm (penalized by Ridge) produces a continuous pseudo-response (at the convergence) that is suitable for the sparse PLS regression.Thus, our approach, called multinomial-SPLS, directly extends ouralgorithm logit-SPLS to the multinomial logistic regression. It estimatesthe linear coefficients B by sparse PLS, processing compression andvariable selection simultaneously. Then, these estimated coefficients areused to get an estimation of the probabilities πig . Our procedure is directly

inspired from the approach by Fort et al. (2005) that extended the algorithmlogit-PLS (Fort and Lambert-Lacroix, 2005) to the multi-categorical cases.

In this context, the SPLS step considers: i) the pseudo-responseξ ∈ RnG constructed from the reformulated response Y , ii) the centeredversion Xc of the modified covariate matrix X defined by:

X =[XT1 , . . . ,XTi , . . . ,XTn

]T,


where each block i is constructed byG diagonal repetitions of the row xifrom the original covariate matrix X, i.e.

Xi =

xi1 . . . xip 0

. . .0 xi1 . . . xip

G repeats of row xTi .

It corresponds to the matrix Z where the terms 1 corresponding to theintercept have been removed. Thus, the coefficients (β0g)g=1:G areestimated afterward. These coefficients are ultimately used to computethe class membership probabilities for each observation, following themodel (A.9). In prediction task, an observation is assigned to the classwith the highest predicted probability.

At this point, we mention that the error rate that we consider in thiscase (especially for tuning of hyper-parameters byV -fold cross-validation,with V = 5 or 10) is the standard error rate, i.e. the proportion of overallmismatches, that previously used in the plsgenomics R-package formulti-class PLS classification.

A.7.3 SPLS components

Since the sparse PLS is applied on the modified covariate matrix X ∈RnG×pG, the constructed SPLS components represent the matrix X ina lower dimensional subspace, and not the original matrix X. However,it is possible to obtain a low dimensional representation of the originalcovariates. Indeed, thanks to the construction of the matrix X , the SPLSweight vectors wk ∈ RpG are partitioned as follows:

wk =(

(wkj1)j=1:p, . . . , (wkjg)j=1:p, . . . , (w

kjG)j=1:p

)T,

for k = 1, . . . ,K. Thus when multiplying the original predictor matrixX by the weights matrix

[(wkjg)j=1:p

]k=1:K

∈ Rp×K , we obtaina representation of the observations in a lower dimensional space ofdimension K, as a matrix Tg ∈ Rn×K . The matrix Tg represents thedirections that discriminate the class g versus the class reference 0.

A.7.4 State-of-the-art

It can be noted that Ding and Gentleman (2005) presented a version of theGPLS method suitable for multinomial logistic regression, i.e. the linearregression inside the iteration of the MIRLS algorithm are processed byweighted PLS regression. Chung and Keles (2010) introduced a similaralgorithm based on sparse PLS (extension of the SGPLS algorithm).However, we used exclusively our multinomial SPLS algorithm in the

data analysis. Indeed, based on the conclusions from the binary case, ourapproach showed better results regarding prediction performance on anexperimental data set. Moreover, the dimension of the data is drasticallyincreased because of the rearrangement since the number of observationsbecomes nG and the number of covariates becomes pG. It is thereforenecessary to account for the computational cost and to give priority tocomputationally efficient methods. In particular, thanks to the Ridgepenalty, we showed that our approach converges quickly, hence reducingthe time of computation.

A.8 Complements on the single T cell dataanalysis

A.8.1 Computation details

On each resampling, the parameter values of each method are tuned by10-fold cross-validation on the training set, respecting the following gridK ∈ {1, . . . , 4}, candidate values for the Ridge parameter λR in RIRLSare 10 points that are log10-linearly spaced in the range [10−2; 103],candidate values for the sparse parameter λs are 10 points that are linearlyspaced in the range [0.05; 0.95].

A.8.2 Additional results single cell data analysis

Training in the first step of prediction. The manual identification of cells ismainly based on the level of the CCR7 markers. The identified cells mostlycorrespond to the most extreme values of CCR7 level. The set of manuallyidentified cells constitutes the training set for the first step of predictionbased on multinomial sparse PLS. Fig. A.3 illustrates the representationof the cells in the training set according to the first two PLS components.The distinction between the reference class (“CM”) and both classes fromthe group of “Effector” cells (“EM” and “TEMRA”) is clearly apparent inthe latent subspace, since there is an important gap between the differentgroups of cells. It confirms that the cells in the training set correspond tothe most extreme phenotypes that appear clearly different.

Genes selection by sparse PLS. The genes that are selected by themultinomial-SPLS during the second round of prediction (as explained inthe manuscript) are the following: “CCL4”, “CCR7”, “CST7”, “GNLY”,“GZMB”, “KLRD1”, “LTB”, “S100A4”. These genes have been identifiedas genes involved in the phenotype (“Effector” or “Memory”) of T-cells(Wherry et al., 2007; Willinger et al., 2005b). In particular, “CCR7”and “LTB” are associated to “Memory” cells, while “CCL4”, “CST7”,“GNLY”, “GZMB” and “KLRD1” characterized “Effector” cells.

18 G. Durif et al.

CM vs EM CM vs TEMRA CM vs TSCM

−40 −20 0 −40 −20 0 20 −20 0 20 40

−20

0

20

40

−10

−5

0

5

−10

−5

0

5

comp1

com

p2

Fig. A.3. Cell scores on the first two PLS components in the latent space that discriminate between the reference class (“CM”) and each other class separately (“EM”, “TEMRA”and “TSCM” respectively, from left to right). Restriction to the T cells in the training set before the first prediction step.

high dimensional classification with combined adaptive

Documents