multiple testing and variable selection along least angle ... · so-called “beta-min...

39
arXiv: math.st Multiple Testing and Variable Selection along Least Angle Regression’s path Jean-Marc Azaïs and Yohann De Castro ? Institut de Mathématiques de Toulouse Université Paul Sabatier, 118 route de Narbonne, 31062 Toulouse, France ? CERMICS École des Ponts ParisTech, 77455 Marne la Vallée, France Abstract: In this article we investigate the outcomes of the standard Least Angle Regres- sion (LAR) algorithm in high dimensions under the Gaussian noise assumption. We give the exact law of the sequence of knots conditional on the sequence of variables entering the model, i.e., the “post-selection” law of the knots of the LAR. Based on this result, we prove an exact of the False Discovery Rate (FDR) in the orthogonal design case and an exact control of the existence of false negatives in the general design case. First, we build a sequence of testing procedures on the variables entering the model and we give an exact control of the FDR in the orthogonal design case when the noise level can be unknown. Second, we introduce a new exact testing procedure on the existence of false negatives when the noise level can be unknown. This testing procedure can be deployed after any support selection procedure that will produce an estimation of the support (i.e., the indexes of nonzero coefficients) for any designs. The type I error of the test can be exactly controlled as long as the selection procedure follows some elementary hypotheses, referred to as “admissible selection procedures”. These support selection procedures are such that the estimation of the support is given by the k first variables entering the model where the random variable k is a stopping time. Monte-Carlo simulations and a real data experiment are provided to illustrate our results. MSC 2010 subject classifications: Primary 62E15, 62F03, 60G15, 62H10, 62H15; sec- ondary 60E05, 60G10, 62J05, 94A08. Keywords and phrases: Multiple Testing, False Discovery Rate, High-Dimensional Statis- tics, Post-Selection Inference. Preprint version of June 27, 2019 1. Introduction Parsimonious models have become an ubiquitous tool to tackle high-dimensional representations with a small budget of observations. Successful applications is signal processing (see for instance the pioneering works [13, 12] and references therein), biology (see for instance [3] or [11, Chapter 1.4] and references therein)... have shown that the existence of an (almost) sparse representations in some well chosen basis is reasonable assumption in practice. These important successes have put focus on High-Dimensional Statistics in the past decades and they may be due to the deployment of tractable algorithms with strong theoretical guarantees. Among the large panoply of methods, we have seen emerged 1 -regularization which may have found a fine balance between tractability and performances. Nowadays, sparse regression techniques based on 1 -regularization are a common and powerful tool in high-dimensional settings. Popular estimators, among which one may point LASSO [27] or SLOPE [10], are known to achieve minimax rate of prediction and to satisfy sharp oracle inequalities under conditions on the design such as Restricted Eigenvalue [7, 4] or Compatibility [11, 30]. Recent avances have focused on a deeper understanding of these techniques looking at confidence intervals and testing procedures (see [30, Chapter 6] and references therein) or false discovery rate control (e.g., [3]) for instance. These new results aim at describing the (asymptotic or non asymptotic) law of the outcomes of 1 -minimization regression. This line of works adresses important issues encountered in practice. Assessing the uncertainty of popular estimators give strong guarantees on the estimation produced, e.g., the false discovery rate is controlled or a confidence interval on linear statistics of the estimator can be given. 1

Upload: others

Post on 20-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

arXiv: math.st

Multiple Testing and Variable Selectionalong Least Angle Regression’s path

Jean-Marc Azaïs• and Yohann De Castro?

•Institut de Mathématiques de ToulouseUniversité Paul Sabatier, 118 route de Narbonne, 31062 Toulouse, France

?CERMICSÉcole des Ponts ParisTech, 77455 Marne la Vallée, France

Abstract: In this article we investigate the outcomes of the standard Least Angle Regres-sion (LAR) algorithm in high dimensions under the Gaussian noise assumption. We givethe exact law of the sequence of knots conditional on the sequence of variables entering themodel, i.e., the “post-selection” law of the knots of the LAR. Based on this result, we provean exact of the False Discovery Rate (FDR) in the orthogonal design case and an exactcontrol of the existence of false negatives in the general design case.

First, we build a sequence of testing procedures on the variables entering the model andwe give an exact control of the FDR in the orthogonal design case when the noise levelcan be unknown. Second, we introduce a new exact testing procedure on the existence offalse negatives when the noise level can be unknown. This testing procedure can be deployedafter any support selection procedure that will produce an estimation of the support (i.e., theindexes of nonzero coefficients) for any designs. The type I error of the test can be exactlycontrolled as long as the selection procedure follows some elementary hypotheses, referredto as “admissible selection procedures”. These support selection procedures are such thatthe estimation of the support is given by the k first variables entering the model where therandom variable k is a stopping time. Monte-Carlo simulations and a real data experimentare provided to illustrate our results.

MSC 2010 subject classifications: Primary 62E15, 62F03, 60G15, 62H10, 62H15; sec-ondary 60E05, 60G10, 62J05, 94A08.Keywords and phrases: Multiple Testing, False Discovery Rate, High-Dimensional Statis-tics, Post-Selection Inference.

Preprint version of June 27, 2019

1. Introduction

Parsimonious models have become an ubiquitous tool to tackle high-dimensional representationswith a small budget of observations. Successful applications is signal processing (see for instance thepioneering works [13, 12] and references therein), biology (see for instance [3] or [11, Chapter 1.4]and references therein)... have shown that the existence of an (almost) sparse representations insome well chosen basis is reasonable assumption in practice. These important successes have putfocus on High-Dimensional Statistics in the past decades and they may be due to the deployment oftractable algorithms with strong theoretical guarantees. Among the large panoply of methods, wehave seen emerged `1-regularization which may have found a fine balance between tractability andperformances. Nowadays, sparse regression techniques based on `1-regularization are a commonand powerful tool in high-dimensional settings. Popular estimators, among which one may pointLASSO [27] or SLOPE [10], are known to achieve minimax rate of prediction and to satisfysharp oracle inequalities under conditions on the design such as Restricted Eigenvalue [7, 4] orCompatibility [11, 30]. Recent avances have focused on a deeper understanding of these techniqueslooking at confidence intervals and testing procedures (see [30, Chapter 6] and references therein)or false discovery rate control (e.g., [3]) for instance. These new results aim at describing the(asymptotic or non asymptotic) law of the outcomes of `1-minimization regression. This line ofworks adresses important issues encountered in practice. Assessing the uncertainty of popularestimators give strong guarantees on the estimation produced, e.g., the false discovery rate iscontrolled or a confidence interval on linear statistics of the estimator can be given.

1

Page 2: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

1.1. Least Angle Regression algorithm, Support Selection and FDR

Least Angle Regression (LAR) algorithm has been introduced in the seminal article [14]. Thisforward procedure produces a sequence of knots λ1, λ2, . . . based on a control of the residuals in`∞-norm. This sequence of knots is closely related to the sequence of knots of LASSO [27], asthey differ by only one rule: “Only in the LASSO case, if a nonzero coefficient crosses zero beforethe next variable enters, drop it from the active set and recompute the current joint least-squaresdirection”, as mentioned in [28, Page 120] for instance. This paper focuses on the LAR algorithmand presents three useful equivalent formulations of the LAR algorithm, see Section A.1. As faras we known, the last formulation given by Algorithm 1, based on a recursive formulation, is new.

One specific task is to estimate the support of the target sparse vector, namely identify thetrue positives. In particular, one may take the support of a LASSO (or SLOPE) solution as anestimate of the support solution. This strategy has been intensively studied in the literature, onemay consider [32, 10, 30, 4] and references therein. Support selection has been studied under theso-called “Irrepresentable Condition” (IC), as presented for instance in the books [30, Page 53]and [11, Sec. 7.5.1] and also referred to as the “Mutual Incoherence Condition” [32]. Under theso-called “Beta-Min Condition”, one may prove [11, 30] that the LASSO asymptotically returnsthe true support. In this article, we investigate the existence of false non-negatives and we presentexact non-asymptotic testing procedures to handle this issue, see Section 2.4. Our procedure doesnot require IC but a much weaker condition, referred to as the “Empirical Irrepresentable Check”(See Section 2.1.3), that can be checked in polynomial time.

Another recent issue is to control the False Discovery Rate (FDR) in high-dimensional setting,as for instance in [3] and references therein; or the Joint family-wise Error Rate as in [8] andreferences therein. In this paper, we investigate the consecutive spacings of knots of the LAR astesting statistics and we prove an exact FDR control using a Benjamini–Hochberg procedure [5]in the orthogonal design case, see Section 2.3. Our proof (see Appendix C.7) is based on the WeakPositive Regression Dependency (WPRDS), the reader may consult [9] or the survey [23], andKnothe-Rosemblatt transport, see for instance [24, Sec.2.3, P.67] or [31, P.20], which is based onconditional quantile transforms.

1.2. Post-Selection Inference in High-Dimensions

This paper presents a class of new tests issued from `1-minimization regression in high-dimensions.Following the original idea of [20], we study tests based on the knots of the LAR’s path. Notethat, conditional on the sequence of indexes selected by LAR, the law of two consecutive knotshas been studied for the first time by [20] referred to as the Spacing test [29]. Later, the article [1]proved that this test is unbiased and introduce a studentized version of this test. On the samenote, inference after model selection has been studied in several papers, as for instance [15, 25]or [26] for selective inference and a joint estimation of the noise level. This line of works studies asingle test on a linear statistics while one may ask for a simultaneous control of several tests as inmultiple testing frame. To the best of our knowledge, this paper is the first to study the joint lawof multiple spacing tests of LAR’s knots in a non-asymptotic frame, see Sections 2.2 and 2.3.

One may point others approaches for building confidence intervals and/or testing procedures inhigh-dimensional settings as follows. Simultaneous controls of confidence intervals independentlyof the selection procedure have been studied under the concept of post-selection constants asintroduced in [6] and studied for instance in [2]. Asymptotic confidence intervals can be buildusing the de-sparsified LASSO, the reader may refer to [30, Chapter 5] and references therein. Wealso point a recent study [18] of the problem of FDR control as the sample size tends to infinityusing debiased LASSO.

2

Page 3: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Algorithm 1: LAR algorithm (“recursive” formulation)Data: Correlations vector Z and variance-covariance matrix R.Result: Sequence ((λk, ık, εk))k≥1 where λ1 ≥ λ2 ≥ . . . > 0 are the knots, and ı1, ı2, . . . are the variables that

enter the model with signs ε1, ε2, . . . (εk = ±1).

/* Define the recursive function Rec() that would be applied repeatedly. The inputs of Rec()are Z a vector, R a SDP matrix and T a vector. */

Function Rec(R, Z, T):Compute

λ = max{j:Tj<1}

{ Zj

1− Tj

}and i = arg max

{j:Tj<1}

{ Zj

1− Tj

}.

/* The following recursions are given by (24), (25) and (26). */Update

x = Ri/Rii

R← R− xR>iZ ← Z − xZi

T ← T + x(1− Ti)

return (R, Z, T , λ, i)

1 Set k = 0, T = 0, Z = (Z,−Z) and R as in (10).

/* Use the following recursion function to compute the LAR path. */2 Update k ← k + 1 and compute

(R,Z, T, λk, ık) = Rec(R, Z, T)

Set ık = ık mod p and εk = 1− 2(ık − ık)/p ∈ ±1.

1.3. Outline of the paper

In Section 2, we introduce our method along with the framework and the notion of “EmpiricalIrrepresentable Check” (EIC), a condition that can be checked in polynomial time, see Section 2.1.3.We divided our contributions in three part: Section 2.2 presents the joint law of the knots of LARunder EIC; Section 2.3 gives an exact control of FDR in the orthogonal design case; and Section 2.4presents a general framework to exactly detect false negatives for general designs and some selectionprocedures referred to as “admissible”.

Detailed mathematical statements are given in Section 3 including a method to “studentize” allthe tests by an independent estimation of variance.

Section 4 presents practical applications including an illustration on real data.The Appendix gives details on the different algorithms to compute LAR, proofs of the state-

ments and a list of notation.

2. Exact Controls using Least Angle Regression, an introductory presentation

2.1. Notation, LAR formulations and Assumptions

2.1.1. Least Angle Regression (LAR)

Consider the linear model in high-dimensions where the number of observations n may be lessthan the number of predictors p. Consider the Least Angle Regression (LAR) algorithm wherewe denote by (λk)k≥1 the sequence of knots and by (ık, εk)k≥1 the sequence of variables ık ∈ [p]and signs εk ∈ {±1} that enter the model along the LAR path, see for instance [28, Chapter 5.6]or [14] for standard description of this algorithm. We recall this algorithm in Algorithm 2 and wepresent equivalent formulations in Algorithm 3 (using orthogonal projections) and Algorithm 1(using a recursion). The interested reader may find their analysis in Appendices A and B.

In particular, we present here Algorithm 1 that consists in three lines, applying the samefunction recursively. We introduce some notation that we will be useful throughout this paper.

3

Page 4: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

We denote by (ı1, . . . , ık) ∈ [2p]k the “signed” variables that enter the model along the LAR pathwith the convention that

ı` := ı` + p(1− ε`

2

), (1)

so that ı` ∈ [2p] is a useful way of encoding both the variable ı` ∈ [p] and its sign ε` = ±1 as usedin Algorithm 1. We denote by Z the vector such that Zk is the scalar product between the k-thpredictor and the response variable, and we denote by R its correlation matrix, see (9) and (10)for further details.

2.1.2. Nested Models

Our analysis is based conditionally to (ı1, . . . , ın) and in this spirit it can be referred to as a “Post-Section” procedure. The selected model S would be chosen among the family of nested models

{ı1}︸︷︷︸S

1

⊂ {ı1, ı2}︸ ︷︷ ︸S

2

⊂ {ı1, ı2, . . . , ık}︸ ︷︷ ︸S

k

⊂ . . . ⊂ {ı1, ı2, . . . , ın}︸ ︷︷ ︸S

n

. (2)

Respectively, denote

E1 ⊂ · · · ⊂ Ek := Span(Xı1 , . . . , Xık) ⊂ · · · ⊂ En = Rn ,

the corresponding family of nested subspaces of Rn. In the sequel, denote by Pk(Y ) (resp. P⊥k (Y ))the orthogonal projection of the observation Y onto Ek (resp. the orthogonal of Ek) for all k ≥ 1.

2.1.3. Empirical Irrepresentable Check

In the sequel we will build our testing statistics on the joint law of the first K knots of the LAR.A detailled discussion on the choice of K is given in Section 2.5, in particular K will be such thatthe so-called “Empirical Irrepresentable Check of order K” holds.

This property will be defined and studied in this section. Note that the “Irrepresentable Con-dition” is a standard condition, as presented for instance in the books [30, Page 53] and [11,Sec. 7.5.1] and also referred to as the “Mutual Incoherence Condition” [32].

Definition 1 (Irrepresentable Condition of order K). The design matrix X satisfies the Irrepre-sentable Condition of order K if and only if

∀S ⊂ [p] s.t. #S ≤ K, maxj∈[p]\S

max||v||∞≤1

X>j XS

(X>S XS

)−1v < 1 , (AIrr.)

where Xj denotes the jth column of X and XS the submatrix of X obtained by keeping the columnsindexed by S.

Remark 1. This condition has been intensively studied in the literature and it is now well es-tablished that some random matrix models satisfies it with high probability. For instance, onemay refer to the article [32] where it is shown that a design matrix X ∈ Rn×p whose rows aredrawn independently with respect to a centered Gaussian distribution with variance-covariancematrix satisfying (AIrr.) (for instance the Identity matrix) satisfies (AIrr.) with high probabilitywhen n & K log(p−K), where & denotes an inequality up to some multiplicative constant.

In practice, the Irrepresentable Condition (AIrr.) is a strong requirement on the design X and,additionally, this condition cannot be checked in polynomial time. One important feature of ourresult, see Theorem 3, is that we do not require Irrepresentable Condition but a much weakerrequirement referred to as the “Empirical Irrepresentable Check” of order K.

4

Page 5: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Figure 1. Observed empirical “phase transition” over 1, 000 Monte-Carlo repetitions on the Empirical IrrepresentableCheck of order K with ρ := K/n and δ := n/p for different values of p. We considered a design X ∈ Rn×p withindependent column vectors uniformly distributed on the sphere and an independent y ∈ Rn with i.i.d. standardGaussian entries, and we computed the variables (ı1, . . . , ın) entering the model with LAR. The plot represents thevalue K defined as the largest order for which (AIrr.) holds with respect to (ı1, . . . , ın).

Definition 2 (Empirical Irrepresentable Check of orderK w.r.t (ı1, . . . , ıK)). The design matrix Xsatisfies the Empirical Irrepresentable Check of order K with respect to (ı1, . . . , ıK) if and only iffor all the supports (S

k)k∈[K] and all the signs (εk)k∈[K] which enter the model at steps k ∈ [K]

∀k ∈ [K] , ∀j /∈ Sk := {ı1, . . . , ık}, X>j XSk

(X>S

kXSk

)−1εk < 1 (AIrr.)

where Skis the selected support at step k (see (2)) and εk := {ε1, . . . , εk} are the corresponding

signs of the selected variables.

Remark 2. Observe that Irrepresentable Condition (AIrr.) of order K implies Empirical Irrepre-sentable Check (AIrr.) of order K with respect to any possible (ı1, . . . , ıK).

Remark 3. Note that this condition can be checked in O(n×p×K+n×KcX,inv) time where O(n×KcX,inv) accounts for the the computational time of the K operations XS(X>S XS)−1e where e ∈ Rkis a fixed vector and S ⊆ [p] has size k for k = 1, . . . ,K. A standard bound for this last term wouldbe cX,inv ≤ 6.3728639, see for instance [19].

Remark 4. When computing the LAR path, one has to compute the values X>j XSk

(X>S

kXSk

)−1εk,

see for instance Algorithm 2 or Algorithm 3 where these values are given by θ as shown by Proposi-tion 2. It means that, in practice, one has for free the maximal order K for which Empirical Irrep-resentable Check holds. Furthermore, this latter remark shows that the maximal order K for whichEmpirical Irrepresentable Check holds is a statistic of the variables ık ∈ [p] and signs εk ∈ {±1}that enter the model along the LAR path.

In Figure 1, in the particular case where the design matrix X ∈ Rn×p is constructed fromindependent column vectors uniformly distributed on the sphere and an independent y ∈ Rn withi.i.d. standard Gaussian entries, and we computed the maximal order K for which the EmpiricalIrrepresentable Check holds. Due to symmetry of the law of the design and of the independentobservation y, note that the probability to observe a given path P

{ı1 = ı1, . . . , ın = ın

}is almost

the same for all possible paths, namely ' 1/(n!). Then we can view the plot in Figure 1 as thelarget value K for which (AIrr.) holds when the support {ı1 = ı1, . . . , ıK = ıK

}is uniformly

distributed.In Figure 1, we chose p = 250, 500, 1000, 2000, different values of n and we plot a shaded blue

line representing 95% of the values ρ (among 1, 000 repetitions). Interestingly, these values are5

Page 6: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

concentrated around the solid blue line referred to as the “empirical phase transition” for theGaussian model.

Example 5. For example, we found that for p = 1, 000 and n = 300 (δ = n/p = 0.3), EmpiricalIrrepresentable Check (AIrr.) of order K holds when K is about K ' ρ× 300 = 0.13× 300 = 39.

For every couple (δ = n/p, ρ = K/n) below this curve we observe that we have (AIrr.) oforder K with overwhelming probability as p is sufficiently large. We did not pursue this issue inthe present paper but we would like to emphasize that (AIrr.) holds empirically for values K of theorder of 10% of n at least. In addition, on real data of Section 4.2, with about p = 200 variablesand n = 700 observations with found K of the order of 30.

2.2. Result one: Mixture of Gaussian Order Statistics in LAR

Recall that this article concerns the linear model in high-dimensions where the number of predic-tors p can be much greater than the number of samples n. We denote by Y ∈ Rn the responsevariable and we assume that

Y = X0β0 + η ∼ Nn(X0β0, σ2Σ) , (3)

where η ∼ Nn(0, σ2Σ) is some Gaussian noise, the matrix Σ � 0 is some known positive definitematrix, the noise level σ > 0 may be known or that has to be estimated depending on the context,and X0 ∈ Rn×p. In this paper, we are interested in selecting the true support S0 of β0, where thesupport is defined by

S0 :={k ∈ [p] : β0

k 6= 0}.

Provided that Σ is known, one can always consider the homoscedastic-independent version of theresponses, namely

Σ−12Y = Xβ0 + Σ−

12 η ∼ Nn(Xβ0, σ2Idn) .

where the design X is given by X := Σ−12X0. From the observation Σ−

12Y and the design X

(or equivalently from (Z,R) defined in (9)), one can compute the Least Angle Regression (LAR)knots (λk)k≥1 and the sequence (ık, εk)k≥1 of variables ık ∈ [p] and signs εk ∈ {±1}. In the sequel,the law (resp. the conditional law) of a random vector V (resp. conditionally to W ) is denotedby L(V ) (resp. L(V |W )). One of the main discovery of this paper (see Theorem 3) is the following:

Under Empirical Irrepresentable Check (AIrr.) of order K, the LAR knots (λ1, . . . , λK)behave as a mixture of Gaussian order statistics with heterogeneous variances, namelyit holds

L(λ1, . . . , λK∣∣λK+1)

=∑

(ı1,...,ıK)∈[2p]K

π(ı1,...,ıK ,λK+1) Z−1(ı1,...,ıK ,λK+1)

[ K⊗k=1

γmk,v2k

]1{λ1≥···≥λK≥λK+1}︸ ︷︷ ︸

L(λ1,...,λK | ı1=ı1,...,ıK=ıK , λK+1)

, (4)

where π(ı1,...,ıK ,λK+1) = P{ı1 = ı1, . . . , ıK = ıK

∣∣λK+1

}can be interpreted as mixing

probabilities, Z(ı1,...,ıK ,λK+1) is a normalizing constant, and γmk,v2kis a short notation

for the Gaussian law such that

• its mean mk is given by (15) and depends only on the covariance σ2Σ, the selectedsupport (ı1, . . . , ık) at step k, and the orthogonal projection of µ0 := X>Xβ0 givenby (5);

• and its variance v2k := σ2ρ2

k is given by (13) and depends only on the covari-ance σ2Σ and the selected support (ı1, . . . , ık) at step k.

6

Page 7: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

As presented in the sequel, this result is the key step to prove an exact control of the FDR,see Section 2.3, and an exact control of the False Negatives of some selection procedures, seeSection 2.4. These controls are obtained from p-values αa,b,c described in Theorem 5 and (19). Wewill see that the p-value αabc detects abnormal large values of λb conditional on (λa, λc)and that, in the orthogonal design case, Theorem 6 shows that the test based on αa,a+1,K+1 isuniformly more powerful than tests based on αx,y,z with a ≤ x < y < z ≤ K + 1.

2.3. Result two: Control of False Discovery Rate in the Orthogonal Design case

2.3.1. Presentation in the general case

Assume that Empirical Irrepresentable Check (AIrr.) of order K holds. The sequence of testingprocedures under consideration is given by the sequence of p-values (αk−1,k,k+1)k∈[K] that arereferred to as “Spacing tests ” [29] in the literature, they are recalled in Remark 10 below. These p-values account for “abnormally large” values of λk conditional on (λk−1, λk+1). Invoking (4),this conditional law of λk depends on two unobserved values: the mean mk, given by (15), andthe variance σ2 (in the case where the variance may be unknown). As discussed in Section 2.4.1,the variance σ2 will be estimated on residuals as described by the heuristic of Remark 6. It entailsthat one may efficiently estimate the variance and, as we will see in Section 3.4, one can plug thisindependent estimator of the variance so as to get a studentized version of the testing proceduresdescribed here. For sake of readability, we will assume that σ is known and we refer to Section 3.4for details on how to studentize our procedure. We understand that the law of testing statisticsare parametrized by the hypotheses (mk)k∈[K], where mk is given by (15).

We recall that we denote µ0 = X>Xβ0 and µ0i it ith coordinate . Assuming that predictors are

normalised, in the general case, this quantity is the sum of β0j ’s whose predictors Xj are highly

correlated with the predictor Xi. Now, given ı1, . . . , ık ∈ [p] and signs ε1, . . . , εk ∈ {±1}k, wedenote by (P⊥ı1,...,ık−1

(µ0))ık the orthogonal projection given by

(P⊥ı1,...,ık−1(µ0))ık := εkX

>i

[Idn−XS

k−1 diag(εS

k−1)(X>S

k−1XSk−1

)−1diag(ε

Sk−1)X>

Sk−1

]Xβ0. (5)

where diag(εS

k−1) := diag(ε1, . . . , εk−1).The tested hypotheses are conditional on the sequence of variables (ı1, . . . , ıK) ∈ [p]K and signs

ε1, . . . , εk ∈ {±1}k entering the model. The p-values under consideration here are given by

◦ p1 := α0,1,2 is the p-value testing H0,1 : “m1 = 0 ′′ namely µ0ı1 = 0;

◦ p2 := α1,2,3 is the p-value testing H0,1 : “m2 = 0 ′′ namely (P⊥1 (µ0))ı2 = 0;

◦ p3 := α2,3,4 is the p-value testing H0,1 : “m1 = 0 ′′ namely (P⊥2 (µ0))ı3 = 0; (6)◦ and so on...

We write I0 fo the setI0 =

{k ∈ [K] : H0,k is true

},

Given a subset R ⊆ [K] of hypotheses that we consider as rejected, we call false positive (FP) andtrue positive (TP) the quantities FP = card(R ∩ I0) and TP = card(R \ I0).

Denote by p(1) ≤ . . . ≤ p(K) the p-values ranked in a nondecreasing order. Let α ∈ (0, 1) andconsider the Benjamini-Hochberg procedure, see for instance [5], defined by a rejection set R ⊆ [K]

such that R = ∅ when {k ∈ [K] : p(k) ≤ αk/K} = ∅ and

R = {k ∈ [K] : pk ≤ αk/K} where k = max{k ∈ [K] : p(k) ≤ αk/K

}. (7)

7

Page 8: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Recall the definition of FDR as the mean of False Discovery Proportion (FDP), namely

FDR := E[ FP

FP + TP1FP+TP≥1︸ ︷︷ ︸

FDP

],

where the expectation is unconditional to the sequence of variables entering the model, while thehypotheses that are being tested are conditional on the sequence of variables entering the model.This FDR can be understood invoking the following decomposition

FDR =∑

(ı1,...,ıK)∈[p]K

π(ı1,...,ıK)E[FDP|ı1 = i1, . . . , ıK = iK

],

where π(ı1,...,ıK) = P{ı1 = ı1, . . . , ıK = ıK

}.

2.3.2. FDR control of Benjamini–Hochberg procedure in the orthogonal design case

We now consider the orthogonal design case where X>X = Idp and the set of p-values givenby (6). Note that I0 is simply the set of null coordinates of β. Remark also that, IrrepresentableCondition (AIrr.) of order p holds and so does Empirical Irrepresentable Check (AIrr.), see Propo-sition 2. Note also that I0 is simply the set of null coordinates of β. It implies that on can considerany value K ∈ [p] in the following result.

Theorem 1. Assume that the design is orthogonal, namely it holds X>X = Idp, and let K ∈ [p].Let (ı1, . . . , ıK) be the first variables entering along the LAR’s path. Consider the p-values givenby (6) and the set R given by (7). Then

E[FDP|ı1 = i1, . . . , ıK = iK

]≤ α ,

and so FDR is upper bounded by α.

The proof of this result is given in Appendix C.7.One interpretation of post-selection type may be given as follows: if one looks at all the experi-

ments giving the same sequence of variables entering the model {ı1 = i1, . . . , ıK = iK} and if oneconsiders the Benjamini–Hochberg procedure for the hypotheses described in Section 2.3.1, thenthe FDR is exactly controlled by α.

2.4. Result three: Exact Testing Procedure on False Negatives

From the result of Section 2.2, one can present a method to select a model and propose an exacttest on false negatives in the general design case. More precisely, we assume here for simplicitythat the design X has full row rank n. In theory we are able to compute the LAR knots up to λn.Of course this is often out of reach due to numerical issues and one has to stop earlier to avoidbad conditioning, but we forget this possibility to avoid heavy notation. One can adapt easily thefollowing text to take into account this possibility. Let us compute the LAR knots (λ1, . . . , λn)and the “signed ” variables (ı1, . . . , ın) ∈ [2p]n entering the model as in (1).

2.4.1. Admissible Support Selection Procedures

In order to select S, one may be interested in an estimation σselect of the noise level since, inpractice, one usually does not know it. Our procedure is valid as long as the following property (P1)is satisfied

Noise Level Estimation for Selection (P1): The estimated noise level σselect is ameasurable function of P⊥nselect

(Y ) where nselect may depend only on (ı1, . . . , ın).8

Page 9: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

In practice, one can chose a fixed nselect = n − B giving a fixed budget of B ≥ 1 independentobservations to estimate the variance1. Obviously, if the noise level is known, one can simplytake nselect = n and σselect = σ.

Remark 6. Observe that if the true support S0 is included in {ı1, ı2, . . . , ık} then P⊥k (Y ) iscentered Gaussian vector with known covariance (up to σ2) and it is standard to compute anestimator of the noise level σ. In particular, the information contained in P⊥k (Y ) does not dependon the target β0, since P⊥k (X0β0) = 0, and is purely due to the “noise” part η, as defined in (3).

Note that, for large enough nselect, the heuristic described in Remark 6 may be true and itentails that one may efficiently estimate the variance on P⊥nselect

(Y ).Now, note that choosing a model S is equivalent to choosing a model size m ∈ [n] so that

S ={ı1, ı2, . . . , ım

}.

Our procedure is flexible on this point and it allows any choice of m as long as the followingproperty (P2) is satisfied

Stopping Rule (P2): The estimated model size m is a “stopping time”, i.e. 1{m≤k}is a measurable function of (Pk(Y ), σselect) for all k ∈ [nselect].

In other words, the decision to select a model of size m = k depends only on the part of the obser-vation Y explained by the predictors Xı1 , . . . , Xık . In the sequel, we will present some examplesof such “stopping time” procedures, see Section 3.5. We would like to emphasize that our analysisreveal the following conditional independence

L((Pnselect

(Y ), σselect)|(ı1, . . . , ın))

= L(Pnselect

(Y )|(ı1, . . . , ın))⊗ L

(σselect|(ı1, . . . , ın)

)(8)

which can be advantageously invoked to build Student-type testing statistics to define m.

2.4.2. Exact Testing Procedure on False Negatives

Once one has selected a model of size m, one may be willing to test if S contains the true support S0

by considering the null hypothesis H0 : “S0 ⊆ S ”, namely there is no false negatives. Equivalently,one aim at testing the null hypothesis

H0 : “Xβ0 ∈ Em ” ,

at an exact significance level α ∈ (0, 1)In this article, we introduce a new exact testing procedure that can be deployed when (P1)

and (P2) hold, namely an “admissible” selection procedure is used to build S. The variance esti-mator σselect may have been used in the definition of m and our method forbid to use it againto “studentized” our testing procedure. We need to estimate the variance again to preserve some“independence” in the spirit of (8). As in the previous variance estimation, we advocate a “bud-get” to this task, namely we use the “strata” (ntest, nselect] (as in the vocabulary of analysis ofvariance) to build σ2

test an estimation of the variance. More precisely, we use the orthogonal pro-jection P(ntest,nselect](Y ) which is the the orthogonal projector onto the space F defined by

Enselect= Entest

⊥⊕ F .

Note that this space is generated by Xıntest+1, . . . , Xınselect

.The resulting test is based on (λm/σtest, . . . , λntest

/σtest). More precisely, we know the law ofλmselect+1/σtest conditional on m, λm, λntest

under H0.

1 One can also look at the conditioning of the variance-covariance matrix of P⊥k (Y ) for k = n, n − 1, . . . tochoose nselect. A more detailed discussion on this issue can be found in Section 2.5.

9

Page 10: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Figure 2. The testing procedure computes the significance of knot λ1+m w.r.t. knots λm and λK (see Step 4above), where K is a compromise between the numerical limitations m+ ∆num of the cost of integrating a functionon [0, 1]∆num−2 and the statistical limitations of the design as depicted by the Empirical Irrepresentable Check.

2.5. Practical Framework

We start by presenting the meta-algorithm in which our testing procedure can be deployed. First,we compute the sequence of LAR knots (λ1, . . . , λn) and “signed” variables (ı1, . . . , ın) ∈ [2p]n

entering the model as in (1). Conditionally to the sequence ı1, . . . , ın the quantities displayed inFigure 2 are introduced in the following order.

• Compute Kirr. which is the maximal value such that (AIrr.) holds true.• Fixing a budget B for variance estimation σselect we fix nselect = n − B. In most of the

applications, n is large and consequently nselect is also large.• From the estimation σselect we deduce mselect. Here, the practitioner is free to use any

selection procedures as soon as it satisfies (P1) and (P2).• Multivariate integration programs have some dimension limitation, this implies that we can

only handle simultaneously ∆num knots. As a consequence K must be smaller or equal thanmselect + ∆num.

• In most cases the quantity

K = min(KIrr., mselect + ∆num)

is convenient because K < nselect and there is enough budget in the strata (K,nselect] tobuild the second estimator of variance, namely σtest.

• Otherwise we have to make a “trade-off”: diminishing K to obtain a sufficient budget ofdegree of freedom in the strata (K,nselect] to estimate the variance while maintaining Klarge to gain power.

An example of values of these parameters is

n = 200, p = 300, Kirr. = 35 (for a Gaussian design), and ∆num = 10 .

This framework has been deployed on real data in Section 4. Examples of admissible proceduressatisfying (P1) and (P2) are presented in Section 3.5.

10

Page 11: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

3. Main Results

Consider the correlations vector Z of the homoscedastic-independent observation Σ−12Y with the

design matrix X, so that

Z := X0>Σ−1Y = Rβ0 +X0>Σ−1η ∼ Np(Rβ0, σ2R) (9)

and R := X>X = X0>Σ−1X0 .

From Z and R we can compute the Least Angle Regression (LAR) knots (λk)k≥1 and the sequence(ık, εk)k≥1 of variables ık and signs εk that enter the model along the LAR path. We will use thesestatistics as testing statistics. For sake of presentation, we may consider the 2p-vector Z := (Z,−Z)whose mean is given by µ0 := (Rβ0,−Rβ0) = (X>Xβ0,−X>Xβ0) = (µ0,−µ0) and its variance-covariance matrix is σ2R with

R =

[R −R−R R

]=

[X>X −X>X−X>X X>X

]. (10)

3.1. Problem reformulation and Assumptions

Remark that Z = (Zi)i is a Gaussian vector of size 2p such that Zi = −Zi+p where the indicesare considered modulo 2p. Note also that we know the covariance of Z up to some multiplicativeconstant σ2, namely the variance-covariance matrix of Z is given by σ2R where R � 0 is knownand σ2 is some parameter that may be known or not.

Now, we can define

∀(i1, . . . , ik) ∈ [2p]k, θj(i1, . . . , ik) :=(Rj,i1 · · ·Rj,ik

)M−1i1,...,ik

(1, . . . , 1) , (11)

where (1, . . . , 1) is the column vector of size k whose entries are equal to one and σ2Mi1,...,ik isthe variance-covariance matrix of the vector (Zi1 , · · · , Zik), and (Rj,i1 · · ·Rj,ik) is a row vector ofsize k. Note that Mi1,...,ik is the submatrix of R obtained by keeping the columns and the rowsindexed by {i1, . . . , ik}. Remark that

θj(i1, . . . , ik) = E(Zj | Zi1 = 1, . . . , Zik = 1) ,

when EZ = 0. In our context, the Irrepresentable Condition of order K is given by

∀k ≤ K, ∀(i1, . . . , ik) ∈ [2p]k, ∀j /∈ {i1, . . . , ik} , θj(i1, . . . , ik) < 1 , (12)

where θj(i1, . . . , ik) is given by (11). In Proposition 2 we show that (12) is equivalent to theIrrepresentable Condition (AIrr.) on the design matrix X. A proof can be found in Appendix C.2.

Proposition 2. Assume X and R satisfy (10) then the following assumptions are equivalent:

• the design matrix X satisfies (AIrr.) of order K,• the variance-covariance matrix R satisfies (12) of order K.

Furthermore, if one of these two assumptions hold then “Empirical Irrepresentable Check” of or-der K holds, namely

max[

maxj 6=ı1

θj (ı1), . . . , maxj 6=ı1,...,ık

θj (ı1, . . . , ık)]< 1

which is an equivalent formulation of (AIrr.) of Definition 2.

Remark 7. One may require that the design is “normalized” so that Ri,i = 1, namely its columnshave unit Euclidean norm. Under this normalization, one can check that R satisfies (AIrr.) oforder K = 1. Hence, up to some normalization, one can always assume (AIrr.) of order K = 1.

11

Page 12: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

3.2. Conditional Joint Law of the Knots

In this section, we are interested in the joint law of the knots (λ1, . . . , λK) of the LAR conditionalon λK+1. Let (ı1, . . . , ıK) be the first variables entering along the LAR path. Define the firstcorrelation by ρ1 :=

√Rı1 ,ı1 and the others by

ρ` :=

√Rı` ,ı` −

(Rı` ,ı1 · · ·Rı` ,ı`−1

)M−1ı1,...,ı`−1

(Rı` ,ı1 , · · · , Rı` ,ı`−1

)1− θ`−1

ı`

, for ` ≥ 2 , (13)

whereθ`−1 := θ(ı1, . . . , ı`−1) , for ` ≥ 2 ,

is defined by (11) and Mı1,...,ı`−1is the submatrix of R obtained by keeping the columns and the

rows indexed by {ı1, . . . , ı`−1}. Furthermore, denote

Fi := Φi(λi) and Pi,j := Φi ◦ Φ−1j , for i, j ∈ [K + 1] , (14)

where Φk(·) := Φ(·/(σρk)) is the CDF of the centered Gaussian law with variance σ2ρ2k for k ≥ 1,

λ0 =∞ and F0 = 1 by convention. The main discovery of this paper is the following theorem.

Theorem 3 (Conditional Joint Law of the LAR Knots). Let (λ1, . . . , λK , λK+1) be the firstknots and let (ı1, . . . , ıK) be the first variables entering along the LAR path. If (AIrr.) holds then,conditional on {ı1, . . . , ıK , λK+1}, the vector (λ1, . . . , λK) has law with density (w.r.t. the Lebesguemeasure)

Z−1(ı1,...,ıK ,λK+1)

( K∏k=1

ϕmk,v2k(`k)

)1{`1≥`2≥···≥`K≥λK+1} at point (`1, `2, . . . , `K) ,

where ϕmk,v2kis the standard Gaussian density with mean

mk :=µ0ık−(Rık ,ı1 · · ·Rık ,ık−1

)M−1ı1,...,ık−1

(µ0ı1, · · · , µ0

ık−1

)1− θk−1

ık

with µ0 := (µ0,−µ0) , (15)

and variance v2k := σ2ρ2

k.

A useful corollary of this theorem is the following. It shows that one can explicitly describe the jointlaw of the LAR’s knots after having selected a support of size m with any procedure satisfying (P1)and (P2).

Corollary 4. Under the conditions of Theorem 3, let m be chosen according to a proceduresatisfying (P1) and (P2) with nselect ≥ K. Then under the null hypothesis, namely

H0 : “Xβ0 ∈ Em ” ,

and conditional on the selection event{m = a, Fa, FK+1, ı1, . . . , ıK , ıK+1

}with 0 ≤ a ≤ K − 1,

the vector (Fa+1, . . . , FK) is uniformly distributed on

Da+1,K :={

(fa+1, . . . , fK) ∈ RK−a :

Pa+1,a(Fa) ≥ fa+1 ≥ Pa+1,a+2(fa+2) ≥ · · · ≥ Pa+1,K(fK) ≥ Pa+1,K+1(FK+1)},

where Pi,j are described in (14).

Remark 8. The previous statement is consistent with the case a = 0 corresponding to the globalnull hypothesis H0 : “Xβ0 = 0” (or equivalently EZ = 0). Therefore, if Z is centered then,conditional on FK+1, (F1, . . . , FK) is uniformly distributed on

D1,K :={

(f1, . . . , fK) ∈ RK : 1 ≥ f1 ≥ P1,2(f2) ≥ · · · ≥ P1,K(fK) ≥ P1,K+1(FK+1))}.

12

Page 13: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Remark 9. In the orthogonal case where R = Id, note that θj(i1, . . . , i`) = 0 for all ` ≥ 1 andall i1, . . . , i` 6= j, ρj = 1 and Pi,j(f) = f . We recover that D1,K is the set of order statistics

1 ≥ f1 ≥ f2 ≥ . . . ≥ fK ≥ Φ(λK+1) .

In this case, knots λi are Gaussian order statistics λ1 = Zı1 ≥ λ2 = Zı2 ≥ . . . ≥ λK = ZıK ≥ λK+1

of the vector Z.

3.3. Testing Procedures

From Theorem 3, we deduce several testing statistics. To this end, we introduce some notation.First, define

Iab(s, t) :=

P(a+1),a(s)∫P(a+1),b(t)

dfa+1

P(a+2),(a+1)(fa+1)∫P(a+2),b(t)

dfa+2

P(a+3),(a+2)(fa+2)∫P(a+3),b(t)

dfa+3 · · ·

P(b−1),(b−2)(fb−2)∫P(b−1),b(t)

dfb−1

for 0 ≤ a < b and s, t ∈ R , with the convention that Iab = 1 when b = a+ 1 ,

and also

Fabc(t) := 1{λc≤t≤λa}

Φb(t)∫Φb(λc)

Iab(Fa, fb) Ibc(fb, Fc) dfb (16)

for 0 ≤ a < b < c ≤ K + 1 , t ∈ R where Fa = Φa(λa) and Fc = Φc(λc).

Then, a simple integration shows that Corollary 4 implies that

P[λb ≤ t | λa, λc, ı1, ı2, . . . , ıc−1

]=

Fabc(t)

Fabc(λa), (17)

whenever EZ = 0. Actually, this result can be refined in the case when EZ 6= 0 by the next result.

Theorem 5. Let (λ1, . . . , λK , λK+1) be the first knots and let (ı1, . . . , ıK) be the first variables en-tering along the LAR path. If (AIrr.) holds and m is chosen according to a procedure satisfying (P1)and (P2) with nselect ≥ K, then under the null hypothesis, namely

H0 : “Xβ0 ∈ Em ” ,

and conditional on the selection event{m = a

}with a ≤ K − 1, and for any integers b, c such

that a < b < c ≤ K + 1, it holds that

αabc := 1− Fabc(λb)Fabc(λa)

∼ U(0, 1) , (18)

namely, it is uniformly distributed over (0, 1).

This testing statistic generalizes previous testing statistics that appeared in “Spacing Tests”, aspresented in [29, Chapter 5] for instance, and will be referred to as the Generalized Spacing test.

Remark 10. If one considers a = 0, b = 1 and c = 2 then one gets

α012 = 1− Φ1(λ1)− Φ1(λ2)

Φ1(λ0)− Φ1(λ2)=

1− Φ1(λ1)

1− Φ1(λ2).

Similarly, taking b = a+ 1 and c = a+ 2 one gets

αa(a+1)(a+2) =Φa+1(λa+1)− Φa+1(λa)

Φa+1(λa+2)− Φa+1(λa).

which is the spacing test as presented in [29, Chapter 5].13

Page 14: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Figure 3. CDF of p-values αabc over 3, 000 Monte-Carlo iterations with n = 200, p = 300 and a random designmatrix X given by 300 independent column vectors uniformly distributed on the Euclidean sphere S299. Centralpanel represents an alternative composed by a 2 sparse vector, the right panel an alternative composed by a 2 sparsevector 5 times larger while the left panel corresponds to the null.

Given α ∈ (0, 1), one can consider the following testing procedures

Sabc := 1{αabc≤α} , (19)

that rejects if the p-value αabc is less than the level α of the test. Now, recall (17) to witness thatthe smaller αabc , the bigger λb conditionally to (λa, λc). So the p-value αabc detects abnormallarge values of λb conditional on (λa, λc).

One may investigate the power of these tests to detect false negatives, namely to detect thealternative given by: there exists k ∈ S0 such that k /∈ {ı1, . . . , ıa}. In particular, what is the mostpowerful test among these latter testing procedures? A comprehensive study of the orthogonal caseis given in the following theorem.

Theorem 6. Assume that the design is orthogonal, namely R = Idp. Let m be chosen accordingto a procedure satisfying (P1) and (P2) with nselect ≥ K. Then under the null hypothesis, namely

H0 : “Xβ0 ∈ Em” ,

and conditional on the selection event{m = a0

}with 1 ≤ a0 ≤ K−1, it holds the test Sa0,a0+1,K+1

is uniformly most powerful than any of the tests Sa,b,c for a0 ≤ a < b < c ≤ K + 1.

The proof of this result is given in Appendix C.6. It shows that the best choice among the tests Sa,b,cis the test Sa0,a0+1,K+1 with the smallest a and the largest c.

3.4. “Studentization”: Adapting to Unknown Noise Level

The previous results can be further refined to the case where σ2 is unknown. LetK chosen followingthe considerations of Section 2.5, the key property is to build an estimation of the variance σ2

test

on the strata (ntest, nselect] which is independent from

(λm, . . . , λK+1, σtest) ,

conditional on the selection event {ı1 = ı1, . . . , ın = ın} and under H0 : “Xβ0 ∈ Em ”.Because of our hypotheses, rank(P(ntest,nselect]) is maximal equal to m := nselect − ntest. Our

variance estimator

σ2 := σ2test =

‖P(ntest,nselect](Z)‖22m

,

follows a σ2χ2(m)/m =: σ2W 2.14

Page 15: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

We set λi := λi

σ = λi

σW . Let us consider a = m < b < c ≤ K + 1 and consider, under H0,the distribution conditional to

{m = a, λa, λK+1, ı1, . . . , ıK , ıK+1

}Then the variables (λa, . . . , λc)

and W are independent. More over the distribution of λb is given by (17). We have

P{λb ≤ t

∣∣λaλc,W = w, ı1, . . . , ıK+1

}= P

{λbσ≤ wt

∣∣λaλc,W = w, ı1, . . . , ıK+1

}.

Then we can use Theorem 5 in the case σ = 1 to get that the expression above is equal to

F abc(tw)

F abc(λaw),

Where F abc is the function given by (17) in the case σ = 1. De-conditioning in w we get

P{λb ≤ t∣∣λaλc, i1, . . . , iK+1} =

∫R

F abc(tw)

F abc(λaw)dPW (w).

As a consequence the p-value of the test in now given by

βabc = 1−∫R

F abc(λbw)

F abc(λaw)dPW (w).

3.5. Examples of Admissible procedures

We are now able to present here an admissible procedure to build an estimate S of the supportthat satisfies the properties (P1) and (P2). We chose a level α′ and we define a light modificationof the Student test

Tabc = 1βabc≤α′ ,

by replacing σ2test by σ2

select. The number of degrees of freedom of the χ2 distribution is nowm := n− nselect. By a small abuse of notation, we still set

λi :=λi

σselect.

We limit our attention to consecutive a, b, c. In such a case it is easy to see that

βa(a+1)(a+2) =T(λa+1

σρa+1

)− T

(λa

σρa+1

)T(λa+2

σρa+1

)− T

(λa

σρa+1

) ,where T is the cumulative distribution of the Student T (m) law, so the quantity above is easy tocompute. We are now in condition to present our algorithm.

• Begin with a = 0,• at each step, perform the test Ta,a+1,a+2 at the level α′,• if the test is significative, we set a = a+ 1 and keep on going,• if it is non-significative, we stop and set m = a+ 2.

Recall that possible selected supports along the LAR’s path are nested models of the form (2).Denote k0 ≥ 1 the smallest integer k such that the true support S0 is contained in S

kand denote

S0 ⊆ Sk0

and S0 * S(k0−1)

.

We understand that admissible procedures depend on the sequence of false positives appearingalong the LAR’s path. If the experimenter believes that there is no more than γFP consecutivefalse positives in S0 a possible admissible procedure would be the following.

15

Page 16: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

• Begin with a = 0,• at each step, perform the test Ta,a+1,a+2 at the level α′,• if the test is significative, we set a = a+ 1 and keep on going,• if the γFP consecutive tests Ta,a+1,a+2, Ta+1,a+2,a+3, . . . , Ta+γFP−1,a+γFP ,a+γFP +1 are all

non-significatives, we stop and set m = a+ γFP + 1.

This method has been deployed on real data in Section 4.2 with γFP = 3.

4. Numerical Experiments

4.1. Monte-Carlo experiment

To study the relative power in a non-orthogonal design case we have build a Monte-Carlo exper-iment with 3, 000 repetitions. We have considered a model with n = 200, p = 300 and a randomdesign matrix X given by 300 independent column vectors uniformly distributed on the Euclideansphere S299. The computation of the function Fabc given by (16) is an important issue that de-mands multivariate integration tools, see Appendix E for a solution using cubature of integral bylattice rule. This has lead to some limitations namely ∆num ≤ 4 that implies c ≤ 5 when a = 1 inour experimental framework.

A python notebook and codes are given at https://github.com/ydecastro/lar_testing.The base function is observed_significance_CBC(lars, sigma, start, end, middle) in thefile multiple_spacing_tests.py. It gives the p-value of T(start)(middle)(end) of knots and indexesgiven by lars and an estimation of (or the true) standard deviation given by sigma. We have run3, 000 repetitions of this function to get the laws displayed in Figure 3. It presents the CDF ofthe p-value αabc under the null and under two 2-sparse alternatives, one with low signal and onewith 5 times more signal. Results show, in our particular case, that all the tests are exact and thetest S125 is the most powerful. More precisely, it holds, as in the orthogonal case, that

• S123 ≤ S124 ≤ S125;• S234 ≤ S245.

4.2. Real data

A detailed presentation in a Python notebook is available at https://github.com/ydecastro/lar_testing/blob/master/multiple_spacing_tests.ipynb.

We consider a data set about HIV drug resistance extracted from [3] and [22]. The experimentconsists in identifying mutations on the genes of the HIV-virus that are involved with drug resis-tance. The data set contains about p = 200 and n = 700 observations. Since some protocol is usedto remove some gene or some individuals, the exact numbers depend on the considered drug.

We used a procedure referred to as “spacing-BH ” procedure which is a Benjamini–Hochbergprocedure based on the sequence of spacing tests

β012, β123, . . . , βa(a+1)(a+2), . . .

as described in Section 3.4 with α = 0.2. The results for Knockoff of [3] and of Benjamini–Hochbergprocedure on the coefficients of linear regression (BHq) are for the R-vignette knockoff of thededicated web page of Stanford. All results are evaluated using the TSM data base that gives, insome sense, the list of true positives. A comparison of our results with those of [3] is displayed inFigure 4. Our procedure is a bit more conservative but, in most of the case, gives a better controlof the FDP.

In addition we have performed on the same dataset a false negative detection as in Section 2.4,and Section 3.5 with γFP = 3. We refer to the aforementioned Python notebook for further details.

16

Page 17: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Figure 4. Comparison of numbers true and false positives for three procedures: Spacing-BH; Knockoff [3] andBenjamini–Hochberg procedure on the coefficient of linear regression BHq. For each drug we indicate the numberof true positives in blue and false positives in orange. In the three procedures the aimed FDR is α = 20%.

17

Page 18: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

References

[1] J.-M. Azaïs, Y. De Castro, and S. Mourareau. Power of the spacing test for least-angleregression. Bernoulli, 24(1):465–492, 2018.

[2] F. Bachoc, G. Blanchard, P. Neuvial, et al. On the post selection inference constant underrestricted isometry properties. Electronic Journal of Statistics, 12(2):3736–3757, 2018.

[3] R. F. Barber, E. J. Candès, et al. Controlling the false discovery rate via knockoffs. TheAnnals of Statistics, 43(5):2055–2085, 2015.

[4] P. C. Bellec, G. Lecué, A. B. Tsybakov, et al. Slope meets lasso: improved oracle bounds andoptimality. The Annals of Statistics, 46(6B):3603–3642, 2018.

[5] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerfulapproach to multiple testing. Journal of the Royal statistical society: series B (Methodologi-cal), 57(1):289–300, 1995.

[6] R. Berk, L. Brown, A. Buja, K. Zhang, L. Zhao, et al. Valid post-selection inference. TheAnnals of Statistics, 41(2):802–837, 2013.

[7] P. J. Bickel, Y. Ritov, A. B. Tsybakov, et al. Simultaneous analysis of lasso and dantzigselector. The Annals of Statistics, 37(4):1705–1732, 2009.

[8] G. Blanchard, P. Neuvial, and E. Roquain. Post hoc inference via joint family-wise error ratecontrol. arXiv preprint arXiv:1703.02307, 2017.

[9] G. Blanchard, E. Roquain, et al. Two simple sufficient conditions for fdr control. Electronicjournal of Statistics, 2:963–992, 2008.

[10] M. Bogdan, E. Van Den Berg, C. Sabatti, W. Su, and E. J. Candès. Slope—adaptive variableselection via convex optimization. The annals of applied statistics, 9(3):1103, 2015.

[11] P. Bühlmann and S. van de Geer. Statistics for high-dimensional data. Springer Series inStatistics. Springer, Heidelberg, 2011. Methods, theory and applications.

[12] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruc-tion from highly incomplete frequency information. IEEE Trans. Inf. Theory, 52(2):489–509,2006.

[13] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit.SIAM J. Sci. Comput., 20(1):33–61 (electronic), 1998.

[14] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, et al. Least angle regression. The Annals ofstatistics, 32(2):407–499, 2004.

[15] W. Fithian, D. Sun, and J. Taylor. Optimal inference after model selection. arXiv preprintarXiv:1410.2597, 2014.

[16] A. Genz. Numerical computation of multivariate normal probabilities. Journal of computa-tional and graphical statistics, 1(2):141–149, 1992.

[17] C. Giraud. Introduction to high-dimensional statistics. Chapman and Hall/CRC, 2014.[18] A. Javanmard, H. Javadi, et al. False discovery rate control via debiased lasso. Electronic

Journal of Statistics, 13(1):1212–1253, 2019.[19] F. Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th

International Symposium on Symbolic and Algebraic Computation, ISSAC ’14, pages 296–303, New York, NY, USA, 2014. ACM.

[20] R. Lockhart, J. Taylor, R. J. Tibshirani, and R. Tibshirani. A significance test for the lasso.Annals of statistics, 42(2):413, 2014.

[21] D. Nuyens and R. Cools. Fast algorithms for component-by-component construction of rank-1lattice rules in shift-invariant reproducing kernel hilbert spaces. Mathematics of Computation,75(254):903–920, 2006.

[22] S.-Y. Rhee, J. Taylor, G. Wadhera, A. Ben-Hur, D. L. Brutlag, and R. W. Shafer. Geno-typic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of theNational Academy of Sciences, 103(46):17355–17360, 2006.

[23] E. Roquain. Type i error rate control for testing many hypotheses: a survey with proofs.Journal de la Société Française de Statistique, 152(2):3–38, 2011.

[24] F. Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55:58–63,2015.

18

Page 19: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

[25] J. Taylor and R. J. Tibshirani. Statistical learning and selective inference. Proceedings of theNational Academy of Sciences, 112(25):7629–7634, 2015.

[26] X. Tian, J. R. Loftus, and J. E. Taylor. Selective inference with unknown variance via thesquare-root lasso. Biometrika, 105(4):755–768, 2018.

[27] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal StatisticalSociety: Series B (Methodological), 58(1):267–288, 1996.

[28] R. Tibshirani, M. Wainwright, and T. Hastie. Statistical Learning with Sparsity: TheLasso and Generalizations. Monographs on Statistics & Applied Probability. Chapman andHall/CRC press, 2015.

[29] R. J. Tibshirani, J. Taylor, R. Lockhart, and R. Tibshirani. Exact post-selection infer-ence for sequential regression procedures. Journal of the American Statistical Association,111(514):600–620, 2016.

[30] S. van de Geer. Estimation and testing under sparsity. Lecture Notes in Mathematics, 2159,2016.

[31] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,2008.

[32] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery usingl1-constrained quadratic programming (lasso). IEEE transactions on information theory,55(5):2183–2202, 2009.

19

Page 20: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Appendix A: Representing the LAR Knots

A.1. The equivalent formulations of the LAR algorithm

We present here three equivalent formulations of the LAR that are a consequence of the analysisprovided in Appendices A and B. One formulation is given by Algorithm 1.

Algorithm 2: LAR algorithm (standard formulation)Data: Correlations vector Z and variance-covariance matrix R.Result: Sequence ((λk, ık, εk))k≥1 where λ1 ≥ λ2 ≥ . . . > 0 are the knots, and ı1, ı2, . . . are the variables that

enter the model with signs ε1, ε2, . . . (εk = ±1).

/* Initialize computing (λ1, ı1, ε1) and defining a “residual” N(1). */

1 Set k = 1, λ1 := max |Z|, ı1 := arg max |Z| and ε1 = Zı1/λ1 ∈ ±1, and N(1):= Z.

/* Note that ((λ`, ı`, ε`))1≤`≤k−1 and N(k−1) have been defined at the previous step. */

2 Set k ← k + 1 and compute the least-squares fit

θj :=(Rj,ı1 · · ·Rj,ık−1

)M−1ı1,...,ık−1

(ε1, . . . , εk−1) , j = 1, . . . , p ,

where Mı1,...,ık−1is the submatrix of R keeping the columns and the rows indexed by {ı1, . . . , ık−1}.

3 For 0 < λ ≤ λk−1 compute the “residuals” N(k)(λ) = (N

(k)1 (λ), . . . , N

(k)p (λ)) given by

N(k)j (λ) := N

(k−1)j − (λk−1 − λ)θj , j = 1, . . . , p ,

and pick

λk := max{β > 0 ; ∃ j /∈ {ı1, . . . , ık−1}, s.t. |N(k)

j (β)| = β}

and ık := arg maxj /∈{ı1,...,ık−1}

|N(k)j (λk)| ,

εk := N(k)ık

(λk)/λk ∈ ±1 and N(k)

:= N(k)

(λk) .

Then, iterate from 2.

Algorithm 3: LAR algorithm (“projected” formulation)Data: Correlations vector Z and variance-covariance matrix R.Result: Sequence ((λk, ık, εk))k≥1 where λ1 ≥ λ2 ≥ . . . > 0 are the knots, and ı1, ı2, . . . are the variables that

enter the model with signs ε1, ε2, . . . (εk = ±1).

/* Initialize computing (λ1, ı1, ε1). */1 Define Z = (Z,−Z) and R as in (10), and set k = 1, λ1 := maxZ , ı1 := arg maxZ, ı1 = ı1 mod p

and ε1 = 1− 2(ı1 − ı1)/p ∈ ±1.

/* Note that ((λ`, ı`))1≤`≤k−1 have been defined at the previous step/loop. */2 Set k ← k + 1 and compute

λk = max{j: θj(ı1,...,ık−1)<1}

{Zj − Pı1,...,ık−1

(Zj)

1− θj (ı1, . . . , ık−1)

}and ık = arg max

{j: θj(ı1,...,ık−1)<1}

{Zj − Pı1,...,ık−1

(Zj)

1− θj (ı1, . . . , ık−1)

},

where

Pı1,...,ık−1(Zj) :=

(Rj,ı1 · · ·Rj,ık−1

)M−1ı1,...,ık−1

(Zı1 , . . . , Zık−1)

θj (ı1, . . . , ık−1) :=(Rj,ı1 · · ·Rj,ık−1

)M−1ı1,...,ık−1

(1, . . . , 1)

and set ık = ık mod p and εk = 1− 2(ık − ık)/p ∈ ±1. Then, iterate from 2.

20

Page 21: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

A.2. Initialization: First Knot

The first step of the LAR algorithm (Step 1 in Algorithm 2) seeks the most correlated predictorwith the observation. In our formulation, introduce the first residual N (1) := Z and observethat N (1) := (N

(1),−N (1)

). We define the first knot λ1 > 0 as

λ1 = maxZ and ı1 = arg maxZ .

One may see that this definition is consistent with λ1 in Algorithm 2 and note that ı1 and (ı1, ε1)are related as in (1).

The LAR algorithm is a forward algorithm that selects a new variable and maintains a residualat each step. We also define

N (2)(λ) = N (1) − (λ1 − λ)θ(ı1) , 0 < λ ≤ λ1 , (20)

and one can check that N (2)(λ) = (N(2)

(λ),−N (2)(λ)) where N(λ) is defined in Algorithm 2. It

is clear that the coordinate ı1 of N (2)(λ) is equal to λ. On the other hand N (1) = Z attains itsmaximum at the single point ı1. By continuity this last property is kept for λ in a left neighborhoodof λ1. We search for the first value of λ such that this property is not met, i.e. the largest valueof λ such that

∃j 6= ı1 such that N (2)(λ) = λ ,

as in Step 3 of Algorithm 2. We call this value λ2 and one may check that this definition isconsistent with λ2 in Algorithm 2.

Now, we can be more explicit on the expression of λ2. Indeed, we may make the followingdiscussion depending on the values of θj (ı1) .

• If θj (ı1) ≥ 1 , since N (1)j < N

(1)ı1

for j 6= ı1 there is no hope to achieve the equalitybetween N (2)

j (λ) and N (2)ı1

(λ) = λ for 0 < λ ≤ λ1 in view of (20).• Thus we limit our attention to the j’s such that θj (ı1) < 1. We have equality N

(2)j (λ) = λ

when

λ =N

(1)j − λ1θj (ı1)

1− θj (ı1).

So we can also define the second knot λ2 of the LAR as

λ2 = maxj:θj (ı1)<1

{Zj − Pı1(Zj)

1− θj (ı1)

}.

where Pi1(Zj) := Zi1θj(i1). Remark that Pi1(Zj) = E(Zj | Zi1) is the regression of Zj on Zi1when EZ = 0.

A.3. Recursion: Next Knots

The loop (2� 3) in Algorithm 2 builds iteratively the knots λ1, λ2 . . . of the LAR algorithm andsome “residuals” N

(1), N

(2), . . . defined in Step 3. We will present here an equivalent formulation

of these knots.Assume that k ≥ 2 and we have build λ1, . . . , λk−1 and selected the “signed” variables ı1, . . . , ık−1.

Introduce N (k−1) := (N(k−1)

,−N (k−1)) and define

N (k)(λ) = N (k−1) − (λk−1 − λ)θ(ı1, . . . , ık−1) , 0 < λ ≤ λk−1 .

Check that θj (ı1, . . . , ık−1) = (θj ,−θj) where we recall that we define

θj :=(Rj,ı1 · · ·Rj,ık−1

)M−1ı1,...,ık−1

(ε1, . . . , εk−1) , j = 1, . . . , p ,

21

Page 22: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

at Step 2 and it holds that ı` and (ı`, ε`) are related as in (1). From this equality, we deduce thatit holds N (k)(λ) = (N

(k)(λ),−N (k)

(λ)). One may also check that the coordinates ı1, . . . , ık−1 ofN (k)(λ) are equal to λ.

Again if we want to solve N (k)j (λ) = λ for some j, we have to limit our attention to j’s such

that θj (ı1, . . . , ık−1) < 1. Solving this latter equality yields to

λk = maxj:θj (ı1,...,ık−1)<1

{N

(k−1)j − λk−1θj (ı1, . . . , ık−1)

1− θj (ı1, . . . , ık−1)

}.

This expression is consistent with λk in Algorithm 2.Now, we can give an other expression of λk that will be useful in the proofs of our main theorems.

Note that the residuals satisfy the relation

N (k) = N (k−1) − (λk−1 − λk)θ(ı1, . . . , ık−1) , (21)

and that N (k−1)j = λk−1 for j = ı1, . . . , ık−1. The following lemma permits a drastic simplification

of the expression of the knots. Its proof is given in Appendix C.3.

Lemma 7. It holds

N (k−1) − λk−1θ(ı1, . . . , ık−1) = Z − Pı1,...,ık−1(Z)

where we denote Pi1,...,ik−1(Z) = (Pi1,...,ik−1

(Z1), . . . , Pi1,...,ik−1(Z2p)) and for all j ∈ [2p] one

has Pi1,...,ik−1(Zj) =

(Rj,i1 · · ·Rj,ik−1

)M−1i1,...,ik−1

(Zi1 , . . . , Zik−1).

Using Lemma 7 we deduce that λk in Algorithm 2 is consistent with

λk = maxj:θj (ı1,...,ık−1)<1

{Zj − Pı1,...,ık−1

(Zj)

1− θj (ı1, . . . , ık−1)

}.

where Pı1,...,ık−1(Zj) =

(Rj,ı1 · · ·Rj,ık−1

)M−1ı1,...,ık−1

(Zı1 , . . . , Zık−1). When EZ = 0, one may re-

mark that Pi1,...,ik−1(Zj) is the regression of Zj on the vector (Zi1 , · · · , Zik−1

) whose variance-covariance matrix is Mi1,...,ik−1

. This analysis leads to an equivalent formulation of the LARalgorithm (Algorithm 2). We present this formulation in Algorithm 3.

Remark 11. Note that Algorithm 2 implies that ı1, . . . , ık are pairwise different, but also thatthey differ modulo p. In the rest of the paper we will limit our attention to such sequences.

Appendix B: First Steps to Derive the Joint Law of the LAR Knots

Given fixed i1, . . . , ik ∈ [2p], one may define Z(i1,...,ik)j := Zj − Pi1,...,ik(Zj) for indices j such

that θj(i1, . . . , ik) = 1 and

∀j s.t. θj(i1, . . . , ik) 6= 1 , Z(i1,...,ik)j :=

Zj − Pi1,...,ik(Zj)

1− θj(i1, . . . , ik),

where we recall that Pi1,...,ik(Zj) =(Rj,i1 · · ·Rj,ik

)M−1i1,...,ik

(Zi1 , . . . , Zik). From this point, we canintroduce

λ(i1,...,ik)k+1 := max

j:θj(i1,...,ik)<1Z

(i1,...,ik)j ,

and remark that λk+1 = λ(ı1,...,ık)k+1 .

22

Page 23: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

B.1. Law of the First Knot

One has the following lemma governing the law of λ1.

Lemma 8. It holds that

• Zi1 is independent of (Z(i1)j )j 6=i1 ,

• If θj(i1) < 1 for all j 6= i1 then {ı1 = i1

}={λ

(i1)2 ≤ Zi1

},

• If θj(i1) < 1 for all j 6= i1 then, conditionally to {ı1 = i1} and λ2, λ1 is a truncated Gaussianrandom variable with mean E(Zi1) and variance ρ2

1 := Rı1 ,ı1 subject to be greater than λ2.

Proof. The first point is clear and standard. Now, observe that

(i1)2 ≤ Zi1

}⇔{∀j 6= i1 ,

Zj − Zi1θj(i1)

1− θj(i1)≤ Zi1

}⇔{∀j 6= i1 , Zj − Zi1θj(i1) ≤ Zi1 − Zi1θj(i1)

}⇔{∀j 6= i1 , Zj ≤ Zi1

}⇔{ı1 = i1

},

as claimed. The last statement is a consequence of the two previous points.

B.2. Recursive Formulation of the LAR

One has the following proposition whose proof can be found in Section C.4. As we will see in thissection, this intermediate result as a deep consequence, the LAR algorithm can be stated in arecursive way applying the same function repeatedly, as presented in Algorithm 1.

Proposition 9. Set

τj,ik :=Rj,ik −

(Rj,i1 · · ·Rj,ik−1

)M−1i1,...,ik−1

(Rik,i1 , · · · , Rik,ik−1

)(1− θj(i1, . . . , ik−1))(1− θik(i1, . . . , ik−1))

,

and observe that τj,ik is the covariance between Z(i1,...,ik−1)j and Z(i1,...,ik−1)

ik. Furthermore, it holds

τj,ikτik,ik

= 1− 1− θj(i1, . . . , ik)

1− θj(i1, . . . , ik−1)(22)

and

∀j 6= i1, . . . , ik , Z(i1,...,ik)j =

Z(i1,...,ik−1)j − Z(i1,...,ik−1)

ikτj,ik/τik,ik

1− τj,ik/τik,ik. (23)

Now, we present Algorithm 1. Define R(0) := R, Z(0) = Z and T (0) = 0. For k ≥ 1 andfixed i1, . . . , ik ∈ [2p], introduce

R(k) :=(Rj,` −

(Rj,i1 · · ·Rj,ik

)M−1i1,...,ik

(R`,i1 , · · · , R`,ik

))j,`

Z(k) := Z − Pi1,...,ik(Z)

T (k) := (θj(i1, . . . , ik))j ,

and note that R(k) is the variance-covariance matrix of the Gaussian vector Z(k). The key propertyis following. Let v1, . . . , vk, be k linearly independent vectors of an Euclidean space and let u beany vector of the space. Set

v := P⊥(v1,...,vk−1)vk,

23

Page 24: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

the projection of vk orthogonally to v1, . . . , vk−1.Then

P⊥(v1,...,vk)u = P⊥v P(v1,...,vk−1)u .

Using this result we deduce that

Z(k) = P⊥i1,...,ik(Z)

= P⊥ik (P⊥i1,...,ik−1(Z))

= P⊥ik (Z(k − 1))

= Z(k − 1)− Pik(Z(k − 1))

= Z(k − 1)− x(k)Z(k − 1) , (24)

where x(k) = Rik(k − 1)/Rik,ik(k − 1). It yields that

R(k) = R(k − 1)− x(k)Rik(k − 1)> . (25)

Using (22) (or (31)), remark that

T (k) = T (k − 1)− x(k)(1− Tik(k − 1)) . (26)

These relations give a recursive formulation of the LAR as presented in Algorithm 1.

B.3. Joint Law

Regarding the joint law of knots, one has the following proposition whose proof can be found inSection C.5.

Proposition 10. One has the following for any fixed i1, . . . , ik+1 ∈ [2p].

• It holds that

(Z(i1,...,ik+1)j )j 6=i1,...,ik+1

⊥⊥ Z(i1,...,ik)ik+1

⊥⊥ Z(i1,...,ik−1)ik

⊥⊥ · · · ⊥⊥ Z(i1)i2⊥⊥ Zi1

are mutually independent.• If (AIrr.) of order k holds then{

ı1 = i1, . . . , ık+1 = ik+1

}={λ

(i1,...,ik)k+1 = Z

(i1,··· ,ik)ik+1

≤ Z(i1,··· ,ik−1)ik

≤ · · · ≤ Z(i1)i2≤ Zi1

},

={λ

(i1,...,ik)k+1 = Z

(i1,··· ,ik)ik+1

≤ Z(i1,··· ,ik−1)ik

≤ · · · ≤ Z(i1,··· ,ia)ia+1

≤ Z(i1,··· ,ia−1)ia

= λ(i1,...,ia−1)a

}⋂{ı1 = i1, . . . , ıa = ia, ık+1 = ik+1

}⋂{λK+1 = Z

(i1,··· ,ik)ik+1

, . . . , λa = Z(i1,··· ,ia−1)ia

},

for any 0 ≤ a ≤ k − 1 with the convention λ0 =∞.

24

Page 25: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Appendix C: Proofs

C.1. Proof of Theorem 3

Step 1We prove by induction on k the following equality

{ı1 = i1, . . . , ıK+1 = iK+1

}=

(i1,...,ik)k+1 ≤

(i1,...,ik−1)k

Z(i1,...,ik−1)ik

≤ · · · ≤[λ1

Zi1

},

where the bracket means that, at any position you can choose indifferently the upper- or thelower-case to get an equivalent event.

Using Lemma 8 we have the following equivalences{ı1 = i1

}={λ1 = Zi1

}={λ

(i1)2 ≤ Zi1

}={λ

(i1)2 ≤ λ1

}that give the property for k = 1.

Using the induction hypothesis and Proposition 9 we can pass from k to k + 1 to get that

{ı1 = i1, . . . , ık+1 = ik+1

}=

(i1,...,ik+1)k+2 ≤ λ(i1,...,ik)

k+1 ≤

(i1,...,ik−1)k

Z(i1,...,ik−1)ik

≤ · · · ≤[λ1

Zi1

}.

The 2k equivalent events above imply obviously the following 2k events{λ

(i1,...,ik+1

k+2 ≤ Z(i1,...,ikik+1

(i1,...,ik−1)

k

Z(i1,...,ik−1

ik

≤ · · · ≤[λ1

Zi1

}.

To get the implication in the other direction we remark that

λ(i1,...,ik+1)k+2 ≤ Z(i1,...,ik)

ik+1

⇔ ∀j 6= i1, . . . , ik+1 , Z(i1,...,ik+1)j ≤ Z(i1,...,ik)

ik+1

⇔ ∀j 6= i1, . . . , ik+1 , Z(i1,...,ik)j − Z(i1,...,ik)

ik+1

τj,ik+1

τik+1,ik+1

≤ Z(i1,...,ik)ik+1

− Z(i1,...,ik)ik+1

τj,ik+1

τik+1,ik+1

⇔ ∀j 6= i1, . . . , ik+1 , Z(i1,...,ik)j ≤ Z(i1,...,ik)

ik+1(27)

⇔ λ(i1,...,ik)k+1 = Z

(i1,...,ik)ik+1

.

using that (23) and that 1 − τj,ik+1/τik+1,ik+1

> 0 (which is a consequence of (22) and (AIrr.))in (27).

Step 2Using the version 3 of the LAR algorithm we deduce directly the expressions of the expectation

and variance ofZ

(i1,...,i`−1)i`

given by (13) (15) (Changing k into `). From this same version of this algorithm we get theindependence of the Z(i1,...,i`−1)

i`when ` varies.

Now suppose that we are conditional to (a, λa, λk+1, ı1, . . . , ıK+1), then standard properties oforder statistics imply that the density of (λa+1, . . . , λk). taken at (`a+1, . . . , `k) is proportional to

( K∏k=a+1

ϕmk,v2k(`k)

)1{`a+1≥`a+2≥···≥`K≥λK+1}. (28)

25

Page 26: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Step 3Suppose now that we are conditional to

{m = a, λa, λK+1, ı1, . . . , ıK , ıK+1

}with 0 ≤ a ≤

K − 1. Because of P2, λa+1, . . . , λK are independent of the condition m = a, so their conditionaldistribution is equal to the unconditional distribution given by (28).

C.2. Proof of Proposition 2

Let S = {j1, . . . , jk} ⊂ [p] and j ∈ [2p] \ S. Let v = (v1, . . . , vk) ∈ {−1, 1}k and define i` =j` + p(1− v`)/2 for ` ∈ [k]. Note that

θj(i1, . . . , ik) =(Rj,i1 · · ·Rj,ik

)M−1i1,...,ik

(1, . . . , 1) =[X>j XSDiag(v)

]M−1i1,...,ik

(1, . . . , 1)

=[X>j XSDiag(v)

]M−1i1,...,ik

[Diag(v)v

]= X>j XS

[Diag(v)M−1

i1,...,ikDiag(v)

]v

= X>j XS

(X>S XS

)−1v .

Now, observe that

max||v||∞≤1

v>(X>S XS

)−1X>S Xj = max

v∈{−1,1}kX>j XS

(X>S XS

)−1v ,

showing the equivalence between the two assumptions.

C.3. Proof of Lemma 7

The proof works by induction. Let us check the relation for k = 2, namely

N (1) − λ1θ(ı1) = Z − Zı1θ(ı1) = Z − Pı1(Z) .

Now, let k ≥ 3. First, the three perpendicular theorem implies that for every j, i1, . . . , ik−1 ,

θj(i1, . . . , ik−2) =(Rj,i1 · · ·Rj,ik−1

)M−1i1,...,ik−1

(θi1(i1, . . . , ik−2), . . . , θik−1(i1, . . . , ik−2)) ,

and Pi1,...,ik−2(Zj) =

(Rj,i1 · · ·Rj,ik−1

)M−1i1,...,ik−1

(Pi1,...,ik−2(Zi1), . . . , Pi1,...,ik−2

(Zik−1)) .

By induction, using (21), we get that

N (k−1) = N (k−2) − (λk−2 − λk−1)θ(ı1, . . . , ık−2) ,

= (N (k−2) − λk−2θ(ı1, . . . , ık−2)) + λk−1θ(ı1, . . . , ık−2) ,

= Z − Pı1,...,ık−2(Z) + λk−1θ(ı1, . . . , ık−2) . (29)

Then, recall that N(k−1)j = λk−1 for j = ı1, . . . , ık−1 and remark that

λk−1θj (ı1, . . . , ık−1) =(Rj,ı1 · · ·Rj,ık−1

)M−1ı1,...,ık−1

(N(k−1)ı1

, . . . , N(k−1)ık−1

) .

Using (29) at indices j = ı1, . . . , ık−1, we deduce that

λk−1θj (ı1, . . . , ık−1)

=(Rj,ı1 · · ·Rj,ık−1

)M−1ı1,...,ık−1

(N(k−1)ı1

, . . . , N(k−1)ık−1

)

=(Rj,ı1 · · ·Rj,ık−1

)M−1ı1,...,ık−1

(Zı1 , . . . , Zık−1)

−(Rj,ı1 · · ·Rj,ık−1

)M−1ı1,...,ık−1

(Pı1,...,ık−2(Zı1), . . . , Pı1,...,ık−2

(Zık−1))

+ λk−1

(Rj,ı1 · · ·Rj,ık−1

)M−1ı1,...,ık−1

(θı1 (ı1, . . . , ık−2), . . . , θık−1(ı1, . . . , ık−2))

= Pı1,...,ık−1(Zj)− Pı1,...,ık−2

(Zj) + λk−1θj (ı1, . . . , ık−2) ,

26

Page 27: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Namely

Pı1,...,ık−1(Z)− λk−1θ(ı1, . . . , ık−1) = Pı1,...,ık−2

(Z)− λk−1θ(ı1, . . . , ık−2) .

Using again (29) we get that

N (k−1) = Z − Pı1,...,ık−2(Z) + λk−1θ(ı1, . . . , ık−2) ,

= Z − Pı1,...,ık−1(Z) + λk−1θ(ı1, . . . , ık−1) ,

as claimed.

C.4. Proof of Proposition 9

We denote

Rj :=(Rj,i1 , . . . , Rj,ik−1

),

Rik :=(Rik,i1 , . . . , Rik,ik−1

),

M := Mi1,...,ik−1,

M := Mi1,...,ik =

[M RikR>ik Rik,ik

],

R :=(Rj,i1 , . . . , Rj,ik

),

x :=1− θj(i1, . . . , ik−1)

1− θik(i1, . . . , ik−1)

τj,ikτik,ik

,

and observe that

x =Rj,ik −R>j M−1Rik

Rik,ik −R>ikM−1Rik,

M−1

=

[Idk−1 −M−1Rik

0 1

] [M−1 0

0(Rik,ik −R>ikM

−1Rik)−1

] [Idk−1 0

−R>ikM−1 1

],

M−1R =

[M−1

(Rj − xRik

)x

], (30)

using the Schur complement of the block M of the matrix M and a LU decomposition. Note alsothat

Z(i1,...,ik−1)j − Z(i1,...,ik−1)

ikτj,ik/τik,ik

1− τj,ik/τik,ik=Zj − Pi1,...,ik−1

(Zj)− x (Zik − Pi1,...,ik−1(Zik))

1− θj(i1, . . . , ik−1)− x(1− θik(i1, . . . , ik−1)).

To prove (23), it suffices to show that the R.H.S term above is equal to the following R.H.S term

Z(i1,...,ik)j =

Zj − Pi1,...,ik(Zj)

1− θj(i1, . . . , ik).

We will prove that the numerators are equal and that the denominators are equal. For the denom-inators, we use that

1− θj(i1, . . . , ik−1)− x(1− θik(i1, . . . , ik−1))

= 1− θj(i1, . . . , ik−1)− x+ x θik(i1, . . . , ik−1)

= 1− (1 · · · 1︸ ︷︷ ︸k times

)

[M−1

(Rj − xRik

)x

]= 1− θj(i1, . . . , ik) , (31)

27

Page 28: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

using (30). Furthermore, it proves (22). For the numerators, we use that

Zj − Pi1,...,ik−1(Zj)− x (Zik − Pi1,...,ik−1

(Zik))

= Zj − Pi1,...,ik−1(Zj)− xZik + xPi1,...,ik−1

(Zik)

= Zj − (Zi1 · · ·Zik)

[M−1

(Rj − xRik

)x

]= Zj − Pi1,...,ik(Zj) .

using (30).

C.5. Proof of Proposition 10

The proof of the first point can be lead by induction. The initialization of the proof is given bythe first point of Lemma 8. Now, observe that Z(i1,...,ik−1)

ik, · · · , Z(i1)

i2, Zi1 are measurable func-

tions of (Zi1 , . . . , Zik) and one may check that the vector (Zi1 , . . . , Zik) is independent of thevector (Z

(i1,...,ik)ik+1

, (Z(i1,...,ik+1)j )j 6=i1,...,ik+1

). We deduce that Z(i1,...,ik−1)ik

, · · · , Z(i1)i2

, Zi1 are inde-

pendent of (Z(i1,...,ik)ik+1

, (Z(i1,...,ik+1)j )j 6=i1,...,ik+1

). One can also check that Z(i1,...,ik)ik+1

is independent

of (Z(i1,...,ik+1)j )j 6=i1,...,ik+1

.The second point also works by induction. The initialization of the proof is given by the second

point of Lemma 8. We will use Proposition 9 to prove the second point.Now, we have

λ(i1,...,ik)k+1 ≤ Z(i1,...,ik−1)

ik

⇔ ∀j 6= i1, . . . , ik , Z(i1,...,ik)j ≤ Z(i1,...,ik−1)

ik

⇔ ∀j 6= i1, . . . , ik , Z(i1,...,ik−1)j − Z(i1,...,ik−1)

ik

τj,ikτik,ik

≤ Z(i1,...,ik−1)ik

− Z(i1,...,ik−1)ik

τj,ikτik,ik

⇔ ∀j 6= i1, . . . , ik , Z(i1,...,ik−1)j ≤ Z(i1,...,ik−1)

ik(32)

⇔ λ(i1,...,ik−1)k = Z

(i1,...,ik−1)ik

.

using that (23) and that 1 − τj,ik/τik,ik > 0 (which is a consequence of (22) and (AIrr.)) in (32).By induction and using (32), it holds that{

λ(i1,...,ik)k+1 ≤ Z(i1,...,ik−1)

ik≤ Z(i1,...,ik−2)

ik−1≤ . . . ≤ Z(i1)

i2≤ Zi1

}⇔{∀j 6= i1, . . . , ik , Z

(i1,...,ik−1)j ≤ Z(i1,...,ik−1)

ik≤ Z(i1,...,ik−2)

ik−1≤ . . . ≤ Z(i1)

i2≤ Zi1

}⇔{∀j 6= i1, . . . , ik−1 , Z

(i1,...,ik−1)j ≤ Z(i1,...,ik−2)

ik−1≤ . . . ≤ Z(i1)

i2≤ Zi1

and ∀j 6= i1, . . . , ik , Z(i1,...,ik−1)j ≤ Z(i1,...,ik−1)

ik

}⇔{λ

(i1,...,ik−1)k ≤ Z(i1,...,ik−2)

ik−1≤ . . . ≤ Z(i1)

i2≤ Zi1 and λ(i1,...,ik−1)

k = Z(i1,...,ik−1)ik

}...

⇔{λ(i1,...,ia−1)a ≤ Z(i1,...,ia−2)

ia−1≤ . . . ≤ Z(i1)

i2≤ Zi1

and λ(i1,...,ik−1)k = Z

(i1,...,ik−1)ik

≤ . . . ≤ λ(i1,...,ia−1)a = Z

(i1,...,ia−1)ia

}(sa)

...

⇔{ı1 = i1, . . . , ık = ik

}.

Now, observe that ik+1 is the (unique) arg max of λ(i1,...,ik)k+1 on the event

{ı1 = i1, . . . , ık = ik

}. It

yields that{ı1 = i1, . . . , ık+1 = ik+1

}={λ

(i1,...,ik)k+1 = Z

(i1,··· ,ik)ik+1

≤ Z(i1,··· ,ik−1)ik

≤ · · · ≤ Z(i1)i2≤ Zi1

},

28

Page 29: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Figure 5. Rejection domains associated to the different comparison sets appearing in steps of the proof of Theorem 6.

as claimed. Stopping at a as in (sa) gives the second part of the statement.

C.6. Orthogonal Case: Proof of Theorem 6

Let I the set of admissible indexes

I := {a, b, c : a0 ≤ a < b < c ≤ K + 1} .

◦ Step 1: We prove that, when the considered indexes such that c + 1 ≤ K + 1 belongto I, Sa,b,c+1 is more powerful than Sa,b,c. Our proof is conditional to Fa = fa, Fc+1 = fc+1 .Note that (Fa+1, . . . , Fc) has for distribution the uniform distribution on the simplex

S := {fa > Fa+1 > · · · > Fc > fc+1}.

This implies by direct calculations that

Iab(s, t) =(s− t)b−a−1

(b− a− 1)!

and thatFabc(λb)

Fabc(λa)= F β((b−a),(c−b))

(Fb − Fcfa − Fc

). (33)

where F β is the cumulative distribution of the Beta distribution in reference. Using monotony ofthis function the Sabc test has for rejection region

(Fb − Fc) ≥ z1(fa − Fc)⇔ Fb ≥ z1fa + (1− z1)Fc, (34)

where z1 is some threshold depending on α that belongs to (0, 1).Similarly Sab(c+1) has for rejection region

Fb ≥ z2(fa − fc+1) + fc+1, (35)

where z2 is some other threshold belonging to (0, 1). We use the following lemmas.

Lemma 11. Let c ≤ K. The density hµ of f1, . . . , fc, conditional to Fc+1 with respect of theLebesgue measure under the alternative is coordinate-wise non-decreasing and given by (36).

Proof. Observe that it suffises to prove the result when σ = 1. Note that

λi1,...,icc+1 = maxj∈[P ] , j 6=i1,...,j 6=ic

|Zj |.

Thus its density pµ0,i1,...,ic does not depend on µ0i1, . . . , µ0

ic. As a consequence the following variables

have the same distribution ; λi1+ε1p,...,ic+εcpc+1 , where ε1, . . . , εc take the value 0 or 1 and indices

are taken modulo p.29

Page 30: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Because of independence of the different variables, the joint density, under the alternative,of λ1, . . . , λc+1 taken at `1, . . . , `c+1, on the domain {λ1 > · · · > λc+1} takes the value

(Const)

′∑(φ(`1 − µ0

j1) + φ(`1 + µ0j1)), . . . ,

(φ(`c − µ0

jc) + φ(`c + µ0jc))pµ0,j1,...,jK (`k+1).

Here the sum′∑

is taken over all different j1, . . . , jc belonging to J1, pK.Then the density, conditional to Fc+1 = fc+1, of F1, . . . , Fc at f1, . . . , fc takes the value

(const)

′∑cosh(µj1f1) . . . cosh(µjcfc)1f1>···>fc>Fc+1 , (36)

implying that this density is coordinate-wise non-decreasing.

Lemma 12. Let ν0 the image on the plane (Fb, Fc) on the uniform probability on S: it is thedistribution under the null of (Fb, Fc). The two rejection regions : R1 associated to (34) and R2

associated to (35) have of course the same probability α under ν0. Let ηµ0 the density w.r.t. ν0 ofthe distribution of (Fb, Fc) under the alternative.

Then ηµ0 is non decreasing coordinate-wise.

Proof. Integration yields that the density of ν0 w.r.t. the Lebesgue measure taken at point (fb, fc)is

(fa − fb)b−a−1(fb − fc)c−b−1

(b− a− 1)!(c− b− 1)!.

The density of νµ0 w.r.t. Lebesgue measure is∫ fa

fb

dfa+1 . . .

∫ fb−2

fb

dfa+1

∫ fb−2

fb

dfb−1

∫ fb

fc

dfb+1 . . .

∫ fc−2

fc

dfc−1hµ0(fa, . . . , fc). (37)

Thus ηµ0 which is the quotient of these two quantities is just a mean value of hµ0 on the domainof integration Dfb,fc in (37).

Suppose that fb and fc increase, then all the borns of the domain Dfb,fc increase also. ByLemma 11 the mean value increases.

We finish now the proof of Step 1: For a given level α let us consider the two rejectionregions Ra,b,c and Ra,b,(c+1) of the two considered tests in the plane Fb, Fc and set

A := Ra,b,c \Ra,b,(c+1) and B := Ra,b,(c+1) \Ra,b,c ,

see Figure 5. These two regions have the same ν0 measure. By elementary geometry there exist apoint K = (Kb,Kc) in the plane such that

• For every point of A, Fb ≤ Kb, Fc ≤ Kc,• For every point of B, Fb ≥ Kb, Fc ≥ Kc,

By transport of measure there exists a transport function T that preserve the measure ν0 andthat is one-to one A → B. As a consequence the transport by T improve the probability underthe alternative: the power of Sa,b,c+1 is larger than that of Sa,b,c.

◦ Step 2: We prove that, when the considered indexes belong to I such that a < b− 1, Sa,(b−1),c

is more powerful than Sa,b,c. Our proof is conditional to Fa = fa, Fb = fb and is located in theplane (Fb−1, Fc).The rejection region Ra,b,c takes the form Fc ≤ 1

1−z1 fb−z1

1−z1 fa for some threshold z1 belongingto (0, 1).The rejection region Ra,(b−1),c takes the form Fc ≤ 1

1−z2Fb−1− z21−z2 fa for some other threshold z2

belonging to (0, 1).These regions as well as the regions A and B and the point K are indicated in Figure 5.

30

Page 31: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Transport of measure and the convenient modification of Lemma 12 imply that the power ofSa,(b−1),c is greater of equal that that of Sa,b,c.

◦ Step 3: We prove that, when the considered indexes belong to I such that a+ 1 < b, Sa,b,c ismore powerful than S(a+1),b,c. Our proof is conditional to Fa = fa, Fc = fc and is located in theplane Fa+1, Fb.The rejection region Ra,b,c takes the form Fb ≥ z1fa + (1− z1)fc for some threshold z1belongingto (0, 1).The rejection region Ra+1,b,c takes the form Fb ≥ z2Fa+1 + (1− z2)fc for some other threshold z2

belonging to (0, 1).These regions as well as the regions A and B and the point K are indicated in Figure 5.

Transport of measure and the convenient modification of Lemma 12 imply that the power ofSa,b,c is greater of equal that that of S(a+1),b,c.

Considering the three cases above, we get the desired result.

C.7. Proof of Theorem 1

We rely on the Weak Positive Regression Dependency (WPRDS) property to prove theresult, one may consult [17, Page 173] for instance. We say that a function g : [0, 1]K → R+ isnondecreasing if for any p, q ∈ [0, 1]K such that pk ≥ qk for any k = 1, . . . ,K, we have g(p) ≥ g(q).We say that a Borel set Γ ∈ [0, 1]K is nondecreasing if g = 1Γ is nondecreasing. In other words ify ∈ γ and if z ≥ 0, then y+z ∈ γ. We say that the p-values (p1 = α0,1,K+1, . . . , pK = αK−1,K,K+1)satisfy the WPRDS if for any nondecreasing set Γ and for all k0 ∈ I0, the function

u 7→ Pµ0

[(p1, . . . , pK) ∈ Γ

∣∣pk0 ≤ u] is nondecreasingwhere µ0 = β0 in our orthogonal design case, and we recall that

I0 ={k ∈ [K] : H0,k is true

}.

To prove Theorem 1, note that it is sufficient [17, Chapter 8] to prove that

u 7→ P[(p1, . . . , pK) ∈ Γ

∣∣pk0 ≤ u] is nondecreasing (38)

where E,P will denote that expectations and probabilities are conditional on {ı1, . . . , ıK , λK+1}and under the hypothesis that µ0 = X>Xβ0. Note that one can integrate in λK+1 to get thestatement of Theorem 1.

◦ Step 1: We start by giving the joint law of the LAR’s knots under the alternative in the orthog-onal design case. Lemma 11 and (36) show that, conditional on {ı1, . . . , ıK , λK+1}, (λ1, . . . , λK)is distributed on the set λ1 ≥ λ2 ≥ · · · ≥ λK ≥ λK+1 and it has a coordinate-wise nondecreasingdensity. Now we can assume without loss of generality that σ2 = 1, in addition because of or-thogonality ρ2

k = 1 implying that Fk = Φ(λk) Pi,j = Φi ◦ Φ−1j = Id. We deduce that, conditional

on {ı1, . . . , ıK , FK+1}, (F1, . . . , FK) is distributed on the set{(f1, . . . , fK) ∈ RK : 1 ≥ f1 ≥ f2 ≥ · · · ≥ fK ≥ FK+1)

},

it has an explicit density given by (36), and we denote it by hµ0 . By the change of variablesGk := Fk−FK+1

Fk−1−FK+1one obtains that the distribution of (G1, . . . , GK) is supported on [0, 1]K . More

precisely, define

ψ(f1, . . . , fK) := (g1, . . . , gK) := (f1 − FK+1

1− FK+1, . . . ,

fK − FK+1

fK−1 − FK+1)

ψ−1(g1, . . . , gK) :=(

(1− FK+1)g1 + FK+1, . . . , (1− FK+1)g1g2 . . . gK + FK+1

),

31

Page 32: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

whose inverse Jacobian determinant is

det[ ∂ψ∂f1· · · ∂ψ

∂fK

]−1

=

K∏k=1

(fk−1 − FK+1) = (1− FK+1)KK∏k=1

gK−kk .

We deduce that the density of (G1, . . . , GK)|{ı1, . . . , ıK , FK+1} at point g with respect to Lebesguemeasure is

p(g) := (const)1g∈(0,1)K

K∏k=1

gK−kk cosh[µ0ık

((1− FK+1)

k∏`=1

g` + FK+1)], (39)

where we have used (36). From (18) and (33), one has

pk = 1− F β(1,K−k+1)

(Fk − FK+1

Fk−1 − FK+1

)= 1− F β(1,K−k+1)(Gk) (40)

where F β is the cumulative distribution of the Beta distribution in reference. We deduce that forany v ∈ (0, 1) and for any ` ∈ [K],

pk = v ⇔ (G1, . . . , GK) ∈ [0, 1]K ∩ {F−1β(1,K−k+1)(1− v) = Gk} ,

so that

P[(p1, . . . , pK) ∈ Γ

∣∣pk0 ≤ u] = P[(G1, . . . , GK) ∈ Γ

∣∣Gk0 ≥ F−1β(1,K−`+1)(1− u)

], (41)

where Γ can be proved to be a nonincreasing Borel set from (40).

◦ Step 2: Let 0 < x < y < 1 and denote by µx the following conditional law

µx := law[(G1, . . . , GK)|{ı1, . . . , ıK , FK+1, Gk0 ≥ x}

].

Remark that if there exists a measurable T : [0, 1]K 7→ [0, 1]K such that

• T is nondecreasing, meaning that for any g ∈ [0, 1]K , T (g) ≥ g;• T is such that push-forward of µx by T gives µy, namely T#µx = µy;

then it holds

• 1{T (g)∈Γ} ≤ 1{g∈Γ};• law

[T (G)|{ı1, . . . , ıK , FK+1, Gk0 ≥ x}

]= law

[G|{ı1, . . . , ıK , FK+1, Gk0 ≥ y}

]where G =

(G1, . . . , GK).

In this case, we deduce that

P[G ∈ Γ

∣∣Gk0 ≥ x] ≥ P[T (G) ∈ Γ∣∣Gk0 ≥ x] = P

[G ∈ Γ

∣∣Gk0 ≥ y] .If one can prove that such function T exists for any 0 < x < y < 1, it proves that

x 7→ P[G ∈ Γ

∣∣Gk0 ≥ x] is nonincreasing ,and, in view of (41), it proves (38). Proving that such function T exists is done in the next step.

◦ Step 3: Let 0 < x < y < 1. Consider the Knothe-Rosenblatt transport map T of µxtoward µy following the order

k0 → k0 + 1→ · · · → K → k0 − 1→ k0 − 2→ · · · → 1 .

32

Page 33: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

It is based on a sequence of conditional quantile transforms defined following the ordering above.Its construction is presented for instance in [24, Sec.2.3, P.67] or [31, P.20]. The transport T isdefined as follows. Given z, z′ ∈ [0, 1]K such that z′ = T (z) it holds

z′k0 = T (k0)(zk0);

z′k0+1 = T (k0+1)(zk0+1, z′k0);

...

z′K = T (K)(zK , z′K−1, . . . , z

′k0);

z′k0−1 = T (k0−1)(zk0−1, z′K , . . . , z

′k0);

...

z′1 = T (1)(z1, z′2, . . . , z

′k0−1, z

′K , . . . , z

′k0);

where T (k0), T (k0+1), . . . , T (K), T (k0−1) . . . , T (1) will be build in the sequel, in which we will droptheir dependencies in the z′k’s to ease notations. It remains to prove that

• T is nondecreasing, meaning that for any g ∈ [0, 1]K , T (g) ≥ g;• T is such that push-forward of µx by T gives µy, namely T#µx = µy;

to conclude. The last point is a property of the Knothe-Rosenblatt transport map. Proving thefirst point will be done in the rest of the proof.

◦ Step 3.1: We start by the first transport map T (k0) : [0, 1] 7→ [0, 1]. Denote µ(k0)x the following

conditional lawµ(k0)x := law

[Gk0 |{ı1, . . . , ıK , FK+1, Gk0 ≥ x}

],

and F(k0)x its cdf. Note that the Knothe-Rosenblatt construction gives T (k0) = (F

(k0)y )−1 ◦ F(k0)

x .We would like to prove that T (k0)(t) ≥ t for all z ∈ (0, 1). This is equivalent to prove that it holdsF

(k0)x ≥ F(k0)

y . For t ≤ y, F(k0)y (t) = 0 and it implies that F(k0)

x (t) ≥ F(k0)y (t). Let t > y, using the

conditional density p defined in (39), note that

F(k0)x (t) ≥ F(k0)

y (t) ⇔∫ txp∫ 1

xp≥∫ typ∫ 1

yp

⇔∫ t

x

∫ 1

y

p⊗ p ≥∫ t

y

∫ 1

x

p⊗ p,

where, for example∫ t

x

means the integral over the hyper rectangle [x, t] :={

(g1, . . . , gK) ∈ [0, 1]K : x ≤ gk0 ≤ t}.

A simple calculation (see also Figure 6) gives that∫ t

x

∫ 1

y

p⊗ p =

∫ t

y

∫ 1

x

p⊗ p +

∫[x,y]×[t,1]

p⊗ p ,

and it proves that F(k0)x ≥ F(k0)

y .

◦ Step 3.2: We continue with the second transport map in Knothe-Rosenblatt construction. Letzk0 ∈ (x, 1) and denote µ(k0+1)

zk0 the following conditional law

µ(k0+1)zk0

:= law[Gk0+1|{ı1, . . . , ıK , FK+1, Gk0 = zk0}

],

33

Page 34: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Figure 6. Note that, by symmetry the two boxed regions have same p⊗p measure. The blue region is [x, t]× [y, 1],its measure is the measure of the red region (namely cDy(t) × Dx(1)) more the bluest upper left corner (namely[x, y]× [t, 1]).

and F(k0+1)zk0 its cdf. Let z′k0 := T (k0)(z′k0) and denote µ(k0+1)

z′k0

the following conditional law

µ(k0+1)z′k0

:= law[Gk0+1|{ı1, . . . , ıK , FK+1, Gk0 = z′k0}

],

and F(k0+1)z′k0

its cdf. Note that x < zk0 ≤ z′k0 = T (k0)(zk0) ≤ 1. Again, we would like to prove that

F(k0+1)zk0 ≥ F(k0+1)

z′k0

which implies that the transport map T (k0+1) :=(F

(k0+1)z′k0

)−1 ◦F(k0+1)zk0 satisfies

T (k0+1)(u) ≥ u for all u ∈ (0, 1).Recall that the conditional density p of G|{ı1, . . . , ıK , FK+1} is given by (39) and recall that

k0 ∈ I0. Observe that µ0k0 = 0, so that the conditional density of G|{ı1, . . . , ıK , FK+1, Gk0 = z} is

(const)1g∈(0,1)K1gk0=z

∏k<k0

gK−kk cosh[µ0ık

((1− FK+1)

k∏`=1

g` + FK+1)]

(42)

×∏k>k0

gK−kk cosh[µ0ık

((1− FK+1) z∏

1≤` 6=k0≤k

g` + FK+1)].

Set τ := z′k0/zk0 ≥ 1 and G′k0+1 = τGk0+1 so that

zk0Gk0+1 = z′k0G′k0+1 .

Denote G′ := (G1, . . . , Gk0−1, G′k0+1, Gk0+2, . . . , GK) ∈ (0, 1)k

0−1× (0, τ)× (0, 1)K−k0−1 and note

that the conditional density of G′|{ı1, . . . , ıK , FK+1, Gk0 = τz} is

(const)1g∈(0,1)k0−1×(0,τ)×(0,1)K−k0−1

∏k<k0

gK−kk cosh[µ0ık

((1− FK+1)

k∏`=1

g` + FK+1)]

×∏k>k0

gK−kk cosh[µ0ık

((1− FK+1) z∏

1≤` 6=k0≤k

g` + FK+1)],

which, up to some normalising constant, is the same as (42) up to the following change of support

1g∈(0,1)K ↔ 1g′∈(0,1)k0−1×(0,τ)×(0,1)K−k0−1 .

By an abuse of notation, we denote by p this function, namely

p(g) =∏k<k0

gK−kk cosh[µ0ık

((1− FK+1)

k∏`=1

g` + FK+1)]

×∏k>k0

gK−kk cosh[µ0ık

((1− FK+1) z∏

1≤` 6=k0≤k

g` + FK+1)].

34

Page 35: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Figure 7. The two boxed rectangles have Lebesgue measure, namely τt. The p⊗p measure of the grey box is greaterthan the p⊗ p measure of the white box.

We deduce that

F(k0+1)zk0

(t) ≥ F(k0+1)z′k0

(t) ⇔ P(Gk0+1 ≤ t|Gk0 = zk0) ≥ P(Gk0+1 ≤ t|Gk0 = z′k0)

⇔ P(Gk0+1 ≤ t|Gk0 = zk0) ≥ P(G′k0+1 ≤ τt|Gk0 = τzk0)

∫D(t)

p∫D(1)

p≥

∫D(τt)

p∫D(τ)

p

⇔∫D(t)×D(τ)

p⊗ p ≥∫D(τt)×D(1)

p⊗ p , (43)

whereD(s) :=

{(g1, . . . , gk0−1, gk0+1 . . . , gK) ∈ (0, 1)K−1 : 0 < gk0+1 ≤ s

}.

We now present an inequality on the to conclude. Observe that we are integrating on domainsdepicted in Figure 7. The two boxes have same area for the uniform measure and we would liketo compare their respective measure for the p ⊗ p measure. We start by the next lemma whoseproof is omitted.

Lemma 13. Let a, b ≥ 0. The function

z 7→ cosh(a z + b)× cosh(a/z + b)

is non-decreasing on the domain [1,∞).

Now, let (g1, . . . , gk0−1, gk0+2 . . . , gK) ∈ (0, 1)K−1 be fixed in the integrals (43). We are the lookingat the weights of the domains (h1, h2) ∈ (0, t)× (0, τ) and (h3, h4) ∈ (0, τ t)× (0, 1) for the weightfunction w given by

w(h1, h2) =C1hK−k0−11 cosh

[µ0ık

((1− FK+1) zk0∏

1≤`<k0g` × h1 + FK+1)

∏k>k0+1

cosh[µ0ık

((1− FK+1) zk0∏

1≤` 6=k0,k0+1≤k

g` × h1 + FK+1)]

× hK−k0−1

2 cosh[µ0ık

((1− FK+1) zk0∏

1≤`<k0g` × h2 + FK+1)

∏k>k0+1

cosh[µ0ık

((1− FK+1) zk0∏

1≤` 6=k0,k0+1≤k

g` × h2 + FK+1)],

where the constant C1 depends on (g1, . . . , gk0−1, gk0+2 . . . , gK) ∈ (0, 1)K−1. By the change ofvariables h′1 = h3/t and h′2 = th4, the right hand term of (43) is given by the integration on the

35

Page 36: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

domain (h′1, h′2) ∈ (0, t)× (0, τ) of the weight function w′ given by

w′(h′1, h′2) =C1h

′1K−k0−1

cosh[µ0ık

((1− FK+1) zk0∏

1≤`<k0g` × t× h′1 + FK+1)

∏k>k0+1

cosh[µ0ık

((1− FK+1) zk0∏

1≤` 6=k0,k0+1≤k

g` × t× h′1 + FK+1)]

× h′2K−k0−1

cosh[µ0ık

((1− FK+1) zk0∏

1≤`<k0g` × h′2/t+ FK+1)

∏k>k0+1

cosh[µ0ık

((1− FK+1) zk0∏

1≤` 6=k0,k0+1≤k

g` × h′2/t+ FK+1)].

Now, invoke Lemma 13 with

a = µ0ık

(1− FK+1) zk0∏

1≤`<k0g` × h

b = µ0ıkFK+1

z = t ≥ 1 ,

where h = h1 or h2, to get that w′ ≥ w and so∫D(t)×D(τ)

p⊗ p ≥∫D(τt)×D(1)

p⊗ p ,

which concludes this part of the proof.

◦ Step 3.3: We continue by induction with the other transport maps in Knothe-Rosenblatt’sconstruction. Assume that we have built z′ := (z′k, . . . , z

′k0) and z := (zk, . . . , zk0) for some k > k0.

Denote µ(k+1)z the following conditional law

µ(k+1)z := law

[Gk+1|{ı1, . . . , ıK , FK+1, Gk = zk, . . . , Gk0 = zk0︸ ︷︷ ︸

denoted G[k,k0]=z

}],

and F(k+1)z its cdf. Denote µ(k+1)

z′ the following conditional law

µ(k+1)z′ := law

[Gk+1|{ı1, . . . , ıK , FK+1, Gk = z′k, . . . , Gk0 = z′k0︸ ︷︷ ︸

G[k,k0]=z′

}],

and F(k+1)z′ its cdf. Note that z ≤ z′ = T (k)(z) ≤ 1. Again, we would prove that F(k+1)

z ≥ F(k+1)z′

which implies that the transport map T (k+1) :=(F

(k+1)z′

)−1 ◦F(k+1)z satisfies T (k+1)(u) ≥ u for all

u ∈ (0, 1).For z ∈ (0, 1)k−k

0 × (x, 1), the conditional density of G|{ı1, . . . , ıK , FK+1, G[k,k0] = z} is

(const)1g∈(0,1)K1g[k,k0]=z

∏m<k0

gK−mm cosh[µ0ım((1− FK+1)

m∏`=1

g` + FK+1)]]

×∏

k0≤m≤k

zK−mm cosh[µ0ım((1− FK+1)

∏1≤`<k0

g`

m∏n=k0

zn + FK+1)]

×∏k<m

gK−mm cosh[µ0ım((1− FK+1)

∏1≤`<k0

g`

k∏n=k0

zn∏

k<`≤m

g` + FK+1)].

36

Page 37: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Set τ :=∏kn=k0 z

′n/∏kn=k0 zn ≥ 1 and G′k = τGk0+1 so that

[ k∏n=k0

z′n

]Gk+1 =

[ k∏n=k0

zn

]G′k+1 .

Then the proof follows the same idea as in Step 3.2 and we will not detail it here.

◦ Step 3.4: This is the last step of the proof. Assume that we have built z′ := (z′K , . . . , z′k0) and

z := (zK , . . . , zk0). Denote µ(k0−1)z the following conditional law

µ(k0−1)z := law

[Gk0−1|{ı1, . . . , ıK , FK+1, G

[K,k0] = z}],

and F(k0−1)z its cdf. Denote µ(k0−1)

z′ the following conditional law

µ(k0−1)z′ := law

[Gk0−1|{ı1, . . . , ıK , FK+1, G

[K,k0] = z′}],

and F(k01)z′ its cdf. Note that z ≤ z′ = T (K)(z) ≤ 1. Again, we would prove that F(k0−1)

z ≥ F(k0−1)z′

which implies that the transport map T (k0−1) :=(F

(k0−1)z′

)−1 ◦ F(k0−1)z satisfies T (k0−1)(u) ≥ u

for all u ∈ (0, 1).For z ∈ (0, 1)K−k

0 × (x, 1), the conditional density of G|{ı1, . . . , ıK , FK+1, G[K,k0] = z} is

(const)1g∈(0,1)K1g[K,k0]=z

∏m<k0

gK−mm cosh[µ0ım((1− FK+1)

m∏`=1

g` + FK+1)]

×∏

k0≤m≤K

zK−mm cosh[µ0ım((1− FK+1)

∏1≤`<k0

g`

m∏n=k0

zn + FK+1)].

Now, let (g1, . . . , gk0−2) ∈ (0, 1)k0−2 be fixed and denote by

∀g ∈ (0, 1), wz(g) :=gK−k0+1 cosh

[µ0ık0−1

((1− FK+1)

k0−2∏`=1

g` × g + FK+1)]

×∏

k0≤m≤K

zK−mm cosh[µ0ım((1− FK+1)

m∏n=k0

zn

k0−2∏`=1

g` × g + FK+1)].

and, substituting z by z′, define wz′ as well. Let t ∈ (0, 1). Following the idea of Step 3.2, one cancheck that it is sufficient to prove that∫ t

0

(∫ 1

0

wz(g)wz′(g′)dg′

)dg ≥

∫ 1

0

(∫ t

0

wz(g)wz′(g′)dg′

)dg .

Substituting ∫ t

0

(∫ t

0

wz(g)wz′(g′)dg′

)dg

on both parts, one is reduced to prove that∫ t

0

(∫ 1

t

wz(g)wz′(g′)dg′

)dg ≥

∫ t

0

(∫ 1

t

wz′(g)wz(g′)dg′

)dg .

Observe that g ≤ g′ in the last two integrals. Now, we have this lemma whose proof is omitted.37

Page 38: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Lemma 14. Let 0 < a ≤ a′ and b > 0. The function

z 7→ cosh(a z + b)

cosh(a′z + b)

is non-increasing on the domain (0,∞).

Let g ≤ g′. From Lemma 14, we deduce that cosh(ag+b) cosh(a′g′+b) ≥ cosh(ag′+b) cosh(a′g+b),proving that wz(g)wz′(g

′) ≥ wz′(g)wz(g′). It proves that T (k0−1)(u) ≥ u for all u ∈ (0, 1).

We then proceed by induction for k0 − 1→ k0 − 2→ · · · → 1. The proof follows the same lineas above, Step 3.4.

sectionProof of Theorem 5 From Corollary 4 we deduce that, under H0 and conditionally onthe selection event

{m = a

}with a ≤ K−1 , (17) remains true, giving the conditional cumulative

distribution function of λb. As a consequence and under the same conditioning.

Fabc(λb)

Fabc(λa)∼ U(0, 1).

Since the conditional distribution doesn’t depend on the condition, it is also the unconditionaldistribution. Finally considerations of distribution under the alternative show that to obtain ap-value we must consider the complement to 1 of the quantity above.

Appendix D: Main notation

[a] the set of integers {1, ..., a},X a full rank n× p design matrix,σ2 the variance of the errors,

ık; εk the indexes and the signs of the variables that enter in the LAR path,ık a way of coding both indexes and signs, see (1),Sk {ı1, . . . , ık }, a possible selected support,S0 the true support,Ek Span(Xı1 , . . . , Xık),

Pk(P⊥k ) projector on (the orthogonal of) Ek,γmk,v2k

(φmk,v2k) The normal distribution (density) with mean mk given (13)

and variance v2k = σ2ρ2

k given by (15),Z the vector of correlations, obtained by sym. from Z defined by (9),R the variance-covariance matrix of Z, see (10),K the number of knots that are considered, see Figure 2, and Section 2.5

Kselect see Figure 2m the chosen size ,S the chosen set of variables : S

m,

ρ2k v2

k/σ2, see (13),

θj(i1, . . . , ik) see (11),θ` θ(ı1, . . . , ı`),

Mi1,...,i` the submatrix of R obtained by keeping the columns,and the rows indexed by {i1, . . . , i`},

Fi;Pij Fi := Φi(λi) := Φ(λi/(σρi)) and Pij is given by (14),Fabc(t) see (16)αabc the p-value of the generalized spacing test, see (18),Sabc 1{αabc≤α}, the generalized spacing test see (19).

38

Page 39: Multiple Testing and Variable Selection along Least Angle ... · so-called “Beta-Min Condition”, one may prove [11,30] that the LASSO asymptotically returns ... proved that this

Appendix E: Cubature of integral by lattice rule

Our goal is to compute the integral of some functionf on the hypercube of dimension d, namely

I :=

∫[0,1]d

f(x)dx.

We want to approximate it by a finite sum over n points

In :=1

n

n∑i=1

f(x(i)).

A convenient way of constructing the sequence x(i), i = 1, . . . , n is the so-called lattice rule: fromthe first point x(1) we deduce the others x(i) by

x(i) ={i.x(1)

},

where the {} brackets mean that we take the fractional part coordinate by coordinate. In such acase the error given by

E(f, n, x(1)) = I − Inis a function, in particular, of starting point x(1) .

The Fast-rank algorithm [21] is a fast algorithm that finds, component by component and asa function of the prime n, the sequence of coordinates of x(1) that minimizes the maximal errorwhen f varies in a unit ball E of some RKHS, namely a tensorial product of Koborov spaces. Inaddition it gives an expression of its minimax error, namely

maxf∈E

(f, n, x(1)).

In practice, very few properties are known on the function f , so the result above is not directlyapplicable. Nevertheless for many functions f , it happens that the convergence of In to I is “fast”:typically of the order 1/n while the Monte-Carlo method (choosing the x(i) at random) convergesat rate 1/

√n.

A reliable estimation of the error is obtained by adding a Monte-Carlo layer as in [16] forinstance. This can be done as follows. Let U a unique uniform variable on [0, 1]d, we define

x(i)U :=

{i.x(1) + U

}, In,U :=

1

n

n∑i=1

f(x(i)U ).

Classical computations show that In,U is now an unbiased estimator of I. In a final step, weperform N (in practice 15-20) independent repetitions of the experiment above an we computeusual asymptotic confidence intervals for independent observations.

39