a first order free lunch for sqrt-lasso∗

A First Order Free Lunch for SQRT-Lasso∗

Xingguo Li, Jarvis Haupt, Raman Arora, Han Liu, Mingyi Hong, and Tuo Zhao

Abstract

Many statistical machine learning techniques sacrifice convenient computational structures

to gain estimation robustness and modeling flexibility. In this paper, we study this fundamental

tradeoff through a SQRT-Lasso problem for sparse linear regression and sparse precision matrix

estimation in high dimensions. We explain how novel optimization techniques help address these

computational challenges. Particularly, we propose a pathwise iterative smoothing shrinkage

thresholding algorithm for solving the SQRT-Lasso optimization problem. We further provide a

novel model-based perspective for analyzing the smoothing optimization framework, which allows

us to establish a nearly linear convergence (R-linear convergence) guarantee for our proposed

algorithm. This implies that solving the SQRT-Lasso optimization is almost as easy as solving

the Lasso optimization. Moreover, we show that our proposed algorithm can also be applied

to sparse precision matrix estimation, and enjoys good computational properties. Numerical

experiments are provided to support our theory.

1 Introduction

Given a design matrix X ∈ Rn×d and a response vector y ∈ Rn, we consider a linear model

y = Xθ∗ + ε, where θ∗ ∈ Rd is an unknown coefficient vector, and ε ∈ Rn is a random noise vector

with i.i.d. sub-Gaussian entries, E[εi] = 0 and E[ε2i ] = σ2 for all i = 1, . . . , n. We are interested in

estimating θ∗ in high dimensions where n/d→ 0. A popular assumption in high dimensions is that

only a small subset of variables are relevant in modeling, i.e., many entries of θ∗ are zero. To get

such a sparse estimator, Tibshirani (1996) proposed Lasso, which solves

θ = argminθ

1

n‖y −Xθ‖22 + λLasso‖θ‖1, (1.1)

where λLasso is the regularization parameter and encourages the solution sparsity. The statistical

properties of Lasso have been established in Zhang and Huang (2008); Zhang (2009); Bickel et al.

∗Xingguo Li and Jarvis Haupt are affiliated with Department of Electrical and Computer Engineering at University

of Minnesota, Minneapolis, MN, 55455, USA; Raman Arora and Tuo Zhao is affiliated with Department of Computer Sci-

ence at Johns Hopkins University Baltimore, MD, 21210, USA; Han Liu is affiliated with Department of Operations Re-

search and Financial Engineering at Princeton University, Princeton, NJ 08544, USA; Mingyi Hong is affiliated with De-

partment of Industrial and Manufacturing Systems Engineering at Iowa State University. Emails: [email protected],

[email protected], [email protected], [email protected], [email protected], [email protected]

1

arX

iv:1

605.

0795

0v1

[cs

.LG

] 2

5 M

ay 2

016

(2009); Negahban et al. (2012). In particular, given λLasso σ√

log d/n, the Lasso estimator in (1.1)

attains the minimax optimal rates of convergence in parameter estimation1,

‖θ − θ∗‖2 = OP(σ√s∗ log d/n

), (1.2)

where s∗ denotes the number of nonzero entires in θ∗ (Ye and Zhang, 2010; Raskutti et al., 2011).

Despite these favorable properties, the Lasso approach has a significant drawback: The selected

regularization parameter parameter λLasso linearly scales with the unknown quantity σ. Therefore,

we need to carefully tune λLasso over a wide range of potential values in order to get a good

finite-sample performance. To overcome this drawback, Belloni et al. (2011) proposed SQRT-Lasso,

which solves

θ = argminθ∈Rd

1√n‖y −Xθ‖2 + λSQRT‖θ‖1. (1.3)

They further show that SQRT-Lasso require no prior knowledge of σ. Choosing λSQRT √

log d/n,

the SQRT-Lasso estimator in (1.3) attains the same optimal statistical rate of convergence in

parameter estimation as (1.2). This means that the regularization selection for SQRT-Lasso does

not scale with σ. We can easily specify a smaller range of potential values for tuning λSQRT than

Lasso.

Besides estimating θ∗, SQRT-Lasso can also estimate σ, which further makes it applicable to

sparse precision matrix estimation; this is not the case with Lasso. Specifically, given n observations

i.i.d. sampled from a d-variate normal distribution with mean 0 and a sparse precision matrix Θ∗,

Liu and Wang (2012) proposed an estimator based on SQRT-Lasso (See more details in §4), and

showed that it attains the minimax optimal statistical rate of convergence in parameter estimation

‖Θ−Θ∗‖2 = OP(‖Θ∗‖2 · s∗

√log d/n

),

where ‖Θ∗‖2 denotes the spectral norm of Θ∗ (i.e., the largest singular value of Θ∗), and s∗ denotes

the maximum number of nonzero entries in each column of Θ∗ (i.e. max` 1(Θj` 6= 0) ≤ s∗).Though SQRT-Lasso simplifies us tuning efforts and achieves the optimal statistical properties

for both sparse linear regression and sparse precision matrix estimation in high dimensions, the

optimization problem in (1.3) is computationally more challenging than (1.1) for Lasso, because the

`2 loss in SQRT-Lasso does not have the same nice computational structures as the least square loss

in Lasso. For example, the `2 loss can be nondifferentiable, and does not have a Lipschitz continuous

gradient. Belloni et al. (2011) converted (1.3) to a second order cone optimization problem, and

further solved it by an interior point method; Li et al. (2015) then solved (1.3) by an ADMM

algorithm. Neither of them, however, can scale to large problems. In contrast, Xiao and Zhang

(2013) proposed an efficient pathwise iterative shrinkage thresholding algorithm (PISTA) for solving

(1.1), which attains a linear convergence to the unique sparse global optimum with high probability.

To address this computational challenge, we propose a pathwise iterative smoothing shrinkage

thresholding algorithm (PIS2TA) to solve (1.3). Specifically, we first apply the conjugate dual

1The notation OP (·) is defined in Line 84 on Page 3

2

smoothing approach to the nonsmooth `2 loss (Nesterov, 2005; Beck and Teboulle, 2012), and obtain

a smooth surrogate denoted by ‖y−Xθ‖µ, where µ > 0 is a smoothing parameter (See more details

in §2). We then apply PISTA to solve the partially smoothed optimization problem as follows:

θ = argminθ∈Rd

1√n‖y −Xθ‖µ + λSQRT‖θ‖1. (1.4)

Existing computational theory guarantees that our proposed PIS2TA algorithm attains a sublinear

convergence to the global optimum in term of the objective value (Nesterov, 2005). However, our

numerical experiments show that PIS2TA achieves far better empirical computational performance

(better than sublinear convergence) for solving SQRT-Lasso, and is significantly more efficient than

other competing algorithms, and nearly as efficient as Xiao and Zhang (2013) for solving Lasso.

This is because the existing computational analyses of the conjugate dual smoothing approach

do not take certain specific modeling structures into consideration. For example: (I) The `2 loss

is only nonsmooth when all residuals are equal to zero (significantly overfitted). But this is very

unlikely to happen because we are solving (1.3) with a sufficiently large regularization; (II) Although

the smoothed `2 loss is not strongly convex, if we restrict the solution to a sparse domain, the

smoothed `2 loss can behave like strongly convex functions over a neighborhood of θ∗.

Motivated by these observations, we establish a new computational theory for PIS2TA, which

exploits the above modeling structures. Particularly, we show that PIS2TA achieves a nearly linear

convergence (R-linear convergence) to the unique sparse global optimum for solving (1.3) with

high probability, and also gives us a well fitted model. There are two implications: (I) We can

solve the SQRT-Lasso optimization as nearly efficiently as solving the Lasso optimization; (II) We

pay almost no price in optimization accuracy when using the smoothing approach for solving the

SQRT-Lasso optimization, because (1.4) and (1.3) share the same unique sparse global optimum

with high probability.

As an extension of our theory for the SQRT-Lasso optimization, we further analyze the com-

putational properties of our proposed PIS2TA algorithm for sparse precision matrix estimation in

high dimensions. We show that PIS2TA also achieves an R-linear convergence to the unique sparse

global optimum with high probability. We provide numerical experiments on simulated and real

data to support our theory. All proofs of our analysis are deferred to the supplementary material.

Notations: Given a vector v = (v1, . . . , vd)> ∈ Rd, we define vector norms: ‖v‖1 =

∑j |vj |,

‖v‖22 =∑

j v2j , and ‖v‖∞ = maxj |vj |. We denote the number of nonzero entries in v as ‖v‖0 =∑

j 1(vj 6= 0). We denote v\j = (v1, . . . , vj−1, vj+1, . . . , vd)> ∈ Rd−1 as the subvector of v with the

j-th entry removed. Let A ⊆ 1, ..., d be an index set. We use A to denote the complementary

set to A, i.e. A = j | j ∈ 1, ..., d, j /∈ A. We use vA to denote a subvector of v by extracting

all entries of v with indices in A. Given a matrix A ∈ Rd×d, we use A∗j = (A1j , ...,Adj)> to

denote the j-th column of A, and Ak∗ = (Ak1, ...,Akd)> to denote the k-th row of A. Let Λmax(A)

and Λmin(A) be the largest and smallest eigenvalues of A. We define ‖A‖2F =∑

j ‖A∗j‖22 and

‖A‖2 =√

Λmax(A>A). We denote A\i\j as the submatrix of A with the i-th row and the j-th

column removed. We denote Ai\j as the i-th row of A with its j-th entry removed. Let A ⊆ 1, ..., d

3

be an index set. We use AAA to denote a submatrix of A by extracting all entries of A with

both row and column indices in A. We denote A 0 if A is a positive-definite matrix. Given

two real sequences An, an, An = O(an) (or An = Ω(an)) if and only if ∃M ∈ R+ and N ∈ Nsuch that |An| ≤M |an| (or |An| ≥M |an|) for all n ≥ N . An an if An = O(an) and An = Ω(an)

simultaneously. An = OP (an) if ∀δ ∈ (0, 1), ∃M ∈ R+ and Nδ ∈ N such that P[|An| > M |an|] < δ

for all n ≥ Nδ. An = o(an) if ∀δ > 0, ∃Nδ ∈ N such that |An| ≤ δ|an| for all n ≥ Nδ, i.e.,

limn→∞An/an = 0. Given a vector x ∈ Rd and a real value λ > 0, we denote the soft thresholding

operator Sλ(x) = [sign(xj) max|xj | − λ, 0]dj=1.

2 Algorithm

Our proposed algorithm consists of three components: (I) Conjugate Dual Smoothing, (II) Iterative

Shrinkage Thresholding Algorithm (ISTA), and (III) Pathwise Optimization.

(I) The Conjugate Dual Smoothing approach is adopted to obtain a smooth surrogate of `2

loss (Nesterov, 2005; Beck and Teboulle, 2012). We denote the smoothed `2 loss function as

‖y −Xθ‖µ = max‖z‖2≤1

z>(y −Xθ)− µ

2‖z‖22. (2.1)

The optimization problem in (2.1) admits a closed form solution:

Lµ(θ) =1√n‖y −Xθ‖µ =

1

2µ√n‖y −Xθ‖22, if ‖y −Xθ‖2 < µ

1√n‖y −Xθ‖2 − µ

2 , o.w..

We present several two-dimensional examples of the smoothed `2 norm using different µ’s in Figure

1. A larger µ introduces a larger approximation error, but makes the approximation smoother. We

then consider the following partially smoothed optimization problem,

θ = argminθ∈Rd

Fµ,λ(θ), where Fµ,λ(θ) = Lµ(θ) + λ‖θ‖1. (2.2)

(a) µ = 0 (b) µ = 0.1 (c) µ = 0.5 (d) µ = 1

Figure 1: Examples of ‖x‖2 (µ = 0) and ‖x‖µ with µ = 0.1, 0.5, and 1 respectively for x ∈ R2.

(II) The ISTA Algorithm is applied to solve (2.2) (Nesterov, 2013). Particularly, given θ(t) at

t-th iteration, we consider the quadratic approximation of Fµ,λ(θ) at θ = θ(t),

Qµ,λ(θ,θ(t)) = Lµ(θ(t)) +∇Lµ(θ(t))>(θ − θ(t)) +L(t)

2‖θ − θ(t)‖22 + λ‖θ‖1, (2.3)

4

where L(t) is a step size parameter determined by the backtracking line search. We then take

θ(t+1) = argminθQµ,λ(θ,θ(t)) = Sλ/L(t)(θ(t) −∇Lµ(θ(t))/L(t)),

For simplicity, we denote θ(t+1) = TL(t+1),λ(θ(t)). Given a pre-specified precision ε, we terminate

the iterations when the approximate KKT condition holds:

ωλ(θ(t)) = ming∈∂‖θ(t)‖1

‖∇Lµ(θ(t)) + λg‖∞ ≤ ε.

(III) The Pathwise Optimization is essentially a multistage optimization scheme for boosting

computational performance. We solve (2.2) using a geometrically decreasing sequence of regular-

ization parameters λ1 > . . . > λN , where λN = λSQRT. This yields a sequence of output solutions

θ[1], . . . , θ[N ] from sparse to dense.

Particularly, at the K-th optimization stage, we choose θ[K−1] (the output solution of the

(K − 1)-th stage) as the initial solution, i.e., θ(0)[K] = θ[K−1], and solve (2.2) with λ = λK using the

ISTA algorithm. This is also referred as the warm start initialization in existing literature. We

summarize our approach in Algorithm 1.

3 Computational and Statistical Analysis

We first define the locally restricted strong convexity and smoothness.

Definition 3.1. Given a constant r ∈ R+, let Br = θ ∈ Rd : ‖θ − θ∗‖22 ≤ r. For any v,w ∈ Br,which satisfies ‖v−w‖0 ≤ s, Lµ is locally restricted strongly convex (LRSC) and smooth (LRSS) on

Br at sparsity level s if there exist universal constants ρ−s , ρ+s ∈ (0,∞) such that

ρ−s2‖v −w‖22 ≤ Lµ(v)− Lµ(w)−∇Lµ(w)>(v −w) ≤ ρ+

s

2‖v −w‖22, (3.1)

We define the locally restricted condition number at sparsity level s as κs = ρ+s /ρ

−s .

The LRSC and LRSS properties are locally constrained variants of restricted strong convexity

and smoothness (Agarwal et al., 2010; Xiao and Zhang, 2013) with respect to a neighborhood of the

true model parameter θ∗, which are keys to establishing the strong convergence guarantees of our

proposed algorithm in high dimensions.

Next, we introduce two key assumptions for establishing our computational theory.

Assumption 3.2. The sequence of the regularization parameters satisfies λN ≥ 6‖∇Lµ(θ∗)‖∞.

Assumption 3.2 requires that λN is sufficiently large such that the irrelevant variables can be

eliminated (Bickel et al., 2009; Negahban et al., 2012).

Assumption 3.3. Lµ satisfies LRSC and LRSS properties on Br, where r ≥ s∗(

8λN1/ρ−s∗+s

)2for

some N1 < N , N1 ∈ Z+, and λN1 > λN . Specifically, (3.1) holds with ρ+s∗+2s, ρ

−s∗+2s ∈ (0,∞), where

s = C1s∗ > (196κ2

s∗+2s + 144κs∗+2s)s∗, C1 ∈ R+ is a constant and κs∗+2s = ρ+

s∗+2s/ρ−s∗+2s.

Assumption 3.3 guarantees that Lµ satisfies LRSC and LRSS properties as long as the estimation

error satisfies ‖θ − θ∗‖22 ≤ r and the number of irrelevant coordinates of solutions is bounded by s.

5

Algorithm 1: Pathwise Iterative Smoothing Shrinkage Thresholding Algorithm (PIS2TA) for

solving the SQRT-Lasso optimization (1.4). θ[K] denotes the output solution corresponding

to λK ; θ(t)[K] denotes the solution at the t-th iteration of the K-th optimization stage; εK is a

pre-specified precision for the K-th optimization stage. The line search procedure is describe

in Algorithm2.

Input: y, X, N , λN , εN , Lmax > 0

Initialize: θ[0] ← 0, λ0 ← ‖∇Lµ(0)‖∞, η ← (λN/λ0)1/N

For: K = 1, . . . , N

t← 0, λK ← ηλK−1, θ(0)[K] ← θ[K−1], L

(0)[K] ← Lmax

Repeat:

t← t+ 1

L(t)[K] ← min2L(t)

[K], Lmax, where L(t)[K] ← LineSearch

(λK ,θ

(t−1)[K] , L

(t−1)[K]

)

θ(t)[K] ← TL(t)

[K],λK

(θ(t−1)[K] )

Until: ωλK (θ(t)[K]) ≤ εK

θ[K] ← θ(t)[K]

End For

Return: θ[N ]

Algorithm 2: Line search of PIS2TA for SQRT-Lasso.

Input: λK , θ(t−1)[K] , L

(t−1)[K]

Initialize: L(t)[K] = L

(t−1)[K]

Repeat:

θ(t)[K] = T

L(t)[K]

,λK(θ

(t−1)[K] )

If Fµ,λK (θ(t)[K]) < Qµ,λK (θ

(t)[K],θ

(t−1)[K] )

L(t)[K] = L

(t)[K]/2

End If

Until: Fµ,λK (θ(t)[K]) ≥ Qµ,λK (θ

(t)[K],θ

(t−1)[K] )

Return: L(t)[K]

3.1 Computational Theory

Our analysis consists of two phases depending on the estimation error and sparsity of the solution

θ along the path. Specifically, we denote Bs∗+sr = Br ∩ θ ∈ Rd : ‖θ − θ∗‖0 ≤ s∗ + s. Let

N1 ∈ 1, . . . , N be a cut-off between Phase I and Phase II. We can show that Phase I corresponds

to the first N1 stages of pathwise optimization, in which we cannot guarantee θ /∈ Bs∗+sr . Thus we

only establish a sublinear convergence for Phase I. But Phase I is still computationally efficient,

6

since we can choose reasonably large εK for K = 1, . . . , N1 to facilitate early stopping; Phase II

corresponds to the consequent (N −N1) stages, in which we guarantee θ ∈ Bs∗+sr . Thus LRSC and

LRSS hold, and a linear convergence can be established accordingly.

Theorem 3.4. Suppose Assumptions 3.2 and 3.3 hold. Let θ[K] be the output solution that satisfies

ωλK (θ[K]) ≤ εK of the K-th stage respectively for all K = 1, . . . , N . We denote S∗ = j | θ∗j 6= 0,S∗ = j | θ∗j = 0 and s∗ = |S∗|. Recall that η is the decaying ratio of the geometrically decreasing

regularization sequence. Given µ ≤√nσ4 , λ0 = ‖∇Lµ(0)‖∞, and η = (λN/λ0)1/N ∈ (5/6, 1), there

exists N1 ∈ 1, 2, . . . , N such that the following results hold:

Phase I: Let R = maxK≤N1supθ ‖θ − θ[K]‖2 : Fµ,λK (θ) ≤ Fµ,λK (θ(0)[K]). At the K-th stage,

where K = 1, . . . , N1, we need at most TK = O(‖X‖22RεKµ√n

)iterations to guarantee

ωλK (θ(t)[K]) ≤ εK and Fµ,λK (θ[K])−Fµ,λK (θ[K]) = O

(‖X‖22R2

TKµ√n

),

where θ[K] is a global optimum to (1.4). Moreover, we have θ[N1] ∈ Br and ‖[θ[N1]]S∗‖0 ≤ s∗ + s.

Phase II: Let α = 1 − 18κs∗+2s

. At the K-th stage, where K = N1 + 1, . . . , N , we have

sparse solutions throughout all iterations, i.e., ‖[θ(t)[K]]Sc‖0 ≤ s∗ + s. Moreover, we need at most

TK = O(κs∗+2s log

(κ3s∗+2s

s∗λ2K

ε2K

))iterations to guarantee ωλK (θ

(t)[K]) ≤ εK ,

Fµ,λK (θ[K])−Fµ,λK (θ[K]) = O(αTKεKλKs

∗) , and ‖θ[K] − θ[K]‖22 = O(αTKεKλKs

∗) ,

where θ[K] is the unique sparse global optimum to (1.4) with λK satisfying ‖[θ[K]]S∗‖0 ≤ s∗ + s.

Theorem 3.4 guarantees that PIS2TA achieves an R-linear convergence to the unique sparse

global optimum to (1.4) in terms of both objective value and solution parameter, which is as nearly

efficient as Xiao and Zhang (2013) for solving Lasso with much less turning effort (since λN is

independent of σ). This further explains why PIS2TA is much more efficient than other competing

algorithm for solving SQRT-Lasso such as ADMM and SOCP + interior point method.

A geometric interpretation of Theorem 3.4 is provided in Figure 2. The first N − 1 stages serve

as intermediate processes to facilitate fast convergence to θ[N ], which do not require high precision

solutions. Thus we choose εK = λK/4 εN for K = 1, . . . , N − 1 such that Phase I is efficient, as

shown in Figure 3, and only high precision for the last stage, e.g. εN = 10−5 (because only the last

regularization parameter is of our interest). The total number of iterations for computing the entire

solution path is at most

O(N1‖X‖22RλN1µ

√n

+ κs∗+2s(N −N1) log(κs∗+2ss∗λN/εN )

).

Moreover, given properly chosen µ and λN , we guarantee that none of the linear convergence region,

true model parameter θ∗ and output solution θ[N ] fall into the smooth region. This implies that

(1.3) and (1.4) share the same global optimum in Phase II. A formal claim is presented in §3.2.

7

Initial SolutionRegion of

Sublinear Conv.

Region ofLinear Conv.

Neighborhood of :

Phase I:

Phase II:

8>>><>>>:

8>>>><>>>>:

...

...

Output Solution for

Output Solution forOutput Solution for

Output Solution for

Bs+sr

SmoothedRegion

(Overfitted Model)

b[N1+1] b[N ]

b[N1]

b[0] = 0

b[2]

b[1]

Output Solution for 1

2

N1+1

N1

N

Figure 2: A geometric interpretation of two phases of convergence. Phase I (yellow region): Sublinear

convergence, and Phase II (orange region): Linear convergence. The linear convergence region, the

true parameter θ∗ and output solution θ[N ] do not fall into the smoothed region.

2 4 6 8 10 12 14 16 1810-14

10-12

10-10

10-8

10-6

10-4

F µ,

K(

(t)

[K])F µ

,K(

[K])

t : # of Iterations in Each Stage

1, . . . ,N1

N

Figure 3: Plots of the objective gaps for all iterations t of each path following stage (only a few

stages are demonstrated for clarity).

3.2 Statistical Theory

To analyze the statistical properties of our estimator obtained via PIS2TA, we assume that the

design matrix X satisfies the restricted eigenvalue condition as follows.

Assumption 3.5. The design matrix X satisfies the Restricted Eigenvalue (RE) condition, i.e.,

there exist constants ψmin, ψmax, ϕmin, ϕmax ∈ (0,∞), which do not scale with (s∗, n, d), such that

ψmin‖v‖22 − ϕminlog d

n‖v‖21 ≤

‖Xv‖22n

≤ ψmax‖v‖22 + ϕmaxlog d

n‖v‖21, (3.2)

A wide family of examples satisfy the RE condition, such as the correlated sub-Gaussian random

design (Rudelson and Zhou, 2013), which has been extensively studied for sparse recovery (Candes

and Tao, 2005; Bickel et al., 2009; Raskutti et al., 2010).

We next verify Assumption 3.2 and 3.3 based on the RE condition in the following lemma.

8

Lemma 3.6. Suppose Assumption 3.5 holds. Given µ ≤√nσ4 and λN = 24

√log dn , λN ≥

6‖∇Lµ(θ∗)‖∞ with high probability. Moreover, for large enough n, Lµ(θ) satisfies LRSC and

LRSS properties on Br with high probability, where r = σ2

8ψmax. Specifically, (3.1) holds with

ρ+s∗+2s ≤

8ψmax

σ and ρ−s∗+2s ≥ψmin8σ , where s = C2s

∗ > (196κ2s∗+2s + 144κs∗+2s)s

∗, C2 ∈ R+ is a

generic constant and κs∗+2s ≤ 64ψmax/ψmin.

Lemma 3.6 guarantees that Assumption 3.2 holds given properly chosen µ and λN , and Assump-

tion 3.3 holds given the design X satisfying RE condition, both with high probability. Therefore, by

Theorem 3.4, PIS2TA achieves an R-linear convergence to the unique sparse global optimum. In the

next theorem, we characterize the statistical rate of convergence of PIS2TA.

Theorem 3.7. Suppose Assumption 3.5 holds. Let the output solution θ[N ] satisfy ωλN (θ[N ]) ≤ εNfor a small enough εN . Given µ ≤

√nσ4 , λN = 24

√log dn and a large enough n, we have:

‖θ[N ] − θ∗‖2 = OP(σ√s∗ log d/n

)and ‖θ[N ] − θ∗‖1 = OP

(σs∗√

log d/n).

Moreover, let σ =‖y−Xθ[N ]‖2√

nbe the estimation of σ. Then we have |σ − σ| = OP (σs∗ log d/n).

Theorem 3.7 guarantees that the output solution θ[N ] obtained by PIS2TA achieves the minimax

optimal rate of convergence in parameter estimation (Ye and Zhang, 2010; Raskutti et al., 2011).

The next proposition shows that (1.3) and (1.4) share the same global optimum, which corresponds

to a well fitted model. This implies that neither the unique sparse global optimum θ[N ] nor the

linear convergence region falls into the smooth region, as shown in Figure 2.

Proposition 3.8. Under the same assumptions as Theorem 3.7, for all λK ’s, where K = N1 +

1, . . . , N , (1.3) and (1.4) share the same unique sparse global optimum with high probability.

4 Extension to Sparse Precision Matrix Estimation

We consider the TIGER approach proposed in Liu and Wang (2012) for estimating sparse precision

matrix. To be clear, we emphasize that we use v1,v2, . . . (without [·] for subscript) to index vectors.

Let X = [x>1 , . . . ,x>n ]> ∈ Rn×d be n observed data points from a d-variate Gaussian distribution

Nd (0,Σ). Our goal is to estimate the sparse precision matrix Θ = Σ−1. Let Z = XΓ−1/2 =

[z>1 , . . . , z>n ]> ∈ Rn×d be the standardized data matrix, where Γ = diag(Σ11, . . . , Σdd) is a diagonal

matrix and Σ = 1nX>X. Then, we can write zi = Z?,\iθ∗i + Γ

−1/2ii εi for all i = 1, . . . , d, where

θ∗i = Γ−1/2ii Γ

1/2\i,\i(Σ\i,\i)

−1Σ\i,i and εi ∼ Nn(0, σ2i In) with σ2

i = Σii − Σ>\i,i(Σ\i,\i)−1Σ\i,i. We

denote τ2i = σ2

i Γ−1ii , and solve:

θi = argminθi∈Rd−1

Lµ,i(θi) + λ‖θi‖1 and τ i = ‖zi − Z?,\iθi‖µ, (4.1)

9

for all i = 1, . . . , d, where Lµ,i(θi) = 1√n‖zi − Z?,\iθi‖µ. Then the i-th column of precision matrix

Θ is estimated by: Θii = τ−2i Γ−1

ii , and Θ\i,i = −τ−2i Γ

−1/2ii Γ

−1/2\i,\i θi.

Here we solve (4.1) by PIS2TA. We then introduce a few mild technical assumptions:

Assumption 4.1. Suppose the true covariance matrix Σ∗ and precision matrix Θ∗ satisfy: (A1)

Θ∗ ∈M(κΘ, s∗) =

Θ ∈ Rd×d : Θ 0, Λmax(Θ)/Λmin(Θ) ≤ κΘ, maxi

∑j 1(Θij 6= 0) ≤ s∗

; (A2)

(s∗)2 log d = o(n); (A3) lim supn→∞maxi(Σ∗ii)

2 log d/n < 1/4.

We first verify the assumptions required by our computational theory by the following lemma.

Lemma 4.2. Suppose Assumption 4.1 holds. Given µ ≤ 15

√nκΘ

and λN = 6√

5 log dn , we have

λN ≥ 6 maxi∈1,...,d ‖∇Lµ,i(θ∗i )‖∞ with high probability. Moreover, Lµ,i(θi) satisfies LRSC and

LRSS properties on Bri for all i = 1, . . . , d with high probability, where ri =σ2i

12κΘ. Specifically,

for all i = 1, . . . , d, (3.1) holds with ρ+s∗+2s ≤

12κΘσi

and ρ−s∗+2s ≥ 112κΘσi

, where s = C3s∗ >

(196κ2s∗+2s + 144κs∗+2s)s

∗ for a generic constant C3, and κs∗+2s ≤ 144κ2Θ.

Lemma 4.2 guarantees that Assumption 3.2 and 3.3 holds with high probability given properly

chosen µ and λN . Thus, by Theorem 3.4, PIS2TA achieves an R-linear convergence to the unique

sparse global optimum of (4.1) with high probability for all columns of Θ. The next theorem

characterizes the statistical rate of convergence of the obtained precision matrix estimator using

PIS2TA.

Theorem 4.3. Suppose Assumption 4.1 holds. Let Θ[N ] be the output solution for the regularization

parameter λN . Given µ ≤ 15

√nκΘ

and λN = 6√

5 log dn , we have

‖Θ[N ] −Θ∗‖2 = OP(s∗‖Θ∗‖2

√log d/n

).

Theorem 4.3 implies that our obtained precision matrix estimator attains the minimax optimal

rate of convergence in parameter estimation. Moreover, we guarantee that neither the linear

convergence region nor output solution Θ[N ] falls into the smooth region with high probability.

5 Numerical Experiments

We investigate the computational performance of the proposed PIS2TA algorithm through numerical

experiments over both simulated and real data example. All simulations are implemented in C with

double precision using a PC with an Intel 3.3GHz Core i5 CPU and 16GB memory.

For simulated data, we generate a training dataset of 200 samples, where each row of the design

matrix Xi,?, i = 1, . . . , 200, independently from a 2000-dimensional normal distribution N(0,Σ)

where Σjj = 1 and Σjk = 0.5 for all k 6= j. We set s∗ = 3 with θ∗1 = 3, θ∗2 = −2, and θ∗4 = 1.5, and

θ∗j = 0 for all j 6= 1, 2, 4. A validation set of 200 samples for the regularization parameter selection

and a testing set of 10, 000 samples are also generated to evaluate the prediction accuracy.

10

We set σ = 0.5, 1, 2, 4 respectively to illustrate the tuning insensitivity. The regularization

parameter of both Lasso and SQRT-Lasso is chosen over a geometrically decreasing sequence

λK50t=0with λ50 = σ

√log d/n/2 for Lasso and λ50 =

√log d/n/2 for SQRT-Lasso. The optimal

regularization parameter is determined by λopt = λN

as N = argminK∈0,...,50 ‖y− Xθ[K]‖22, where

θ[K] denotes the obtained estimate using the regularization parameter λK , and y and X denote the

response vector and design matrix of the validation set. For both Lasso and SQRT-Lasso, we set

the stopping precision εK = 10−5 for all K = 1, . . . , 30 and εK = 10−5 for all K = 31, . . . , 50. For

SQRT-Lasso, we set the smoothing parameter µ = 10−4.

First of all, we compare PIS2TA with ADMM proposed in Li et al. (2015)2. The backtracking line

search described in Algorithm 2 is adopted to accelerate both algorithms. We conduct 500 simulations

for all σ’s. The results are presented in Table 1. The PIS2TA and ADMM algorithms attain similar

objective values, but PIS2TA is about 20 times faster than ADMM. Both algorithms also achieve

similar estimation errors. Throughout all 500 simulations, we have ‖y −Xθ[N ]‖2

√nµ ≈ 0.0015.

This implies that all obtained optimal estimators are outside the smoothed region of the optimization

problem, i.e., the smoothing approach does not hurt the solution accuracy.

Next, we compare the computational and statistical performance between Lasso (solved by

PISTA Xiao and Zhang (2013)) and SQRT-Lasso (solved by PIS2TA). The results averaged over

500 simulations are summarized in Tables 1. In terms of statistical performance, Lasso and SQRT-

Lasso attains similar estimation and prediction error. In terms of computational performance, the

PIS2TA algorithm for solving SQRT-Lasso is as efficient as PISTA for Lasso, which matches our

computational analysis.

Moreover, we also examine the optimal regularization parameters for Lasso and SQRT-Lasso. We

visualize the distribution of all 500 selected λopt’s using the kernel density estimator. Particularly,

we adopt the Gaussian kernel, and the kernel bandwidth is selected based on the 10-fold cross

validation. Figure 4 illustrates the estimated density functions. The horizontal axis corresponds

to the rescaled regularization parameter λopt/√

log d/n. We see that the optimal regularization

parameters of Lasso significantly vary with different σ. In contrast, the optimal regularization

parameters of SQRT-Lasso are more concentrated. This is consistent with the claimed tuning

insensitivity.

Finally, we compare PIS2TA with ADMM over real data sets for precision matrix estimation.

Particularly, we use four real world biology data sets preprocessed by Li and Toh (2010): Estrogen

(d = 692), Arabidopsis (d = 834), Leukemia (d = 1, 225), Hereditary (d = 1, 869). We set three

different values for λN such that the obtained estimators achieve different levels of sparse recovery.

We set N = 50, and εK = 10−4 for all K’s. The timing performance is summarized in Table 2. As

can be seen, PIS2TA is 5 to 20 times faster than ADMM on all four data sets3.

2We do not have any results for the algorithm proposed in Belloni et al. (2011), because it failed to finish 500

simulations in 12 hours. The implementation was based on SDPT3.3We do not have any results for the algorithm proposed in Belloni et al. (2011), because it failed to finish the

experiments on all four data sets in 12 hours. The implementation was also based on SDPT3.

11

Table 1: Quantitative comparison between Lasso (PISTA) and SQRT-Lasso (PIS2TA and ADMM)

over 500 simulations. The estimation error is defined as ‖θ−θ∗‖2. The prediction error is defined as

‖y− Xθ[N ]‖2/√n. The residual is defined as ‖y−Xθ

[N ]‖2 for PIS2TA only. PIS2TA attains nearly

the same estimation and prediction errors as ADMM, but is significantly faster than ADMM over

all different settings. Besides, we also find that all obtained estimators are outside the smoothed

region throughout all 500 simulations.

Variance Est. Err. Pred. Err. Time (Second) Residual

of Noise Lasso PIS2TA Lasso PIS2TA Lasso PIS2TA ADMM PIS2TA

σ = 0.50.2761 0.2760 0.5403 0.5399 0.8817 0.9526 17.260 5.2505

(0.0651) (0.0537) (0.0172) (0.0143) (0.2824) (0.2646) (3.1723) (0.9591)

σ = 10.5271 0.5319 1.0722 1.0757 0.9146 1.0170 19.762 10.5209

(0.1174) (0.1065) (0.0303) (0.0280) (0.3185) (0.2959) (4.4387) (1.7760)

σ = 21.0962 1.1065 2.1551 2.1492 1.1772 1.1263 24.406 20.886

(0.2252) (0.2141) (0.0595) (0.0589) (0.4582) (0.4128) (4.2786) (3.6506)

σ = 42.1275 2.1356 4.2928 4.2963 1.2913 1.2544 26.101 45.7623

(0.4247) (0.4033) (0.1112) (0.1079) (0.4855) (0.5074) (5.7725) (5.6333)

Table 2: Timing comparison between PIS2TA and ADMM on biology data under different levels of

sparsity recovery. PIS2TA is significantly faster than ADMM over all settings and data sets.

Estrogen Arabidopsis Leukemia Hereditary

PIS2TA ADMM PIS2TA ADMM PIS2TA ADMM PIS2TA ADMM

Sparsity 1% 16.562 175.98 18.404 373.83 30.609 431.45 43.161 498.32

Sparsity 3% 70.622 338.96 81.557 707.52 86.406 812.69 141.65 895.09

Sparsity 10% 188.03 703.24 226.97 1378.1 257.23 1653.1 413.85 1921.6

0 2 4 6 80

0.5

1

1.5 = 0.5

= 1 = 2

= 4

(a) Lasso: Distribution of λopt/√

log d/n

0.5 1 1.5 2 2.5 30

0.5

1

1.5

2 = 0.5

= 1 = 2

= 4

(b) SQRT-Lasso: Distribution of λopt/√

log d/n

Figure 4: Estimated distributions of λopt/√

log d/n over different values of σ for Lasso and PIS2TA.

12

References

Agarwal, A., Negahban, S. and Wainwright, M. J. (2010). Fast global convergence rates of

gradient methods for high-dimensional statistical recovery. In Advances in Neural Information

Processing Systems.

Beck, A. and Teboulle, M. (2012). Smoothing and first order methods: A unified framework.

SIAM Journal on Optimization 22 557–580.

Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root lasso: pivotal recovery of

sparse signals via conic programming. Biometrika 98 791–806.

Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and

dantzig selector. The Annals of Statistics 37 1705–1732.

Buhlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data: methods, theory

and applications. Springer.

Candes, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Transactions on

Information Theory 51 4203–4215.

Johnstone, I. M. (2001). Chi-square oracle inequalities. Lecture Notes-Monograph Series 399–418.

Li, L. and Toh, K.-C. (2010). An inexact interior point method for l 1-regularized sparse covariance

selection. Mathematical Programming Computation 2 291–315.

Li, X., Zhao, T., Yuan, X. and Liu, H. (2015). The flare package for high dimensional linear

regression and precision matrix estimation in R. The Journal of Machine Learning Research 16

553–557.

Liu, H. and Wang, L. (2012). Tiger: A tuning-insensitive approach for optimally estimating

Gaussian graphical models. Tech. rep., Massachusett Institute of Technology.

Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified framework

for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science

27 538–557.

Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course, vol. 87.

Springer.

Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Programming

103 127–152.

Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical

Programming 140 125–161.

13

Raskutti, G., Wainwright, M. J. and Yu, B. (2010). Restricted eigenvalue properties for

correlated Gaussian designs. The Journal of Machine Learning Research 11 2241–2259.

Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for

high-dimensional linear regression over-balls. Information Theory, IEEE Transactions on 57

6976–6994.

Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements.

Information Theory, IEEE Transactions on 59 3434–3447.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society, Series B 58 267–288.

Wainwright, J. (2015). High-dimensional statistics: A non-asymptotic viewpoint. preparation.

University of California, Berkeley .

Xiao, L. and Zhang, T. (2013). A proximal-gradient homotopy method for the sparse least-squares

problem. SIAM Journal on Optimization 23 1062–1091.

Ye, F. and Zhang, C.-H. (2010). Rate minimaxity of the lasso and dantzig selector for the lq loss

in lr balls. The Journal of Machine Learning Research 11 3519–3540.

Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional

linear regression. The Annals of Statistics 36 1567–1594.

Zhang, T. (2009). Some sharp performance bounds for least squares regression with `1 regularization.

The Annals of Statistics 37 2109–2144.

14

A Intermediate Results of Theorem 3.4 and Theorem 3.7

We introduce some important implications of the proposed assumptions. Recall that S∗ = j :

θ∗j 6= 0 be the index set of non-zero entries of θ∗ with s∗ = |S∗| and S∗ = j : θ∗j = 0 be the

complement set. From Lemma 3.6, Assumption 3.3 implies RSC and RSS with parameter ρ−s∗+2s

and ρ+s∗+2s respectively. By Nesterov (2004), the following conditions are equivalent to RSC and

RSS, i.e., for any v,w ∈ Rd satisfying ‖v −w‖0 ≤ s∗ + 2s,

ρ−s∗+2s‖v −w‖22 ≤ (v −w)>∇Lµ(w) ≤ ρ+s∗+2s‖v −w‖22, (A.1)

1

ρ+s∗+2s

‖∇Lµ(v)−∇Lµ(w)‖22 ≤ (v −w)>∇Lµ(w) ≤ 1

ρ−s∗+2s

‖∇Lµ(v)−∇Lµ(w)‖22. (A.2)

From the convexity of `1 norm, we have

‖v‖1 − ‖w‖1 ≥ (v −w)>g, (A.3)

where g ∈ ∂‖w‖1. Combining and (A.1) and (A.3), we have for any v,w ∈ Rd satisfying ‖v−w‖0 ≤s∗ + 2s,

Fµ,λ(v)−Fµ,λ(w)− (v −w)>∇Fµ,λ(w) ≥ ρ−s∗+2s‖v −w‖22, (A.4)

Remark A.1. For any t and k, the line search satisfies

L(t)[K] ≤ L

(t)[K] ≤ Lmax, Lµ ≤ L(t)

[K] ≤ L(t)[K] ≤ 2Lµ and ρ+

s∗+2s ≤ L(t)[K] ≤ L

(t)[K] ≤ 2ρ+

s∗+2s, (A.5)

where Lµ = minL : ‖∇Lµ(v)−∇Lµ(w)‖2 ≤ L‖x− y‖2, ∀v,w ∈ Rd.We first show that when θ is sparse and the approximate KKT condition is satisfied, then both

estimation error (in `2 norm) and objective error, w.r.t. the true model parameter, are bounded.

This characterizes that the initial value θ(0)[K] of K-th path following stage has desirable statistical

properties if we initialize θ(0)[K] = θ[K−1]. This is formalized in Lemma A.2, and its proof is deferred

to Appendix N.1.

Lemma A.2. Suppose Assumption 3.2 and Assumption 3.3 hold, and λ ≥ λN . If θ satisfies

‖θS∗‖0 ≤ s and the approximate KKT condition

ming∈∂‖θ‖1

‖∇Lµ(θ) + λg‖∞ ≤ λ/2, (A.6)

then we have

‖(θ − θ∗)S∗‖1 ≤ 5‖(θ − θ∗)S∗‖1, (A.7)

‖θ − θ∗‖2 ≤2λ√s∗

ρ−s∗+2s

, (A.8)

‖θ − θ∗‖1 ≤12λs∗

ρ−s∗+2s

, (A.9)

Fµ,λ(θ)−Fµ,λ(θ∗) ≤ 6λ2s∗

ρ−s∗+2s

. (A.10)

15

Next, we show that if θ is sparse and the objective error is bounded, then the estimation error

is also bounded. This characterizes that within the K-th path following stage, good statistical

performance is preserved after each proximal-gradient update. This is formalized in Lemma A.3,

and its proof is deferred to Appendix N.2.


‖θS∗‖0 ≤ s and the objective satisfies


ρ−s∗+2s

then we have

‖θ − θ∗‖2 ≤4λ√

3s∗

ρ−s∗+2s

, (A.11)

‖θ − θ∗‖1 ≤24λs∗

ρ−s∗+2s

. (A.12)

We then show that if θ is sparse and the objective error is bounded, then each proximal-gradient

update preserves solution to be sparse. This demonstrates that within the K-th path following

stage, each update of θ(t)[K] is preserved to be sparse and has good statistical performance. This is

formalized in Lemma A.4, and its proof is deferred to Appendix N.3.


‖θS∗‖0 ≤ s, L satisfies L < 2ρ+s∗+2s, and the objective satisfies


ρ−s∗+2s

then we have

‖ (TL,λ(θ))S∗ ‖0 ≤ s. (A.13)

Moreover, we show that if θ satisfies the approximate KKT condition, then the objective has a

bounded error w.r.t. the regularization parameter λ. This characterizes the geometric decrease of

the objective error when we choose a geometrically decreasing sequence of regularization parameters.

This is formalized in Lemma A.5, and its proof is deferred to Appendix N.4.


ωλ(θ) ≤ λ/2,

For any λ ∈ [λN , λ], let θ = argminθ Fµ,λ(θ). Then we have

Fµ,λ

(θ)−Fµ,λ

(θ) ≤12(λ+ λ)

(ωλ(θ) + λ− λ

)s∗

ρ−s∗+2s

.

16

Furthermore, we show that each path following stage has a local linear convergence rate if the

initial value θ(0) is sparse and satisfies the approximate KKT condition with adequate precision.

Besides, the estimation after each proximal gradient update is also sparse. This is the key result

in demonstrating the overall geometric convergence rate of the algorithm. This is formalized in

Lemma A.6, and its proof is deferred to Appendix N.5.

Lemma A.6. Suppose Assumption 3.3 holds. If the initialization θ(0) for every stage with any λ

in Algorithm 1 satisfies

‖θ(0)‖0 ≤ s.

Then for any t = 1, 2, . . ., we have ‖θ(t)‖0 ≤ s,

Fµ,λ(θ(t))−Fµ,λ(θ) ≤(

1− 1

8κs∗+2s

)t (Fµ,λ(θ(0))−Fµ,λ(θ)

),

where θ = argminθ Fµ,λ(θ).

In addition, we provide the sublinear convergence rate when RSC does not hold based on a

refined analysis of the convergence rate for convex objective (not strongly convex) via proximal

gradient method with line search Nesterov (2013). Specifically, we provide a sub-linear rate of

convergence without the need to classify the distance of the initial objective to the optimal objective.

This characterizes the convergence behavior when ‖X(θ − θ∗)‖2 is large. We formalize this in

Lemma A.7, and provide the proof in Appendix N.6.

Lemma A.7 (Refined result of Theorem 4 in Nesterov (2013)). Given the initialization θ(0), if for

any θ ∈ Rd that satisfies Fµ,λ(θ) ≤ Fµ,λ(θ(0)), denote R as

‖θ − θ‖2 ≤ R.

Then for any t = 1, 2, . . ., we have

Fµ,λ(θ(t))−Fµ,λ(θ) ≤ 4‖X‖22R2

(t+ 2)µ√n, (A.14)

where θ = argminθ Fµ,λ(θ).

Finally, we introduce two results characterizing the proximal gradient mapping operation,

adapted from Nesterov (2013) and Xiao and Zhang (2013) without proof. The first lemma describes

sufficient descent of the objective by proximal gradient method.

Lemma A.8 (Adapted from Theorem 2 in Nesterov (2013)). For any L > 0,

Qµ,λ (TL,λ(θ),θ) ≤ Fµ,λ (θ)− L

2‖TL,λ(θ)− θ‖22.

17

Besides, if Lµ(θ) is convex, we have

Qµ,λ (TL,λ(θ),θ) ≤ minxFµ,λ (x) +

L

2‖x− θ‖22. (A.15)

Further, we have for any L ≥ Lµ,

Fµ,λ (TL,λ(θ)) ≤ Qµ,λ (TL,λ(θ),θ) ≤ Fµ,λ (θ)− L

2‖TL,λ(θ)− θ‖22. (A.16)

The next lemma provides an upper bound of the optimal residue ω(·).

Lemma A.9 (Adapted from Lemma 2 in Xiao and Zhang (2013)). For any L > 0, if Lµ is the

Lipschitz constant of ∇Lµ, then

ωλ (TL,λ(θ)) ≤ (L+ SL(θ)) ‖TL,λ(θ)− θ‖2 ≤ (L+ Lµ) ‖TL,λ(θ)− θ‖2,

where SL(θ) =‖∇Lµ(TL,λ(θ))−∇Lµ(θ)‖2

‖TL,λ(θ)−θ‖2 is a local Lipschitz constant, which satisfies SL(θ) ≤ Lµ.

B Proof of Theorem 3.4

We first demonstrate the sublinear rate for initial stages when the estimation error ‖θ − θ∗‖2 is

large due large regularization parameter λK . The proof is provided in Appendix F

Theorem B.1. For any K = 1, . . . , N and λK > 0, let θ[K] = argminθ Fµ,λK (θ) be the optimal

solution of K-th stage with regularization parameter λK . For any θ ∈ Rd that satisfies Fµ,λK (θ) ≤Fµ,λK (θ

(0)[K]), let ≥ ‖θ − θ[K]‖2. If the initial value θ

(0)[K] satisfies ωλK (θ

(0)[K]) ≤ λK/2, then within

K-th stage, for any t = 1, 2, . . ., we have

Fµ,λ(θ(t)[K])−Fµ,λ(θ[K]) ≤

4‖X‖22R2

(t+ 2)µ√n, (B.1)

To achieve the approximate KKT condition ωλK (θ(t)[K]) ≤ εK , the number of proximal gradient steps

is no more than

18‖X‖32RεKµ3/2n3/4

√Lµ− 3. (B.2)

Note that Lµ ‖X‖22/(µ√n). Then (B.2) can be simplified as O

(‖X‖22RεKµ√n

).

Next, we demonstrate the linear rate when the estimator satisfies θ ∈ Bs∗+sr . The proof is

provided in Appendix G.

Theorem B.2. Suppose Assumption 3.2 and Assumption 3.3 hold, and λK > 0 for any K =

N1, . . . , N . Let θ[K] = argminθ Fµ,λK (θ) be the optimal solution of K-th stage with regularization

18

parameter λK . If the initial value θ(0)[K] satisfies ωλK (θ

(0)[K]) ≤ λK/2 with ‖(θ(0)

[K])S∗‖0 ≤ s, then within

K-th stage, for any t = 1, 2, . . ., we have

‖(θ(t)[K])S∗‖0 ≤ s, ‖θ

(t)[K] − θ[K]‖22 ≤

(1− 1

8κs∗+2s

)t 24λKs∗ωλK (θ

(t)[K])

(ρ−s∗+2s)2

and

Fµ,λK (θ(t)[K])−Fµ,λK (θ[K]) ≤

(1− 1

8κs∗+2s


(t)[K])

ρ−s∗+2s

, (B.3)

(1) For K = N1, . . . , N − 1, to achieve the approximate KKT condition ωλK (θ(t)[K]) ≤ λK/4, the

number of proximal gradient steps is no more than

log(

1536 (1 + κs∗+2s)2 s∗κs∗+2s

)

log (8κs∗+2s/(8κs∗+2s − 1)). (B.4)

(2) For K = N , to achieve the approximate KKT condition ωλN (θ(t)[N ]) ≤ εN , the number of

proximal gradient steps is no more than

log(

96 (1 + κs∗+2s)2 λ2

Ns∗κs∗+2s/ε

2N

)

log (8κs∗+2s/(8κs∗+2s − 1)). (B.5)

From basic inequalities, since κs∗+2s ≥ 1, we have

log

(8κs∗+2s

8κs∗+2s − 1

)≥ log

(1 +

1

8κs∗+2s − 1

)≥ 1

8κs∗+2s.

Then (B.4) and (B.5) can be simplified asO(κs∗+2s

(log(κ3s∗+2ss

∗))) andO(κs∗+2s

(log(κ3s∗+2sλ

2Ns∗/ε2

N

)))

respectively.

As can be seen from Theorem B.2, when the initial value θ(0)[K] satisfies the approximate KKT

condition ωλK (θ(0)[K]) ≤ λK/2 with θ

(0)[K] ∈ Bs

∗+sr , then we can guarantee the geometric convergence

rate of the estimated objective value towards the minimal objective. Next, we show that if the the

optimal solution θ[K−1] from K− 1-th path following stage satisfies the approximate KKT condition

and the regularization parameter λK in the K-th path following stage is chosen properly, then θ[K−1]

satisfies the approximate KKT condition for λK with a slightly larger bound. This characterizes that

good computational properties are preserved by using the warm start θ(0)[K] = θ[K−1] and geometric

sequence of regularization parameters λK . We formalize this notion in Lemma B.3, and its proof is

deferred to Appendix H.

Lemma B.3. Let θ[K−1] be the approximate solution of K − 1-th path following state, which

satisfies the approximate KKT condition ωλK−1(θ[K−1]) ≤ λK−1/4. Then we have

ωλK (θ[K−1]) ≤ λK/2,

where λK = ηλK−1 with η ∈ (5/6, 1).

19

Combining Theorem (B.2) and Lemma (B.3), we can achieve the global convergence in terms

of the objective value using the path following proximal gradient method. We have the bounds of

iterations TK in Phase 2 directly from (B.4) and (B.5) of Theorem B.2.

Finally, we obtain the objective gap of N -th stage via analogous argument. Specifically, in the

N -th (final) path following stage, when the number of iterations for proximal method is large enough

such that ωλN (θ(t)[N ]) ≤ εN holds, then we obtain the result from Lemma A.5 with λ = λ = λN .

Finally, we need to show that there exists some N1 ∈ 1, . . . , N such that θ[N1] ∈ Bs∗+sr . We

demonstrate this result in Lemma B.4 and provide its proof in Appendix I.

Lemma B.4. Suppose Assumption 3.2 and Assumption 3.3 holds, and the approximate KKT

satisfies ωλ(θ) ≤ λ/4. If µ ≤ √nσ/4 andρ−s∗+s8

√rs∗ > λ > λN , then we have

‖θ − θ∗‖22 ≤ r and ‖θS∗‖0 ≤ s.

Lemma B.4 guarantees that there exits some N1 < N such that for some λN1 > λN , the

approximate solution θ[N1] satisfies the approximate KKT condition and θ[N1] ∈ Bs∗+sr , then we

enters the Phase 2 of strong linear convergence by Theorem B.2. Thus we finish the proof.

We further provide a bound of the total number of proximal gradient steps for each λK in the

following lemma for interested readers. The proof is provided in Appendix J.

Lemma B.5. For each λK , K = 1, . . . , N , if we restart the line search with a large enough Lmax, then

the total number of proximal gradient steps is no more than 2(TK+1)+max(log2 Lmax − log2 ρ+s∗+2s), 0.

C Proof of Lemma 3.6

Part 1. We first show that Assumption 3.2 holds. By y = Xθ∗ + ε and (C.4), we have

∇Lµ(θ∗) =X>(Xθ∗ − y)

max√nµ,√n‖y −Xθ∗‖2= − X>ε

max√nµ,√n‖ε‖2. (C.1)

Since ε has i.i.d. sub-Gaussian entries and E[εi] = 0 and E[ε2i ] = σ2 for all i = 1, . . . , n, then we

have from Wainwright (2015) that

P[‖ε‖22 ≤

1

4nσ2

]≤ exp

(− n

32

), (C.2)

By Negahban et al. (2012), we have the following result.

Lemma C.1. Assume X satisfies ‖xj‖2 ≤√n for all j = 1, . . . , d and ε has i.i.d. zero-mean

sub-Gaussian entries with E[w2i ] = σ2 for all i = 1, . . . , n, then we have

P

[1

n‖X>ε‖∞ ≥ 2σ

√log d

n

]≤ 2d−1.

20

Combining (C.1), (C.2) and Lemma C.1, we have with probability at least 1− 2d−1− exp(− n

32

),

‖∇Lµ(θ∗)‖∞ ≤4√

log d/n

max√

2µ/(√nσ), 1

.

Part 2. Next, we show that Assumption 3.3 holds. We divide the proof into two steps.

Step 1. When X satisfies the RE condition, i.e.

ψmin‖v‖22 − ϕminlog d

n‖v‖21 ≤

‖Xv‖22n

≤ ψmax‖v‖22 + ϕmaxlog d

n‖v‖21,

Denote s = s∗ + 2s. Since ‖v‖0 ≤ s, which implies ‖v‖21 ≤ s‖v‖22, then we have

(ψmin − ϕmin

s log d

n

)‖v‖22 ≤

‖Xv‖22n

≤(ψmax + ϕmax

s log d

n

)‖v‖22,

Then there exists a universal constant c1 such that if n ≥ c1s∗ log d, we have

1

2ψmin‖v‖22 ≤

‖Xv‖22n

≤ 2ψmax‖v‖22. (C.3)

Step 2. Conditioning on (C.3), we show that Lµ satisfies LRSC and LRSS with high probability.

The gradient of Lµ(θ) is

∇Lµ(θ) =1√n

((∂‖y −Xθ‖µ∂(y −Xθ)

)>(∂(y −Xθ)

∂θ

)>)>=

X>(Xθ − y)

max√nµ,√n‖y −Xθ‖2. (C.4)

The Hessian of Lµ(θ) is

∇2Lµ(θ) =1

n

∂(−X>z)

∂θ=

X>X√nµ, if ‖y −Xθ‖2 < µ

1√n‖y−Xθ‖2 X>

(I− (y−Xθ)(y−Xθ)>

‖y−Xθ‖22

)X, o.w.

(C.5)

For notational convenience, we define ∆ = v−w for any v,w ∈ B∗s . Also denote the residual of

the first order Taylor expansion as

δLµ(w + ∆,w) = Lµ(w + ∆)− Lµ(w)−∇Lµ(w)>∆.

Using the first order Taylor expansion of Lµ(θ) at w and the Hessian of Lµ(θ) in (C.5), we have

from mean value theorem that there exists some α ∈ [0, 1] such that

δLµ(w + ∆,w) =

∆>X>X∆√nµ

, if ‖ξ‖2 < µ

1√n‖ξ‖2 ∆>X>

(I− ξξ>

‖ξ‖22

)X∆, o.w.

where ξ = y−X(w+α∆). For notational simplicity, let‘s denote z = X(v−θ∗) and z = X(w−θ∗),which can be considered as two fixed vectors in Rn. Without loss of generality, assume ‖z‖2 ≤ ‖z‖2.

Then we have

‖z‖22 ≤ ‖z‖22 ≤ 2ψmaxn‖w − θ∗‖22 ≤nσ2

4.

21

Further, we have

ξ = y −X(w + α∆) = ε−X(w + α∆− θ∗) = ε− αz− (1− α)z, and X∆ = z− z.

We have from Wainwright (2015) that

P[‖ε‖22 ≤ nσ2(1− δ)

]≤ exp

(−nδ

2

16

), (C.6)

Then by taking δ = 1/3 in (C.6), we have with probability 1− exp(− n

144

),

‖ξ‖2 ≥ ‖ε‖2 − α‖z‖2 − (1− α)‖z‖2≥‖ε‖2 − ‖z‖2≥4

5

√nσ − 1

2

√nσ ≥ 1

4

√nσ. (C.7)

We first discuss the RSS property. From (C.7), we have ‖ξ‖2 ≥ µ, then from (C.7) we have

δLµ(w + ∆,w) =1√n‖ξ‖2

∆>X>(

I− ξξ>

‖ξ‖22

)X∆ =

1√n‖ξ‖2

(‖X∆‖22 −

(ξ>X∆)2

‖ξ‖22

)

≤ ‖X∆‖22√n‖ξ‖2

≤ 8ψmax

σ‖∆‖22

Next, we verify the RSC property. From (C.7), we have ‖ξ‖2 ≥ µ. We want to show that with

high probability, for any constant a ∈ (0, 1)∣∣∣∣ξ>

‖ξ‖2X∆

∣∣∣∣ ≤√

1− a‖X∆‖2. (C.8)

Consequently, we have

∆>X>(

I− ξξ>

‖ξ‖22

)X∆ = ‖X∆‖22 −

(ξ>

‖ξ‖2X∆

)2

≥ a‖X∆‖22.

This further implies

δLµ(w + ∆,w) =1√n‖ξ‖2

∆>X>(

I− ξξ>

‖ξ‖22

)X∆ ≥ aψmin

2‖ξ‖2/√n‖∆‖22. (C.9)

Since ‖z‖2 ≤ ‖z‖2, then for any real constant a ∈ (0, 1),

P[∣∣∣∣ξ>

‖ξ‖2X∆

∣∣∣∣ ≤√

1− a‖X∆‖2]

= P[∣∣∣∣

(ε− αz− (1− α)z)>

‖ε− αz− (1− α)z‖2(z− z)

∣∣∣∣ ≤√

1− a‖z− z‖2]

(i)

≥ P[∣∣∣∣

(ε− z)>(z− z)

‖ε− z‖2

∣∣∣∣ ≤√

1− a‖z− z‖2]

= P[(ε>(z− z)− z>(z− z)

)2≤ (1− a)‖ε− z‖22‖z− z‖22

]

(ii)= P

[∣∣∣∣∣

(ε>(z− z)

‖z− z‖2

)2

+ ‖z‖22 − 2ε>z

∣∣∣∣∣ ≤ (1− a)(‖ε‖22 + ‖z‖22 − 2ε>z)

], (C.10)

22

where (ii) is from dividing both sides by ‖v‖22, and (i) is from a geometric inspection and the

randomness of ε, i.e., for any α ∈ [0, 1] and ‖z‖2 ≤ ‖z‖2,

∣∣∣∣−z>

‖ − z‖2(z− z)

∣∣∣∣ ≤∣∣∣∣

(−αz− (1− α)z)>

‖ − αz− (1− α)z‖2(z− z)

∣∣∣∣ .

The random vector ε with i.i.d. entries does not affect the inequality above. Let‘s first discuss one

side of the probability in (C.10), i.e.,

P

[(ε>(z− z)

‖z− z‖2

)2

+ ‖z‖22 − 2ε>z ≤ (1− a)(‖ε‖22 + ‖z‖22 − 2ε>z)

]

= P

[(1− a)‖ε‖22 ≥

(ε>(z− z)

‖z− z‖2

)2

+ a(‖z‖22 − 2ε>z)

]. (C.11)

Since ε has i.i.d. sub-Gaussian entries with E[εi] = 0 and E[ε2i ] = σ2 for all i = 1, . . . , n, thenε>(z−z)‖z−z‖2 and ε>z are also zero-mean sub-Gaussians with variances σ2 and σ2‖z‖22 respectively. We

have from Wainwright (2015) that

P[‖ε‖22 ≤ nσ2(1− δ)

]≤ exp

(−nδ

2

16

), (C.12)

P

[(ε>(z− z)

‖z− z‖2

)2

≥ nσ2δ2

]≤ exp

(−nδ

2

2

), (C.13)

P[ε>z ≤ −nσ2δ

]≤ exp

(−n

2σ2δ2

2‖z‖22

). (C.14)

Combining (C.12) – (C.14) with ‖z‖22 ≤ nσ2/4, we have from union bound that with probability

at least 1− exp(− n

144

)− exp

(− n

128

)− exp

(− n

128

)≥ 1− 3 exp

(− n

144

),

‖ε‖22 ≥2

3nσ2,

(ε>(z− z)

‖z− z‖2

)2

≤ 1

64nσ2, − ε>z ≤ 1

16nσ2.

This implies for a ≤ 3/5, we have

ξ>

‖ξ‖2X∆ ≤

√1− a‖X∆‖2.

For the other side of the probability in (C.10), we have

P

[(ε>(z− z)

‖z− z‖2

)2

+ ‖z‖22 − 2ε>z ≥ −(1− a)(‖ε‖22 + ‖z‖22 − 2ε>z)

]

= P

[(1− a)‖ε‖22 ≥ −

(ε>(z− z)

‖z− z‖2

)2

− (2− a)(‖z‖22 − 2ε>z)

]

≥ P

[(1− a)‖ε‖22 ≥

(ε>(z− z)

‖z− z‖2

)2

+ a(‖z‖22 − 2ε>z)

]. (C.15)

23

Combining (C.10), (C.11) and (C.15), we have (C.8) holds with high probability, i.e., for any

r > 0,

P[∣∣∣∣ξ>

‖ξ‖2X∆

∣∣∣∣ ≤√

1− a‖X∆‖2]≥ 1− 6 exp

(− n

144

).

Now wo bound ‖ξ‖2 to obtain the desired result. From Wainwright (2015), we have

P[‖ε‖22 ≥ nσ2(1 + δ)

]≤ exp

(−nδ

2

18

)= exp

(− n

72

), (C.16)

where we take δ = 1/2. From ξ = ε− αz− (1− α)z, we have

‖ξ‖2 ≤ ‖ε‖2 + α‖z‖2 + (1− α)‖z‖2(i)

≤ ‖ε‖2 + ‖z‖2(ii)

≤√

3n

2σ +

1

2

√nσ. (C.17)

where (i) is from ‖z‖2 ≤ ‖z‖2 and (ii) is from (C.16) and ‖z‖22 ≤ nσ2/4. Then by the union bound

setting a = 1/2, with probability at least 1− 7 exp(− n

144

), we have

δLµ(w + ∆,w) ≥ ψmin

8σ‖∆‖22.

Moreover, we also have r = σ2

8ψmax> s∗ (64σλN1/ψmin)2 ≥ s∗

(8λN1/ρ

−s∗+s

)2for large enough

n ≥ c1s∗ log d, where λN1 ≥ 2λN = 48

√log d/n. The choice of the constant “2” in λN1 ≥ 2λN is

somewhat arbitrary, which can be any fixed constant larger than 1/η such that the existence of λN1

is guaranteed.

D Proof of Theorem 3.7

Part 1. We first show that estimation errors are as claimed. Since θ[K] is the approximate solution

of K-th path following stage, it satisfies ωλK (θ[K]) ≤ λK/4 ≤ λK+1/2 for t ∈ [N1 + 1, T − 1], then

we have from Lemma B.3 that

ωλK+1(θ

(0)[K+1]) ≤ λK+1/2.

By Theorem B.2, we have for any t = 1, 2, . . .,

‖(θ(t)[K+1])S∗‖0 ≤ s.

Applying Lemma A.2 recursively, we have

‖θ[N ] − θ∗‖2 ≤2λN√s∗

ρ−s∗+2s

and ‖θ[N ] − θ∗‖1 ≤12λNs

∗

ρ−s∗+2s

.

Applying Lemma 3.6 with λN = 24√

log d/n and ρ−s∗+2s = ψmin8σ , then by union bound, with

probability at least 1− 8 exp(− n

144

)− 2d−1, we have

‖θ[N ] − θ∗‖2 ≤384σ

√s∗ log d/n

ψmin,

‖θ[N ] − θ∗‖1 ≤2304σs∗

√log d/n

ψmin.

24

Part 2. Next, we demonstrate the result of the estimation of variance. Let θ[N ] = argminθ Fµ,λN (θ)

be the optimal solution of K-th stage. Apply the argument in Part recursively, we have

‖θ[N ] − θ∗‖1 ≤2304σs∗

√log d/n

ψmin. (D.1)

Denote c1, c2, . . . as positive universal constants. Then we have

Lµ(θ[N ])− Lµ(θ∗) ≤ λN (‖θ∗‖1 − ‖θ[N ]‖1) ≤ λN (‖θ∗S∗‖1 − ‖(θ[N ])S∗‖1 − ‖(θ[N ])S∗‖1)

≤ λN‖(θ[N ] − θ∗)S∗‖1 ≤ λN‖θ[N ] − θ∗‖1(ii)

≤ c1σs∗ log d

n, (D.2)

where (i) is from the value of λN and `1 error bound in (D.1).

On the other hand, from the convexity of Lµ(θ), we have

Lµ(θ[N ])− Lµ(θ∗) ≥ (θ[N ] − θ∗)>∇Lµ(θ∗) ≥ −‖∇Lµ(θ∗)‖∞‖θ[N ] − θ‖1(i)

≥ −c2λN‖θ[N ] − θ‖1(ii)

≥ −c3σs∗ log d

n, (D.3)

where (i) is from Assumption 3.2 and (ii) value of λN and `1 error bound in (D.1).

For our choice of µ and n, we have Lµ(θ) = 1√n‖y −Xθ‖2 − µ

2 by Proposition 3.8, then

Lµ(θ[N ])− Lµ(θ∗) =‖y −Xθ[N ]‖2√

n− ‖ε‖2√

n. (D.4)

From Wainwright (2015), we have for any δ > 0,

P[∣∣∣∣‖ε‖22n− σ2

∣∣∣∣ ≥ σ2δ

]≤ 2 exp

(−nδ

2

18

). (D.5)

Combining (D.2), (D.3), (D.4) and (D.5) with δ2 = c3s∗ log dn , we have with high probability,

∣∣∣∣∣‖y −Xθ[N ]‖2√

n− σ

∣∣∣∣∣ = O(σs∗ log d

n

). (D.6)

From Part 1, for n ≥ c4s∗ log d, we have with high probability,

‖θ[N ] − θ∗‖2 ≤384σ

√s∗ log d/n

ψmin≤ σ

2√

2ψmax,

then θ[N ] ∈ Bs∗+sr and ‖θ[N ] − θ[N ]‖0 ≤ s∗ + 2s. Then from the analysis of Theorem B.2, we have

ωλK (θ(t+1)[K] ) ≤ (1 + κs∗+2s)

√4ρ+

s∗+2s

(Fµ,λK (θ

(t)[K])−Fµ,λK (θ[K])

)≤ εN .

This implies

Fµ,λK (θ(t)[K])−Fµ,λK (θ[K]) ≤

ε2N4ρ+

s∗+2s (1 + κs∗+2s)2 . (D.7)

25

On the other hand, from the LRSC property of Lµ, convexity of `1 norm and optimality of θ, we

have

Fµ,λK (θ(t)[K])−Fµ,λK (θ[K]) ≥ ρ−s∗+2s‖θ[N ] − θ[N ]‖22. (D.8)

Combining (D.7), (D.8) and Assumption 3.5, we have

‖X(θ[N ] − θ[N ])‖2√n

≤

√8ρ+

s∗+2s

σ‖θ[N ] − θ∗‖2 ≤

√2

σρ−s∗+2s

εN(1 + κs∗+2s)

≤ 4εN(1 + κs∗+2s)

√ψmin

. (D.9)

Combining (D.6) and (D.9), we have

∣∣∣∣∣‖y −Xθ[N ]‖2√

n

∣∣∣∣∣ ≤∣∣∣∣∣‖y −Xθ[N ]‖2√

n

∣∣∣∣∣+‖X(θ[N ] − θ[N ])‖2√

n

≤∣∣∣∣∣‖y −Xθ[N ]‖2√

n

∣∣∣∣∣+4εN

(1 + κs∗+2s)√ψmin

.

If εN ≤ c5σs∗ log d

n for some constant c5, then we have the desired result.

E Proof of Proposition 3.8

Let θ[K] and θ[K] be the unique global optima of (1.3) and (1.4) respectively for all K = N1+1, . . . , N .

Then, we show θ[K] = θ[K] under the proposed conditions. Apply the argument of the proof of

Theorem 3.7 recursively, we have with probability at least 1− 8 exp(− n

144

)− 2d−1,

‖θ[K] − θ∗‖2 ≤384σ

√s∗ log d/n

ψmin.

By SE condition of X in Assumption 3.3, this implies

‖X(θ[K] − θ∗)‖2 ≤384σ

√2ψmaxs∗ log d

ψmin. (E.1)

On the other hand, we have

‖y −Xθ[K]‖2 = ‖X(θ[K] − θ∗) + ε‖2 ≥ ‖ε‖2 − ‖X(θ[K] − θ∗)‖2. (E.2)

Since ε has i.i.d. sub-Gaussian entries with E[εi] = 0 and E[ε2i ] = σ2 for all i = 1, . . . , n, we have

from Wainwright (2015) that

P[‖ε‖22 ≤

2

3nσ2

]≤ exp

(− n

144

), (E.3)

26

Combining (E.1) and (E.2), (E.3) and the condition on n ≥ c4s∗ log d for some constant c4, we have

with probability at least 1− 9 exp(− n

144

)− 2d−1,

‖y −Xθ[K]‖2 ≥√nσ

(√2

3− 384σ

√2ψmaxs∗ log d/n

ψmin

)>

√nσ

4≥ µ.

This implies Fµ,λ(θ) = Fµ(θ) + µ2 , thus argminθ Fµ,λ(θ) = argminθ Fµ(θ), i.e., θT = θ[K]. Besides

this also implies ‖y −Xθ∗‖2 >√nσ4 ≥ µ, i.e., θ∗ is not in the smoothed region.

Applying the same argument again to θ[K], we have that for large enough n, with high probability,

‖y −Xθ[K]‖2 >√nσ

4≥ µ,

and r = σ2

8ψmax> s∗ (64σλN1/ψmin)2 ≥ s∗

(8λN1/ρ

−s∗+s

)2is guaranteed, where λN1 ≥ 2λN =

48√

log d/n. By Lemma B.4, this implies the existence of the linear convergence region, which does

not fall into the smoothed region. Besides, θ[K] is not in the smoothed region.

F Proof of Theorem B.1

The sub-linear rate of convergence (B.1) follows directly from Lemma A.7. In terms of the optimal

residual, we have

ω2λK

(θ(t+1)[K] )

(i)

≤(L

(t)[K] + S

L(t)[K]

(θ(t)[K])

)2

‖θ(t+1)[K] − θ

(t)[K]‖

22

(ii)

≤(L

(t)[K] + ‖X‖22/(

√nµ)

)2‖θ(t+1)

[K] − θ(t)[K]‖

22

(iii)

≤2(L

(t)[K] + ‖X‖22/(

√nµ)

)2

(k −m+ 1)·

(∑ki=mFµ,λK (θ

(m)[K] )−Fµ,λK (θ

(m+1)[K] )

)

mini∈[m,...,k]L(i)[K]

(iv)

≤ 18‖X‖42(k −m+ 1)nµ2

·Fµ,λK (θ

(m)[K] )−Fµ,λK (θ

(t+1)[K] )

Lµ

≤ 18‖X‖42(k −m+ 1)nµ2

·Fµ,λK (θ

(m)[K] )−Fµ,λK (θ[K])

Lµ(v)

≤ 72R2‖X‖62Lµ(k −m+ 1)(m+ 2)n3/2µ3

(vi)

≤ 288R2‖X‖62Lµ(k + 3)2n3/2µ3

, (F.1)

where (i) is from Lemma A.9, (ii) is from SL

(t)[K]

(θ(t)[K]) ≤ Lµ ≤ ‖X‖22/(

√nµ) in Lemma A.9 and

Lemma A.7, (iii) is from (A.16) in Lemma A.8, (iv) is from Lµ ≤ L(t)[K] ≤ 2Lµ ≤ 2‖X‖22/(

√nµ) in

Remark A.1 and Lemma 3.6, (v) is from Lemma A.7 and (vi) is obtained by choosing m = bk/2c.To achieve the approximate KKT condition ωλK (θ

(t)[K]) ≤ εK , we require the R.H.S. of (F.1) to be

no greater than ε2K , then we have the desired result (B.2).

27

G Proof of Theorem B.2

Note that the RSS property implies that line search terminate when L(t)[K] satisfies

ρ+s∗+2s ≤ L

(t)[K] ≤ 2ρ+

s∗+2s. (G.1)

Since the initialization θ(0)[K] satisfies ωλK (θ

(0)[K]) ≤ λK/2 with ‖(θ(0)

[K])S∗‖0 ≤ s, then by Lemma A.2,

the objective satisfies

Fµ,λK (θ(0)[K])−Fµ,λ(θ∗) ≤ 6λ2

Ks∗

ρ−s∗+2s

.

Then by Lemma A.4, we have

‖(θ(1)[K])S∗‖0 ≤ s.

By monotone decrease of Fµ,λK (θ(t)[K]) from (A.16) in Lemma A.8 and recursively applying

Lemma A.4, ‖(θ(t)[K])S∗‖0 ≤ s holds in (B.3) for any t = 1, 2, . . ..

For the objective error, we have

Fµ,λK (θ(t)[K])−Fµ,λK (θ[K])

(i)

≤(

1− 1

8κs∗+2s

)t (Fµ,λK (θ

(0)[K])−Fµ,λK (θ[K])

)

(ii)

≤(

1− 1

8κs∗+2s


(t)[K]

ρ−s∗+2s

, (G.2)

where (i) is from Lemma A.6, and (ii) is from Lemma A.5 with λ = λ = λK and ωλK (θ(t+1)[K] ) ≤

λK/2 ≤ λK , which results in (B.3).

Combining (G.2) and (A.4), we have

‖θ(t)[K] − θ[K]‖22 ≤

1

ρ−s∗+2s

(Fµ,λK (θ

(t)[K])−Fµ,λK (θ[K])−∇Fµ,λK (θ[K])

)

=1

ρ−s∗+2s

(Fµ,λK (θ


)≤(

1− 1

8κs∗+2s


(t)[K])

(ρ−s∗+2s)2

28

For the optimal residue ωλK (θ(t+1)[K] ) of (t+ 1)-th iteration of K-th path following stage, we have

ωλK (θ(t+1)[K] )

(i)

≤(L

(t)[K] + S

L(t)[K]

(θ(t)[K])

)‖θ(t+1)

[K] − θ(t)[K]‖2

(ii)

≤(L

(t)[K] + ρ+

s∗+2s

)‖θ(t+1)

[K] − θ(t)[K]‖2

(iii)

≤ L(t)[K]

(1 +

ρ+s∗+2s

ρ−s∗+2s

)‖θ(t+1)

[K] − θ(t)[K]‖2

(iv)

≤ L(t)[K]

(1 +

ρ+s∗+2s

ρ−s∗+2s

)√√√√√

2(Fµ,λK (θ

(t)[K])−Fµ,λK (θ

(t+1)[K] )

)

L(t)[K]

(v)

≤ (1 + κs∗+2s)

√4ρ+

s∗+2s

(Fµ,λK (θ


)

(vi)

≤ (1 + κs∗+2s)

√96λ2

Ks∗κs∗+2s

(1− 1

8κs∗+2s

)t, (G.3)

where (i) is from Lemma A.9, (ii) is from SL

(t)[K]

(θ(t)[K]) ≤ ρ

+s∗+2s, (iii) is from ρ−s∗+2s ≤ L

(t)[K] in (G.1),

(iv) is from (A.16) in Lemma A.8, (v) is from L(t)[K] ≤ 2ρ+

s∗+2s in (G.1) and monotone decrease of

Fµ,λK (θ(t)[K]) from (A.16) in Lemma A.8, and (vi) is from (G.2) and κs∗+2s =

ρ+s∗+2s

ρ−s∗+2s

.

For K-th path following stage, K = 1, . . . , N − 1, to have ωλK (θ(t+1)[K] ) ≤ λK/4, we set the R.H.S.

of (G.3) to be no greater than λK/4, which is equivalent to require the number of iterations k to be

an upper bound of (B.4). For the last N -th path following stage, we need ωλN (θ[N ]) ≤ εN ≤ λN/4.

Set the R.H.S. of (G.3) to be no greater than εN , which is equivalent to require the number of

iterations k to be an upper bound of (B.5).

H Proof of Lemma B.3

Since ωλK−1(θ[K−1]) ≤ λK−1/4, there exists some subgradient g ∈ ∂‖θ[K−1]‖1 such that

‖∇Lµ(θ[K−1]) + λK−1g‖∞ ≤ λK−1/4. (H.1)

By the definition of ωλK (·), we have

ωλK (θ[K−1]) ≤ ‖∇Lµ(θ[K−1]) + λKg‖∞ = ‖∇Lµ(θ[K−1]) + λK−1g + (λK − λK−1)g‖∞

≤ ‖∇Lµ(θ[K−1]) + λK−1g‖∞ + |λK − λK−1| · ‖g‖∞(i)

≤ λK−1/4 + (1− η)λK−1

(ii)

≤ λK/2,

where (i) is from (H.1) and choice of λK , (ii) is from the condition on η.

29

I Proof of Lemma B.4

Part 1. We first show ‖θ − θ∗‖22 ≤ r by contradiction. Suppose ‖θ − θ∗‖2 >√r. Let α ∈ [0, 1]

such that θ = (1− α)θ + αθ∗ and

‖θ − θ∗‖2 =√r. (I.1)

Let g = argming∈∂‖θ‖1 ‖∇Lµ(θ) + λg‖∞ and ∆ = θ − θ∗, then we have

Fµ,λ(θ∗)(i)

≥ Fµ,λ(θ)− (∇Lµ(θ) + λg)>∆ ≥ Fµ,λ(θ)− ‖∇Lµ(θ) + λg‖∞‖∆‖1(ii)

≥ Fµ,λ(θ)− λ

4‖∆‖1, (I.2)

where (i) is from the convexity of Fµ,λ(θ) and (ii) is from the approximate KKT condition.

Denote ∆ = θ − θ∗. Combining (I.2) and (I.1), we have

Fµ,λ(θ)(i)

≤ (1− α)Fµ,λ(θ) + αFµ,λ(θ∗) ≤ (1− α)Fµ,λ(θ∗) +(1− α)λ

4‖∆‖1 + αFµ,λ(θ∗)

≤ Fµ,λ(θ∗) +λ

4‖(1− α)(θ − θ∗)‖1 = Fµ,λ(θ∗) +

λ

4‖(1− α)θ + αθ∗ − θ∗)‖1

= Fµ,λ(θ∗) +λ

4‖θ − θ∗‖1 = Fµ,λ(θ∗) +

λ

4‖∆‖1.

where (i) is from the convexity of Fµ,λ(θ). This indicates

Lµ(θ)− Lµ(θ∗) ≤ λ(‖θ∗‖1 − ‖θ‖1 +1

4‖∆‖1)

= λ(‖θ∗S∗‖1 − ‖θS∗‖1 − ‖θS∗‖1 +1

4‖∆S∗‖1 +

1

4‖∆S∗‖1)

≤ λ(‖θ∗S∗ − θS∗‖1 − ‖θS∗ − θ∗S∗‖1 +1

4‖∆S∗‖1 +

1

4‖∆S∗‖1)

=5λ

4‖∆S∗‖1 −

3λ

4‖∆S∗‖1. (I.3)


Lµ(θ)− Lµ(θ∗)(i)

≥ ∇Lµ(θ∗)∆ ≥ −‖∇Lµ(θ∗)‖∞‖∆‖1(ii)

≥ −λ6‖∆‖1

= −λ6‖∆S∗‖1 −

λ

6‖∆S∗‖1, (I.4)

where (i) is from the convexity of Lµ(θ), (ii) is from Assumption 3.2. Combining (I.3) and (I.4), we

have

‖∆S∗‖1 ≤5

2‖∆S∗‖1. (I.5)

30

Next, we consider the following sequence of sets:

S0 =

j ∈ S

∗:∑

m∈S∗1(θm ≥ θj) ≤ s

and

Si =

j ∈ S∗\

⋃

k<i

Sk :∑

m∈S∗\⋃k<i Sk

1(θm ≥ θj) ≤ s

for all i = 1, 2, . . . .

We introduce a result from Buhlmann and van de Geer (2011) with its proof provided therein.

Lemma I.1 (Adapted from Lemma 6.9 in Buhlmann and van de Geer (2011) by setting q = 2).

Let v = [v1, v2, . . .]> with v1 ≥ v2 ≥ . . . ≥ 0. For any s ∈ 1, 2, . . ., we have

∑

j≥s+1

v2j

1/2

≤∞∑

k=1

(k+1)s∑

j=ks+1

v2j

1/2

≤ ‖v‖1√s.

Denote A = S∗ ∪ S0. Then we have

∑

i≥1

‖∆Si‖2(i)

≤ 1√s‖∆S∗‖1

(ii)

≤ 5

2

√s∗

s‖∆S∗‖2 ≤

5

2

√s∗

s‖∆A‖2, (I.6)

where (i) is rom Lemma I.1 with s = s and (ii) is from (I.5). Let θ = (1 − β)θ + βθ∗ for any

β ∈ [0, 1]. Then we have

‖θ − θ∗‖2 = (1− β)‖θ − θ∗‖2 ≤√r,

which implies Lµ(θ) satisfies RSC/RSS for θ restricted on a sparse set by Assumption 3.3. Then we

have

|∆>A∇A,ALµ(θ)∆A| ≤∑

i≥1

|∆>Si∇Si,ALµ(θ)∆A| ≤ ρ+s∗+s‖∆A‖2

∑

i≥1

‖∆Si‖2

(i)

≤ 5

2

√s∗

sρ+s∗+s‖∆A‖22, (I.7)

where (i) is from (I.6). On the other hand, we have from RSC

∆>A∇A,ALµ(θ)∆A ≥ ρ−s∗+s‖∆A‖22. (I.8)

Then we have w.h.p.

∆∇Lµ(θ)∆ = ∆>A∇A,ALµ(θ)∆A + 2∆>A∇A,ALµ(θ)∆A + ∆>A∇A,ALµ(θ)∆A

≥ ∆>A∇A,ALµ(θ)∆A − 2|∆>A∇A,ALµ(θ)∆A|(i)

≥(ρ−s∗+s − 5

√s∗

sρ+s∗+s

)‖∆A‖22

(ii)

≥ 9

14ρ−s∗+s‖∆A‖22,

31

where (i) is from (I.7) and (I.8), (ii) is from Assumption 3.3. This implies

Lµ(θ)− Lµ(θ∗) = ∇Lµ(θ∗)>∆ +1

2∆∇Lµ(θ)∆ ≥ ∇Lµ(θ∗)>∆ +

9

28ρ−s∗+s‖∆A‖22

(i)

≥ 9

28ρ−s∗+s‖∆A‖22 −

λ

6‖∆S∗‖1 −

λ

6‖∆S∗‖1, (I.9)

where (i) is from Assumption 3.2. Combining (I.3) and (I.9), we have

ρ−s∗+s‖∆S∗‖22 ≤ ρ−s∗+s‖∆A‖22 ≤8

3λ‖∆S∗‖1 ≤

8

3λ√s∗‖∆S∗‖2 ≤

8

3λ√s∗‖∆A‖2.

This implies

‖∆S∗‖2 ≤ ‖∆A‖2 ≤8λ√s∗

3ρ−s∗+sand ‖∆S∗‖1 ≤

8λs∗

3ρ−s∗+s. (I.10)

Then we have

‖∆A‖2 ≤∑

i≥1

‖∆Si‖2(i)

≤ 1√s‖∆S∗‖1

(ii)

≤ 5

2

√1

s∗‖∆S∗‖1

(iii)

≤ 20λ√s∗

3ρ−s∗+s, (I.11)

where (i) is rom Lemma I.1 with s = s, (ii) is from (I.5) and s ≥ s∗ and (iii) is from (I.10).

Combining (I.10) and (I.11), we have

‖∆‖2 =

√‖∆A‖22 + ‖∆A‖22 ≤

8λ√s∗

ρ−s∗+s<√r.

This conflicts with (I.1), which indicates that ‖θ − θ∗‖2 ≤√r.

Part 2. We next demonstrate the sparsity of θ. From λ > λN ≥ 6‖∇Lµ(θ∗)‖∞, then we have∣∣∣∣i ∈ S∗ : |∇iLµ(θ∗)| ≥ λ

6

∣∣∣∣ = 0. (I.12)

Denote S1 =i ∈ S∗ : |∇iLµ(θ)−∇iLµ(θ∗)| ≥ 2λ

3

and s1 = |S1|. Then there exists some b ∈ Rd

such that ‖b‖∞ = 1, ‖b‖0 ≤ s1 and b>(∇Lµ(θ) − ∇Lµ(θ∗)) ≥ 2λs13 . Then by the mean value

theorem, we have for some θ = (1− α)θ + αθ∗ with α ∈ [0, 1], ∇Lµ(θ)−∇Lµ(θ∗) = ∇2Lµ(θ)∆,

where ∆ = θ − θ∗. Then we have

2λs1

3≤ b>∇2Lµ(θ)∆

(i)

≤√b>∇2Lµ(θ)b

√∆>∇2Lµ(θ)∆

(ii)

≤√s1ρ

+s1

√∆>(∇Lµ(θ)−∇Lµ(θ∗)), (I.13)

where (i) is from the generalized Cauchy-Schwarz inequality, (ii) is from the definition of RSS and

the fact that ‖b‖2 ≤√s1‖b‖∞ =

√s1. Let g achieve ming∈∂‖θ‖1 Fµ,λ(θ). Further, we have

∆>(∇Lµ(θ)−∇Lµ(θ∗)) ≤ ‖∆‖1‖∇Lµ(θ)−∇Lµ(θ∗)‖∞≤ ‖∆‖1(‖∇Lµ(θ∗)‖∞ + ‖∇Lµ(θ)‖∞)

≤ ‖∆‖1(‖∇Lµ(θ∗)‖∞ + ‖∇Lµ(θ) + λg‖∞ + λ‖g‖∞)

(i)

≤ 28λs∗

3ρ−s∗+s(λ

6+λ

4+ λ) ≤ 14λ2s∗

ρ−s∗+s, (I.14)

32

where (i) is from combining (I.5) and (I.10), condition on λ, approximate KKT condition and

‖g‖∞ ≤ 1. Combining (I.13) and (I.14), we have 2√s1

3 ≤√

14ρ+s1s∗

ρ−s∗+s

, which further implies

s1 ≤32ρ+

s1s∗

ρ−s∗+s≤ 32κs∗+2ss

∗ ≤ s. (I.15)

For any v ∈ Rd that satisfies ‖v‖0 ≤ 1, we have

S2 =

i ∈ S∗ :

∣∣∣∣∇iLµ(θ) +λ

4vi

∣∣∣∣ ≥5λ

6

⊆i ∈ S∗ : |∇iLµ(θ∗)| ≥ λ

6

⋃S1.

Then we have |S2| ≤ |S1| ≤ s. Since for any i ∈ S∗ and∣∣∇iLµ(θ) + λ

4vi∣∣ < 5λ

6 , we can find gi that

satisfies |gi| ≤ 1 such that ∇iLµ(θ) + λ4vi + λgi = 0 which implies θi = 0, then we have

∣∣∣∣i ∈ S∗ :

∣∣∣∣∇iLµ(θ) +λ

4vi

∣∣∣∣ <5λ

6

∣∣∣∣ = 0.

Therefore, we have ‖θS∗‖0 ≤ |S2| ≤ s.

J Proof of Lemma B.5

Let n(t) be the number of proximal gradient steps in t-th iteration of PIS2TA. Then we have

L(t+1) ≤ 2L(t)

(1

2

)n(t)−1

.

This indicates

n(t) ≤ 2 + log2

L(t)

L(t+1).

Then we have

t = 0

TKn(t) ≤ 2(TK + 1) + log2

L(0)

L(TK+1)

We obtain the desired result by L(0) = Lmax and L(TK+1) ≥ ρ+s∗+2s.

K Intermediate Results of Theorem 4.3

We start with some preliminaries. For any S ⊂ 1, . . . , d with |S| ≤ s∗, we denote the set of cone

CνS =

x ∈ Rd : ‖xS‖1 ≤ ν‖xS‖1

and Cνs∗ =⋃

S⊂1,...,d,|S|≤s∗CνS .

33

Besides, since Θ∗ = Σ∗−1 ∈M(κ, s∗), we have

Λmax(Θ∗)Λmin(Θ∗)

=Λmax(Σ∗−1)

Λmin(Σ∗−1)=

Λmax(Σ∗)Λmin(Σ∗)

≤ κ

We first introduce some important results on characterizing the data matrix X. These are

adapted from intermediate lemmas in Liu and Wang (2012), which we refer to interested readers for

detailed proofs.

The first lemma provides the bounds of entry-wise difference between sample and population

correlation matrices.

Lemma K.1. Let R and R be the sample and population correlation matrices. Then for event

E1 =

‖R−R‖∞ ≤ 18

√log d

n

,

we have P[E1] ≥ 1− d−1.

The second lemma provides the bounds of normalized model noise ε.

Lemma K.2. Let εi ∈ Rn follows εi ∼ Nn(0, σ2i In). Then for event

E2 =

max

i∈1,...,d‖εi‖22nσ2

i

≤ 1.4 and maxi∈1,...,d

∥∥∥∥‖εi‖22nσ2

i

− 1

∥∥∥∥ ≤ 3.5

√log d

n

,

we have P[E1] ≥ 1− d−1 − d exp(−100/n).

The third lemma provides the bounds of sample standard deviation of the marginal univariate

Gaussian random variables.

Lemma K.3. Let Σ be the sample covariance matrix. Suppose Assumption (A3) holds, then for

event

E3 =

1

2Λmin(Σ) ≤ min

i∈1,...,dΣii ≤ max

i∈1,...,dΣii ≤

3

2Λmax(Σ)

,

we have P[E1] ≥ 1− d−1 − d exp(−100/n).

Next, we verify Assumption 3.2 in the following lemma, and provide its proof in Appendix N.7.

Lemma K.4. Denote Lµ,i(θ∗i ) = ‖zi−Z?,\iθ∗i ‖µ/√n. Let λN =

6√

5 log d/n

max√

5/3µminiΓ1/2ii /(

√nσi),1

, then

for event

E4 =

λN ≥ 6 max

i∈1,...,d‖∇Lµ,i(θ∗i )‖∞

,

we have P[E1] ≥ 1− d exp(− n

25

)− d0.6√

0.6πa log d.

Then we provide the bound of the restricted eigenvalue of the sample correlation matrix.

34

Lemma K.5. Suppose E3 and Assumption 4.1 (A2) hold, then for event

E5 =

inf

θ∈Cνs∗

√s∗θ>Rθ‖R‖1

≥ 1

5(1 + ν)√κ

,

there exists constants c1 and c2 such that P[E5|E3] ≥ 1− c2 exp(−c2n).

Further, we provide the prediction error bound for the approximate solution. It follows directly

from Theorem 3.7.

Lemma K.6. Suppose E5 holds, then for event

E6 =

max

i∈1,...,d

‖Z?,\i(θi − θ∗i )‖τi

≤ c3

√s∗ log d

,

there exists constants c1, c2 and c3 such that P[E6|E5] ≥ 1− c1 exp(−c2n)− c3d−1.

L Proof of Lemma 4.2

Using the result in Liu and Wang (2012) (Lemma 12) and Agarwal et al. (2010) (Proposition 1),

we have with probability at least 1− c1 exp(−c2n) for some universal constants c1 and c2, for any

v ∈ Rd,

1Λmin(Σ)

2Λmax(Σ)‖v‖22 −

9Λmax(Σ)

Λmin(Σ)· log d

n‖v‖21 ≤

‖Zv‖22n

≤ 2Λmax(Σ)

Λmin(Σ)‖v‖22 +

9Λmax(Σ)

Λmin(Σ)· log d

n‖v‖21.

Applying the same analysis for Theorem3.4, we have for some constant c, ‖v‖21 ≤ c‖vS∗‖21 ≤cs∗‖vS∗‖22 ≤ cs∗‖v‖22 and we have

(1Λmin(Σ)

2Λmax(Σ)− 9Λmax(Σ)

Λmin(Σ)· s∗ log d

n

)‖v‖22 ≤

‖Zv‖22n

≤(

2Λmax(Σ)

Λmin(Σ)+

9Λmax(Σ)

Λmin(Σ)· s∗ log d

n

)‖v‖22.

Since n ≥ 54Λ2max(Σ)s∗ log dΛ2

min(Σ), we have

1Λmin(Σ)

3Λmax(Σ)‖v‖22 ≤

‖Zv‖22n

≤ 3Λmax(Σ)

Λmin(Σ)‖v‖22.

From Λmax(Σ) = 1/Λmin(Θ) and Λmin(Σ) = 1/Λmax(Θ), we have

1

3κΘ‖v‖22 ≤

‖Zv‖22n

≤ 3κΘ‖v‖22.

35

To satisfy the condition for the computational theory, we require µ ≤√nVar (Γ

−1/2ii εi)

4 for any

i ∈ 1, . . . , d. From σi = Θ−1/2ii and mini∈1,...,d Σii ≤ 3

2Λmax(Σ) with high probability in

Lemma K.3, we have√

Var (Γ−1/2ii εi) = Γ

−1/2ii σi =

1√ΘiiΣii

≥ 1√32ΘiiΛmax(Σ)

Since Σ = Θ−1, for any i ∈ 1, . . . , d, µ need to satisfy

µ ≤√n

4√

32 maxi ΘiiΛmax(Σ)

≤ 1

5

√n

Λmax(Θ)Λmax(Σ)=

1

5

√nΛmin(Θ)

Λmax(Θ)=

1

5

√n

κΘ.

By Lemma K.4, the condition on λN guarantees that Assumption 3.2 holds for each i = 1, . . . , d.

Besides, when n is large enough, it can be guaranteed that there exists N1 < N , N1 ∈ Z+, such

thatρ−s∗+s8

√rs∗ > λN1 ≥ 2λN ≥ 12‖∇Lµ,i(θi)‖∞.

M Proof of Theorem 4.3

The analysis here follows directly from our analysis in the linear model and the analysis in Liu and

Wang (2012). Let E = ∩6i=1Ei. Combining Lemma K.4 and our choice of µ, we have λN = 6

√5 log dn .

We first show that the estimation error of diagonal elements are bounded.

Lemma M.1 (Adapted from Lemma 14 in Liu and Wang (2012)). Suppose Assumption 4.1 and

the event E hold, the we have

maxi∈1,...,d

|Θii −Θ∗ii| ≤ c4‖Θ∗‖2log d

n.

Besides, we have the `1 norm error bounded for the estimation of off-diagonal elements each

column.

Lemma M.2 (Adapted from Lemma 15 in Liu and Wang (2012)). Suppose Assumption 4.1 and

the event E hold, the we have

maxi∈1,...,d

‖Θ\i,i −Θ∗\i,i‖1 ≤ c5(s∗‖Θ∗‖2 + ‖Θ∗‖1)log d

n.

Combining Lemma M.1 and Lemma M.2, we have

‖Θ−Θ∗‖1 = maxi∈1,...,d

‖Θ?,i −Θ∗?,i‖1 ≤ maxi∈1,...,d

|Θii −Θ∗ii|+ ‖Θ\i,i −Θ∗\i,i‖1

≤ c6(s∗‖Θ∗‖2 + ‖Θ∗‖1)log d

n

(i)

≤ c7(s∗‖Θ∗‖2)log d

n,

where (i) is from ‖Θ∗‖1 ≤ s∗‖Θ∗‖2. Then we finish the proof from

‖Θ−Θ∗‖2 ≤ ‖Θ−Θ∗‖1.

36

N Proofs of Intermediate Lemmas in Appendix A and Appendix K

N.1 Proof of Lemma A.2

We first bound the estimation error. From Assumption 3.3, we have the RSC property, which

indicates

Lµ(θ) ≥ Lµ(θ∗) + (θ − θ∗)>∇Lµ(θ∗) + (ρ−s∗+2s/2)‖θ − θ∗‖22, (N.1)

Lµ(θ∗) ≥ Lµ(θ) + (θ∗ − θ)>∇Lµ(θ) + (ρ−s∗+2s/2)‖θ − θ∗‖22, (N.2)

Adding (N.2) and (N.1), we have

(θ − θ∗)>∇Lµ(θ) ≥ (θ − θ∗)>∇Lµ(θ∗) + ρ−s∗+2s‖θ − θ∗‖22. (N.3)

Let g ∈ ∂‖θ‖1 be the subgradient that achieves the approximate KKT condition of the L.H.S of

(A.6), then we have

(θ − θ∗)> (∇Lµ(θ) + λg) ≤ ‖θ − θ∗‖1 ‖∇Lµ(θ) + λg‖∞ ≤1

2λ‖θ − θ∗‖1. (N.4)

On the other hand, we have from (N.3)

(θ − θ∗)> (∇Lµ(θ) + λg)≥(θ − θ∗)>∇Lµ(θ∗) + ρ−s∗+2s‖θ − θ∗‖22 + λg>(θ − θ∗), (N.5)

Since ‖θ − θ∗‖1 = ‖(θ − θ∗)S∗‖1 + ‖(θ − θ∗)S∗‖1, then

(θ − θ∗)>∇Lµ(θ∗) ≥ −‖(θ − θ∗)S∗‖1‖Lµ(θ∗)‖∞ − ‖(θ − θ∗)S∗‖1‖Lµ(θ∗)‖∞. (N.6)

Besides, we have

(θ − θ∗)>g = g>S∗(θ − θ∗)S∗ + g>S∗(θ − θ∗)S∗

(i)

≥ −‖gS∗‖∞‖(θ − θ∗)S∗‖1 + g>S∗θS∗

(ii)

≥ −‖(θ − θ∗)S∗‖1 + ‖gS∗‖1(iii)= −‖(θ − θ∗)S∗‖1 + ‖(θ − θ∗)S∗‖1, (N.7)

where (i) and (iii) is from θ∗S∗ = 0, (ii) is from ‖gS∗‖∞ ≤ 1 and g ∈ ∂‖θ‖1.

Combining (N.4), (N.5), (N.6) and (N.7), we have

1

2λ‖θ − θ∗‖1 =

1

2λ‖(θ − θ∗)S∗‖1 +

1

2λ‖(θ − θ∗)S∗‖1

≥ ρ−s∗+2s‖θ − θ∗‖22 − (λ+ ‖Lµ(θ∗)‖∞)‖(θ − θ∗)S∗‖1+ (λ− ‖Lµ(θ∗)‖∞)‖(θ − θ∗)S∗‖1.

This implies

ρ−s∗+2s‖θ − θ∗‖22 + (1

2λ− ‖Lµ(θ∗)‖∞)‖(θ − θ∗)S∗‖1

≤ (3

2λ+ ‖Lµ(θ∗)‖∞)‖(θ − θ∗)S∗‖1, (N.8)

37

which results in (A.7) from ρ−s∗+2s > 0 and Assumption 3.2 as

‖(θ − θ∗)S∗‖1 ≤32λ+ ‖Lµ(θ∗)‖∞12λ− ‖Lµ(θ∗)‖∞

‖(θ − θ∗)S∗‖1.

Combining 12λ− ‖Lµ(θ∗)‖∞ ≥ 0, 3

2λ+ ‖Lµ(θ∗)‖∞ ≤ 2λ and (N.8), we have estimation error bound

in (A.8) and (A.9) as

ρ−s∗+2s‖θ − θ∗‖22 ≤ 2λ‖(θ − θ∗)S∗‖1 ≤ 2λ√s∗‖θ − θ∗‖2.

‖θ − θ∗‖1 ≤ 6‖(θ − θ∗)S∗‖1 ≤ 6√s∗‖θ − θ∗‖2.

Next, we bound the objective error in (A.10). We have

Fµ,λ(θ)−Fµ,λ(θ∗)(i)

≤ −(∇Lµ(θ) + λg)>(θ∗ − θ) ≤ ‖∇Lµ(θ) + λg‖∞‖θ∗ − θ‖1

≤ 1

2λ‖θ∗ − θ‖1 =

1

2λ(‖(θ∗ − θ)S∗‖1 + ‖(θ∗ − θ)S∗‖1)

(ii)

≤ 3λ‖(θ∗ − θ)S∗‖1 ≤ 3λ√s∗‖(θ∗ − θ)S∗‖2

(iii)

≤ 6λ2s∗

ρ−s∗+2s

,

where (i) is from the convexity of Fµ,λ(θ) with ∇Lµ(θ) + λg as its subgradient, (ii) is from (A.7),

and (iii) is from (A.8).


Assumption Fµ,λ(θ)−Fµ,λ(θ∗) ≤ 6λ2s∗/ρ−s∗+2s implies

Lµ(θ)− Lµ(θ∗) + λ(‖θ‖1 − ‖θ∗‖1) ≤ 6λ2s∗

ρ−s∗+2s

. (N.9)

We have from the RSC property that

Lµ(θ) ≥ Lµ(θ∗) + (θ − θ∗)>∇Lµ(θ∗) +ρ−s∗+2s

2‖θ − θ∗‖22, (N.10)

Then we have (N.9) and (N.10),

ρ−s∗+2s

2‖θ − θ∗‖22≤

6λ2s∗

ρ−s∗+2s

− (θ − θ∗)>∇Lµ(θ∗) + λ(‖θ∗‖1 − ‖θ‖1). (N.11)

Besides, we have

(θ − θ∗)>∇Lµ(θ∗) ≥ −‖(θ − θ∗)S∗‖1‖Lµ(θ∗)‖∞ − ‖(θ − θ∗)S∗‖1‖Lµ(θ∗)‖∞, (N.12)

and

‖θ∗‖1 − ‖θ‖1 = ‖θ∗S∗‖1 − ‖θS∗‖1 − ‖(θ − θ∗)S∗‖1 ≤ ‖(θ − θ∗)S∗‖1 − ‖(θ − θ∗)S∗‖1. (N.13)

38

Combining (N.11), (N.12) and (N.13), we have

ρ−s∗+2s

2‖θ − θ∗‖22 ≤

6λ2s∗

ρ−s∗+2s

+ (‖∇Lµ(θ∗)‖∞ + λ)‖(θ − θ∗)S∗‖1

+ (‖∇Lµ(θ∗)‖∞ − λ)‖(θ − θ∗)S∗‖1. (N.14)

We discuss two cases as following:

Case 1. We first assume ‖θ − θ∗‖1 ≤ 12λs∗

ρ−s∗+2s

. Then (N.14) implies

ρ−s∗+2s

2‖θ − θ∗‖22

(i)

≤ 6λ2s∗

ρ−s∗+2s

+ (‖∇Lµ(θ∗)‖∞ + λ)‖(θ − θ∗)S∗‖1

(ii)

≤ 6λ2s∗

ρ−s∗+2s

+3

2λ‖(θ − θ∗)S∗‖1

≤ 6λ2s∗

ρ−s∗+2s

+18λ2s∗

ρ−s∗+2s

=24λ2s∗

ρ−s∗+2s

.

where (i) is from ‖∇Lµ(θ∗)‖∞ − λ ≤ 0 and (ii) is from ‖∇Lµ(θ∗)‖∞ + λ ≤ 32λ. This indicates

‖θ − θ∗‖2 ≤4√

3s∗λ

ρ−s∗+2s

. (N.15)

Case 2. Next, we assume ‖θ − θ∗‖1 > 12λs∗

ρ−s∗+2s

. Then (N.14) implies

ρ−s∗+2s

2‖θ − θ∗‖22

≤ (‖∇Lµ(θ∗)‖∞ + λ)‖(θ − θ∗)S∗‖1 + (‖∇Lµ(θ∗)‖∞ − λ)‖(θ − θ∗)S∗‖1 +1

2λ‖θ − θ∗‖1

= (‖∇Lµ(θ∗)‖∞ +3

2λ)‖(θ − θ∗)S∗‖1 + (‖∇Lµ(θ∗)‖∞ −

1

2λ)‖(θ − θ∗)S∗‖1

(i)

≤ 2λ‖(θ − θ∗)S∗‖1 ≤ 2√s∗λ‖(θ − θ∗)S∗‖2, (N.16)

where (i) is from ‖∇Lµ(θ∗)‖∞ + 32λ ≤ 2λ and ‖∇Lµ(θ∗)‖∞ − 1

2λ ≤ 0. This indicates

‖θ − θ∗‖2 ≤4√s∗λ

ρ−s∗+2s

. (N.17)

Besides, we have

‖θ − θ∗‖1(i)

≤ 6‖(θ − θ∗)S∗‖1 ≤ 6√s∗‖(θ − θ∗)S∗‖2 ≤

24λs∗

ρ−s∗+2s

, (N.18)

where (i) is from ‖∇Lµ(θ∗)‖∞ + 32λ ≤ 2λ and (N.16).

Combining (N.15) and (N.17), we have desired result (A.11). Combining the assumption in Case

1 and (N.18), we have desired result (A.12).

39


v> (∇Lµ(θ∗)−∇Lµ(θ)) ≤ ‖v‖2‖∇Lµ(θ∗)−∇Lµ(θ)‖2(i)

≤√|A| · ‖∇Lµ(θ∗)−∇Lµ(θ)‖2

(ii)

≤ ρ+s∗+2s

√|A| · ‖θ − θ∗‖2, (N.27)

where (i) is from ‖v‖2 ≤√|A|maxi : |Ai| ≤

√|A|, and (ii) is from (A.1) and (A.2).

Combining (N.26) and (N.27), we have

λ|A| ≤ 2ρ+s∗+2s

√|A| · ‖θ − θ∗‖2

(i)

≤ 8λκs∗+2s

√3s∗|A|

where (i) is from (A.11) in Lemma A.3 and definition of κs∗+2s = ρ+s∗+2s/ρ

−s∗+2s. Considering

A3 ⊆ A, this implies

|A3| ≤ |A| ≤ 196κ2s∗+2ss

∗. (N.28)

Now combining Even A1, A2, A3 and L ≤ 2ρ+s∗+2s in assumption, we close the proof as

‖ (TL,λ(θ))S∗ ‖0 ≤ |A1|+ |A2|+ |A3| ≤72Ls∗

ρ−s∗+2s

+ 196κ2s∗+2ss

∗ ≤ (144κs∗+2s + 196κ2s∗+2s)s

∗

≤ s.


Let g = argming∈∂‖θ‖1 Lµ + λ‖θ‖1, then ωλ = ‖∇Lµ + λg‖∞. By the optimality of θ and convexity

of Fµ,λ

, we have

Fµ,λ

(θ)−Fµ,λ

(θ) ≤(∇Lµ + λg

)>(θ − θ) ≤ ‖∇Lµ + λg‖∞‖θ − θ‖1

≤(ωλ(θ) + λ− λ

)‖θ − θ‖1. (N.29)

Besides, we have

‖θ − θ‖1 ≤ ‖θ − θ∗‖1 + ‖θ − θ∗‖1(i)

≤ 6(‖(θ − θ∗)S∗‖1 + ‖(θ − θ∗)S∗‖1

)

≤ 6√s∗(‖(θ − θ∗)S∗‖2 + ‖(θ − θ∗)S∗‖2

) (ii)

≤ 12(λ+ λ)s∗

ρ−s∗+2s

. (N.30)

where (i) and (ii) are from (A.7) and (A.8) in Lemma A.2 respectively. Combining (N.29) and

(N.30), we have desired result.

41


Our analysis has two steps. In the first step, we show that θ(t)∞t=0 converges to the unique limit

point θ. In the second step, we show that the proximal gradient method has linear convergence rate.

Step 1. Note that θ(t+1) = TLµ,λ(θ(t)). Since Fµ,λ(θ) is convex in θ (but not strongly convex),

the sub-level set θ : Fµ,λ(θ) ≤ Fµ,λ(θ(0)) is bounded. By the monotone decrease of Fµ,λ(θ(t))

from (A.16) in Lemma A.8, θ(t)∞t=0 is also bounded. By BolzanoWeierstrass theorem, it has a

convergent subsequence and we will show that θ is the unique accumulation point.

Since Fµ,λ(θ) is bounded below,

limk→∞

‖θ(t+1) − θ(t)‖2 ≤2

L(t)µ

· limk→∞

[Fµ,λ

(θ(t+1)

)−Fµ,λ

(θ(t))]

= 0.

By Lemma A.9, we have

limk→∞

ωλ(θ(t)) = 0,

This implies limk→∞ θ(t) satisfies the KKT condition, hence is an optimal solution.

Let θ be an accumulation point. Since θ = argminθ Fµ,λ(θ), then there exists some g ∈ ∂‖θ‖1such that

∇Fµ,λ(θ) = Lµ,λ(θ) + λg = 0. (N.31)

By Lemma A.4, every proximal update is sparse, hence ‖θS∗‖0 ≤ s. By RSC property in (3.1), if

‖θS∗‖0 ≤ s, i.e.,‖(θ − θ)S∗‖0 ≤ s , then we have

Lµ(θ)− Lµ(θ) ≥ (θ − θ)>∇Lµ(θ) +ρ−s∗+2s

2‖θ − θ‖22, (N.32)

From the convexity of ‖θ‖1 and g ∈ ∂‖θ‖1, we have

‖θ‖1 − ‖θ‖1 ≥ (θ − θ)>g. (N.33)

Combining (N.32) and (N.33), we have for any ‖θS∗‖0 ≤ s,

Fµ,λ(θ)−Fµ,λ(θ) = Lµ(θ) + λ‖θ‖1 −(Lµ(θ)− λ‖θ‖1

)

≥ (θ − θ)>(Lµ(θ) + λg

)+ρ−s∗+2s

2‖θ − θ‖22

(i)=ρ−s∗+2s

2‖θ − θ‖22 ≥ 0, (N.34)

where (i) is from (N.31). Therefore, θ is the unique accumulation point, i.e. limk→∞ θ(t) = θ.

Step 2. The objective Fµ,λ(θ(t+1)) satisfies

Fµ,λ(θ(t+1))(i)

≤ Qµ,λ(θ(t+1),θ(t)

)

(ii)= min

θLµ(θ(t)) +∇Lµ(θ(t))>(θ − θ(t)) +

L(t)λ

2‖θ − θ(t)‖22 + λ‖θ‖1. (N.35)

42

where (i) is from (A.16) in Lemma A.8, (ii) is from the definition of Oµ,λ in (2.3). To further bound

R.H.S. of (N.35), we consider the line segment

S(θ,θ(t)) = θ : θ = αθ + (1− α)θ(t), α ∈ [0, 1].

Then we restrict the minimization over the line segment S(θ,θ(t)),

Fµ,λ(θ(t+1)) ≤ minθ∈S(θ,θ(t))

Lµ(θ(t)) +∇Lµ(θ(t))>(θ − θ(t)) +L

(t)λ

2‖θ − θ(t)‖22 + λ‖θ‖1. (N.36)

Since ‖θS∗‖0 ≤ s and ‖θ(t)

S∗‖0 ≤ s, then for any θ ∈ S(θ,θ(t)), we have ‖θS∗‖0 ≤ s and ‖(θ −θ(t))S∗‖0 ≤ 2s. By RSC property, we have

Lµ(θ) ≥ Lµ(θ(t)) +∇Lµ(θ(t))>(θ − θ(t)) +ρ−s∗+2s

2‖θ − θ(t)‖22

≥ Lµ(θ(t)) +∇Lµ(θ(t))>(θ − θ(t)). (N.37)

Combining (N.36) and (N.37), we have

Fµ,λ(θ(t+1)) ≤ minθ∈S(θ,θ(t))

Lµ(θ) +L

(t)λ

2‖θ − θ(t)‖22 + λ‖θ‖1

= minθ∈S(θ,θ(t))

Fµ,λ(θ) +L

(t)λ

2‖θ − θ(t)‖22

= minα∈[0,1]

Fµ,λ(αθ + (1− α)θ(t)) +α2L

(t)λ

2‖θ − θ(t)‖22

(i)

≤ minα∈[0,1]

αFµ,λ(θ) + (1− α)Fµ,λ(θ(t)) +α2L

(t)λ

2‖θ − θ(t)‖22

(ii)

≤ minα∈[0,1]

Fµ,λ(θ(t))− α(Fµ,λ(θ(t))−Fµ,λ(θ)

)+α2L

(t)λ

ρ−s∗+2s

(Fµ,λ(θ(t))−Fµ,λ(θ)

)

= minα∈[0,1]

Fµ,λ(θ(t))− α(

1− αL(t)λ

ρ−s∗+2s

)(Fµ,λ(θ(t))−Fµ,λ(θ)

), (N.38)

where (i) is from the convexity of Fµ,λ and (ii) is from (N.34).

Minimize the R.H.S. of (N.38) w.r.t. α, the optimal value α =ρ−s∗+2s

2L(t)λ

results in

Fµ,λ(θ(t+1)) ≤ Fµ,λ(θ(t))−ρ−s∗+2s

4L(t)λ

(Fµ,λ(θ(t))−Fµ,λ(θ)

). (N.39)

Subtracting both sides of (N.39) by Fµ,λ(θ), we have

Fµ,λ(θ(t+1))−Fµ,λ(θ) ≤(

1−ρ−s∗+2s

4L(t)λ


)

(i)

≤(

1−ρ−s∗+2s

8ρ+s∗+2s


), (N.40)

43

where (i) is from Remark A.1. Apply (N.40) recursively, we have the desired result.


We first show an upper bound of Lµ. Recall from the analysis in Appendix C of Lemma 3.6, there

exists some α ∈ [0, 1] such that

∇2Lµ =

X>X√nµ, if ‖ξ‖2 < µ

1√n‖ξ‖2 X>

(I− ξξ>

‖ξ‖22

)X, o.w.

where ξ = y −X(w + α∆). We discuss two cases depending on ‖ξ‖2 < µ and ‖ξ‖2 ≥ µ.

Case 1. For ‖ξ‖2 <√nµ, we have from (??) that

Lµ ≤ ‖∇2Lµ‖2 =‖X‖22√nµ

=‖X‖22√nµ

.

Case 2. For ‖ξ‖2 ≥ µ, we have

Lµ ≤ ‖∇2Lµ‖2 =1√n‖ξ‖2

∣∣∣∣∣∣∣∣X>

(I− ξξ>

‖ξ‖22

)X

∣∣∣∣∣∣∣∣2

≤ ‖X‖22√n‖ξ‖2

=‖X‖22√n‖ξ‖2

≤ ‖X‖22√

nµ.

Combining the two cases, we have

Lµ ≤‖X‖22√nµ

. (N.41)

Applying the analogous argument in Step 1 of the proof of Lemma A.6, we have that θ(t)∞t=0

converges to the unique limit point θ. By the monotonicity of Fµ,λ(θ(t)) from (A.16) in Lemma A.8

and convexity of Fµ,λ(θ), we have ‖θ(t) − θ‖2 ≤ R for all t = 1, 2, . . .. Then we have

Fµ,λ(θ(t+1))(i)

≤ Qµ,λ(θ(t+1),θ(t))(ii)

≤ minθFµ,λ(θ) +

L(t)

2‖θ − θ(t)‖22

≤ minθ=αθ+(1−α)θ(t),α∈[0,1]

Fµ,λ(θ) +L(t)

2‖θ − θ(t)‖22

= minα∈[0,1]

Fµ,λ(αθ + (1− α)θ(t)

)+L(t)α2

2‖θ(t) − θ‖22

(iii)

≤ minα∈[0,1]

Fµ,λ(θ(t))− α(Fµ,λ(θ(t))−Fµ,λ(θ)

)+

2‖X‖22R2α2

2√nµ

, (N.42)

where (i) and (ii) are from (A.16) and (A.15) in Lemma A.8 respectively, (iii) is from the convexity

of Fµ,λ(θ), ‖θ(t) − θ‖2 ≤ R for all t = 1, 2, . . . and L(t) ≤ 2Lµ ≤ 2‖X‖22/(√nµ) in Remark A.1 and

Lemma 3.6. We discuss in two cases to provide an upper bound of R.H.S. (N.42).

Case 1: Suppose Fµ,λ(θ(0)) − Fµ,λ(θ) ≤ 2‖X‖22R2/(√nµ). Minimizing the R.H.S. of (N.42)

w.r.t. α, then the optimal value is

α =Fµ,λ(θ(t))−Fµ,λ(θ)

2‖X‖22R2/(√nµ)

≤ Fµ,λ(θ(0))−Fµ,λ(θ)

2‖X‖22R2/(√nµ)

≤ 1.

44

Then we have

Fµ,λ(θ(t+1)) ≤ Fµ,λ(θ(t))−(Fµ,λ(θ(t))−Fµ,λ(θ)

)2

4‖X‖22R2/(√nµ)

Equivalently, we have

Fµ,λ(θ(t+1))−Fµ,λ(θ) ≤ Fµ,λ(θ(t))−Fµ,λ(θ)−(Fµ,λ(θ(t))−Fµ,λ(θ)

)2

4‖X‖22R2/(√nµ)

Denote fk = Fµ,λ(θ(t))−Fµ,λ(θ). Then we have

1

fk+1≤ 1

fk− 1

4f2k‖X‖22R2/(

√nµ)

,

which results in

fk+1 ≥ fk +fk+1

4fk‖X‖22R2/(√nµ)

(i)

≥ fk +1

4‖X‖22R2/(√nµ)

, (N.43)

where (i) is from the monotonicity of Fµ,λ(θ(t)) (A.16) in Lemma A.8. Applying (N.43) recursively,

we have

fk ≥ f0 +k

4‖X‖22R2/(√nµ)

(i)

≥ t+ 2

4‖X‖22R2/(√nµ)

,

where (i) is from Fµ,λ(θ(0))−Fµ,λ(θ) < 2‖X‖22R2/(√nµ). Then we have the desired result (A.14).

Case 2: Suppose Fµ,λ(θ(0))−Fµ,λ(θ) > 2‖X‖22R2/(√nµ). Minimize the R.H.S. of (N.42) w.r.t.

α, then the optimal value is α = 1 and

Fµ,λ(θ(1))−Fµ,λ(θ) ≤ 2‖X‖22R2

2√nµ

.

We claim that for all t = 1, 2, . . .,

Fµ,λ(θ(t))−Fµ,λ(θ) = ct2‖X‖22R2/(√nµ),

where 1t+2 ≤ ct ≤ 2

t+2 . We prove the claim by induction.

This obviously holds when t = 1. Assume 1t+2 ≤ ct ≤ 2

t+2 holds when t = T . For t = T + 1,

minimize the R.H.S. of (N.42) w.r.t. α, then the optimal value is α = cT ≤ 1 and convergence rate

for T + 1-th iteration is cT+1 = cT − c2T /2. Since cT+1 is a increasing function of cT in cT ∈ [0, 1/2]

and ct is monotone decreasing, i.e., ct ≤ c1 for all k > 1, then we verifies the claim since

cT+1 ≤2

T + 2− 1

2

(2

T + 2

)2

≤ 2

T + 3, and

cT+1 ≥1

T + 2− 1

2

(1

T + 2

)2

≤ 1

T + 3.

Combining the two cases, we have the desired result (A.14).

45

N.7 Proof of Lemma K.4

Recall that the model is

zi = Z?,\iθ∗i + Γ

−1/2ii εi,

where zi = Γ−1/2ii xi, Z?,\i = X?,\iΓ

−1/2\i,\i and εi ∼ Nn(0, σ2

i In). Then we have

∇Lµ,i(θ∗i ) =Z>?,\i(Z?,\iθ

∗i − zi)

max√nµ,√n‖zi − Z?,\iθ∗i ‖2= −

Z>?,\iεi

max√nµΓ1/2ii ,√n‖εi‖2

.

Since ‖εi‖2nσ2

i∼ χ2

n, we have from Johnstone (2001) that for any δ ∈ [0, 1/2),

P[

maxi∈1,...,d

‖εi‖22nσ2

i

≤ 1− δ]≤ d exp

(−nδ

2

4

). (N.44)

Besides, Z>?,\iεi ∼ N (0, nσ2i ). Then we have from Liu and Wang (2012) that for any δ ∈ [0, 1/2)

and c > 2,

P[

maxi∈1,...,d

‖Z>?,\iεi‖∞ > σi√

2cn log d(1− δ)]≤ d2−c(1−δ)√πa log d(1− δ)

. (N.45)

Combining (N.44) and (N.45), we have with probability at least 1− d exp(−nδ2

4

)− d2−c(1−δ)√

πa log d(1−δ),

maxi∈1,...,d

‖Z>?,\iεi‖∞maxµΓ

1/2ii ,√n‖εi‖2

≤√

2c log d(1− δ)/nmaxmini∈1,...,d µΓ

1/2ii /(

√nσi),

√1− δ

.

Take δ = 2/5 and c = 7/3, then we have the desired result.

46

a first order free lunch for sqrt-lasso∗

Documents