topics in high-dimensional regression and …

TOPICS IN HIGH-DIMENSIONAL REGRESSION ANDNONPARAMETRIC MAXIMUM LIKELIHOOD

METHODS

BY LONG FENG

A dissertation submitted to the

Graduate School—New Brunswick

Rutgers, The State University of New Jersey

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

Graduate Program in Statistics and Biostatistics

Written under the direction of

Cun-Hui Zhang & Lee H. Dicker

and approved by

New Brunswick, New Jersey

October, 2017

ABSTRACT OF THE DISSERTATION

Topics in high-dimensional regression and nonparametric

maximum likelihood methods

by LONG FENG

Dissertation Director: Cun-Hui Zhang & Lee H. Dicker

This thesis contains two parts. The first part, in Chapter 2-4, addresses three connected

issues in penalized least-square estimation for high-dimensional data. The second part, in

Chapter 5, concerns nonparametric maximum likelihood methods for mixture models.

In the first part, we prove the estimation, prediction and selection properties of concave

penalized least-square estimation (PLSE) under fully observed and noisy/missing design,

and validate an essential condition for PLSE: the restricted-eigenvalue condition. In Chapter

2, we prove that the concave PLSE matches the oracle inequalities for prediction and

coefficients estimation of the Lasso, based only on the restricted eigenvalue condition, one of

the mildest condition imposed on the design matrix. Furthermore, under a uniform signal

strength assumption, the selection consistency does not require any additional conditions

for proper concave penalties such as the SCAD penalty and MCP. A scaled version of

the concave PLSE is also proposed to jointly estimate the regression coefficients and noise

level. Chapter 3 concerns high-dimensional regression when the design matrix are subject

to missingness or noise. We extend the PLSE for fully observed design to noisy or missing

design and prove that the same scale of coefficients estimation error can be obtained, while

requiring no additional condition. Moreover, we show that a linear combination of the `2

norm of regression coefficients and the noise level is large enough as penalty level when noise

ii

or missingness exists. This sharpens the commonly understood results where `1 norm of

coefficients is required. Chapter 4 validates the restricted eigenvalue (RE) type conditions

required in Chapter 2 and Chapter 3 and considers a more general groupwise version. We

prove that the population version of the groupwise RE condition implies its sample version

under a low moment condition given usual sample size requirement. Our results include the

ordinary RE condition as a special case.

In the second part, we consider nonparametric maximum likelihood (NPML) methods

for mixture models, a nonparametric empirical Bayes approach. We provide concrete

guidance on implementing multivariate NPML methods for mixture models, with theoretical

and empirical support; topics covered include identifying the support set of the mixing

distribution, and comparing algorithms (across a variety of metrics) for solving the

simple convex optimization problem at the core of the approximate NPML problem. In

addition, three diverse real data applications are provided to illustrate the performance of

nonparametric maximum likelihood methods.

iii

Acknowledgements

I would like to express my deepest gratitude to my advisors, Prof. Cun-Hui Zhang and

Prof. Lee Dicker. I feel extremely fortunate to have the opportunity to work with them.

Prof. Zhang is more than an advisor to me. He is a great researcher, a dedicated

educator, a devoted mentor, a trusted friend and a respectable elder. He provides me

helpful instructions and exceptional research trainning, and also unwavering support and

constant encouragement. More importantly, his devotation to research and to students

makes me interested in exploring a career in academia and to be a researcher and teacher.

Prof. Dicker is the jonior professor I admire most. His brilliant ideas and excellent intuition

on statistics always make my understaning of research questions deeper. He can always

convey complex problems in plain language. I enjoy so much working with him and benefit

a lot in each of our meetings.

Secondly I would like to extend my gradtitude to my dissertation committee, Prof.

Pierre Bellec and Prof. Eitan Greenshtein, for the time they dedicated to review my thesis

and comments on the manuscripts. Special thanks go to Prof. Greenshtein for his helpful

discussion on the topics of nonparametric maximum likelihood estimation in Chapter 5 of

this thesis.

In addition, I want to say thanks to Professor John Kolassa, for his support over the

past five years and Prof. Minge Xie for his advices and encouragement during my study.

Also, I want to thank the fellow students in our department and my friends at Rutgers for

their suggestions and help. I feel very lucky to meet all these people in my graduate life

and have such a happy and unforgettable journey.

iv

Dedication

To my family

v

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1. High-dimensional regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2. Nonparametric maximum liklihood methods . . . . . . . . . . . . . . . . . . 3

2. Oracle properties of concave PLSE and its scaled version . . . . . . . . . 4

2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2. Statistical Properties of Concave PLSE methods . . . . . . . . . . . . . . . 7

2.2.1. Concave penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2. The restricted eigenvalue condition . . . . . . . . . . . . . . . . . . . 9

2.2.3. Properties of concave PLSE . . . . . . . . . . . . . . . . . . . . . . . 10

2.3. Smaller penalty levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.1. Smaller penalty levels . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.2. RE-type conditions for smaller penalty levels . . . . . . . . . . . . . 21

2.3.3. Prediction and estimation errors bounds at smaller penalty levels . . 23

2.4. Scaled concave PLSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.1. Description of the scaled concave PLSE . . . . . . . . . . . . . . . . 27

2.4.2. Performance guarantees of scaled concave PLSE at universal penalty

levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vi

2.4.3. Performance bounds of scaled concave PLSE at smaller penalty levels 32

2.5. Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.5.1. No signal case: β˚ “ 0 . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5.2. Effect of correlation: ranging over different ρ . . . . . . . . . . . . . 35

2.5.3. Effect of signal-to-noise ratio: ranging over different snr . . . . . . . 36

2.5.4. Effect of sparsity: ranging over different α . . . . . . . . . . . . . . . 37

2.6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3. Penalized least-square estimation with noisy and missing data . . . . . 41

3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2. Theoretical Analysis of PLSE . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.1. Restricted eigenvalue conditions . . . . . . . . . . . . . . . . . . . . 44

3.2.2. Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3. Theoretical penalty levels for missing/noisy data . . . . . . . . . . . . . . . 46

3.4. Scaled PLSE and Variance Estimation . . . . . . . . . . . . . . . . . . . . . 51

3.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4. Group Lasso under Low-Moment Conditions on Random Designs . . . 56

4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2. A review of restricted eigenvalue type conditions . . . . . . . . . . . . . . . 60

4.3. The group transfer principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4. Groupwise compatibility condition . . . . . . . . . . . . . . . . . . . . . . . 67

4.5. Groupwise restricted eigenvalue condition . . . . . . . . . . . . . . . . . . . 76

4.6. Convergence of the restricted eigenvalue . . . . . . . . . . . . . . . . . . . . 80

4.7. Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.8. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5. Nonparametric Maximum Likelihood for Mixture Models: A Convex

Optimization Approach to Fitting Arbitrary Multivariate Mixing

Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

vii

5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2. NPMLEs for mixture models via convex optimization . . . . . . . . . . . . 92

5.2.1. NPMLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2.2. A simple finite-dimensional convex approximation . . . . . . . . . . 93

5.3. Choosing Λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.4. Connections with finite mixtures . . . . . . . . . . . . . . . . . . . . . . . . 96

5.5. Implementation overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.6. Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.6.1. Comparing NPMLE algorithms . . . . . . . . . . . . . . . . . . . . . 99

5.6.2. Gaussian location scale mixtures: Other methods for estimating a

normal mean vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.7. Baseball data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.8. Two-dimensional NPMLE for cancer microarray classification . . . . . . . . 104

5.9. Continuous glucose monitoring . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.9.1. Linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.9.2. Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.9.3. Comments on results . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.10. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

viii

List of Tables

2.1. Median bias of standard deviation estimates. No signal, σ “ 1, ρ “ 0, sample

size n “ 100. Minimum error besides the oracle is in bold for each analysis. 35

5.1. Comparison of different NPMLE algorithms. Mean values (standard

deviation in parentheses) reported from 100 independent datasets; p “ 1000,

throughout simulations. Mixing distribution 1 has constant σj ; mixing

distribution 2 has correlated µj and σj . . . . . . . . . . . . . . . . . . . . . 110

5.2. Mean TSE for various estimators of µ P Rp based on 100 simulated datasets;

p “ 1000. pq1, q2q indicates the grid points used to fit GΛ. . . . . . . . . . . 110

5.3. Baseball data. TSE relative to the naive estimator. Minimum error is in

bold for each analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.4. Microarray data. Number of misclassification errors on test data. . . . . . 111

5.5. Blood glucose data. MSE relative to CGM. . . . . . . . . . . . . . . . . . . 111

ix

List of Figures

2.1. Median standard deviation estimates over different levels of predictor

correlation. σ “ 1, α “ 0.5, snr “ 1, sample size n “ 100, predictors p “

100, 200, 500, 1000 moving from left to right along rows. Plot number refer to

CV L(1), CV SCAD(2), SZ L(3), SZ L2(4), SZ MCP (5), SZ MCP2(6),

SZ MCP3(7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2. Median standard deviation estimates over different levels of signal-to-noise

level. σ “ 1, α “ 0.5, ρ “ 0, sample size n “ 100, predictors p “



SZ MCP3(7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.3. Median standard deviation estimates over different levels of sparsity.

σ “ 1, snr “ 1, ρ “ 0, sample size n “ 100, predictors p “



SZ MCP3(7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.4. Five λ0s as functions k, n=100, p=1000. Line numbers refer to (1)

λ0pkq “ tp2nq logppqu12, (2) λ0pkq “ tp2nq logppkqu12, (3) λ0pkq “

p2nq12L1pkpq, (4) Adaptive λ0 described in section 2.5 with various k,

assuming that the correlation between columns of X is 0. (5) Same as (4)

except assuming that the correlation between columns of X is 0.8. The k1

is the solution to (2.37), k2 is the solution to 2k “ L41pkpq ` 2L2

1pkpq. . . . 39

x

5.1. (a) Histogram of 20,000 independent draws from the estimated distribution

of pAj , HjAjq, fitted with the Poisson-binomial NPMLE to all players in the

baseball dataset; (b) histogram of non-pitcher data from the baseball dataset;

(c) histogram of pitcher data from the baseball dataset. . . . . . . . . . . . 104

xi

1

Chapter 1

Introduction

1.1 High-dimensional regression

This first part of this thesis addresses three issues in parameter estimation, prediction and

variable selection for high-dimensional regression: concave penalized least-square regression,

high-dimensional regression with noisy and missing data, and restricted eigenvalue-type

conditions for high-dimensional regression.

As modern technology generates tons of data, high-dimensional data have been studied

intensively both in statistics and computer science. In linear regression, a widely used

approach to analyze high-dimensional data is the penalized least-square estimation (PLSE).

The Lasso, or `1 penalization [71] and the concave penalization, such as the SCAD [21] and

MCP [84] are two mainstream methods in penalized least-square estimation. It is been

shown that the concave PLSE guarantees variable selection consistency under significantly

weaker conditions than the Lasso, for example, the strong irrepresentable condition on the

design matrix required by the Lasso can be replaced by a sparse Riesz condition. Moreover,

the concave PLSE also enjoys rate-optimal error bounds in prediction and coefficients

estimation. However, the error bounds for prediction and coefficients estimation in the

literature still require significantly stronger conditions than what Lasso require, for example,

the knowledge of the `1 norm of the true coefficients vector or the upper sparse eigenvalue

condition. Ideally, selection, prediction and estimation properties should only depend on

lower sparse eigenvalue/restricted eigenvalue, is that achievable? In the second chapter, we

give an affirmative answer to this question.

In Chapter 2, we prove that the concave PLSE matches the oracle inequalities for

prediction and `q coefficients estimation of the Lasso, with 1 ď q ď 2, based only on

the restricted eigenvalue condition, which can be viewed as nearly the weakest available

2

condition on the design matrix. Furthermore, under a uniform signal strength assumption,

the selection consistency does not require any additional conditions for proper concave

penalties such as the SCAD penalty and MCP. Our theorem applies to all the local solutions

that computable by path following algorithm starting from the origin. We also developed

a scaled version of the concave PLSE that jointly estimates the regression coefficients and

noise level. The scaled concave PLSE is not an easy extension of the scaled Lasso because

the joint distribution of regression coefficients and noise level of the former is non-convex.

The computation cost of scaled concave PLSE is negligible beyond computing a continuous

solution path. All our consistency results apply to cases where the number of predictors p

is much larger than the sample size n.

In Chapter 3, we consider high-dimensional regression when the design matrices are not

fully observable. Two specifications are discussed: missing design and noisy design. We

extend the PLSE to noisy or missing design and prove that the same scale of coefficients

estimation error can be obtained compared with the fully observed design, while requiring

no additional condition. Moreover, we prove that a linear combination of the noise level

and `2 norm of coefficients is large enough for penalty level when noise or missing data

exists. This sharpens the existing results where an `1 norm of coefficients is required. We

further extend the scaled version of PLSE to missing and noisy data case. Since the cross-

validation based technique is time consuming and maybe misleading for missing or noisy

data, the proposed scaled solution is of great use.

As discussed before, restricted eigenvalue (RE) type conditions can be viewed as

nearly the weakest available condition on design matrix to guarantee prediction and

estimation performance of the Lasso, concave penalized least-square estimator and

groupwise estimators in high-dimensional regression. In Chapter 4, we prove that the

population version of the groupwise RE condition implies its sample version under: (i)

a second moment uniform integrability assumption on the linear combinations of the design

variables and (ii) a fourth moment uniform boundedness assumption on the individual

design variables and a m-th moment assumption on the linear combinations of the within

group design variables for m ą 2, provided usual sample size requirement. Moreover, the

fourth and m-th moment assumptions can be removed given a slightly larger sample size.

3

Besides, the low moment condition is also sufficient to guarantee the groupwise compatibility

condition, an `1-version of RE condition. Our results include the ordinary RE condition as

a special case. This study demonstrates a benefit of standardizing the design variables in

penalized least squares estimation for heavy-tailed random designs. In addition, it indicates

that the RE condition of bootstrapped sample can be guaranteed given the corresponding

sample RE condition.

1.2 Nonparametric maximum liklihood methods

The second part of this thesis considers two types of models using nonparametric maximum

likelihood (NPML) methods, a nonparametric empirical Bayes approach: NPML methods

for mixture models and NPML methods for linear models.

Nonparametric maximum likelihood (NPML) for mixture models is a technique for

estimating mixing distributions that has a long and rich history in statistics going back to the

1950s, and is closely related to empirical Bayes methods. Historically, NPML-based methods

have been considered to be relatively impractical because of computational and theoretical

obstacles. However, recent work focusing on approximate NPML methods suggests that

these methods may have great promise for a variety of modern applications. Building on this

recent work, we study a class of flexible, scalable, and easy to implement approximate NPML

methods for problems with multivariate mixing distributions. In Chapter 5, we provide

concrete guidance on implementing these methods, with theoretical and empirical support;

topics covered include identifying the support set of the mixing distribution, and comparing

algorithms (across a variety of metrics) for solving the simple convex optimization problem

at the core of the approximate NPML problem. Additionally, we illustrate the methods’

performance in three diverse real data applications: (i) A baseball data analysis (a classical

example for empirical Bayes methods, originally inspired by Efron & Morris), (ii) high-

dimensional microarray classification, and (iii) online prediction of blood-glucose density

for diabetes patients. Among other things, our empirical results clearly demonstrate the

relative effectiveness of using multivariate (as opposed to univariate) mixing distributions

for NPML-based approaches.

4

Chapter 2

Oracle properties of concave PLSE and its scaled version

2.1 Introduction

The purpose of this chapter is to study prediction, coefficient coefficients estimation, and

variable selection properties of concave penalized least squares estimator (PLSE) in linear

regression under the restrictive eigenvalue (RE) condition on the design matrix.

Consider the linear model

y “Xβ˚ ` ε, (2.1)

where X “ px1, ...,xpq P Rnˆp is a design matrix, y P Rn is a response vector, ε P Rn is a

noise vector, and β˚ P Rp is an unknown coefficient coefficients vector. For simplicity, we

assume throughout the chapter that the design matrix is column normalized with xj22 “ n.

We shall focus on penalized loss functions of the form

Lpβ;λq “1

2ny ´Xβ22 `

pÿ

j“1

ρp|βj |;λq, (2.2)

where the penalty function ρpt;λq, indexed by λ ě 0, is concave in t ą 0 with ρp0`;λq “

ρp0;λq “ 0, and the index λ is taken as the penalty level limtÑ0` ρpt;λqt. Additional

regularity conditions on ρp¨; ¨q will be described in Section 2.2. The PLSE can be defined

as a statistical choice among local minimizers of the penalized loss.

Among PLSE methods, the Lasso [71] with the absolute penalty ρpt;λq “ λ|t| is the

most widely used and extensively studied. The Lasso is relatively easy to compute as it is a

convex minimization problem, but it is well known that the Lasso is biased. A consequence

of this bias is the requirement of a neighborhood stability/strong irrepresentable condition

5

on the design matrix X for the selection consistency of the Lasso [51, 88, 72, 79]. Fan and

Li [21] proposed a concave penalty to remove the bias of the Lasso and proved an oracle

property for one of the local minimizers of the resulting penalized loss. Zhang [84] proposed

a path finding algorithm PLUS for concave PLSE and proved the selection consistency

of the PLUS-computed local minimizer under a rate optimal signal strength condition on

the coefficients and the sparse Riesz condition (SRC) [85] on the design. The SRC, which

requires bounds on both the lower and upper sparse eigenvalues of the Gram matrix and is

closely related to the restricted isometry property (RIP) [12], is substantially weaker than

the strong irrepresentable condition. This advantage of concave PLSE over the Lasso has

since become well understood.

For prediction and coefficient estimation, the existing literature somehow presents an

opposite story. Consider hard sparse coefficient vectors satisfying |supppβ˚q| ď s with small

psnq log p. Although rate minimax error bounds were proved under the RIP and SRC

respectively for the Dantzig selector and Lasso in [11] and [85], Bickel et al. [6] sharpened

their results by weakening the RIP and SRC to the RE condition, and van de Geer and

Buhlmann [77] proved comparable prediction and `1 estimation error bounds under an even

weaker compatibility or `1 RE condition. Meanwhile, rate minimax error bounds for concave

PLSE still require two-sided sparse eigenvalue conditions like the SRC [84, 87, 80, 22] or a

proper known upper bound for the `1 norm of the true coefficient vector [46]. It turns out

that the difference between the SRC and RE conditions are quite significant as Rudelson

and Zhou [66] proved that the RE condition is a consequence of a lower sparse eigenvalue

condition alone. This seems to suggest a theoretical advantage of the Lasso, in addition to

its computational simplicity, compared with concave PLSE.

An interesting question is whether the RE condition alone on the design matrix is

also sufficient for the above discussed results for concave penalized prediction, coefficient

coefficients estimation and variable selection, provided proper conditions on the coefficient

coefficients and noise vectors. An affirmative answer of this question, which we provide

in this chapter, amounts to the removal of the upper sparse eigenvalue condition on the

design matrix and actually also a relaxation of the lower sparse eigenvalue condition or

the restricted strong convexity (RSC) condition [56] imposed in [46]. We also extend

6

the prediction and estimation error bounds to smaller penalty levels λ which are more

practical and provide rate minimaxity in prediction and coefficient coefficients estimation

when psnq logppsq is small.

The Lasso still enjoys computational advantages over concave PLSE. However, this

advantage may not be so drastic in many applications in view of the literature on

statistical and computational properties of iterative and path finding algorithms for concave

penalization [25, 89, 84, 87, 9, 1, 32, 56, 80, 46, 22]. In this chapter, we focus on statistical

properties of local solutions of concave PLSE computable by path finding algorithms as we

are also interested in adaptive choice of the penalty level λ in the solution path and the

estimation of the noise level. Exact solution paths of the PLSE can be computed by the

PLUS algorithm [84], while approximate solution paths can be computed by the gradient

decent algorithm of Wang et al. [80] with computational complexity guarantee.

Suppose that a local solution path of the concave penalization problem is obtained, one

still needs to take an appropriate choice of an estimator in the solution path or a proper

penalty level. This problem, which we also study in this chapter, is equivalent to consistent

estimation of the noise level due to scale invariance.

Substantial effort has been made in scale free estimation under the `1 penalty. The idea

is to make the penalty level proportional to the noise level σ. Stadler et al. [67] proposed to

estimate β and σ by maximizing their joint log-likelihood with an `1 penalty on βσ through

reparametrization. In the discussion of [67], Antoniadis [2] proposed to minimize Huber’s

[34] concomitant joint loss function with the `1 penalty on β without reparametrization,

and Sun and Zhang [68] considered a “naive” iteration between the estimation of β and

σ and proved the bias reduction property of one iteration from the joint estimator of [67].

Belloni et al. [5] introduced and studied a square-root Lasso for the estimation of β. It turns

out that for the `1 penalty, Huber’s concomitant joint loss, the equilibrium of the iterative

algorithm, and the square-root Lasso all produce the same estimator. Sun and Zhang [69]

proposed the iterative algorithm as scaled PLSE for joint estimation of β and σ under both

the `1 and concave penalties and studied the scaled Lasso with the joint penalized loss of [2],

especially the consistency and asymptotic normality of the resulting noise level estimator.

However, a theoretical study of the scaled concave PLSE is noticeably missing.

7

A main reason for this absence of a theoretical study of scale free concave PLSE is the

loss of the scale free property; In the joint likelihood, the concomitant loss and the square-

root formulations, it is not proper to use scale free concave penalty functions as they are

not proportional to the penalty level. While the iterative approach is still scale free with

concave penalties, concave regularization is more difficult to study due to the loss of its

equivalence to joint convex minimization, compared with the Lasso.

In this chapter, we find a much weaker condition under which local solutions of concave

PLSE enjoy desired properties in prediction, coefficients estimation, and variable selection

as well. Specifically, we prove that the concave PLSE achieves rate minimaxity in prediction

and coefficients estimation under the `0 sparsity condition on β and the RE condition on X.

Furthermore, the selection consistency can also be guaranteed under an additional uniform

signal strength condition on the nonzero coefficients. In addition, we prove that the same

properties hold for the scaled concave PLSE in the iterative algorithm formulation.

The rest of this chapter is organized as follows. In Section 2.2, we study concave PLSE

under the RE condition on the design. In Section 2.3 we study concave PLSE with smaller

penalty/threshold levels. In Section 2.4 we study theoretical properties of the scaled concave

PLSE. Section 2.5 presents results of an extensive simulation study for variance estimation.

Section 2.6 contains some discussion.

Notation: We denote by β˚ the true regression coefficient coefficients vector, Σ “

XTXn the sample Gram matrix, S “ supppβ˚q the support set of the coefficient

coefficients vector, s “ |S| the size of the support, and Φp¨q the standard Gaussian

cumulative distribution function. For vectors v “ pv1, ..., vpq, we denote by vq “ř

jp|vj |qq1q the `q norm, with v8 “ maxj |vj | and v0 “ #tj : vj ‰ 0u. Moreover,

x` “ maxpx, 0q.

2.2 Statistical Properties of Concave PLSE methods

In this section, we present our results for concave PLSE at a sufficiently high penalty level

to allow selection consistency. We first need to describe our assumptions on the penalty

function and design matrix.

8

2.2.1 Concave penalties

We study the class of concave penalties ρpt;λq satisfying the following properties:

(i) ρpt;λq is symmetric, ρpt;λq “ ρp´t;λq;

(ii) ρpt;λq is monotone, ρpt1;λq ď ρpt2;λq for all 0 ď t1 ă t2;

(iii) ρpt;λq is left- and right-differentiable in t for all t;

(iv) ρpt;λq has selection property, 9ρp0`;λq “ λ;

(v) | 9ρpt´;λq| _ | 9ρpt`;λq| ď λ for all real t.

We write 9ρpt;λq “ x when x is between the left- and right-derivative of ρpt;λq at t, including

t “ 0 where 9ρp0;λq “ x means |x| ď λ. We use the following quantities to measure the

concavity of penalty functions. For a given penalty function ρp¨;λq, define the maximum

concavity at t as

κpt; ρ, λq “ supt1ą0

9ρpt1;λq ´ 9ρpt;λq

t´ t1, (2.3)

where the supreme is taken over all possible choices of 9ρpt;λq and 9ρpt1;λq between the left-

and right-derivatives. Further, define the overall maximum concavity of ρp¨;λq as

κpρq “ κpρ, λq “ maxtě0

κpt; ρ, λq. (2.4)

Many popular penalties satisfy conditions (i) to (v). We illustrate the SCAD (smoothly

clipped absolute deviation) penalty and MCP (minimax concave penalty) as examples. The

SCAD penalty [21] is defined as

ρpt, λq “ λ

ż |t|

0

"

Ipx ď λq `pγλ´ xq`pγ ´ 1qλ

Ipx ą λq

*

dx (2.5)

with a fixed parameter γ ą 2. A straightforward calculation yields κp0; ρ, λq “ 1γ and

κpρ, λq “ 1pγ ´ 1q for the SCAD penalty. The MCP [84] is defined as

ρpt, λq “ λ

ż |t|

0p1´

x

λγq`dx (2.6)

9

with γ ą 0 and κpρ, λq “ κp0; ρ, λq “ 1γ.

2.2.2 The restricted eigenvalue condition

We now consider conditions on the design matrix. The restricted eigenvalue (RE) condition,

proposed in [6], can be viewed as nearly the weakest available condition on the design to

guarantee rate optimal prediction and coefficients estimation performance of the Lasso. The

RE coefficient RE2pS, ηq for the `2 estimation loss can be defined as follows: For η P r0, 1q

and δ˚ P r0, 1s,

RE22pS; η, δ˚q “ inf

"

uTΣu

u22: p1´ ηquSc1 ď p1` δ˚ηquS1

*

. (2.7)

The RE condition refers to the property that RE2pS; η, δ˚q is no smaller than a certain

positive constant for all n and p. For the prediction and `1 estimation, an `1-version of the

RE can be employed. The following compatibility or `1-RE coefficient [77] can be used,

RE21pS; η, δ˚q “ inf

"

uTΣu|S|uS

21

: p1´ ηquSc1 ď p1` δ˚ηquS1

*

. (2.8)

We introduce a relaxed cone invertibility factor (RCIF) for prediction as

RCIFpredpS; η,ωq “ inf

"

Σu28|S|uTΣu

: p1´ ηquSc1 ď ´ωTSuS

*

, (2.9)

where ω P Rp, and a RCIF for the `q estimation, 1 ď q ď 2, as

RCIFest,qpS; η,ωq “ inf

#

Σu8|S|1q

uq: p1´ ηquSc1 ď ´ω

TSuS

+

. (2.10)

The choices of δ˚ and ω depend on the problem under consideration in the analysis, but

typically we have ω8 ď 1`δ˚η so that the minimization in (2.9) and (2.10) is taken over a

smaller cone. For example, one may take ωS “ 0 for studying selection consistency. We will

use an RE condition to prove cone membership of the estimation error of the concave PLSE

and the RCIF to bound the prediction and coefficients estimation errors. The following

proposition shows that the RCIF may provide sharper bounds than the RE does.

10

Proposition 2.1. Let RE, RCIF be as in (2.7)-(2.10), η P p0, 1q, and ξ “ p1`δ˚ηqp1´ηq.

If ωS8 ď 1` δ˚η, then

RCIFpredpS; η,ωq ě RE21pS; η, δ˚qp1` ξq

2

RCIFest,1pS; η,ωq ě RE21pS; η, δ˚qp1` ξq

2 (2.11)

RCIFest,2pS; η,ωq ě RE1pS; η, δ˚qRE2pS; η, δ˚qp1` ξq.

Proof of Proposition 2.1. Since ωS8 ď 1` δ˚η, we have

p1´ ηquSc1 ď ´ωTSuS ď p1` δ˚ηquS1.

It then follows that

Σu28|S|uTΣu

ěuTΣu|S|u21

ěuTΣu|S|

p1` ξq2uS21

.

The first inequality of (2.11) is obtained by taking infimum in the cone C pS; η, δ˚q “ tu :

p1´ ηquSc1 ď p1` δ˚ηquS1u. Similarly,

Σu8|S|u1

ěuTΣu|S|u21

ěuTΣu|S|

p1` ξq2uS21

,

Σu8|S|12

u2ěuTΣu|S|u1u2

ěuTΣu|S|

p1` ξquS1u2.

The second and third inequality of (2.11) can be obtained by taking the infimum in the

cone C pS; η, δ˚q on the above inequalities. ˝

2.2.3 Properties of concave PLSE

As our analysis directly allows the penalty to depend on index j, we consider the follows

generalization of the penalized loss (2.2),

Lpβ;λq “1

2ny ´Xβ22 `

pÿ

j“1

ρjpβj ;λq.

11

Given penalty functions ρjp¨; ¨q and a penalty level λ, a vector pβ P Rp is a critical point

of the penalized loss (2.2) if the following local Karush-Kuhn-Tucker (KKT) condition is

satisfied:

xTj py ´Xpβqn “ 9ρjppβj ;λq (2.12)

for a certain version of 9ρjppβj ;λq (between the left and right derivatives as in our convention)

for every j “ 1, . . . , p. By property (v) of the penalty, (2.12) is well defined and | 9ρjppβj ;λq| ď

λ. When the penalized loss is convex in t, the local KKT condition (2.12) is necessary and

sufficient for the global minimization of the penalized loss Lp¨;λq. In general, solutions of

(2.12) include all local minimizers of Lp¨;λq.

For positive λ˚ and κ˚, consider the class of all penalty functions ρjp¨;λq with no smaller

penalty level than λ˚ and no greater concavity than κ˚,

Ppλ˚, κ˚q “!

ρjp¨;λq : λ ě λ˚, κpρj , λq ď κ˚

)

. (2.13)

Among all local solutions for all such penalties ρjp¨;λq in Ppλ˚, κ˚q, we shall focus on the

subclass B0pλ˚, κ˚q of those connected to the origin through a continuous path of such

solutions. Formally, let

B “ Bpλ˚, κ˚q “!

pβ: (2.12) holds with some ρjp¨;λq PPpλ˚, κ˚q)

.

The class B0pλ˚, κ˚q can be written as

B0pλ˚, κ˚q “!

pβ : pβ and 0 are connected in Bpλ˚, κ˚q)

. (2.14)

As pβ “ 0 is the sparsest solution, B0 can be viewed as the sparse branch of the solution

space B.

By definition, B0pλ˚, κ˚q is the set of all local solutions computable by path following

algorithms starting from the origin, with constraints λ ě λ˚ and κpρj , λq ď κ˚ on the

penalty and concavity levels respectively. This is a large class of estimators as it includes all

12

local solutions connected to the origin regardless of the specific algorithms used to compute

the solution and different types of penalties can be used in a single solution path. For

example, the Lasso estimator belongs to the class as it is connected to the origin through

the LARS algorithm [58, 59, 20]. The SCAD and MCP solutions belong to the class if they

are computed by the PLUS algorithm [84] or by a path following algorithm from the Lasso

solution.

The following theorem studies the difference between solutions pβ P B0pλ˚, κ˚q and an

oracle coefficient vector βo satisfying supppβoq Ď S under the RE condition on the design

matrix. The vector βo P Rp can be taken as the true regression coefficient coefficients vector

β˚ so that Theorem 2.1 directly yields prediction and estimation error bounds under the

RE condition. Alternatively, βo can be taken as the oracle LSE pβo

given by

pβo

S “ pXTSXSq

´1XTSy,

pβo

Sc “ 0, (2.15)

with S “ supppβ˚q, so that Theorem 2.1 directly yields sufficient conditions for selection

consistency and indirectly sharper prediction and estimation error bounds, still under the

RE condition.

We consider here penalty levels no smaller than a certain λ˚ satisfying

XTScpy ´XTβoqn8 ă ηλ˚, X

TS py ´X

Tβoqn8 ď ηδ˚λ˚, (2.16)

where η ă 1 and δ˚ ď 1. When ε “ y ´Xβ˚ „ Np0, σ2Inˆnq and

λ˚ “ pσηqa

p2nq log p,

(2.16) holds with at least probability 1´a

2pπ log pq, as xj are normalized to xj2 “?n,

provided that βo is either the true β˚ with δ˚ “ 1 or the oracle LSE in (2.15) with δ˚ “ 1

δ˚ “ 0. Smaller penalty levels will be considered in Section 2.3.

We study the difference between solutions pβ P B0pλ˚, κ˚q and the oracle coefficient

13

coefficients vector βo via a random vector ω “ ωpβo, λq with elements

wj “ 9ρjpβoj ;λqλ´ x

Tj py ´Xβ

oqpλnq. (2.17)

The relevance of ω can be clearly seen from the definition of pβ in (2.12) as

xTj Xpβo ´ pβqn “ λwj ` 9ρjppβj ;λq ´ 9ρjpβ

oj ;λq (2.18)

We may choose ω to satisfy supppωq Ď S in our convention as 9ρjpβoj ;λq is allowed to take

any value in r´λ, λs for βoj “ 0. However, this choice is not used in our analysis. Let

φminpMq denote the minimum eigenvalue of symmetric matrices M .

Theorem 2.1. Let pβ be a solution of (2.12) in B0pλ˚, κ˚q with penalties ρjp¨;λq in the

class Ppλ˚, κ˚q. Suppose RE22pS; η, δ˚q ě κ˚ and (2.16) holds for certain βo P Rp and

S Ě supppβoq. Let ω be as in (2.17).

(i) With ξ “ p1` δ˚ηqp1´ ηq,

Xpβ ´Xβo22n ďp1` ηq2λ2|S|

RCIFpredpS; η,ωqďp1` ξq2p1` ηq2λ2|S|

RE21pS; η, δ˚q

, (2.19)

pβ ´ βoq ď

$

’

’

’

’

’

’

’

’

’

’

’

’

’

&

’

’

’

’

’

’

’

’

’

’

’

’

’

%

p1` ηqλ|S|RCIFest,1pS; η,ωq

ďp1` ξq2p1` ηqλ|S|

RE21pS; η, δ˚q

, q “ 1,

p1` ηqλ|S|12

RCIFest,2pS; η,ωqď

p1` ξqp1` ηqλ|S|12

RE1pS; η, δ˚qRE2pS; η, δ˚q, q “ 2,

p1` ηqλ|S|1q

RCIFest,qpS; η,ωq, q ě 1.

(2.20)

(ii) Suppose maxjďp κpβoj ; ρj , λq ď p1´ 1C0qRE2

2pS; η, δ˚q. Then,

Xpβ ´Xβo22n ď pC0λq2 supu‰0

“

ωTSuS ´ p1´ ηquSc1‰2

`

uTΣu(2.21)

14

and for any seminorm ¨ as a loss function

pβ ´ βo ď C0λ supu‰0

u“

ωTSuS ´ p1´ ηquSc1‰

uTΣu. (2.22)

(iii) Suppose βo is a solution of (2.12) or equivalently ωS “ 0. Then,

pβSc “ 0 and sgnppβjqsgnpβoj q ě 0 @ j P S. (2.23)

If κp0; ρj , λq ă φminpXTSXSnq, then

sgnppβq “ sgnpβoq. (2.24)

If maxjPS κpβoj ; ρj , λq ă φminpX

TSXSnq, then

pβ “ βo. (2.25)

Remark 2.1. In the above theorem, one may also use a relaxed version of RE1 and RE2

with the constraint replaced by p1´ ηquSc1 ď ´ωTSuS .

Corollary 2.1. Suppose 9ρjpt;λq “ 0 for |t| ą λγ and conditions of Theorem 2.1 (ii) hold

with βo “ pβo

being the oracle estimator in (2.15). Then, (2.21) and (2.22) hold with ω22 ď

p1 ` δ˚ηq2|S1| where S1 “ tj P S : |pβoj | ď λγu. Consequently, when C2

0RE22pS; η, δ˚q “

OP p1q and λ À σa

plog pqn,

Xpβ ´Xpβo22n`

pβ ´ pβo22 “ OP pσ

2nq|S1| log p,

implying pβ “ pβo

when |S1| “ 0, and

Xpβ ´Xβ˚22n` pβ ´ β˚22 “ OP pσ

2nq`

|S1| log p` |S|˘

.

Theorem 2.1 gives a unified treatment of penalized least squares methods, including

the `1 and concave penalties, under the RE condition on the design matrix and natural

conditions on the penalty. For prediction and coefficient coefficients estimation, (2.19)

15

and (2.20) match those of state-of-art for the Lasso in both the convergence rate and the

regularity condition on the design, while (2.21), (2.22) and Corollary 2.1 demonstrate the

advantages of concave penalization when |S1| is of smaller order than |S|. Moreover, the

prediction and estimation error bounds in Theorem 2.1 (ii) and Corollary 2.1 directly and

naturally provide selection consistency when ωS “ 0 or |S1| “ 0. More precisely, for

selection consistency, Theorem 2.1 (iii) requires only the RE condition for (2.23) and mild

additional eigenvalue conditions for (2.24) and (2.25), provided the existence of an oracular

solution βo with supppβoq Ď S or equivalently ωS “ 0. Note that κp0; ρj , λq ď κ˚ and

RE22pS; 0q ď φminpX

TSXSnq by definition. For concave penalties, this condition ωS “ 0 can

be fulfilled by the rate-optimal signal strength condition minjPS |pβoj | ą γλ as in Corollary 2.1.

However, the condition ωS “ 0 for the Lasso requires more restrictive `8-type conditions

such as the irrepresentable condition on the design. These RE-based results are new and

significant as the existing theory for concave penalization, which requires substantially

stronger conditions on the design such as the sparse Riesz condition in [84], leaves a false

impression that the Lasso has a technical advantage in prediction and parameter estimation

under the RE condition on the design. Moreover, compared with existing analysis, the proof

of Theorem 2.1 is much simpler.

For the Lasso, κ˚ “ 0 ď RE2pS; η, δ˚q always holds and C0 “ 1 in Theorem 2.1 (ii),

which implies the following corollary due to ωS8 ď 1` η.

Corollary 2.2. Let pβ be the Lasso estimator. If (2.16) holds for a coefficient vector βo P Rp

with S Ě supppβoq, then

Xpβ ´Xβo22np1` ηq2λ2

ď supu‰0

ψ2puq

uTΣuď max

"

|S|p1´ 1ξq2

RE21pS; η, δ˚q

,|S|

RE21pS; 0q

*

with ψpuq ““

uS1 ´ uSc1ξ‰

`and ξ “ p1` ηqp1´ ηq, and

pβ ´ βo2p1` ηqλ

ď supu‰0

u2ψpuq

uTΣuď max

"

|S|12p1´ 1ξq

RE21,2pS; η, δ˚q

,|S|12

RE21,2pS; 0q

*

,

where RE1,2pS; η, δ˚q “ tRE2pS; η, δ˚qRE1pS; η, δ˚qu12 ě RE2

2pS; η, δ˚q.

In fact, the sharper prediction and coefficient coefficients estimation error bounds in

16

Corollary 2.2 are the sharpest possible based on the basic inequality uTΣu ď ψpuq for

u “ ppβ´βoqλ. For example, it is strictly sharper than the familiar prediction error bound

in [77],

Xpβ ´Xβo22n ď p1` ηq2λ2|S|RE2

1pS; η, δ˚q,

when RE21pS; η, δ˚q ă RE2

1pS; 0q.

To prove Theorem 2.1, we first present the following lemma.

Lemma 2.1. Let S Ă t1, . . . , pu, λ ą 0, pβ be a solution of (2.12), and βo a coefficient

vector satisfying supppβoq Ď S. Let h “ pβ ´ βo, ω be as in (2.17) and z “ py ´Xβoqn.

Then,

hTΣh “ř

jhjt 9ρjpβoj ;λq ´ 9ρjppβj ;λq ´ λwj

(

ďř

jPScpzjhj ´ λ|hj |q ´ λωTShS `

ř

jκpβoj ; ρj , λqh

2j (2.26)

and |wj | ď 1` |zj |λ for j P S, where κpβoj ; ρj , λq is as in (2.3).

Proof of Lemma 2.1. Recall that h “ pβ´βo and z “ py´Xβoqn with a βo satisfying

supppβoq Ď S. For j P Sc, hj “ pβj , so that by (2.17)

hjt 9ρjpβoj ;λq ´ 9ρjppβj ;λq ´ λwju

“ pβjtzj ´ 9ρjppβj ;λqu

ď pβjzj ´ |pβj | 9ρjp|pβj |;λq (2.27)

ď pβjzj ´ λ|pβj | ` p|pβj | ´ 0q

9ρjp0`;λq ´ 9ρjp|pβj |;λq(

ď zjhj ´ λ|hj | ` κp0; ρj , λqh2j .

For j P S,

hjt 9ρjpβoj ;λq ´ 9ρjppβj ;λq ´ λwju ď ´λwjhj ` κpβ

oj ; ρj , λqh

2j . (2.28)

Summing the above inequalities over j, we find via (2.18) that (2.26) holds. Moreover, by

the definition of ω in (2.17), |wj | ď 1` |zj |λ for j P S. ˝

17

Proposition 2.2. Let S Ě supppβoq, tη, δ˚u be as in (2.16),

C pS; η, δ˚q “

u : p1´ ηquSc1 ď p1` δ˚ηquS1(

,

and B˚0 pλ˚, κ˚q “ Bpλ˚, κ˚q X tβ

o ` C pS; η, δ˚qu be the set of all solutions pβ of (2.12)

with penalties in Ppλ˚, κ˚q and estimation error pβ ´ βo in the cone C pS; η, δ˚q. Let

pβ P B˚0 pλ˚, κ˚q with penalty level λ and rβ P Bpλ˚, κ˚q with penalty level rλ. Suppose

RE22pS; η, δ˚q ě κ˚ and (2.16) holds. Let ε1 “ pη ´ zSc8λ˚q2, ε2 “ ε1p2κ˚q and

ε0 “ mintε2, ε1ε2p1` ηqu. Then,

›

›prβ ´ βoqrλ´ ppβ ´ βoqλ›

›

1ď ε0 ñ rβ P B˚

0 pλ˚, κ˚q.

Proposition 2.2 asserts that among general solutions pβ of (2.12) in Bpλ˚, κ˚q, those

with the normalized error ppβ ´ βoqλ inside the cone C pS; η, δ˚q and outside the cone are

separated by ε0 in the `1 distance of the normalized error. Thus, if pβptq

is a sequence of such

solutions with penalty levels λptq such that the normalized errors uptq “ ppβptq´βoqλptq have

small `1 increments, uptq ´upt´1q1 ď ε0, then uptq are either all in the cone C pS; η, δ˚q or

all outside cone. In particular, Proposition 2.2 implies that the solutions pβ P B0pλ˚, κ˚q

has the cone property pβ ´ βo P C pS; η, δ˚q or equivalently B0pλ˚, κ˚q Ď βo ` C pS; η, δ˚q,

as λ‘ pβ is connected to λp0q ‘ 0 through a continuous path and the origin 0 has the cone

property.

Proof of Proposition 2.2. Let u “ ppβ ´ βoqλ and v “ prβ ´ βoqλ. We want to prove

that

u´ v1 ď ε0 and u P C pS; η, δ˚q imply v P C pS; η, δ˚q. (2.29)

By the definition of ε1 and condition (2.16), we have ε1 ą 0. As κpβoj ; ρj , λq ď κ˚ and

zSc8λ ď zSc8λ˚ ď η ´ 2ε1, Lemma 2.1 implies that

uTΣu``

1´ η ` 2ε1˘

uSc1

18

ď ´ωTSuS `

maxjκpβoj ; ρj , λq

(

u22 (2.30)

ď`

1` δ˚η˘

uS1 ` κ˚u22

and that the same inequalities also hold for v. Recall that ε2 “ ε1p2κ˚q. If u1 ď ε2 and

v ´ u1 ď ε2, then v1 ď ε1κ˚, so that the v-version of (2.30) implies

`

1´ η ` 2ε1˘

vSc1 ď`

1` δ˚η˘

vS1 ` κ˚v21 ď

`

1` δ˚η˘

vS1 ` ε1v1,

or equivalently

`

1´ η ` ε1˘

vSc1 ď`

1` δ˚η ` ε1˘

vS1,

which then implies v P C pS; η, δ˚q. Because u P C pS; η, δ˚q, we have κ˚u22 ď u

TΣu by

the RE condition, so that (2.30) implies

`

1` 2ε1 ´ η˘

uSc1 ď`

1` δ˚η˘

uS1.

Due to p1` ε1 ´ ηqp1´ ε1 ` δ˚ηq ď p1` 2ε1 ´ ηqp1` δ˚ηq, we have

`

1` ε1 ´ η˘

uSc1 ď`

1´ ε1 ` δ˚η˘

uS1.

If u1 ą ε2 and v ´ u1 ď ε1ε2p1` δ˚ηq, then v P C pS; η, δ˚q follows from

`

1´ η˘

vSc1 ´`

1` δ˚η˘

vS1

ď`

1´ η˘

uSc1 ´`

1` δ˚η˘

uS1 ` p1` δ˚ηqv ´ u1

ď`

1´ η˘

uSc1 ´`

1` δ˚η˘

uS1 ` ε1u1

ď 0.

Thus, (2.29) holds with ε0 “ mintε2, ε1ε2p1` ηqu. ˝

Proof of Theorem 2.1. Let h “ pβ ´ βo and u “ hλ. It follows from Proposition 2.2

that u P C pS; η, δ˚q as λ ‘ pβ is connected to λp0q ‘ 0 through a continuous path and the

origin has the cone property. Let 1 ď C0 ď 8 satisfying the condition maxj κpβoj ; ρj , λq ď

19

p1´ 1C0qREpS, ηq. As u P C pS; η, δ˚q, uTΣu ě REpS, ηqu22, so that by (2.30)

C´10 uTΣu` p1´ ηquSc1 ď ´ω

TSuS ď ωS2u2. (2.31)

This immediately implies (2.21) and (2.22) with u “ ppβ´βoqλ. For (2.19) and (2.20), we

set C0 “ 8. However, by the definition of RCIF,

RCIFpredpS; η,ωquTΣu ď Σu28|S|.

Consequently, the first inequality in (2.19) follows from the fact that

Σu8 ď XT py ´Xpβqn8λ` X

T py ´Xβoqn8λ ď 1` η,

and the second inequality follows from the first inequality in (2.11). Similarly, the first

inequality in (2.20) follows from

RCIFest,qpS; η,ωquq ď Σu8|S|1q ď p1` ηq|S|1q,

and the second follows from the second and third inequalities in (2.11).

Finally we consider selection consistency under the assumption ωS “ 0. In this case,

βo is a solution of (2.12), and hSc1 “ 0 by (2.31). Moreover, because both βo and pβ are

solutions of (2.12) with support in S,

κ˚hS22 ď h

TSΣShS “ ´

ÿ

jPShj

9ρjppβj ;λq ´ 9ρjpβoj ;λq

(

ďÿ

jPSκpβoj ; ρj , λqh

2j .

As κpβoj ; ρj , λq ď κ˚, the maximum concavity is attained above at every j P S in the sense

of ´ 9ρjppβj ;λq ` 9ρjpβoj ;λq “ κ˚ppβj ´ βoj q for all j P S with hj ‰ 0. This is possible only

when sgnppβjqsgnpβoj q ě 0 for all j P S. Furthermore, sgnppβjqsgnpβoj q ą 0 for all j P S when

κp0; ρj , λq ă φminpXTSXSnq, and hS “ 0 when maxjPS κpβ

oj ; ρj , λq ă φminpX

TSXSnq. ˝

20

2.3 Smaller penalty levels

We have studied in Section 2.2 exact solutions of (2.12) for penalty levels λ ě λ˚ in the

event where λ˚ is a strict upper bound of the supreme norm of the random vector z “

XT py ´Xβoqn as in (2.16). Such penalty or threshold levels are commonly used in the

literature to study regularized methods in high-dimensional regression. However, this is

quite conservative and often yields poor numerical results. In this section, we consider

smaller penalty levels under somewhat stronger RE conditions on the design.

2.3.1 Smaller penalty levels

We consider penalty levels λ which control a sparse `2 norm of a truncated z “ XT py ´

Xβoqn, instead of the larger `8 norm of z. For q P r1,8s and t ą 0, the sparse `q norm

is defined as

vpq,tq “ maxJĂt1,...,pu,|J |ăt`1

vJq.

To control the effect of the noise, we consider penalty levels λ ě λ˚ with a minimum penalty

level λ˚ such that

›

›

›p|z| ´ η0λ˚q`

›

›

›

p2,mq“ sup|J |“m

b

ř

jPJ

`

|zj | ´ η0λ˚˘2

`ă η1m

12λ˚ (2.32)

happens with high probability for certain positive numbers η0 and η1 satisfying η0` η1 ă 1

and a positive integer m. It is clear that (2.16) implies (2.32) with η “ η0 ` η1 and m “ 1.

As properties of the Lasso has been considered in [70] under penalty levels λ ą λ˚ with the

smaller λ˚ in (2.32), the results in this subsection can be viewed as an extension of their

results to general solutions of (2.12) in the set B0pλ˚, κ˚q in (2.14).

With η “ η0 ` η1 and z “XT py ´Xβoqn, define

S “ tj P t1, ..., pu : |zj | ě ηλ˚u (2.33)

21

be the set of indexes of large |zj |s. Main consequences of (2.32) are

|S| ă m,ř

jPSp|zj | ´ η0λ˚q2` ă mpη1λ˚q

2, zSc8 ă ηλ˚. (2.34)

These properties can be used to prove

Xpβ ´Xβo22n À`

ωS22 `m

˘

λ2, (2.35)

with ωS22 À |S| in the worst case scenarior scenario, and parallel estimation error bounds

under a certain RE-type condition. See [70] and Subsection 3.3.

Consider Gaussian noise ε “ y ´Xβ˚ „ Np0, σ2Inˆnq. Let L1ptq “ Φ´1p´tq be the

standard normal negative quantile function. Sun and Zhang [70] proved that when βo is

the true coefficient vector, βo “ β˚, (2.32) holds with at least probability 1 ´ ε under the

conditions

η0λ˚ “ pσn12qL1pkpq, (2.36)

η1

η0ą

ˆ

4km

L41pkpq ` 2L2

1pkpq

˙12

`L1pεpq

L1pkpq

ˆ

κ`pmq

m

˙12

,

where κ`pmq “ maxtuTΣu : u0 “ m, u2 “ 1u is the upper sparse eigenvalue of Σ. A

conservative choice of k is to take

k “ L41pkpq ` 2L2

1pkpq (2.37)

as in [70], giving m “ Op1q in prediction and estimation error bounds. However, by (2.35),

larger k can be taken without changing the order of error bounds as long as m À ωS22.

2.3.2 RE-type conditions for smaller penalty levels

When a smaller penalty level is taken, a lower level of regularization is imposed on the

estimator pβ, so that the estimation error h “ pβ´βo may fail the condition p1´ ηqhSc ď

p1 ` ηqhS1 in the definition of the restricted eigenvalue in (2.7). However, in the event

22

(2.32), we can still prove the membership of the error h in the following larger cone,

U pS, η0, η1,mq

“

!

u :`

1´ η˘

uSc1 ď p1` ηquS1 ` η1pm12uS2 ´ uS1q

)

with η “ η0 ` η1 ă 1 and the set S in (2.33). This will be verified in the proof of Theorem

2.2 but can be also vaguely seen from (2.34). Consequently, the restricted eigenvalue is

defined in the larger cone as

ĎRE2pS; η0, η1,mq “ inf

#

puTΣuq12

u2: 0 ‰ u P U pS, η0, η1,mq

+

. (2.38)

When m “ 1, S “ H and the restricted eigenvalue (2.38) coincides with the original RE as

defined in (2.7). Although (2.38) is a random variable due to its dependence on S (even for

deterministic designs), it is no smaller than

ĎRE˚,2pS; η,mq “ min|T zS|ăm

inf

#

puTΣuq12

u2: uT c1 ă ξ|T |12uT 2

+

(2.39)

due to |S| ă m in (2.34), where ξ “ p1` ηqp1´ ηq.

Similarly, we extend the relaxed cone invertibility factor (RCIF) as

ĞRCIFpredpS; η0, η1,mq “ inf!

Σu2cs˚

uTΣu: u P U pS; η0, η1,mq

)

ĞRCIFest,qpS; η0, η1,mq “ inf!

Σucps˚q1q

uq: u P U pS, η0, η1,mq

)

, (2.40)

where s˚ “ max

|S|, |S|(

represents a potentially lower level of sparsity due to possible

selection of variables outside S and ¨ c is a combination of the `2 norm on S and the `8

norm on Sc defined as

vc “ max

vS2m12, vSc8

(

.

The new RCIF for prediction and estimation are respectively. Whenm “ 1, the combination

norm coincides with the `8 norm and the modified RCIFs coincide with those in (2.7) and

23

(2.10) respectively.

2.3.3 Prediction and estimation errors bounds at smaller penalty levels

Theorem 2.2. Let pβ be a solution of (2.12) in B0pλ˚, κ˚q with penalties ρjp¨;λq P

Ppλ˚, κ˚q. Let η “ η0 ` η1 ă 1 with positive η0 and η1, m be a positive integer, S

as in (2.33) with a certain βo P Rp, S Ě supppβoq, and s˚ “ max

|S|, |S|(

. Suppose

ĎRE22pS; η0, η1,mq ě κ˚ and (2.32) holds. Then,

Xpβ ´Xβ˚22n ď

p1` ηqλ(2s˚

ĞRCIFpredpS; η0, η1,mqď

p1` ηqξ1λ(2s˚

ĎRE22pS; η0, η1,mq

(2.41)

with ξ1 ““

2p|S|s˚q12 ` p1´ η0qpms˚q12

‰

`

1´ η˘

, and

pβ ´ β˚q ď

$

’

’

’

’

’

’

&

’

’

’

’

’

’

%

p1` ηqλps˚q12

ĞRCIFest,2pS; η0, η1,mqďp1` ηqξ1λps

˚q12

ĎRE22pS; η0, η1,mq

, q “ 2,

p1` ηqλps˚q1q

ĞRCIFest,qpS; η0, η1,mq, @q P r1, 2s.

(2.42)

Remark 2.2. When m — s, k is on the same order k — m. The penalty level λ˚ in (2.37) is

on the ordera

p2nq logppkq. Theorem 2.2 guarantees that the prediction and `2 estimation

error are on the order

Xpβ ´Xpβ˚22n`

pβ ´ pβ˚22 — pmnq logppmq — psnq logppsq.

This matches the minimax prediction and `2 estimation rate of the Slope in Bellec et al. [4]

As an extension of Theorem 2.1 (i), Theorem 2.2 provides prediction and estimation

error bounds in the same form for smaller penalty levels with somewhat smaller RCIF

and RE. However, the approach does not provide a full extension of Theorem 2.1 in several

aspects. Due to the use of the `2 norm in condition (2.32), the `q estimation error bound can

be extended only for 1 ď q ď 2, and the compatibility coefficient cannot be used to bound

the prediction and `1 errors. In addition, solutions of (2.12) are not selection consistent at

the smaller penalty level due to a high likelihood of some false positive selection.

24

We have considered so far solutions of (2.12) in the main branch of the solution space

B0pλ˚, κ˚q in (2.14). Such solutions are computable by path finding algorithms. In fact, as

discussed below Proposition 2.2, our analysis is also applicable if ppβ´βoqλ is connected to a

cone through a discrete sequence of such normalized errors in small `1 increments. Statistical

and computational properties of iterative discrete solution paths have been studied in [80]

among others. However, compared with Theorems 2.1 and 2.2, [80] requires upper sparse

eigenvalue conditions on X and larger penalty levels satisfying (2.16).

Proof of Theorem 2.2. Let h “ pβ ´ βo. Recall that Lemma 2.1 gives

hTΣh ďř

jPScpzjhj ´ λ|hj |q ´ λωTShS `

ř


2j

“ hTz ´ λhSc1 ´ř

jPShj 9ρjpβoj ;λq `

ř


2j , (2.43)

where z “ XT py ´Xβoqn, wj “ 9ρjpβoj ;λqλ ´ zjλ, κpt; ρj , λq is as in (2.3). Let ε1 “

min

η ´ zSc8λ˚, η1 ´ p|zS | ´ η0λ˚q`2pm

12λ˚q(

and T Ě supppzq. By (2.34), ε1 ą 0

and

|hTz| ď pη ´ ε1qλ˚hT zS1 ` pη0 ´ ε1qλ˚hS1

`ř

jPS |hj |p|zj | ´ pη0 ´ ε1qλ˚q` (2.44)

ď ηλ˚hT zS1 ` η0λ˚hS1 ´ ε1λ˚hT 1

`

´

p|zS | ´ η0λ˚q`2 ` ε1m12λ˚

¯

hS2.

ď pη ´ ε1qλ˚hT 1 ` η1λ˚`

m12hS2 ´ hS1˘

.

Let u “ hλ with λ ě λ˚. Combining (2.43) and (2.44), we have

uTΣu` ε1u1 ` p1´ ηquSc1 (2.45)

ď p1` ηquS1 ` η1

`

m12uS2 ´ uS1˘

` κ˚u22.

The above inequality holds for all u “ ppβ´βoqλ as long as pβ P Bpλ˚, κ˚q. As in the proof

of Proposition 2.2, the condition ĎRE22pS; η0, η1,mq ě κ˚ and the above inequality imply

25

that for all such u,

p1´ ηquSc1 ď p1` ηquS1 ` η1

`

m12uS2 ´ uS1˘

,

so that u P U pS, η0, η1,mq.

By the definition of ĞRCIF, we have

ĞRCIFpredpS; η0, η1,mqhTΣh ď Σh2c |S|, (2.46)

and

ĞRCIFest,qpS; η0, η1,mqhq ď Σhc|S|1q. (2.47)

Moreover, we have

pΣhqSc8 ď XTScpy ´Xpβqn8 ` X

TScpy ´Xβ

oqn8 ď p1` ηqλ,

and

pΣhqS2 ď XTS py ´X

pβqn2 ` XTS py ´X

pβqn2

ď λm12 ` η0λ˚m12 ` η1m

12λ˚

ď p1` ηqm12λ.

Thus,

Σhc “ max

pΣhqSc8, pΣhqS2m12

(

ď p1` ηqλ. (2.48)

We establish the RCIF error bounds in (2.41) and (2.42) by inserting the above inequality

into (2.46) and (2.47) respectively.

To compare the RCIF and RE, we note that

p1´ ηqu1 ď 2|S|12uS2 ` η1m12uS2.

26

for u P U pS; η0, η1,mq, so that for η ă 1

uTΣu “ uTScpΣuqSc ` uTS pΣuqS

ď uSc1Σuc `m12uS2Σuc

ď

´2|S|12uS2 ` η1m12uS2

1´ η`m12uS2

¯

Σuc

ď2|S|12 ` p1´ η0qm

12

1´ ηu2Σuc.

It follows that

uTΣu

u22ď

”2p|S|s˚q12 ` p1´ η0qpms˚q12

1´ η

ı2 Σu2cs˚

uTΣu,

and

uTΣu

u22ď

”2p|S|s˚q12 ` p1´ η0qpms˚q12

1´ η

ı

Σucps˚q12

u2.

Taking infimum in the cone U pS; η0, η1,mq on both sides and noting that ξ1 “

“

2p|S|s˚q12 ` p1´ η0qpms˚q12

‰

`

1´ η˘

, we obtain

ĞRCIFpredpS; η0, η1,mq ě ĎRE22pS; η0, η1,mqξ

21 ,

and

ĞRCIFest,2pS; η0, η1,mq ě ĎRE22pS; η0, η1,mqξ1.

This completes the proof. ˝

2.4 Scaled concave PLSE

We have studied in previous sections the properties of all the local solutions in B0pλ˚, κ˚q.

Suppose that the local solution set B0pλ˚, κ˚q is obtained, one still needs to choose an

appropriate solution in the set or a proper penalty level. This problem, which will be

studied in this section, is essentially to estimate the noise level σ due to scale invariance.

27

Numerous efforts have been devoted to scale free estimation under the `1 penalty. Stadler

et al. [67] proposed the minimizer of joint log-likelihood of regression coefficients and noise

level with an `1 penalty. The comment on this paper by Antoniadis [2] pointed out that

their estimator is equivalent to the joint minimization of Huber’s concomitant loss

ppβ, pσq “ arg minβ,σ

y ´Xβ222nσ

`σ

2` λ0β1. (2.49)

It turns out that (2.49) coincides with many other works on the scale free estimation under

the `1 penalty. For example, the square-root Lasso solution [5] and the equilibrium of the

iterative algorithm [69] are both equivalent to (2.49). However, all of these studies of the

scale free estimation are limited to the `1 penalty. The scaled concave PLSE is not an easy

extension of the scaled `1 penalization due to the loss of scale free property.

In fact, the concomitant loss or the square-root formulation fail for concave penalties.

To illustrate this, we take the MCP as an example. Denote σ˚ “ y ´Xβ˚2n12 as the

oracle noise level estimator given the true coefficients β˚. Under Gaussian assumption, this

is the maximum likelihood estimator for σ when β˚ is known and thus a natural estimation

target. For Minimax concave penalty ρpt, λq “ λş|t|0 p1´ xpλγqq`dx,

pσ2 “ arg minσ2

y ´Xβ˚222nσ

`σ

2`

1

σ

pÿ

j“1

ρp|β˚j |;λ0σq

“ tσ˚u2 ´ p1γqÿ

jPt|β˚j |ăλ0pσγu

tβ˚j u2,

pσ is expected to underestimate σ˚ unless there is no small β˚j such that t|β˚j | ă λ0pσγu.

This validates the argument that concomitant loss formulation fails for concave penalties.

In addition, the iterative algorithm becomes extremely difficult to analyze due to the loss

of its equivalence to joint convex minimization, compared with the Lasso.

2.4.1 Description of the scaled concave PLSE

Given a coefficients pβ, define the noise level estimator as

pσppβq “ y ´Xpβ2tn´ du12, (2.50)

28

where d is a parameter provides an option to adjust degrees-of-freedom. Typically, we let

d “ p when p ă n and d “ 0 otherwise. Within the local solution set B0pλ˚, κ˚q, we search

for a subclass of scaled concave penalized least-square estimators B0,scalpλ0;λ˚, κ˚q, defined

as

B0,scalpλ0;λ˚, κ˚q “!

pβ P B0pλ˚, κ˚q : λ0pσppβq “ λ)

. (2.51)

Here, λ0 is a prefixed penalty level and independent of σ. For example, one may choose

λ0 “ Atp2nq log pu12 for universal penalty and λ0 “ An´12L1pkpq for smaller penalty,

with appropriate k and A. We derive the consistency results for noise level estimation for

different λ0 separately in the following analysis.

As discussed in Section 2.2, B0pλ˚, κ˚q is a large class of estimators that includes all

local solutions connected to the origin regardless of the specific algorithms used to compute

the solution. We here use the PLUS algorithm as an example to illustrate the computation

of the estimators in B0,scalpλ0;λ˚, κ˚q. The PLUS, indexed by x, is defined as

λpxq ‘ pβpxq”

$

’

’

&

’

’

%

a continuous path of solutions of (2.12) in R1`p

with pβp0q“ 0 and limxÑ8 λ

pxq “ 0.

(2.52)

Given a PLUS solution path, the scaled estimator can be defined as

pβscal

“ pβppxq, px “ mintx : λ0pσppβ

pxqq ě λpxqu. (2.53)

The “ ě ” in defining px in (2.53) can be changed to “ “ ” by the continuity of the PLUS

path. Under mild regularity conditions,we will prove that pβscal

P B0,scalpλ0;λ˚, κ˚q. See

next subsections for the proof. This also guarantees the non-emptiness of B0,scalpλ0;λ˚, κ˚q.

2.4.2 Performance guarantees of scaled concave PLSE at universal

penalty levels

In this subsection, we derive the consistency results for noise level estimation with

sufficiently large λ0. Since σ˚ “ y´Xβ˚2n12 is a natural target of noise level estimation,

29

we aim to derive the convergence results of pσppβqσ˚ with pβ P B0,scalpλ0;λ˚, κ˚q in the

following theorem.

Theorem 2.3. Let β˚ be the true regression coefficients, pβscal

be in (2.53) and σ˚ “

y´Xβ˚2n12 be the oracle noise level estimator. Let 0 ă η ă 1 and ξ “ p1` ηqp1´ ηq.

Suppose κpρq ď κ˚ and RE22pS; η, 1q ě κ˚.

piq Let τ0 “ p1 ` ξqp1 ` ηqλ0s12RE1pS; η, 1q. When (2.16) holds with λ˚ “ λ0σ

˚p1 ´ τ0q

and δ˚ “ 1, we have pβscal

P B0,scalpλ0;λ˚, κ˚q. Moreover, for any pβ P B0,scalpλ0;λ˚, κ˚q,

max

˜

1´pσppβq

σ˚, 1´

σ˚

pσppβq

¸

ď τ0,Xpβ ´Xβ˚2

n12σ˚ď

τ0

1´ τ0. (2.54)

In particular, if we take λ0 “ Atp2nq log pu12 with A ą 1η and τ0 Ñ 0, then for all ε ą 0

Pβ˚,σp|pσppβqσ ´ 1| ą εq Ñ 0. (2.55)

piiq Let τ2˚ “ ηp1`ηqp1`ξq2λ2

0sRE1pS; η, 1q. When (2.16) holds with λ˚ “ λ0σ˚p1´τ2

˚q

and δ˚ “ 1, we have pβscal

P B0,scalpλ0;λ˚, κ˚q. Moreover, for any pβ P B0,scalpλ0;λ˚, κ˚q,

max

˜

1´pσppβq

σ˚, 1´

σ˚

pσppβq

¸

ď 3τ2˚ . (2.56)

If we take λ0 “ Atp2nq log pu12 with A ą 1η and τ2˚ ! n´12, then

n12ppσppβqσ ´ 1q Ñ N p0, 12q (2.57)

in distribution under Pβ˚,σ.

By proving pβscal

P B0,scalpλ0;λ˚, κ˚q, Theorem 2.3 guarantees the non-emptiness of

B0,scalpλ0;λ˚, κ˚q with appropriate λ˚ and λ0. Moreover, it provides the convergence

and asymptotic normality for the scaled concave estimation of noise level under only the

restricted eigenvalue conditions. In part (i), we achieve an error rate τ0 —a

psnq log p for

noise level estimation with pβ P B0,scalpλ0;λ˚, κ˚q. This matches the `1 penalized maximum

likelihood estimator in [67]. In part (ii), we provide sharper convergence rate and the

asymptotic normality results. The sharper rate τ2˚ is on the order of psnq log p, which

30

essentially taking the square of the order in part (i). The asymptotic normality then follows

from the sharper rate under mild assumptions. The convergence rate in part (ii) matches

the rate of iterative algorithm formulation in Sun and Zhang [69].

Proof of Theorem 2.3. First prove (i). Denote z “ XT py ´ Xβ˚qn and hpxq “

pβpxq´ β˚. Consider penalty level λpx0q “ λ˚ “ λ0σ

˚p1 ´ τ0q for certain x0 in the PLUS

path. Since λpx0q “ λ˚ satisfies (2.16), it follows from Theorem 2.1 and the definition of τ0

that Xhpx0q2n12 ď σ˚τ0p1´ τ0q ď σ˚τ0. Then we have

λ0pσppβpx0qq “ λ0y ´Xpβ

px0q2n

12

ě λ0

ˇ

ˇ

ˇσ˚ ´ Xhpx0q2n

12ˇ

ˇ

ˇě λ0σ

˚p1´ τ0q “ λpx0q. (2.58)

By the definition of px, px ď x0. Since any penalty level λpxq ě λ˚ is a local solution

of (2.12) in the PLUS path, λpxq is a non-increasing function of x for λpxq ě λ˚. Thus,

λppxq ě λpx0q “ λ˚. It follows that pβscal

P B0,scalpλ0;λ˚, κ˚q.

Moreover, for any pβ P B0,scalpλ0;λ˚, κ˚q, with penalty λ ě λ˚, we have

pσppβq “ λλ0 ě λ˚λ0 “ σ˚p1´ τ0q. (2.59)

Furthermore, by Theorem 2.1 we have

ˇ

ˇ

ˇy ´Xpβ2n

12 ´ σ˚ˇ

ˇ

ˇď Xppβ ´ β˚q2n

12 ď τ0pσppβq. (2.60)

Thus,

pσppβq

σ˚“y ´Xpβ2

n12σ˚ďτ0pσppβq ` σ

˚

σ˚“ 1` τ0

pσppβq

σ˚, (2.61)

This implies pσppβq ď σ˚p1 ´ τ0q. Combing with (2.59), the first part of (2.54) holds. In

addition,

Xpβ ´Xβ˚2n12 ď τ0pσppβq ď σ˚τ0p1´ τ0q. (2.62)

31

The second part of (2.54) holds. To prove (3.2), since for certain A,

Pβ,σ”

z8 ď Aσtp2nq log pu12ı

Ñ 1,

we have (3.2) follows from (2.54).

Now we prove (ii). By the KKT condition,

´pz8 ` λqh1 ď pXhqT!

y ´Xβ˚ ` y ´Xpβ)

n

ď pσ˚q2 ´ y ´Xpβ22n

ď pXhqT t2py ´Xβ˚q ´Xhun ď 2z8h1. (2.63)

We use above inequalities as lower and upper bounds for pσ˚q2 ´ y ´Xpβ22n.

Consider λpx1q “ λ˚ “ λ0σ˚p1´ τ2

˚q in the PLUS path. Since λpx1q “ λ˚ satisfies (2.16),

it follows from Theorem 2.1 that hpx1q1 ď p1` ξq2p1` ηqλpx1qsRE1pS; η, 1q. Combining

with z8 ă λ0ησ˚p1´ τ2

˚q, we have

λ20 pσ2ppβ

px1qq “ λ2

0 y ´Xpβpx1q22n

ě λ20

´

tσ˚u2 ´ 2z8hpx1q1

¯

ě λ20tσ

˚u2`

1´ 2τ2˚p1´ τ

2˚q

2˘

ě λ20tσ

˚u2p1´ τ2˚q

2 “ pλpx1qq2. (2.64)

The last inequality holds since τ2˚ ď 1. As in part (i), we find λppxq ě λpx1q “ λ˚ and

pβscal


Similarly, for any pβ P B0,scalpλ0;λ˚, κ˚q with penalty λ ě λ˚ “ λ0σ˚p1 ´ τ2

˚q, we have

pσppβq “ λλ0 ě λ˚λ0 “ σ˚p1 ´ τ2˚q. On the other hand, recall that z8 ă λ0ησ

˚p1 ´ τ2˚q

and pβ ´ β˚1 ď p1` ξq2p1` ηqλ0pσppβqsRE1pS; η, 1q, we have

pσppβq2

tσ˚u2“

y ´Xpβ22ntσ˚u2

ď

tσ˚u2 `´

z8 ` λ0pσppβq¯

pβ ´ β˚1

tσ˚u2

ďtσ˚u2 ` τ2

˚p1´ τ2˚qpσp

pβqσ˚ ` p1ηqτ2˚pσ

2ppβq

tσ˚u2. (2.65)

32

Solving above equation w.r.t pσppβqσ˚, we obtain pσppβqσ˚ ď p1 ` τ2˚qp1 ´ τ2

˚q. Thus 1 ´

σ˚pσppβq ď 3τ2˚ . This proves (3.3). Given (3.3), the proof of (2.57) follows the proof of

Theorem 2 (ii) in Sun and Zhang [69]. ˝

2.4.3 Performance bounds of scaled concave PLSE at smaller penalty

levels

In this subsection, we derive the consistency results for noise level estimation with smaller

λ0.

Theorem 2.4. Let β˚, pβscal

and σ˚ be as in Theorem 2.3 and ĎRE˚,2pS; ¨, ¨q be as in (2.39).

Let m be a positive integer, η “ η0`η1 with positive η0, η1 and ξ2 “ r2`p1´η0qpmsq12sp1´

ηq. Define τ1 “ p1`ηqλ0ξ2ps_mq12ĎRE˚,2pS; η,mq. Suppose κpρq ď κ˚ and RE2

2pS; η, 1q ě

κ˚. When (2.32) holds with λ˚ “ λ0σ˚p1´τ1q, we have pβ

scalP B0,scalpλ0;λ˚, κ˚q. Moreover,

for any pβ P B0,scalpλ0;λ˚, κ˚q,

max

˜

1´pσppβq

σ˚, 1´

σ˚

pσppβq

¸

ď τ1,Xpβ ´Xβ˚2

n12σ˚ď

τ1

1´ τ1. (2.66)

If we take λ0 “ An´12L1pkpq with k in (2.37), A ą 1η0 and τ1 Ñ 0, then for all ε ą 0

Pβ˚,σp|pσppβqσ ´ 1| ą εq Ñ 0. (2.67)

Similar as Theorem 2.3, Theorem 2.4 first guarantees the non-emptiness of

B0,scalpλ0;λ˚, κ˚q but with smaller λ0. Furthermore, it provides the convergence results

for noise level estimation at smaller penalties with nearly identical condition as in Theorem

2.3. Compared with existing literatures, Theorem 2.4 could be viewed as a generalization

of scaled Lasso with smaller penalties in Sun and Zhang [70].

Proof of Theorem 2.4. Consider penalty level λpx1q “ λ˚ “ λ0σ˚p1 ´ τ1q for certain

x1 ă 8 in the PLUS path. Since (2.32) holds for λpx1q, by Theorem 2.2, we have

Xhpx1q2n12 ď

p1` ηqξ1λps˚q12

ĎRE2pS; η0, η1,mqďp1` ηqξ2λ

px1qps_mq12

ĎRE˚,2pS; η,mqď σ˚τ1.

33

Similar as (2.58),

λ0pσppβpx1qq “ λ0y ´Xpβ

px1q2n

12 ě λ0σ˚p1´ τ1q “ λpx1q.

As in the proof of Theorem 2.3, we find λppxq ě λpx1q and pβscal


Moreover, (2.66) and (2.67) can be proved in the same way as Theorem 2.3. ˝

2.5 Simulation Study

In this section, we report the noise level estimation results of the scaled concave PLSE

and compare with several competing methods in a comprehensive simulation study. The

experimental settings follow Reid et al. [61] and are described with our notation as below.

The simulation aims to estimate noise level σ in a variety of settings. All simulations

are run at a sample size of n “ 100, the number of predictors is considered in four different

values: p “ 100, 200, 500, 1000. Elements of the design matrix X are generated randomly

as Xij „ N p0, 1q. Correlation between columns of X is set to be ρ. The true parameter

β˚ is generated as follows: the number of nonzero elements is set to be pnz “ rnαs, i.e., α

controls the sparsity of β˚: the higher the α; the less sparse of β˚. It ranges between 0 and

1. The indices corresponding to nonzero β˚ are selected randomly. Their value are set to

be random samples from a Laplacep0, 1q distribution. The elements of the resulting β˚ is

scaled such that the signal-to-noise ratio, defined as tβ˚uTΣβ˚L

σ2 is some predetermined

value, snr. Simulations were run over a grid of values for each of the parameters described

above. In particular,

• ρ “ 0, 0.2, 0.4, 0.6, 0.8

• α “ 0.1, 0.3, 0.5, 0.7, 0.9

• snr “ 0.5, 1, 2, 5, 10, 20.

We simulate B “ 200 independent datasets for each set of parameters. The competing

methods considered include:

34

• The oracle estimator given the true coefficients β˚,

pσ2o “

1

n

nÿ

i“1

pyi ´XTi β

˚q2.

• The cross-validation based Lasso, denoted as CV L,

pσ2CV L “

1

n´ pspλCV L

nÿ

i“1

pyi ´XTipβpλCV L

q,

where pλCV L is selected by 10-fold cross validation and pspλCV L

“ pβpλCV L

0.

• The cross-validation based SCAD, denoted as CV SCAD,

pσ2SCAD “

1

n´ pspλSCAD

nÿ

i“1

pyi ´XTipβpλSCAD

q2,

where pλSCAD is selected by 10-fold cross validation and pspλSCAD

“ pβpλSCAD

0.

• Scaled Lasso in Sun and Zhang [69], with λ0 “a

p2nq log p, denoted as SZ L.

• Scaled Lasso with smaller smaller penalty level, λ0 “ p2nq12L1pkpq with k in (2.37),

denoted as SZ L2.

• Scaled MCP in (2.53) with universal penalty λ0 “a

p2nq log p, denoted as SZ MCP .

• Scaled MCP in (2.53) with smaller penalty λ0 “ p2nq12L1pkpq with k in (2.37),

denoted as SZ MCP2.

• Scaled MCP in (2.53), with adaptive penalty and denoted as SZ MCP3, is defined

as follows: We generate an error vector ε1 „ Np0, Inq and compute z “XTε1n. We

then order |z| and pick the k1 “ rks largest elements of |z|, denoted as |z|1, ...|z|k1 ,

with k in (2.37). Then, let λ1 “ n´12L1pk1pq. Finally, define Λ0 as

Λ0 “ λ1 `! 1

k1

k1ÿ

j“1

p|z|j ´ λ1q2`

)12

We repeat this procedure for 500 times, and take λ0 equal to the median value of all

the computed Λ0s.

35

The true noise level/standard deviation is set to be σ “ 1 in all simulations. For the

concave penalties, we use default concavity γ in corresponding R packages. Specifically, we

use R package glmnet for the Lasso estimates, package ncvreg for SCAD estimates and

R package plus for MCP estimates.

2.5.1 No signal case: β˚ “ 0

We first consider the case where α “ ´8, forcing β˚ “ 0. It is obvious that snr is irrelevant

here, since there is no signal.

Methods p=100 p=200 p=500 p=1000

Oracle 0.0034 -0.0059 0.0019 -0.0064CV L -0.0256 -0.0215 -0.0340 -0.0510CV SCAD -0.0163 -0.0187 -0.0304 -0.0481SZ L 0.0004 -0.0141 -0.0093 -0.0078SZ L2 -0.0430 -0.0443 -0.0509 -0.0519SZ MCP -0.0018 -0.0181 -0.0098 -0.0103SZ MCP2 -0.0669 -0.0638 -0.0913 -0.0816SZ MCP3 -0.0344 -0.0513 -0.0913 -0.112

Table 2.1: Median bias of standard deviation estimates. No signal, σ “ 1, ρ “ 0, samplesize n “ 100. Minimum error besides the oracle is in bold for each analysis.

Table 1 shows the median bias of different standard deviation estimates when no signal

exists, with σ “ 1, ρ “ 0, n “ 100. Besides the oracle estimator, it is clear that the scaled

estimators with universal penalty SZ L and SZ MCP perform best in each analysis. All

other estimators are slightly downward biased. One possible reason is that the estimators

with larger penalty, e.g., SZ L or SZ MCP , tend to select fewer variables. When there

is no signal, choosing fewer variables may lead to better noise level estimation accuracy.

Comparatively, the scaled estimators with smaller penalty or the cross-validation based

estimators tend to include more variables. This may lead to the underestimation of noise

level.

2.5.2 Effect of correlation: ranging over different ρ

Now we consider the setting where correlation ρ of the design matrix ranges from 0 to

0.8. Figure 1 plots the median standard deviation estimates over different ρ, with fixed

36

0.0 0.2 0.4 0.6 0.8

0.9

1.0

1.1

1.2

1.3

correlation

1 1 1 1 12 2 22 2

3 3 3

33

44

4 4 45 5 5

5 5

6 6

6 6

6

7 77 7

7

0.0 0.2 0.4 0.6 0.8

0.9

1.0

1.1

1.2

1.3

correlation

1 11 1 1

2 2 22 2

33

3

3

34

4

44

4

55 5 5

5

6 6

6 66

77 7 7

7

0.0 0.2 0.4 0.6 0.8

0.9

1.0

1.1

1.2

1.3

correlation

11

11 1

22 2 2 2

33

33

34 44

4

4

5

55

5

5

6

6 66

6

77

7 7

7

0.0 0.2 0.4 0.6 0.80.

91.

01.

11.

21.

3

correlation

11

11

12

22

2

2

33

33

3

4 4 4 4

4

55

55

56 6

6 66

77

77 7

Figure 2.1: Median standard deviation estimates over different levels of predictorcorrelation. σ “ 1, α “ 0.5, snr “ 1, sample size n “ 100, predictors p “ 100, 200, 500, 1000moving from left to right along rows. Plot number refer to CV L(1), CV SCAD(2),SZ L(3), SZ L2(4), SZ MCP (5), SZ MCP2(6), SZ MCP3(7).

α “ 0.5, snr “ 1. One clear trend is that the correlation between predictors becomes

the rescue of many variance estimators. This observation agrees with the results in Reid

et al. [61]. Among all estimators, scaled MCP with adaptive penalty SZ MCP3 performs

consistently well, even better than the cross-validation based methods in a large range of

correlations. This is because that SZ MCP3 is data dependent and count the correlation

between predictors. Comparatively, when correlation ρ goes large, even the scaled MCP with

smaller penalty SZ MCP2 chooses too large penalty, which then degrades its performance.

2.5.3 Effect of signal-to-noise ratio: ranging over different snr

We then consider the setting where signal-to-noise ratio ranges from 0.5 to 20. Figure 2 plots

the median standard deviation estimates over different levels of snr, with fixed α “ 0.5,

ρ “ 0. Similar with previous subsection, the performance of SZ MCP3 remains in the

37

0 5 10 15 20

1.0

1.2

1.4

1.6

1.8

2.0

2.2

signal−to−noise ratio

1 1 1 1 1 12 2 2 2 2 2

33

33

3

3

4 4 44 4 45 5 5 5

55

6 6 6 6 6 67 7 7 7 7 7

0 5 10 15 20

1.0

1.2

1.4

1.6

1.8

2.0

2.2


11 1 1 1 122 2 2 2 2

3 33

3

3

3

4 4 4 4 44

5 5 55 5 5

6 6 6 6 6 67 7 7 7 7 7

0 5 10 15 20

1.0

1.2

1.4

1.6

1.8

2.0

2.2


1 1 1 11

12 2 2 2 2 2

33

3

3

33

4 44 4

4 4

55 5 5

5 5

6 6 6 6 6 67 7 7 7 7 7

0 5 10 15 201.

01.

21.

41.

61.

82.

02.

2


1 1 1 1 1 12

2 2 2 2 2

33

3

3

3

3

4 44

44

4

5 5 5 55 5

6 6 6 6 6 67 7 7 7 7 7

Figure 2.2: Median standard deviation estimates over different levels of signal-to-noise level.σ “ 1, α “ 0.5, ρ “ 0, sample size n “ 100, predictors p “ 100, 200, 500, 1000 moving fromleft to right along rows. Plot number refer to CV L(1), CV SCAD(2), SZ L(3), SZ L2(4),SZ MCP (5), SZ MCP2(6), SZ MCP3(7).

top tier with CV L and CV SCAD in a large range of snr. Besides, the scaled MCP

with smaller penalty (SZ MCP2) also achieves very competitive estimation accuracy. On

the contrary, the performance of two scaled Lasso estimators (SZ L and SZ L2) degraded

significantly as the snr increase. We believe that the reason lies on the intrinsic bias of the

Lasso. Indeed, with fixed sparsity level, the higher the snr, the larger the per-element signal

strength. The impact of the bias of the Lasso becomes increasingly severe as individual

signal size goes large.

2.5.4 Effect of sparsity: ranging over different α

We finally consider the setting where sparsity level α ranges from 0.1 to 0.9. Figure 1 plots

the median standard deviation estimates over different α, with fixed ρ “ 0, snr “ 1. It is

clear that each of the estimators shows an upward bias trend. This trend appears because

38

0.2 0.4 0.6 0.8

0.9

1.0

1.1

1.2

1.3

1.4

sparsity

1 1 11

1

2 22

2

2

3

3

3

3

3

44

4

4

4

55

5

5

5

6 66

6

6

7 77

7

7

0.2 0.4 0.6 0.8

0.9

1.0

1.1

1.2

1.3

1.4

sparsity

11

1 1

1

2 22

2

2

33

3

3

3

44

4

4

4

55

5

5

5

66

6

6

6

77

7

7

7

0.2 0.4 0.6 0.8

0.9

1.0

1.1

1.2

1.3

1.4

sparsity

11 1

1

1

2 2

2

2

2

3

3

33

3

44

4

4

4

5

5

5

5

5

6 6

6

6

6

7 7

7

7

7

0.2 0.4 0.6 0.80.

91.

01.

11.

21.

31.

4

sparsity

1 1

1

1

1

2 22

2

2

3

3

3

33

4 4

4

4

4

5

5

5

5

5

66

6

6

6

77

7

7

7

Figure 2.3: Median standard deviation estimates over different levels of sparsity. σ “ 1,snr “ 1, ρ “ 0, sample size n “ 100, predictors p “ 100, 200, 500, 1000 moving from leftto right along rows. Plot number refer to CV L(1), CV SCAD(2), SZ L(3), SZ L2(4),SZ MCP (5), SZ MCP2(6), SZ MCP3(7).

no estimator successfully select all important variables when more and more variables come

into the model. However, SZ MCP2, SZ MCP3 along with CV L and CV SCAD are

four estimators that perform stably over a large range of sparsities. Indeed, while the

superiority of cross-validation based methods has been demonstrated by Reid et al. [61],

SZ MCP2 and SZ MCP3 also proved their robustness toward different sparsities in our

experiments.

After all, we conclude that the scaled MCP with adaptive penalty SZ MCP3 performs

consistently well in most settings, regardless of sparsity, signal-to-noise ratio or the design

correlations. The performance of scaled MCP with smaller penalty SZ MCP2 is also

notable besides the case where the design matrix is highly correlated. Comparatively,

SZ MCP is not very competitive in several settings. This confirms again that a universal

penalty for scaled estimators may be too large. More intuitions behind the choice of λ0 will

be discussed in section 2.6.

39

2.6 Discussion

In this chapter, we developed a new theory of concave penalized least-square estimator

and its scaled version under much weakened conditions. We prove that the concave PLSE

matches the oracle properties of prediction and coefficients estimation of the Lasso based

only on the RE-type conditions, one of the mildest conditions on the design matrix.

Moreover, to achieve selection consistency, our theorem does not require any additional

conditions for proper concave penalties such as the SCAD penalty and MCP. Furthermore,

the scaled version of the concave PLSE provides consistency and asymptotic normality for

noise level estimation. A comprehensive simulation study of variance estimation under

different levels of sparsity, signal-to-noise ratio and design correlation demonstrated the

superior performance of the scaled concave PLSE.

0 10 20 30 40

0.20

0.25

0.30

0.35

0.40

0.45

0.50

k

λ 0

1 1 1 1 1 1

k2 k1

2

22

22 2

3

3

33

33

4

44

44 4

5

5

55

5 5

Figure 2.4: Five λ0s as functions k, n=100, p=1000. Line numbers refer to (1) λ0pkq “tp2nq logppqu12, (2) λ0pkq “ tp2nq logppkqu12, (3) λ0pkq “ p2nq

12L1pkpq, (4) Adaptiveλ0 described in section 2.5 with various k, assuming that the correlation between columnsof X is 0. (5) Same as (4) except assuming that the correlation between columns of X is0.8. The k1 is the solution to (2.37), k2 is the solution to 2k “ L4

1pkpq ` 2L21pkpq.

In the simulation study, we considered three different λ0s to compute the scaled MCP.

All three λ0s are independent with the true sparsity level s “ β˚0. On the other

hand, the theoretical choice of λ0 for universal and smaller penalties are respectively

tp2nq logppsqu12 and p2nq12L1pspq. One in fact need to find some replacements, named

40

k, for s. We plot different λ0s as functions of k in Figure 4 to find some more intuitions for

the choice of λ0.

It is clear from Figure 4 that the constant λ0 “ tp2nq logppqu12 (line 1) over-estimates

λ0 “ tp2nq logppkqu12 (line 2) in a significant amount. This partially explains the

impaired performance of SZ MCP1 and SZ L1 in the simulations. When the columns of

design matrix are uncorrelated, the adaptive λ0 (line 4) is quite close to the theoretical chose

of universal and smaller λ0 (line 2 and line 3) in a large range of k. On the other hand, when

the columns of design matrix are highly correlated, the adaptive λ0 (line 5) is clearly way

below line 2 and line 4. This explains the superior performance of SZ MCP3 compared to

SZ MCP2 for highly correlated design. Moreover, Sun and Zhang [70] proposed to estimate

k by solving C0k “ L41pkpq ` 2L2

1pkpq with C0 being certain constant. We take C0 “ 1

in the simulations by the suggestion of Sun and Zhang [70]. Different C0, (e.g. C0 “ 2,

corresponds to k2 in Figure 4) was also tried, but no dramatic change on the accuracy of

noise level estimation.

41

Chapter 3

Penalized least-square estimation with noisy and missing

data

3.1 Introduction

In this chapter, we consider the high-dimensional linear model where the design matrix

subject to disturbance. Two types of disturbance will be discussed in this chapter:

(i) Covariates with noise: We observe Zij “ Xij `Wij , where W i “ pWi1, ...,WipqT is a

random vector with mean 0 and known covariance matrix Σw.

(ii) Covariates with missingness: Let ηij “ 1 if Xij is observed, ηij “ 0 otherwise. Define

Zij “ Xijηij , then the random matrix Z P Rnˆp is observed. Denote the probability

that Xij is observed as π. We suppose that ηij follows Bernullip1 ´ πq distribution

independently for i “ 1, ..., n, j “ 1, ..., p.

We assume the target coefficients vector is s-sparse (i.e., β˚0 ď s) and allow the

number of coefficients p to be much larger than the sample size n. When the design matrices

are fully observed, i.e., π “ 0 and Σw “ 0, the problem has been addressed using penalized

least square estimation (PLSE) in chapter 2. The PLSE for fully observed design, including

the Lasso [71], SCAD [21] and MCP [84], can all be viewed as the form

pβ P arg minβ

#

1

2βTΣβ ´ βTz `

pÿ

j“1

ρp|βj |;λq

+

(3.1)

with Σ “ XTXn, z “ XTyn and ρp¨;λq being the penalty function indexed by penalty

level λ.

When the designs are subject to disturbance, the program (3.1) cannot be implemented

due to unobtainable Σ and z. On the other hand, observe that Σ and z serve as the natural

42

replacements of their unobserved population counterparts Σx “ EXTXn and Σxβ˚, Loh

and Wainwright [44] proposed to form other estimates, say pΓ and pγ, of Σx and Σxβ˚. The

program (3.1) then becomes

pβ P arg minβ

#

1

2βT pΓβ ´ βTpγ `

pÿ

j“1

ρp|βj |;λq

+

. (3.2)

With different types of disturbance, ppΓ, pγq may take different forms. For example, when

covariates are subject to noise, ppΓ, pγq can be taken as

ppΓ, pγq “ ppΓnoi, pγnoiq “

ˆ

1

nZTZ ´Σw,

1

nZTy

˙

. (3.3)

When covariates are subject to missingness, ppΓ, pγq can be taken as

ppΓ, pγq “´

pΓmis, pγmis

¯

“

˜

rZTrZ

n´ π diag

`

rZTrZ

n

˘

,1

nrZTy

¸

, (3.4)

where rZij “ Zijp1 ´ πq. It is easy to show that both ppΓnoi, pγnoiq and ppΓmis, pγmisq are

unbiased estimates of pΣx,Σxβ˚q. Moreover, they reduce to pΣ, zq when noise covariance

matrix Σw “ 0 or missing probability π “ 0.

One major issue of the optimization program (3.2) is its non-convexity. Indeed, even

for the convex `1 penalty, (3.2) may still be a non-convex optimization problem and has

multiple local solutions. This is because that neither pΓnoi nor pΓmis is positive semi-definite.

Substantial effort has been made to overcome this technique issue. Loh and Wainwright

[44] inserted a side constraint on the optimization problem and proposed the estimator,

pβ P arg minβ1ďb0

?s

#

1

2βT pΓβ ´ βTpγ `

pÿ

j“1

λ|βj |

+

. (3.5)

with b0 being a constant. Loh and Wainwright proved that both statistical and optimization

error bounds of coefficients estimation can be guaranteed with a properly chosen b0 and some

restricted eigenvalue conditions. Indeed, a properly chosen b0 is critical for their results.

The b0 cannot be too small since they require b0 ě β2, where β is the unknown true

coefficients. On the other hand, the b0 cannot be too large due to the imposed restricted

43

eigenvalue conditions. Datta and Zou [14] further proposed to approximate (3.2) by a

convex objective function with the Lasso penalty and named CoCoLasso. The CoCoLasso

first aims to find a nearest semi-positive definite matrix to the non positive semi-definite

matrix pΓ,

pΓ` “ arg minKě0

K ´ pΓ8, (3.6)

and then estimate pβ with

pβ P arg minβ

#

1

2βT pΓ`β ´ β

Tpγ `

pÿ

j“1

λ|βj |

+

. (3.7)

Besides the penalized least square type estimators, we also note that the Dantzig Selector

[11] type estimator has also been proposed to deal with the disturbance-in-design problem.

Rosenbaum et al. [64, 65] proposed the matrix-uncertainty (MU) selector and its improved

version and proved their coefficients estimation property. The MU selector, denoted as our

notation, is

pβ P!

β1 P Rp : pγ ´ pΓβ8 ď µβ1 ` τ)

(3.8)

where µ ě 0, τ ě 0 are pre-specified constants. Note that the feasible set of the MU-selector

was formed by using the `1-norm of regression coefficients to bound pγ´ pΓβ8. We believe

that this may not be optimal as an `2-norm bound maybe sufficient. A detailed reasoning

may be seen in Section 4.

In this chapter, we study the penalized least square estimator (PLSE) (3.2) with general

concave penalties, including the Lasso, SCAD and MCP as special cases. We prove that the

PLSE subject to noise or missingness achieves the same scale of coefficients estimation error

as the full observed design, based only on the restricted eigenvalue condition. Compared to

Loh and Wainwright [45], we require no further side constraints or the knowledge of true

coefficients. Compared to Datta and Zou [14], we solve an exact solution of (3.2) instead of

an approximation. More importantly, our approach is not limited to the `1 penalty Lasso

and applies to general concave penalties. Furthermore, we prove that a linear combination

44

of the `2 norm of coefficients and noise level is sufficient for penalty level when noise or

missingness exists. This sharpens the existing results which use an `1 norm of regression

coefficients for penalty level. Based on this, we extend the scaled PLSE to noisy and missing

designs. Since the cross-validation based technique may be misleading for missing or noisy

data, the proposed scaled solution is of great use. All our consistency results applies to the

case where the number of predictors p is much larger than the sample size n.

The rest of this chapter is organized as follows: In Section 3.2, we present the coefficients

estimation bounds of the PLSE for missing/noisy designs along with some definitions

and assumptions on the regularizers and design matrices. In Section 3.3, we discuss

the theoretical choice of penalty level. Section 3.4 extends the scaled PLSE and proves

consistency of noise level estimation. Section 3.5 contains discussion.

3.2 Theoretical Analysis of PLSE

In this section, we derive the coefficients estimation error bounds of the concave PLSE for

missing and noisy design. The class of penalty functions we studied in this chapter follows

that in Section 2.2.

3.2.1 Restricted eigenvalue conditions

In high-dimensional regression with fully observed design, the restricted eigenvalue condition

(RE, 6) can be viewed as the weakest available condition on the design matrix to guarantee

desired statistical properties. When the designs are subject to disturbance, the same type

of RE condition can also be applied. For a given (not necessarily semi-definite) matrix Γ

and η ă 1, the restricted eigenvalue (RE) can be defined as

RE22pΓ,S; ηq “ inf

"

uTΓu

u22: p1´ ηquSc1 ď p1` ηquS1

*

. (3.9)

The RE condition refers to the property that RE2pΓ,S; ηq is bounded away from certain

non-negative constant. Similarly, we define the compatibility coefficient, which can be

45

viewed as the `1-RE [77],

RE21pΓ,S; ηq “ inf

"

uTΓu|S|uS

21

: p1´ ηquSc1 ď p1` ηquS1

*

. (3.10)

We note that RE2pΓ,S; ηq is aimed for `2 coefficients estimation error, while RE21pΓ,S; ηq

is for `1 estimation error. When the designs are fully observed, the RE1 and RE2 are

guaranteed to be non-negative due to the positive semi-definite Γ “ Σ “ XTXn. When

noise or missingness come into the design, neither Γ “ pΓnoi nor pΓmis is positive semi-definite.

Thus the RE1 and RE2 defined in (3.9) and (3.10) may be negative.

3.2.2 Main results

For a given penalty ρp¨; ¨q, the Karush-Kuhn-Tucker (KKT) type conditions for λ‘pβ P R1`p

to be a critical point of program (3.2) is

$

’

’

&

’

’

%

pγj ´ ppΓpβqj “ 9ρppβj ;λq, pβj ‰ 0

|ppγj ´ ppΓpβqj | ď λ, pβj “ 0.

(3.11)

For fully observed design and convex penalty, the local KKT condition is a necessary and

sufficient condition for a global minimizer. For general cases where missing or noisy data

exist, solution of (3.11) include all local minimizers of (3.1).

Similar as Chapter 2, we consider the class of all penalty functions ρp¨;λq with no smaller

penalty level than λ˚ and no greater concavity than κ˚.

Ppλ˚, κ˚q “ The set of all penalties satisfying (i) to (iv) in Chapter 2

with λ ě λ˚ and κpρλq ď κ˚

Then we define the local solution set that we consider here. Let

Bpλ˚, κ˚q “ The set of all solutions of (2.12) for some ρp¨;λq PPpλ˚, κ˚q.

be the class of all local solutions for penalty ρp¨;λq P Bpλ˚, κ˚q. The local solution set

46

we considered here is the subclass of Bpλ˚, κ˚q that connected to the origin through a

continuous path. Formally, denote

B0pλ˚, κ˚q “ The set of all vectors connected to 0 in Bpλ˚, κ˚q.

The penalty level we considered here is no smaller than a certain λ˚ satisfying

pγ ´ pΓβ˚8 ă ηλ˚ (3.12)

for certain constant 0 ă η ă 1. We will provide specifications of λ˚ in different disturbance

scenarios, e.g. missing or noisy data, in Section (3.3).

Theorem 3.1. Let pβ be a local minimizer in B0pλ˚, κ˚q with a penalty ρp¨;λq PPpλ˚, κ˚q.

Suppose pΓ satisfies RE22ppΓ,S; ηq ě κ˚. Let ξ “ p1 ` ηqp1 ´ ηq, if ppΓ, pγq satisfies (3.12),

then

pβ ´ β˚q ď

$

’

’

’

’

’

&

’

’

’

’

’

%

4ξλ|S|p1´ ηqRE2

1pS; ηq, q “ 1

2ξλ|S|12

RE1pS; ηqRE2pS; ηq, q “ 2.

(3.13)

In terms of `1 and `2 coefficients estimation error bounds, (3.13) can be viewed as a

generalization of (2.20) as noisy or missing data is allowed in the design matrix. We can

see that (3.13) obtains the same form of coefficients estimation error bounds compared with

(2.20) with no additional assumption. Together with Theorem 2.1, we provide a unified

treatment of penalized least squares methods, including the `1 and concave penalties for

fully observed or missing/noisy data, under the RE condition on the design matrix and

natural conditions on the penalty.

3.3 Theoretical penalty levels for missing/noisy data

In Section 2, the universal penalty level λ˚ can be viewed as a probabilistic upper bound of

pγ ´ pΓβ8. When the design is fully observed, a well-known upper bound for pγ ´ pΓβ8 “

47

XTεn8 would be Aσa

p2nq log p with A being certain constant. In this section, we

provide tight upper bound for pγ ´ pΓβ8 under missing or noisy data scenarios.

Theorem 3.2. Suppose that each row of X are iid zero-mean sub-Gaussian random vectors

with parameters pΣx, σ2xq, let σ be the noise level. Suppose n Á log p.

(i) Additive noise: Let Z “ X `W be the observed design with noise matrix W .

Suppose each row of W are iid zero mean sub-Gaussian random vectors with parameters

pΣw, σ2wq and independent with X. Let σ2

z “ σ2w ` σ

2x,

pµ1, µ2q “

˜

A

c

log p

nσzσw, A

c

log p

nσz

¸

where A is certain constant. Then

pγnoi ´ pΓnoiβ˚8 ď µ1β

˚2 ` µ2σ

with probability at least 1´ c1 expp´c2 log pq.

(ii) Missing data: Let Z “Xη be the missing data design matrix with missing probability

π. Let

pµ1, µ2q “

˜

A

c

log p

n

` σ2x

1´ π`

σ2x

p1´ πq2˘

, A

c

log p

n

σx1´ π

¸

where A is certain constant. Then

pγmis ´ pΓmisβ˚8 ď µ1β

˚2 ` µ2σ

with probability at least 1´ c1 expp´c2 log pq.

Remark 3.1. When the missing probabilities of each column of design matrix are different,

Theorem 3.2 can be extended by letting π “ πmax “ max1ďjďp πj, with πj being the missing

probability in column j.

Theorem 3.2 provides a guidance on choosing the penalty level when noise or missing

data appears. It proves that a linear combination of the `2 norm of coefficients and noise

level is large enough to bound pγ ´ pΓβ˚8 under both scenarios. Compared with fully

48

observed design, an extra `2-norm of coefficients is required for penalty to compensate the

missingness or noise. Combing Theorem 3.2 and Theorem 3.1, we see that the coefficients

estimation bounds is on the order of

pβ ´ β˚2 —

c

s log p

npβ˚2 ` Cσq

pβ ´ β˚1 — s

c

log p

npβ˚2 ` Cσq (3.14)

for certain constant C.

The matrix-uncertainty (MU) selector in (3.8) [64, 65] is a Dantzig selector type

estimator for high-dimensional regression with noisy or missing data. In our notation,

the feasible set of the MU selector is pγ ´ pΓβ8 ď µβ1` τ for certain constants µ and τ .

Since for β ‰ 0, β2 is strictly smaller than β1, an MU selector with modified feasible

set pγ ´ pΓβ8 ď µβ2 ` τ may achieve shaper coefficients estimation error bounds.

Proof of Theorem 3.2 We first state the following lemmas, which will be used to prove

Theorem 3.2.

Lemma 3.1. Suppose X P Rnˆp1 and Y P Rnˆp2 are composed of independent rows of

covariates Xi,˚ and Y i1,˚ respectively, i “ 1, ..., p1, i1 “ 1, ..., p2. Assume Xi,˚ and Y i1,˚

are zero-mean sub-Gaussian vectors with parameters pΣx, σ2xq and pΣy, σ

2yq respectively. If

n Á log p, then

Pˆ

Y TX

n´ CovpY i,Xiq8 ě ε

˙

ď 6p1p2 exp

ˆ

´cnmin

"

ε2

pσxσyq2,

ε

σxσy

*˙

(3.15)

where c is certain constant.

Proof of Lemma 3.1. The proof of Lemma 3.1 can be seen in the supplementary material

of [44].

Now we can prove Theorem 3.2 (i). First note that the observed matrix Z “ X `W

has sub-Gaussian rows with parameters σ2x`σ

2w. This is because for any unit vector u P Rp,

E“

exppλuTZi,˚q‰

“ E“

exp`

λuT pXi,˚ `W i,˚q˘‰

ď exp`

p12qλ2pσ2x ` σ

2wq˘

. (3.16)

49

Moreover, Wβ˚ is also sub-Gaussian vector with parameter σ2wβ

˚22 since for any unit

vector v P Rn,

E“

exppλvTWβ˚q‰

“ E

«

exp

˜

nÿ

i“1

λvjβ˚2pβ

˚β˚2qTW i,˚

¸ff

ď

pź

j“1

exp`

λ2v2j β

˚22˘

“ exp`

λ2β˚22˘

. (3.17)

On the other hand,

pγnoi ´ pΓnoiβ˚8 “ ZTyn´ pZTZn´Σwqβ

˚8

ď ZT εn8 ` `

ZTW n´Σw

˘

β˚8 (3.18)

Thus, by the sub-Gaussianity Z, Wβ˚, ε and Lemma 3.1, we have

P

˜

ZTεn8 ě C

c

log p

nσzσ

¸

ď c1 expp´c2 log pq, (3.19)

and

P

˜

pZTW n´Σwqβ˚8 ě C

c

log p

nσzσwβ

˚2

¸

ď c1 expp´c2 log pq. (3.20)

Combing (3.18),(3.19) and (3.20), we have

P

˜

pγnoi ´ pΓnoiβ˚8 ě C

c

log p

npσzσ ` σzσwβ

˚2q

¸


Similarly we can prove (ii). First note that Z “ Xη has sub-Gaussian rows with

parameter σx since for any unit vector,

E“

exppλuTZi,˚| missing values q‰

“ E“

exp`

λuTXi,˚

˘‰

ď exp`

p12qλ2σ2x

˘

.

Moreover, Xβ˚ and Zβ˚ are both sub-Gaussian vectors with parameter σ2xβ

˚2 by the

same argument of (3.17). On the other hand,

pγmis ´ pΓmisβ˚8

50

ď pγmis ´Σxβ˚8 ` ppΓmis ´Σxqβ

˚8

“1

1´ πZTyn´ CovpZi,Xiqβ

˚8 ` ppΓmis ´Σxqβ˚8

ď1

1´ πZTXβ˚n´ CovpZi,Xiβ

˚q8 `1

1´ πZTεn8

`ppΓmis ´Σxqβ˚8. (3.22)

By the sub-Gaussianity of Z, Xβ˚, ε and Lemma 3.1,

P

˜

1

nZTXβ˚ ´ CovpZi,˚,Xi,˚β

˚q8 ě C

c

log p

nσ2xβ

˚2

¸

ď c1 expp´c2 log pq, (3.23)

and

P

˜

ZTε

n8 ě C

c

log p

nσxσ

¸


To control ppΓmis ´Σxqβ˚8, we define matrix M ,

Mij “ Epηiηjq “

$

’

’

&

’

’

%

p1´ πq2, i ‰ j,

1´ π, i “ j,

and covariance matrix Σz “ CovpZi,˚,Zi,˚q. Then

ppΓ´Σxqβ˚8 “

›

›

`

ZTZn´Σzq cM˘

β˚›

›

8ď

1

p1´ πq2ZTZβ˚n´Σzβ

˚8.

Thus

P

˜

ppΓ´Σxqβ˚8 ě C

c

log p

n

1

p1´ πq2σ2xβ

˚2

¸


Combining (3.22), (3.23), (3.24) and (3.25), we conclude that

P

˜

pγmis ´ pΓmisβ˚8 ě C

c

log p

n

„

` σ2x

1´ π`

σ2x

p1´ πq2˘

β˚2 `σx

1´ πσ

¸

ď c1 expp´c2 log c2pq.

51

3.4 Scaled PLSE and Variance Estimation

In this section, we extend the scaled PLSE for fully observed data in Section 2.4 to the

missing or noisy data scenario.

We start by proposing the noise level estimator for missing or noisy data. Let us first

review the noise level estimator for fully observed design in a high-dimensional setting:

σ2pλq “ y ´Xpβpλq22n “pβTpλqΣpβpλq ´ 2zT pβpλq ` y22n (3.26)

with given penalty λ. When facing noisy/missing data issue, pΣ, zq is not directly available.

A natural estimation of noise level would be obtained by replacing pΣ, zq with ppΓ, pγq in

(3.26), and define

pσ2pλq “ pβTpλqpΓpβpλq ´ 2pγT pβ

Tpλq ` y22n. (3.27)

On the other hand, if the true coefficients is given, an “oracle” may estimate noise level

with noise/missing data by

tσou2 “ βT pΓβ ´ 2pγTβ ` y22n. (3.28)

σo can be viewed as a natural estimation targets for σ. Indeed, σo is not only the best noise

level estimator one can obtain in the missing/noisy data scenarios, σo is also close enough

to the true σ. In fact, one may prove that σo converge to σ˚ “ y ´Xβ˚2?n under

mild conditions. Since σ˚ is the maximum likelihood estimator for σ when β is known and

npσ˚σq2 follows the χ2n distribution under Gaussian assumption, this guarantees that σo

goes to σ under certain condition.

Given the the noise level estimator (3.27), the scaled PLSE is defined as

ppβscal

, σq “´

pβppλq, pσppλq¯

, pλ “ maxtλ : λ ď µ1pβpλq2 ` µ2pσpλqu (3.29)

where A and B are pre-known coefficients and changes depending on the nature of model

52

(noise or missing). For example, we may let

pµ1, µ2q “

˜

1

η

c

log p

nσzσw,

1

η

c

log p

nσz

¸

(3.30)

for noisy data, and let

pµ1, µ2q “

˜

1

η

c

log p

n

` σ2x

1´ π`

σ2x

p1´ πq2˘

,1

η

c

log p

n

σx1´ π

¸

(3.31)

for missing data with missing probability π, where η P p0, 1s is some constant. In light of

oracle noise level estimator σo, we define a oracle penalty level as

λo “ µ1β˚2 ` µ2σ

o. (3.32)

We will derive upper and lower bounds for pλλo´1 in following analysis. Before that, some

more definitions are needed. Let

τ1 “2ξ|S|12

RE1pS; ηqRE2pS; ηq(12

, τ2 “2|S|12 tp1` ηqp1` 3ηqu12

p1´ ηqRE1pS; ηq. (3.33)

Theorem 3.3. Let ppβscal

, pσq be the scaled penalized regression estimator in (3.29) with

pµ1, µ2q be in (3.30) for noisy design and in (3.31) for missing design. Let ξ “ p1´ηqp1`ηq,

λo be in (3.32), τ1, τ2 be in (3.33) and τ0 “ µ1τ1 ` µ2τ2 Suppose RE22pS; ηq ě κ˚ holds in

the solution path with λ ě λ˚. When pγ ´ pΓβ˚8 ă p1´ τ0qηλo, we have

max

˜

1´pλ

λo, 1´

λo

pλ

¸

ď τ0, (3.34)

Moreover,

|pσ ´ σo| ď pλτ2 ďλoτ2

1´ τ0, βscal ´ β˚ ď pλτ1 ď

λoτ1

1´ τ0. (3.35)

Remark 3.2. If one choose pµ1, µ2q as in (3.30) for noisy design and as in (3.31) for

missing design, we have pλÑ λo when psnq log pÑ 0.

53

Theorem 3.3 guarantees the consistency of penalty level estimation via an oracle

inequality for the prediction error of the concave PLSE for missing/noisy design.

Proof of Theorem 3.3. We first consider penalty level λ0 “ λop1´τ0q. Since pγ´pΓβ8 ă

λ0η, it follows from Theorem 3.1 that

pβpλ0q ´ β˚2 ď λ0τ1.

where τ1 “ 2ξ|S|12 tRE1pS; ηqRE2pS; ηqu12. Thus,

pβpλ0q2 ě β˚2 ´ pβpλ0q ´ β

˚2 ě β˚2 ´ λ0τ1. (3.36)

Now denote Λpβq “ p12qβT pΓβ ´ βTpγ. By Taylor’s expansion,

Λ`

pβpλ0q˘

“ Λpβ˚q `∇Λpβ˚qT`

pβpλ0q ´ β˚˘

` p12q`

pβpλ0q ´ β˚˘T

pΓ`

pβpλ0q ´ β˚˘

ě Λpβ˚q `∇Λpβ˚qT`

pβpλ0q ´ β˚˘

` p12qκ˚pβpλ0q ´ β˚22

ě Λpβ˚q `∇Λpβ˚qT`

pβpλ0q ´ β˚˘

. (3.37)

where the first inequality holds because λ0 ě λ˚ and thus RE22pS; ηq ě κ˚. It then follows

that

Λpβ˚q ´ Λ`

pβpλ0q˘

ď ´∇Λpβ˚qT`

pβpλ0q ´ β˚˘

ď pγ ´ pΓβ˚8pβpλ0q ´ β˚1

ď ηλ0pβpλ0q ´ β˚1

ď4λ2

0|S|

pη ` η2qp1´ ηq2(

RE21pS; ηq

( ă λ20τ

22 2, (3.38)

The forth inequality holds by Theorem 3.1. Then we have

pσpλ0q “ σ˚ ´´

b

2Λ`

β˚˘

` y22n´

b

2Λ`

pβpλ0q˘

` y22n¯

ě σ˚ ´ λ0τ2. (3.39)

Combining (3.36) and (3.39), we have that

µ1pβpλ0q2 ` µ2pσpλ0q ě µ1

`

β˚2 ´ λ0τ1

˘

` µ2

`

σ˚ ´ λ0τ2

˘

54

“ µ1β˚2 ` µ2σ

˚ ´ λ0

`

µ1τ1 ` µ2τ2

˘

“ λo`

1´ p1´ τ0qτ0

˘

ě λo`

1´ τ0

˘

“ λ0.

Since pλ “ maxtλ : λ ď µ1pβpλq2 ` µ2pσpλqu, we have

pλ ě λ0 “ λop1´ τ0q ě λ˚. (3.40)

Now consider penalty level pλ. Since pλ ě λ˚, it follows from Theorem 3.1 that pβscal

´

β˚2 ď pλτ1. This proves the second part of (3.35). Furthermore,

pβscal2 ď β

˚2 ` pβscal

´ β˚2 ď β˚2 ` pλτ1. (3.41)

Similarly,

Λ`

pβscal˘


pβscal

´ β˚˘

` p12q`

pβscal

´ β˚˘T

pΓ`

pβscal

´ β˚˘

ď Λ`

β˚˘

` pγ ´ pΓβ˚8pβscal

´ β˚1 ` p12q`

pβ ´ β˚˘T

pΓ`

pβ ´ β˚˘

ď Λ`

β˚˘

` ηpλpβscal

´ β˚1 ` p12q`

pβscal

´ β˚˘T

pΓ`

pβscal

´ β˚˘

(3.42)

Since pλ ě λ˚, by Theorem 3.1,

pβscal

´ β˚1 ď4ξpλ|S|p1´ ηq

RE21pS; ηq

, ppβscal

´ β˚qT pΓppβscal

´ β˚q ď

2ξpλ(2|S|

RE21pS; ηq

.

Put above into (3.42), we obtain

Λ`

pβscal˘

´ Λpβ˚q ď2pλ2|S|

p1` ηqp1` 3ηqp1´ ηq2(

RE21pS; ηq

“ pλ2τ22 2.


pσ “ σ˚ `´

b

2Λ`

pβscal˘

` y22n´b

2Λ`

β˚˘

` y22n¯

ď σ˚ ` pλτ2 (3.43)

55

Combining (3.41) and (3.43), we have

pλ ď µ1pβscal2 ` µ2pσ ď µ1pβ

˚2 ` pλτ1q ` µ2pσ˚ ` pλτ2q

“ λo ` pλ`

µ1τ1 ` µ2τ2

˘

“ λo ` pλτ0,

This implies pλ ď λop1´ τ0q. Combining with (3.40), we proves (3.34). On the other hand,

Λ`

pβscal˘


pβscal

´ β˚˘

` p12q`

pβscal

´ β˚˘T

pΓ`

pβscal

´ β˚˘

ě Λ`

β˚˘

`∇Λpβ˚qT`

pβscal

´ β˚˘

` pκ˚2qpβscal

´ β˚2

ě Λ`

β˚˘

´ pγ ´ pΓβ˚8pβscal

´ β˚1 (3.44)


Λpβ˚q ´ Λ`

pβscal˘

ď pγ ´ pΓβ˚8pβscal

´ β˚1 ă pλ2τ22 2,

Then we have

pσ “ σ˚ ´´

b

2Λ`

β˚˘

` y22n´

b

2Λ`

pβscal˘

` y22n¯

ě σ˚ ´ pλτ2. (3.45)

Then the first part of (3.35) follows from (3.43) and (3.45). ˝

3.5 Conclusions

In this chapter, we extend the PLSE to noisy or missing design and proved a rate-optimal

coefficients estimation error while requiring no additional condition. Moreover, we showed

that a linear combination of the `2 norm of coefficients and noise level is large enough for

penalty level when noise or missingness exists. This sharpens the commonly understood

results where `1 norm of coefficients is required. We further extend the scaled version

of PLSE to missing and noisy data case. Since the cross-validation based technique is

extremely time consuming and maybe misleading for missing or noisy data, the proposed

scaled solution is of great use.

56

Chapter 4

Group Lasso under Low-Moment Conditions on Random

Designs

4.1 Introduction

As discussed in previous chapters, the restricted eigenvalue (RE; 6) condition is among

the mildest imposed on the design matrix to guarantee desired statistical properties of

regularized estimators in high-dimensional regression. When the effects of design variables

are naturally grouped, the group Lasso has been shown to provide sharper results compared

with the Lasso [33], and such benefit of group sparsity has been proven under groupwise

RE conditions [54, 47, 52]. However, the RE condition is still somewhat abstract compared

with well understood properties of the design matrix such as sparse eigenvalue and

moment conditions. In this chapter, we prove that the groupwise RE and closely related

compatibility conditions can be guaranteed by a low moment condition for random designs

when the RE of the population Gram matrix is bounded away from zero. Our results include

the ordinary RE condition for the Lasso as a special case.

Consider a linear model

y “Xβ˚ ` ε, (4.1)

where X “ px1, ..., xpq P Rnˆp is the design matrix, y P Rn is the response vector, ε is a

noise vector with mean E ε “ 0 and covariance σ2Inˆn and β˚ P Rp is the target coefficients

vector. The group Lasso [83] can be defined as

pβpGq“ pβ

pGqpλq “ arg min

β

!

y ´Xβ222n

`

Jÿ

j“1

λjβGj2

)

, (4.2)

57

where tGj , 1 ď j ď Ju forms a partition of the index set t1, ..., pu and λ “ pλ1, ..., λJq P RJ

gives the penalty level. The Lasso [71] can be viewed as a special case of the group Lasso

with group size equal to one.

To guarantee desired prediction and estimation performance of the group Lasso,

restricted eigenvalue type conditions play a critical role. For any subset S Ď t1, ..., Ju

and positive number ξ, the groupwise RE for the `2 estimation error can be defined as

RE2˚pGqpΣ;S, ξ,λq “ inf

"

uTΣu

u22: u P CpGqpξ, S,λq

*

, (4.3)

where Σ P Rpˆp, λ specifies the penalty level in (4.2), and CpGqpξ, S,λq is a cone defined by

CpGqpξ, S,λq “!

u :ÿ

jPSc

λjuGj2 ď ξÿ

jPS

λjuGj2

)

. (4.4)

For the analysis of the prediction and weighted `2,1 estimation errorřJj“1 λj

pβpGq´ β2,

the groupwise compatibility coefficient (CC) can be defined as

CC2pGqpΣ;S, ξ,λq “ inf

#

uTΣuř

jPS λ2j

`ř

jPS λjuGj2˘2 : u P CpGqpξ, S,λq

+

. (4.5)

When |Gj | “ 1 and λj does not depend on j, the groupwise RE and CC reduces to their

original versions in [6] and [74] respectively. In what follows, the groupwise RE and CC

conditions respective refer to the case where the groupwise RE and CC are bounded away

from zero.

While somewhat different versions of the groupwise RE and CC have been considered in

[54], [75] and [47], we focus on verification of the groupswise RE and CC conditions with the

quantities defined in (4.3) and (4.5) as the theory associated with these quantities better

handle unequal group sizes and penalty levels. Specifically, the RE and CC in (4.3) and

(4.5) have been used in [52] to prove the following oracle inequalities.

Let β be a vector with supppβq P GS “ YjPSGj and pβpGq

the group Lasso estimator

in (4.2). Let Σ “ XTXn be the sample Gram matrix, ξ ą 1. Then, in the event

max1ďjďJ XTGjpy´Xβq2n ď λjpξ´ 1qpξ` 1q, the prediction and estimation loss of the

58

group Lasso are bounded by

XpβpGq´Xβ22n ď

C1ř

jPS λ2j

CC2pGqpΣ;S, ξ,λq

, (4.6)

Jÿ

j“1

λjpβpGq

Gj´ βGj

2 ďC2

ř

jPS λ2j

CC2pGqpΣ;S, ξ,λq

, (4.7)

and

pβpGq´ β2 ď

C3

`ř

jPS λ2j

˘12

RE2˚pGqpΣ;S, ξ,λq

, (4.8)

where C1, C2, C3 are constants depending on ξ only.

Substantial effort has been made to deduce the RE-type conditions from commonly

understood conditions such as moment and eigenvalue conditions. [6], [77], [86] and [82]

used lower and upper sparse eigenvalues to bound the RE and CC. Raskutti et al. [60]

proved RE condition for Gaussian designs under a population RE condition and a sample

size condition of the form n ě Cs log p. Rudelson and Zhou [66] further reduced the

sample size requirement from n ě Cs log p to n ě Cs logppsq and extended the results

to sub-Gaussian designs. To establish the RE condition for sub-Gaussian designs, they

proved a reduction/transfer principle showing that the RE condition can be guaranteed by

examining the restricted isometry on a certain family of low-dimensional subspaces. More

importantly, these results contribute significantly to the literature by removing the upper

eigenvalue requirement imposed in earlier analyses of regularized least squares such as the

restricted isometry property (RIP; 12, 11) for the Dantzig selector and the sparse Riesz

condition (SRC; 85, 84) for the Lasso. Lecue and Mendelson [42] further weakened the sub-

Gaussian condition of Rudelson and Zhou [66] on the design to an m-th moment condition

of order m ě C log p and a small-ball condition, while van de Geer and Muro [76] imposed

an m-th order isotropy condition with m ą 2 and tail probability conditions on the sample

second moment of the design variables with nonzero coefficients.

Compared with the rich variety of existing results on the RE-type conditions for the

59

Lasso, the literature on the validity of the groupwise RE-type conditions is rather thin.

Mitra and Zhang [52] proved that the groupwise RE and CC conditions hold for sub-

Gaussian matrices. However, their results require both the upper and lower eigenvalue

conditions on the population Gram matrix.

In this chapter, we show that the groupwise RE condition can be guaranteed by a

low moment condition on the design matrices when the population RE is bounded away

from zero. Specifically, we prove that the groupwise RE condition holds under: (i) a second

moment uniform integrability assumption on the linear combinations of the design variables

and (ii) a fourth moment uniform boundedness assumption on the individual design variables

and a m-th moment assumption on the linear combinations of the within group variables

for m ą 2, given a corresponding population RE condition and the usual sample size

requirement. Moreover, the fourth and m-th moment assumption can be removed given a

slightly larger sample size. Besides, the groupwise CC condition could also be guaranteed

under same type of low moment conditions. All the results include the RE-type conditions

for the Lasso as a special case. Our results indicate that accurate statistical estimation

and prediction is feasible in high-dimensional regression with grouped variables for a broad

class of design matrices. Furthermore, it also provide a theoretical foundation for the

bootstrapped penalized least-square estimation.

The rest of this chapter is organized as follows. In Section 4.2, we review existing

restricted-eigenvalue type conditions. In Section 4.3, we prove a group transfer principle

which is the key to proving the RE and CC conditions. In Section 4.4 and 4.5 we study

the groupwise CC and RE conditions respectively. In Section 4.6 we study the convergence

of the groupwise restricted eigenvalue and compatibility coefficient. Section 4.7 provides

some additional lemmas that used to prove the CC and RE conditions. Section 4.8 contains

discussion.

Notation: Throughout this chapter, we let X be the normalized design matrix, i.e.,

xj22 “ n, and β˚ be the corresponding vector of true regression coefficients. For a

vector v “ pv1, ..., vpq, vq “ř

jp|vj |qq1q denotes the `q norm, and |v|8 “ maxj |vj |,

v0 “ #tj : vj ‰ 0u. For a matrix M , M2 “ supu2“1 Mu2 be the operator norm,

φminpMq and φmaxpMq be the minimum and maximum eigenvalues of M respectively.

60

For a number x, rxs denotes the smallest integer larger than x. Moreover, we let the set

S P p1, ..., pq for the Lasso and S P p1, ..., Jq for the group Lasso.

4.2 A review of restricted eigenvalue type conditions

In this section, we briefly review the existing conditions on the designs required by the

Lasso and group Lasso to achieve the oracle properties. We let the set S P p1, ..., pq for the

Lasso and S P p1, ..., Jq for the group Lasso in this chapter.

Before the introduction of the RE condition, the restricted isometry property (RIP;

12, 11) and sparse Riesz condition (SRC; 85, 84) were imposed to analyze the Dantzig

selector and Lasso respectively. Candes and Tao [11] further improved the RIP condition

and named it the uniform uncertainty principle (UUP). The RIP and UUP conditions are

specialized for random designs with covariance matrix close to Ipˆp, while the SRC condition

works for more general random designs.

Bickel et al. [6] proposed the RE condition and under which provided oracle inequalities

for prediction and coefficients estimation for the Lasso. For a positive number ξ, the ordinary

RE coefficient REpΣ;S, ξq for the prediction and `1 coefficients estimation takes the form

RE2pΣ;S, ξq “ inf

"

uTΣu

uS22: uSc1 ď ξuS1

*

. (4.9)

The RE coefficient for the `2 estimation error takes the form

RE2˚pΣ;S, ξq “ inf

"

uTΣu

u22: uSc1 ď ξuS1

*

. (4.10)

The ordinary RE (4.10) is a special case of (4.3) when group size dj “ 1. Moreover, the

groupwise version of RE (4.9) takes the form [54, 47]

RE2pGqpΣ;S, ξ,λq “ inf

"

uTΣu

uS22: u P CpGqpξ, S,λq

*

.

We note that CC2pGqpΣ;S, ξ,λq ě RE2

pGqpΣ;S, ξ,λq ě RE2˚pGqpΣ;S, ξ,λq. Same as the

groupwise CC, RE2pGqpΣ;S, ξ,λq is also aimed at the prediction and the mixed `2,1 estimation

61

errors. When dj “ 1, the CC, originally formulated by van de Geer [74], becomes

CC2pΣ;S, ξq “ inf

"

uTΣu|S|

uS21: uSc1 ď ξuS1

*

. (4.11)

van de Geer and Buhlmann [77] proved that the prediction and `1 estimation loss of the

Lasso are under control when the CC is bounded away from zero.

The restricted strong convexity (RSC) condition introduced by Negahban et al. [55],

could also be viewed as an RE-type condition with a slightly larger cone. The RSC for

prediction and `1 estimation can be written as

κ2pΣ;S, ξq “ inf

"

uTΣu

uS22: uSc1 ď ξ|S|12uS2

*

. (4.12)

The original RSC condition takes the form

uTΣu ě

$

’

’

&

’

’

%

α1u22 ´ τ1tplog pqnuu21, u2 ď 1

α2u2 ´ τ2

a

plog pqnu1, u2 ě 1,

(4.13)

for positive constants α1, α2 and nonnegative constants τ1, τ2. We prove in Section 4.8 that

the RSC condition (4.13) is equivalent to that κpΣ;A, ξq is bounded below by a positive

constant for any set A with cardinality |A| ď Cn log p. Moreover, the RSC coefficient for

`2 estimation error can be defined as

κ2˚pΣ;S, ξq “ inf

"

uTΣu

u22: uSc1 ď ξ|S|12uS2

*

. (4.14)

We further define the groupwise RSC coefficient for the prediction and `2,1 estimation as

κ2pGqpΣ;S, ξ,λq “ inf

"

uTΣu

uGS22

: u P C ˚pGqpξ, S,λq

*

, (4.15)

and the groupwise RSC coefficient for the `2 estimation as

κ2˚pGqpΣ;S, ξ,λq “ inf

"

uTΣu

u22: u P C ˚pGqpξ, S,λq

*

, (4.16)

62

where the cone C ˚pGqpξ, S,λq takes the form

C ˚pGqpξ, S,λq “

u :ÿ

jPSc

λjuGj2 ď ξ´

ÿ

jPS

λ2j

¯12uGS

2(

. (4.17)

4.3 The group transfer principle

The main purpose of this chapter is to show that the groupwise RE-type conditions can be

guaranteed by a low moment condition. The key to prove this is the group transfer principle.

Specifically, to control the RE-type coefficients, it is essential to minimize uTΣu in certain

cone. The “transfer principle” refers to the property that the cone in the minimization

problem can be transfered to a smaller cone of proper cardinality. Oliveira [57] proved the

transfer principle and use which proved the RE-type conditions. In this section, we provide

a groupwise version of the transfer principle.

Before introducing the group transfer principle, we first state the strong group sparsity

condition of Huang and Zhang [33]. A coefficients vector β P Rp is strongly group-sparse if

there exists integers g and s such that

supppβq P GS “ YjPSGj , |S| ď g, |GS | ď s. (4.18)

Further, we define that

dj “ |Gj |, j “ 1, ..., J and maxjPSc

dj “ d˚Sc , maxjPS

dj “ d˚S . (4.19)

We let C and c denote generic positive constants in all our theoretical results. Their values

may vary in different expressions but they remain universal constants.

Theorem 4.1. Let d˚Sc and d˚S be as in (4.19), λ˚ “ maxjPSc λj, ξ ą 0, σ be regression

noise level. Suppose that λj “ pa

dj `?

2 log JqA0σ?n with certain A0 ą 1, j “ 1, ..., J

and (4.18) holds for certain integers g, s. For any L ą 0, define k˚ “ CLξ2`

d˚Sc`s`g log J˘

and s˚ “ max!

ř

jPA dj : A P Sc,ř

jPApd12j `

?2 log Jq2 ď k˚

)

. Then,

63

(i) For any L ą 0 and u P Rp, there exist v P Rp such that

v0 ď s˚, vGS“ uGS

,ÿ

jPSc

λjvGj2 “ÿ

jPSc

λjuGj2, (4.20)

and

uTΣu ě vTΣv ´min! 1

L,

`ř

jPS λ2j

˘12

Lλ˚

)

. (4.21)

(ii) Let ε ą 0 and D be a block diagonal matrix with block corresponding to G. Ifř

jPS λ2jφmaxpDGj ,Gj q ď p1 ` εq

ř

jPS λ2j and φminpDGj ,Gj q ě 1p1 ` εq,@j P Sc, then for

any ε1 ą 0, L ą 0 and u P Rp, there exist v P Rp such that (4.20) holds and

uTΣu

D´12u22ě p1´ ε1q

vTΣv

D´12v22´ C min

! 1

L,

`ř

jPS λ2j

˘12

Lλ˚

)

, (4.22)

where C is a constant that depending on ε and ε1 only.

Remark 4.1. A more explicit form of the k˚ in the definition of cardinality s˚ is

k˚ “ 4Lξ2 max

U, V(

` p43qU,

where U “`

td˚Scu12 `

?2 log J

˘2, V “

ř

jPS

`

d12j `

?2 log J

˘2. We do not seek the best

(smallest) cardinality here. In fact, smaller s˚ can be found easily.

We note that (4.21) aims to prove the compatibility condition, while (4.22) aims for the

restricted eigenvalue condition. Indeed, to prove the CC condition, one may only need to

control the nonzero part of u, or uGS. However, the whole vector u need to be controlled

in order to prove the RE condition.

A special case of Theorem 4.1 is when dj “ 1, j “ 1, ..., J , the transfer principle (4.1)

and (4.2) can be written as

v0 ď s˚ “ rLξ2ss, vGS“ uGS

, vS1 “ uS1 and

uTΣu ě vTΣv ´ ξ2ss˚. (4.23)

64

Note that (4.23) is close to the transfer principle proved by Oliveira [57], and is also the

key for van de Geer and Muro [76] to prove the ordinary CC condition under low moment

conditions.

Proof of Theorem 4.1. We apply a stratified version of Maurey’s empirical method

to groups of different sizes. Let λ˚ “ minjPSc λj , λ1˚ “ minjPS λj , J0 “ S,

Jk “

j P SczJk´1 : λj ď 2kλ˚(

, k “ 1, . . . , k˚,

with k˚ “ rlog2pλ˚λ˚qs ` 1. Let u P CpGqpξ, S,λq. Define U j P Rp by tU juGj “ uGj and

tU juGi “ 0¯, i ‰ j. Let Zpi,kq be independent vectors independent of X with

P!

Zpi,kq “ π´1j,kU j

)

“ πj,k “λjuGj2Itj P Jkuř

jPJkλjuGj2

. (4.24)

Let

Zpkq“

k´1ÿ

`“0

ÿ

jPJ`

U j `

k˚ÿ

`“k

m´1`

mÿ

i“1

Zpi,`q, k “ 1, ..., k˚. (4.25)

Also let Zpk˚`1q

“ u. As E“

Zpkqˇ

ˇZpk`1q

,Σ‰

“ Zpk`1q

,

E“`

Zpkq˘T

ΣZpkqˇ

ˇZpk`1q

,Σ‰

ď`

Zpk`1q˘T

ΣZpk`1q

`m´1k

ÿ

jPJk

π´1j,ku

TGj

ΣGj ,GjuGj

“`

Zpk`1q˘T

ΣZpk`1q

`m´1k

´

ÿ

jPJk

λjuGj2

¯´

ÿ

jPJk

uGj2λj

¯

ď`

Zpk`1q˘T

ΣZpk`1q

`

´

p2k´1λ˚q2mk

¯´1´ ÿ

jPJk

λjuGj2

¯2.

Let

mk “

RLξř

jPJkλjuGj2

´

ř

jPS λ2j

¯12max

´

λ˚,`ř

jPS λ2j

˘12¯

p2k´1λ˚q2ř

jPS λjuGj2

V

. (4.26)

65

It follows that

E“`

Zp1q˘T

ΣZp1qˇˇΣ

‰

ď uTΣu`k˚ÿ

k“1

´

p2k´1λ˚q2mk

¯´1´ ÿ

jPJk

λjuGj2

¯2

ď uTΣu`min! 1

L,

`ř

jPS λ2j

˘12

Lλ˚

)

ˆ

´

ř

jPS λjuGj2

¯2

ř

jPS λ2j

. (4.27)

Moreover, as

k˚ÿ

k“1

p2k´1λ˚q2mk ď Lξ2

`

ÿ

jPS

λ2j

˘12max

!

λ˚,`

ÿ

jPS

λ2j

˘12)

` p43qpλ˚q2,

for λj “ pa

dj `?

2 log JqA0σ?n, we have

ÿ

jPSc,pZp1qqGj

‰0

´

d12j `

a

2 log J¯2

ď 4Lξ2 max

U, pUV q12(

` p43qU

ď 4Lξ2 max

U, V(

` p43qU,

where U “`

td˚Scu12 `

?2 log J

˘2, V “

ř

jPS

`

d12j `

?2 log J

˘2. Therefore, there exists v

satisfying vGSc 0 ď s˚ with

s˚ “ max!

ÿ

jPA

dj : A P Sc,ÿ

jPA

´

d12j `

a

2 log J¯2ď cLξ2

d˚Sc ` s` g log J(

)

for certain constant c. Moreover, whenř

jPS λjuGj2 “

´

ř

jPS λ2j

¯12,

vGS“ uGS

,ÿ

jPSc

λjvGj2 “ÿ

jPSc

λjuGj2,

vTΣv ď uTΣu`min! 1

L,

`ř

jPS λ2j

˘12

Lλ˚

)

.

This proves (4.20) and (4.21).

66

To prove (4.22), we first note that

pZp1qqTΣZ

p1qď uTGS

ΣGS ,GSuGS

`

k˚ÿ

k“1

max Zpi,kq22

ďÿ

jPS

uGj22 `

k˚ÿ

k“1

´

ř

jPJkλjuGj2

2k´1λ˚

¯2

ď

´

ÿ

jPS

λjuGj2

¯2ˆ

1

λ1˚

˙2

`

´

ÿ

jPSc

λjuGj2

¯2 k˚ÿ

k“1

´ 1

2k´1λ˚

¯2

ď

ˆ

1

pλ1˚q2`

4ξ2

3λ2˚

˙

ÿ

jPS

λ2j , (4.28)

Moreover, by the property of D,

pÿ

jPS

λ2j q

12 “ÿ

jPS

λjD12Gj ,Gj

D´12Gj ,Gj

uGj2

ďÿ

jPS

λjφ12maxpDGj ,Gj qD

´12Gj ,Gj

uGj2 ď p1` εq12

`

ÿ

jPS

λ2j

˘12››D

´12GS ,GS

uGS

›

›

2.

It follows that

›

›D´12Zp1q››

2ě

›

›D´12GS ,GS

Zp1qGS

›

›

2“

›

›D´12GS ,GS

uGS

›

›

2ě 1p1` εq12. (4.29)

By Lemma 4.2, for any 0 ă ε1 ă 1,

P!

›

›D´12Zp1q››

2

2ď p1´ ε1q

›

›D´12u›

›

2

2

)

ď k˚ exp

´cε21L(

,

where c is a constant depending on ε and ε1 only. This combines with (4.28) and (4.29), we

have

E

«

pZp1qqTΣZ

p1q

›

›D´12Z›

›

2

2

ˇ

ˇ

ˇ

ˇ

ˇ

X

ff

ď E

«

pZp1qqTΣZ

p1q

p1´ ε1q›

›D´12u›

›

2

2

ˇ

ˇ

ˇ

ˇ

ˇ

X

ff

`k˚ exp`

´ cε21L˘

p1` εq

ˆ

1

tλ1˚u2`

4ξ2

3λ2˚

˙

ÿ

jPS

λ2j .

67

Combing this with (4.27), we have that

E

«

pZp1qqTΣZ

p1q

›

›D´12Z›

›

2

2

ˇ

ˇ

ˇ

ˇ

ˇ

X

ff

ďuTΣu

p1´ ε1q›

›D´12u›

›

2

2

`1` ε

1´ ε1min

! 1

L,

`ř

jPS λ2j

˘12

Lλ˚

)

`k˚ exp`

´ cε21L˘

p1` εq

ˆ

1

tλ1˚u2`

4ξ2

3λ2˚

˙

ÿ

jPS

λ2j . (4.30)

Note that the third term of the RHS of (4.30) maybe of smaller order of the second term,

we have

E

«

pZp1qqTΣZ

p1q

›

›D´12Z›

›

2

2

ˇ

ˇ

ˇ

ˇ

ˇ

X

ff

ďuTΣu

p1´ ε1q›

›D´12u›

›

2

2

` C min! 1

L,

`ř

jPS λ2j

˘12

Lλ˚

)

(4.31)

holds for some constant C. Then (4.22) follows from (4.31). ˝

4.4 Groupwise compatibility condition

In this section, we prove that the groupwise compatibility condition, which is sufficient to

control the prediction and the mixed `2,1 estimation errors, can be guaranteed under low

moment conditions on random designs using the group transfer principle.

Following Yuan and Lin [83], suppose that the design matrix X in linear model (4.1)

is normalized such that XTGjXGjn “ Idjˆdj for j “ 1, ..., J . In random designs, this

corresponds to

XGj “ĂXGj

rΣ´12

Gj ,Gj, rΣ “ĂX

TĂXn, j “ 1, . . . , J. (4.32)

where ĂX P Rnˆp is the original design matrix before normalization and composed of

independent rows of observed covariates, ĂXi,˚ “ p rXij , j “ 1, . . . , pq from the i-th data

point. Further, let

Σ “XTXn and Σ “ ErΣ (4.33)

be the normalized sample Gram matrix and original population Gram matrix respectively.

68

Moreover, let rD and D be the block diagonal matrix of rΣ and Σ respectively,

rDGj ,Gj “rΣGj ,Gj , DGj ,Gj “ ΣGj ,Gj , j “ 1, . . . , J.

The main reason using normalized X is the explicitness in the choice of the corresponding

penalty level λj in (4.2). For example, by Huang and Zhang [33], λj can be taken as

pa

dj`?

2 log JqA0σ?n for certain A0 ą 1 when noise level σ is known. Finally, we define

two events related to the original sample Gram matrix. For any ε ą 0, let

ΩS “

!

φmaxp rDGj ,Gj q ď 1` ε, @j P S)

,

ΩS “

!

ÿ

jPS

λ2jφmaxp

rDGj ,Gj q ď p1` εqÿ

jPS

λ2j

)

. (4.34)

Theorem 4.2. Suppose ΣGj ,Gj “ Idjˆdj , j “ 1, ..., J . Let ε ą 0, ξ ą 0, λ, λ˚, g, s, d˚S

and d˚Sc be as in Theorem 4.1, ΩS and ΩS be in (4.34). Let κ2pGq

`

Σ;S, ¨,λ˘

be in (4.15),

CpGq`

¨, S,λ˘

and C ˚pGq

`

¨, S,λ˘

be in (4.4) and (4.17) respectively. For any L ą 0, define

k˚ “ CLξ2`

d˚Sc`s`g log J˘

and s˚ “ max!

ř

jPA dj : A P Sc,ř

jPApd12j `

?2 log Jq2 ď k˚

)

.

(i) Suppose

L ě”

εCC2pGq

`

Σ;S, p1` εqξ,λ˘

ı´1. (4.35)

Suppose that the following variable class is uniformly integrable,

inf!

pĂXi,˚uq2 : u P CpGq

´

p1` εqξ, S,λ¯

, uSc0 ď s˚,uTΣu “ 1,@i)

. (4.36)

Then, if n ě Cs˚ log

eps˚(

,

P!

CCpGqpΣ;S, ξ,λq ě p1´ 3εq12CCpGqpΣ;S,`

1` ε˘

ξ,λq)

Ñ 1´ PpΩcSq, (4.37)

where C is a constant depending on ε only.

(ii) Suppose L ě”

εκ2pGq

`


ı´1and

inf!

pĂXi,˚uq2 : u P C ˚pGq

´

p1` εqξ, S,λ¯


(4.38)

69

is uniformly integrable. Then, if n ě Cs˚ log

eps˚(

,

P!

CCpGqpΣ;S, ξ,λq ě p1´ 3εq12κpGqpΣ;S,`

1` ε˘

ξ,λq)

Ñ 1´ PpΩcSq. (4.39)

Remark 4.2. The assumption that ΣGj ,Gj “ Idjˆdj does not lose any generality since

replacing ĂXi,˚ by ĂXi,˚D´12 yields the same X in (4.32).

In Theorem 4.2, we proved that the population version groupwise CC condition implies

its sample version under a second moment assumption on the linear combinations of the

design variables with probability 1 ´ PpΩcSq. Also, under same type of second moment

assumption, the groupwise CC condition could also be guaranteed given a restricted strong

convexity condition with probability 1´ PpΩcSq.

The next question is how large PpΩcSq and PpΩc

Sq would be. The following theorem

guarantees that both PpΩcSq and PpΩc

Sq go to zero under a fourth moment assumption on

the individual design variables and a m-th moment assumption on the linear combinations

of the design variables within each group j P S, with m ą 2. Moreover, the fourth and

m-th moment assumption can be removed given a slightly larger sample size.

Theorem 4.3. Suppose ΣGj ,Gj “ Idjˆdj . Let ΩS ,ΩS ,λ, g, s and d be as in Theorem 4.1.

(i) Suppose that for any i “ 1, ...., n, j P S, EtĂXi,Gj42u and

sup

"

E|ĂXi,Gju|2q : q “

1

1´ c?s, u P Rdj

*

. (4.40)

are bounded, where c is a constant. If n ě Csd˚Splog sq,

P

ΩSu ě P

ΩSu Ñ 1. (4.41)

(ii) With no uniform boundedness condition, if n ě Csd˚Splog sq2,

P

ΩSu ě P

ΩSu Ñ 1. (4.42)

Remark 4.3. The sample size requirement n ě Csd˚Splog sq in part (i) is rather minimum.

70

It usually can be dominated by the sample size requirement n ě Cs˚ log

eps˚(

in Theorem

4.2.

Combing Theorem 4.2 and Theorem 4.3, we conclude that the groupwise CC condition

can be guaranteed under (i): a second moment uniform integrability assumption on the

linear combinations of the design variables and (ii) a fourth moment uniform boundedness

assumption on the individual design variables and a m-th moment assumption on the linear

combinations of the within group design variables for m ą 2, given a population CC or

RSC condition and a sample size n ě C

s˚ log

eps˚(

_ sd˚S log s(

. This is the usual

sample size requirement as sd˚Splog sq can usually be dominated by s˚ log

eps˚(

. If given

a slightly larger sample size n ě C

s˚ log

eps˚(

_ sd˚Splog sq2(

, the fourth and m-th

moment assumption can be further removed.

The ordinary CC is a special case of the groupwise CC. This leads to the following

corollary.

Corollary 4.1. Suppose diagpΣq “ Ipˆp. Let ε ą 0, ξ ą 0, ΩS and ΩS be the events

ΩS “

!

rΣj,j ď 1` ε, @j P S)

, ΩS “

!

ÿ

jPS

rΣj,j ď p1` εqs)

. (4.43)

(i)Suppose L ě”

εCC2`

Σ;S, p1` εqξ˘

ı´1and

inf!

pĂXi,˚uq2 : uSc1 ď p1` εqξuS1, uSc0 ď s˚,uTΣu “ 1,@i

)

, (4.44)

is uniformly integrable with s˚ “ rLξ2ss . Then, if n ě Cs˚ log

eps˚(

,

P!

CCpΣ;S, ξq ě p1´ 3εq12CCpΣ;S,`

1` ε˘

ξq)

Ñ 1´ PpΩcSq. (4.45)

(ii) Suppose L ě”

εκ2`

Σ;S, p1` εqξ˘

ı´1and

inf!

pĂXi,˚uq2 : uSc1 ď p1` εqξ

?suS2, uSc0 ď s˚,uTΣu “ 1,@i

)

71

is uniformly integrable with s˚ “ rLξ2ss. If n ě Cs˚ log

eps˚(

, then

P!

CCpΣ;S, ξq ě p1´ 3εq12κpΣ;S,`

1` ε˘

ξq)

Ñ 1´ PpΩcSq. (4.46)

(iii) Moreover, with no further assumption,

PpΩSq Ñ 1. (4.47)

If for any 1 ď i ď n, j P S, EĂX4

ij is bounded,

PpΩSq Ñ 1. (4.48)

Remark 4.4. One notable difference between the ordinary CC condition and the groupwise

version is that the fourth moment assumption on the individual design variable variables

can be removed given the restricted strong convexity condition on the population, under

usual sample size requirement. The proof of (4.47) is straightforward by the weak law of

large numbers. Corollary 4.1 is close to Theorem 5.3 of van de Geer and Muro [76], where

the difference is that they require a m ą 2 order isotropy condition instead of the second

moment uniform integrability condition .

Proof of Theorem 4.2. We first prove (i). Let v P CpGq,s˚pξ, S,λq and supposeř

jPS λjvGj2 “`ř

jPS λ2j

˘12. When L satisfies (4.35), take infinitum in the cone

CpGqpξ, S,λq of both sides of (4.21), we have

CC2pGqpΣ;S, ξ,λq ě inf

!

vTΣv : v P CpGq,s˚pξ, S,λq,ÿ

jPS

λjvGj2 “`

ÿ

jPS

λ2j

˘12)

´εCC2pGq

´

Σ;S, p1` εqξ,λ¯

. (4.49)

Now consider the RHS of (4.49). Let rv “ rD´12

v. Denote event

ΩSc “

"

φminp rDGj ,Gj q ě1

1` ε, @j P Sc

*

. (4.50)

72

In the event Ω “ ΩS X ΩSc , we have

ÿ

jPS

λjrvGj2 ě

ř

jPS λjvGj2

maxjPS φ12maxp rDGj ,Gj q

“

`ř

jPS λ2j

˘12

maxjPS φ12maxp rDGj ,Gj q

ě

`ř

jPS λ2j

˘12

?1` ε

, (4.51)

and

ÿ

jPSc

λjrvGj2 ď

ř

jPSc λjvGj2

maxjPSc φ12minp

rDGj ,Gj qď

ξř

jPS λjvGj2

maxjPSc φ12minp

rDGj ,Gj qď p1` εqξ

ÿ

jPS

λjrvGj2.

(4.52)

Thus, with ξ1 “ p1` εqξ, we have

p1` εq inf!

vTΣv : v P CpGq,s˚pξ, S,λq,ÿ

jPS

λjvGj2 “`

ÿ

jPS

λ2j

˘12)

ě p1` εq inf

$

&

%

ĂXrv22n

:ÿ

jPS

λjrvGj2 ě

`ř

jPS λ2j

˘12

?1` ε

, rv P CpGq`

ξ1, S,λ˘

, rvSc0 ď s˚

,

.

-

ě inf

#

ĂXrv22n

: rv P CpGqpξ1, S,λq,ÿ

jPS

λjrvGj2 “`

ÿ

jPS

λ2j

˘12, rvSc0 ď s˚

+

“ inf

#

rvTΣrvř

jPS λ2j

ř

jPS λjrvGj2ˆĂXrv22n

: rv P CpGqpξ1, S,λq, rvSc0 ď s˚, rvTΣrv “ 1

+

ě CC2pGqpΣ;S, ξ1,λq inf

#

1

n

nÿ

i“1

pĂXi,˚rvq2 : rv P CpGqpξ1, S,λq, rvSc0 ď s˚, rvTΣrv “ 1

+

.

Moreover, by Lemma 4.4, when n ě Cs˚ logteps˚u,

P

#

inf! 1

n

nÿ

i“1


)

ě 1´ ε

+

Ñ 1.

Combing with (4.49), we have that

P!

CC2pGqpΣ;S, ξ,λq ě p1´ 3εqCC2

pGqpΣ;S,`

1` ε˘

ξ,λq)

ě P!

CC2pGqpΣ;S, ξ,λq ě p

1´ ε

1` ε´ εqCC2

pGqpΣ;S,`

1` ε˘

ξq,λ)

Ñ 1.

Now we need to control PtΩScu, by Lemma 4.4,

PtΩScu ě minjPSc

P!

inf 1

n

nÿ

i“1

pĂXi,˚uq2 : supppuq Ă Gj , u2 “ 1

(

ě1

1` ε

)

Ñ 1. (4.53)

73

Then PtΩScu Ñ 1 and (4.37) holds. To prove (4.39), note that in the event Ω “ ΩS XΩSc ,

pÿ

jPS

λ2j q

12 “ÿ

jPS

λj rD12

Gj ,GjrvGj2

ďÿ

jPS

λjφ12maxp

rDGj ,Gj qrvGj2 ď p1` εq12

`

ÿ

jPS

λ2j

˘12››

rvGS

›

›

2.

So that

rvGS2 ě 1

?1` ε, (4.54)

and

ÿ

jPSc

λjrvGj2 ď

ř

jPSc λjvGj2

maxjPSc φ12minp

rDGj ,Gj q

ďξř

jPS λjvGj2

maxjPSc φ12minp

rDGj ,Gj qď p1` εqξ

´

ÿ

jPS

λj

¯12rvGS

2. (4.55)

The remaining proof follows the same way. ˝

Proof of Theorem 4.3. It is easy to see that P

ΩSu ě P

ΩSu, we only need to prove

P

ΩSu “ P

φmaxprΣGj ,Gj q ď 1` ε, @j P S(

Ñ 1 for any ε ą 0. We first truncate ĂXi,Gj as

follows:

ĂXi,Gj “ĂXi,GjI

ĂXi,Gj22 ď an

(

`ĂXi,GjI

ĂXi,Gj22 ą an

(

,

where an will be defined different for proving (i) and (ii). Further, for any j P S, we let

Mi “ĂXi,GjĂXT

i,GjI

ĂXi,Gj22 ď anu ´ EĂXi,Gj

ĂXT

i,GjI

ĂXi,Gj22 ď anu.

To prove (i), let an “ nε0 for certain ε0 ą 0, we have

EMiMTi 2 “ EĂXi,Gj

ĂXT

i,GjĂXi,Gj

ĂXT

i,GjI

ĂXi,Gj22 ď an

(

2

ď maxu2“1

E|ĂXi,Gju|2ĂXi,Gj

221`

ĂXi,Gj22 ď an

˘

ď maxu“1

´

E|ĂXi,Gju|2q¯1q´

EĂXi,Gj42

¯1´1qa1´2p1´1qqn 2

74

ď Cd2p1´1qqj a1´2p1´1qq

n (4.56)

holds for some constant C, where the first inequality holds due to Holder’s inequality,

the second inequality holds due to the uniform boundedness of supu,i,j E|ĂXi,Gju|2q and

EĂXi,Gj42. Let σ2

j “ Cd2p1´1qqj a

1´2p1´1qqn and hpxq “ p1 ` xq logp1 ` xq ´ x. By Bennett

inequality in Tropp [73],

P!›

›

›

rΣGj ,GjI

max1ďiďn

ĂXi,Gj22 ď an

(

´ ErΣGj ,GjI

max1ďiďn

ĂXi,Gj22 ď an

(

›

›

›

2ą ε

)

ď dj exp

#

ńσ2

j

a2n

hánε

σ2j

¯

+

ď dj exp

#

ńσ2

j

a2n

anε

σ2j

log´1

eànε

eσ2j

¯

+

ď Cdj exp

#

ńε

anlog

´1

e`

anε

ed2p1´1qqj a

1´2p1´1qqn

¯

+

“ Cdj exp

#

´ε

ε0log

´1

e`εa

2p1´1qqn

ed2p1´1qqj

¯

+

ď Cdj exp

#

´2p1´ 1

q qε

ε0log

àndj

˘

+

ď Cdγ`1j

nγ

” γ

2p1´ 1qqε

ıγ, (4.57)

where γ “ 2p1´ 1q qεε0, C is a constant and vary in each inequalities. The second inequality

holds due to hpxq ě x logp1e` xeq. Fix γ “ 1 and 1´ 1q “ c?

log s, we have that when

n ě Cplog sqsd˚S ,

ÿ

jPS

P!›

›

›

rΣGj ,GjI

max1ďiďn

ĂXi,Gj22 ď an

(

´ ErΣGj ,GjI

max1ďiďn

ĂXi,Gj22 ď an

(

›

›

›

2ą ε

)

ďC?

log sř

jPS d2j

εnÑ 0. (4.58)

Moreover, the Markov inequality infers

P´

max1ďiďn,jPS

ĂXi,Gj22 ą ann

¯

ďn

`

ε0n˘2

ÿ

jPS

EĂXi,Gj42I´

ĂXi,Gj22 ą ε0n

¯

ď Clog s

ř

jPS d2j

nÑ 0. (4.59)

75

Combing above two inequalities, we have (4.40) holds.

To prove (ii), we let an “ ε0n log s for ε0 ą 0, we have

EMiMTi 2 ď EĂXi,Gj

ĂXT

i,GjĂXi,Gj

ĂXT

i,GjI

ĂXi,Gj22 ď an

(

2

ď pε0n log sq12EĂXi,GjĂXT

i,GjI

ĂXi,Gj22 ď ε0n log s

(

2

ď ε0n log s.

Let σ2j “ ε0n log s, and let ε0 satisfying 2

?ε0 ` p83qε0 “ ε. Then, by Bernstein inequality

in Tropp [73],

P!›

›

›

rΣGj ,GjI

max1ďiďn

ĂXi,Gj22 ď an

(

´ ErΣGj ,GjI

max1ďiďn

ĂXi,Gj22 ď an

(

›

›

›

2ą ε

)

ď Cdj exp´

´nε22

ε0n log s` 2nε0εp3 log sq

¯

“ Cdj exp´ ´p2 log sq

´

?ε0 ` p43qε0

¯2

ε0 ` p23qε0`

2?ε0 ` p83qε0

˘

¯

ď Cdjs2. (4.60)

Summing above inequality over j P S, we have

ÿ

jPS

P!›

›

›

rΣGj ,GjI

max1ďiďn

ĂXi,Gj22 ď an

(

´ ErΣGj ,GjI

max1ďiďn

ĂXi,Gj22 ď an

(

›

›

›

2ą ε

)

ď Cÿ

jPS

djs2“ Cs´1 Ñ 0. (4.61)

Moreover, by Markov inequality, when n ě Cplog sq2sd˚S ,

P´

maxi,jPS

ĂXi,Gj22 ą ε0n log s

¯

ďn

`

ε0n log s˘2

ÿ

jPS

EĂXi,Gj42I´

ĂXi,Gj22 ą ε0n log s

¯

ď Cplog sq2

ř

jPS d2j

nÑ 0.

Combing with (4.61), we have (4.42) holds. ˝

76

4.5 Groupwise restricted eigenvalue condition

In this section, we prove that the groupwise restricted eigenvalue condition, which is

sufficient to control the `2 estimation error, can be guaranteed under low moment conditions

in the same way as the groupwise CC in Section 4.4.

Recall that the design matrix X in linear model (4.1) is normalized such that

XTGjXGjn “ Idjˆdj . Let βporigq be the original regression coefficients in the linear model

y “ĂXβporigq ` ε (4.62)

before normalization of design matrix. As βporigq is the parameter of interest, it is natural

to use the loss function pβporigq

´ βporigq, where pβporigq

is the estimates for βporigq. If we

estimate βporigq by pβporigq

“ rD´12

pβpGq

, the loss function for pβporigq

can be written as

pβporigq

´ βporigq “ rD´12

ppβpGq´ β˚q.

This leads to the restricted eigenvalue for the `2 loss:

RE2˚pGqpΣ,

rD;S, ξ,λq “ inf

#

uTΣu

rD´12

u22

: u P CpGqpξ, S,λq

+

. (4.63)

Theorem 4.4. Suppose ΣGj ,Gj “ Idjˆdj . Let ε ą 0, ξ ą 0,λ, λ˚, g, s, d˚S and d˚Sc be as

in Theorem 4.1, ΩS and ΩS be in (4.34). Let κ2˚pGq

`

Σ;S, ¨,λ˘

be in (4.16), CpGq`

¨, S,λ˘

and C ˚pGq

`

¨, S,λ˘

be in (4.4) and (4.17) respectively. For any L ą 0, define k˚ and s˚ as in

Theorem 4.2.

(i) Suppose

L ěC min

!

1,`ř

jPS λ2j

˘12λ˚

)

εRE2pGq

`

Σ;S, p1` εqξ,λ˘ . (4.64)


inf!

pĂXi,˚uq2 : u P CpGq

´

p1` εqξ, S,λ¯


. (4.65)

77

If n ě Cs˚ log

eps˚(

, then

P!

RE˚pGqpΣ, rD;S, ξ,λq ě p1´ 4εq12RE˚pGq`


)

Ñ 1´ PpΩcSq, (4.66)

where C is a constant depending on ε only.

(ii) Suppose

L ěC min

!

1,`ř

jPS λ2j

˘12λ˚

)

εκ2˚pGq

`

Σ;S, p1` εqξ,λ˘ .


inf!

pĂXi,˚uq2 : u P C ˚pGq

´

p1` εqξ, S,λ¯


If n ě Cs˚ log

eps˚(

, then

P!

RE˚pGqpΣ, rD;S, ξ,λq ě p1´ 4εq12κ˚pGq`


)

Ñ 1´ PpΩcSq. (4.67)

(iii) Let ξ ą 1 and β be a vector with supppβq P GS “ YjPSGj. In the event ΩS Y ΩS,

if max1ďjďJ XTGjpy´Xβq2n ď λjη holds with η “ pξ´ 1qpξ` 1q, then the group Lasso

solution pβpGq

in (4.2) has

rD´12

pβpGq´ βporigq2 ď

p1` ηq`ř

jPS λ2j

˘12

RE˚pGqpΣ, rD;S, ξ,λqCCpGqpΣ;S, ξ,λq.

ďp1` εq12p1` ηq

`ř

jPS λ2j

˘12

RE2˚pGqpΣ,

rD;S, ξ,λq. (4.68)

Since PpΩcSq and PpΩc

Sq go to zero under a uniform boundedness condition by Theorem

4.3, Theorem 4.4 proved that the population groupwise RE (or RSC) condition implies the

sample RE under a low moment condition in the same way as the groupwise CC. Moreover,

the inequality (4.68) confirms that the `2 estimation error of original coefficients can be

bounded via RE2˚pGqpΣ,

rD;S, ξ,λq. The ordinary RE condition could also be viewed as a

special case of the groupwise version as in the following corollary.

78

Corollary 4.2. Suppose diagpΣq “ Ipˆp. Let ε ą 0, ξ ą 0, ΩS and ΩS be the events in

(4.43).

(i) Suppose L ě C”

εRE2˚

`

Σ;S, p1` εqξ˘

ı´1and

inf!

pĂXi,˚uq2 : uSc1 ď p1` εqξuS1, uSc0 ď s˚,uTΣu “ 1,@i

)

(4.69)


eps˚(

, then

P!

RE˚pΣ, rD;S, ξq ě p1´ 4εq12RE˚`

Σ;S, p1` εqξ˘

)

Ñ 1´ PpΩcSq. (4.70)

(ii) Suppose ΩS, let L ě C”

εκ2˚

`

Σ;S, p1` εqξ˘

ı´1and

inf!

pĂXi,˚uq2 : uSc1 ď p1` εqξ

?suS2, uSc0 ď s˚,uTΣu “ 1,@i

)


eps˚(

, then

P!

RE˚pΣ, rD;S, ξq ě p1´ 4εq12κ˚`

Σ;S, p1` εqξ˘

)

Ñ 1´ PpΩcSq. (4.71)

Existing results of the RE-type conditions are limited to the CC or the RE in (4.9),

which could only guarantee the prediction and `1 estimation error of the Lasso. Corollary

4.2 is the first result to guarantee the RE condition for `2 estimation error of the Lasso

under low moment conditions as to our knowledge. Proving the RE condition is also way

more difficult that that of the CC, because the whole vector u need to be controlled in order

to bound the RE. While for the CC condition, one may only need to control the nonzero

part of u.

Proof of Theorem 4.4. Consider u P CpGqpξ, S,λq and supposeř

jPS λjuGj2 “

`ř

jPS λ2j

˘12. In the event Ω “ ΩS X ΩSc , when L satisfies (4.35), take infinitum in the

cone CpGqpξ, S,λq of both sides of (4.22) and let ε1 “ ε, we have

RE2˚pGqpΣ,

rD;S, ξq ě p1´ εq inf

#

vTΣv

rD´12

v22

: v P CpGq,s˚pξ, S,λq

+

´εRE2˚pGq

´


. (4.72)

79

Let rv “ rD´12

v. Recall (4.51) and (4.52), we have

ÿ

jPS

λjrvGj2 ě

`ř

jPS λ2j

˘12

?1` ε

,ÿ

jPSc

λjrvGj2 ď p1` εqξÿ

jPS

λjrvGj2.

Thus, with ξ1 “ p1` εqξ, we have

p1` εq inf! vTΣv

rD´12

v22

: v P CpGq,s˚pξ, S,λq)

ě p1` εq inf!

ĂXrv22nrv22

:ÿ

jPS

λjrvGj2 ě

`ř

jPS λ2j

˘12

?1` ε

, rvSc0 ď s˚, rv P CpGqpξ1, S,λq)

ě RE2˚pGqpΣ;S, ξ1,λq inf

!

ĂXrv22n : rvSc0 ď s˚, rv P CpGqpξ1, S,λq, rvTΣrv “ 1

)

.

By Lemma 4.4,

inf!

ĂXrv22n : rvSc0 ď s˚, rv P CpGqpξ1, S,λq, rvTΣrv “ 1

)

ě 1´ ε

holds with probability goes to 1. When L satisfies (4.64), it follows from (4.72) that

RE2˚,pGqpΣ,

rD;S, ξ,λq ěp1´ εq2

1` εRE2

˚pGq

`

Σ;S, ξ1,λ˘

´ εRE2˚pGq

´


ě p1´ 4εqRE2˚pGq

`

Σ;S, ξ1,λ˘

.

Also, the upper bound of PpΩScq have been controlled in (4.53). The further bound in the

event Ω “ ΩS X ΩSc follows the same way.

To prove the original coefficients estimation error bound (4.68), we start with the basic

inequality. Let h “ pβpGq´ β, by Mitra and Zhang [52],

hTΣh ď p1` ηqÿ

jPS

λjhGj2 ´ p1´ ηqÿ

jPSc

λjhGj2.

Setting h “ λtu and optimizing over t we have

rD´12

pβpGq´ βporigq2 ď λCest,`2p

rD;S, ηq,

80

where Cest,`2 “ sup!

rD´12

u2

p1`ηqř

jPS λjuGj2´p1´ηqř

jPSc λjuGj2(

`: uTΣu “

1)

. Further,

Cest,`2 ď sup0ďtďξ

! rD´12

u2“

p1` ηqř

jPS λjuGj2 ´ p1´ ηqř

jPSc λjuGj2‰

uTΣu:

ÿ

jPS

λjuGj2 “ tÿ

jPSc

λjuGj2

)

ďsup0ďtďξ

!

“

p1` ηq ´ tp1´ ηq‰

)

`ř

jPS λ2j

˘12

RE˚pGqpΣ, rD;S, ξ,λqCCpGqpΣ;S, ξ,λq

ďp1` ηq

`ř

jPS λ2j

˘12

RE˚pGqpΣ, rD;S, ξ,λqCCpGqpΣ;S, ξ,λq. (4.73)

The first inequality in (4.68) holds. Moreover, for any u P CpGqpξ, S,λq,

uTΣu`ř

jPS λ2j

˘

`ř

jPS λjuGj2˘2 ě

uTΣu`ř

jPS λ2j

˘

`ř

jPS λjφ12maxp rDGj ,Gj q

rD´12

Gj ,GjuS2

˘2

ěuTΣu

`ř

jPS λ2j

˘

λ2jφmaxp rDGj ,Gj q

rD´12

GS ,GSuS22

ěuTΣup1` εq

rD´12

GS ,GSuS22

ěuTΣup1` εq

rD´12

u22

.

Taking infimum in the cone CpGqpξ, S,λq on both sides give us

CC2pGqpΣ;S, ξ,λq ě RE2

˚pGqpΣ,rD;S, ξ,λqp1` εq.

Combining with (4.73), we have the second inequality of (4.68) holds. ˝

4.6 Convergence of the restricted eigenvalue

In Section 4.4 and 4.5, we proved that the groupwise CC and RE condition can be controlled

under low moment conditions, given their population version condition hold. The next

question is whether the restricted eigenvalue truly converge to its population version.

Rudelson and Zhou [66] considered the case of bounded designs and proved the

convergence of the RE given certain sample size requirement. van de Geer and Muro [76]

81

further extended the bounded assumption to sub-Gaussian designs, but assume a strong

isotropy condition. In this section, we show that van de Geer and Muro’s results can be

generalized to the groupwise setting easily with no further condition.

Definition 4.1. Let m ą 2. A random vector X0 P Rp is strongly m-th order isotropic

with constant cm if for all u P Rp with EpXT0 uq

2 “ 1 it holds that

rEpXT0 uq

ms1m ď cm.

Definition 4.2. A random variable Z P R is sub-Gaussian with constant c if for all λ ą 0

it holds that

E exprλ|Z|s ď 2 exprλ2c22s.

Lemma 4.1. Let rΣ “ĂXTĂXn and Σ “ ErΣ. If ĂXi,˚ is strongly m-th order isotropic with

constant cm and its components ĂXi,j are sub-Gaussian with constant c, then for a universal

constant c1 and all t ą 0 with probability at least 1´ p1` n´mpm´2qq expr´ts,

supuT Σu“1,u1ďM

ˇ

ˇuT rΣu´ uTΣuˇ

ˇc1

ď cM

d

`

2t` 2 logp2pq ` 2m log npm´ 2q˘

plog p log3 n` tq

n

` c2M2

`


plog p log3 n` tq

n

` c2m expr´tpm´ 2qmsn. (4.74)

This Lemma is quoted from van de Geer and Muro [76]. Observably, to achieve the

convergence, the sample size n need to satisfy the condition

log2 p log3 n ! n, pm´ 2q´1m log2 p log3 n ! n.

Now we generalize the results to the groupwise setting.

Theorem 4.5. Let rΣ “ ĂXTĂXn and Σ “ ErΣ. Suppose ĂXi,˚ is strongly m-th order

isotropic with constant cm and its components ĂXi,j are sub-Gaussian with constant c. Let

82

A0, σ, λ be as in Theorem 4.1. Then, for a universal constant c1 and all t ą 0 with

probability at least 1´ p1` n´mpm´2qq expr´ts,

supuT Σu“1,

řJj“1 λjuGj

2ďMA0σ?n

ˇ

ˇuT rΣu´ uTΣuˇ

ˇc1

ď cM

d

`


plog p log3 n` tq

n

` c2M2

`


plog p log3 n` tq

n

` c2m expr´tpm´ 2qmsn. (4.75)

The proof of Theorem 4.5 is straightforward. When λj “ pa

dj `?

2 log JqA0σ?n,

řJj“1 λjuGj2 ď MA0σ

?n indicate that

řJj“1 d

12j uGj2 ď M . By Cauchy-Schwarz

inequality,

uGS1 ď

Jÿ

j“1

d12j uGj2 ďM

Then (4.75) follows from (4.74).

4.7 Lemmas

In this section, we provide three lemmas that was used before.

Lemma 4.2. Let Zpi,kq

and Zpkq

be in (4.24) and (4.25) respectively. Let D be a block

diagonal matrix such thatř

jPS λ2jφmaxpDGj ,Gj q ď p1 ` εq

ř

jPS λ2j and φminpDGj ,Gj q ě

1p1` εq,@j P Sc. Then for any ε1 ą 0,

P!

›

›D´12Zp1q››

2

2ď p1´ ε1q

›

›D´12u›

›

2

2

)

ď k˚ exp

´cε21L(

.

where c is a constant that may depend on ε and ε1.

Proof of Lemma 4.2. We first note that

›

›D´12Zp1q››

2

2´›

›D´12u›

›

2

2“

`

Zp1q` u

˘TD´1

`

Zp1q´ u

˘

ě 2uTD´1`

Zp1q´ u

˘

.

83

Let ζpi,kq “ uTD´1pZpi,kq ´ř

jPJkUjq

›

›D´12u›

›

2

2for k “ 1, ..., k˚ and i “ 1, ...,mk. Let

ζpkq“

řmki“1 ζ

pi,kqmk, we have Erζpi,kqs “ 0. Since φminpDGj ,Gj q ě 1p1 ` εq,@j P Sc and

D´12u ě 1p1` εq12 by (4.29), we further have

|ζpi,kq ` 1| ď maxjPJk

pD´1uqGj2Zpi,kq2

›

›D´12u›

›

2

2

ď maxjPJk

D´12Gj ,Gj

pD´12uqGj2

´ 1

λj

ÿ

jPJk

λjuGj2

¯

›

›D´12u›

›

2

2

ď p1` εq12pD´12uqGj2`

2k´1λ˚q´1

ÿ

jPJk

λjuGj2›

›D´12u›

›

2

2

ďp1` εq12

2k´1λ˚

ÿ

jPJk

λjuGj2›

›D´12u›

›

2

ďp1` εq

ř

jPJkλjuGj2

2k´1λ˚. (4.76)

Moreover,

Epζpi,kqq2 “ÿ

jPJk

πj

˜

π´1j u

TGjD´1Gj ,Gj

uGj

›

›D´12u›

›

2

2

¸2

´pD´12uqGJk

42›

›D´12u›

›

4

2

ďÿ

jPJk

ř

jPJkλjuGj2

λjuGj2

pD´1uqGj2uGj2pD´12uqGj

22

›

›D´12u›

›

4

2

ď

ř

jPJkλjuGj2 maxjPJk pD

´1uqGj2

2k´1λ˚›

›D´12u›

›

2

2

pD´12uqGJk22

›

›D´12u›

›

2

2

ď

ř

jPJkλjuGj2p1` εq

12pD´12uqGj2

2k´1λ˚›

›D´12u›

›

2

2

ďp1` εq

ř

jPJkλjuGj2

2k´1λ˚. (4.77)

Define ξk “ p1 ` εqř

jPJkλjuGj2p2

k´1λ˚q and let mk be in (4.26). Combining (4.76),

(4.77) and by the Bernstein inequality,

P!

´ ζpkqě

?2´ 1

2?

22´pk

˚`1´kq2ε1

)

“ P!

´1

mk

mkÿ

i“1

ζpi,kq ě

?2´ 1

2?

22´pk

˚`1´kq2ε1

ˇ

ˇ

ˇ

ĂX)

ď exp

$

&

%

´mkp?

2´12?

22´pk

˚`1´kq2ε1q22

ξkr1` p?

2´12?

22´pk˚`1´kq2ε1q3s

,

.

-

“ exp

´ cε21L(

,

84

holds for any ε1 ą 0 and k “ 1, ...., k˚, where c is certain constant. Further we have

P!

›

›D´12Zp1q››

2

2´›

›D´12u›

›

2

2ď ´ε1

›

›D´12u›

›

2

2

)

“ P!

´

k˚ÿ

k“1

ζpkqě ε12

)

ď 1´k˚ź

k“1

P!

ζpkqď

?2´ 1

2?

22pk

˚`1´kq2ε1

)

ď 1´k˚ź

k“1

“

1´ exp

´ cε21L(‰

ď k˚ exp

´ cε21L(

.

˝

Lemma 4.3. For any A Ă t1, . . . , pu and M ą 0,!

pĂXi,˚uq2 ^M : supppuq Ă A

)

is a VC

class of functions of ĂXi,˚ with no greater VC-dimension than 10|A| ` 10, i “ 1, ..., n.

Proof of Lemma 4.3. Denote the set of subgraphs of functions in!

ĂXi,˚u : supppuq Ă A)

as C1, the set of subgraphs of functions in!

´ĂXi,˚u : supppuq Ă A)

as C2, then V pC1q “

V pC2q “ |A| ` 2 by Lemma 2.6.15 of van der Vaart and Wellner [78], where V p¨q denote

the VC-dimension. Let C be the subgraph of functions in!

pĂXi,˚uq2 ^M : supppuq Ă A

)

,

then C Ă C1 X C2 and V pCq ď V pC1 X C2q. Denote N “ 10|A| ` 10. To prove Lemma 4.3,

we only need to show V pC1 X C2q ď N . Or equivalently,

maxx1,...,xN

∆N pC1 X C2, x1, ..., xN q ă 2N ,

where

∆N pC1 X C2, x1, ..., xN q “ #

C X tx1, ..., xNu : C P C1 X C2

(

.

In fact, we have

maxx1,...,xN

∆N pC1 X C2, x1, ..., xN q

ď`

maxx1,...,xN

∆N pC1, x1, ..., xN q˘`

maxx1,...,xN

∆N pC2, x1, ..., xN q˘

85

ď

¨

˝

V pC1q´1ÿ

j“0

ˆ

N

j

˙

˛

‚

¨

˝

V pC2q´1ÿ

j“0

ˆ

N

j

˙

˛

‚“

¨

˝

|A|`1ÿ

j“0

ˆ

N

j

˙

˛

‚

2

,

where the second inequality comes from Corollary 2.6.3 of van der Vaart and Wellner [78].

We need to prove´

ř|A|`1j“0

`

Nj

˘

¯2ă 2N . Let π “ |A|`1

N “ 110 . Since

|A|`1ÿ

j“0

ˆ

N

j

˙

“

|A|`1ÿ

j“0

ˆ

N

j

˙

πjp1´ πqN´j1

πjp1´ πqN´j

ď maxjď|A|`1

1

πjp1´ πqN´j

“1

π|A|`1p1´ πqN´|A|´1

“

ˆ

1

ππp1´ πq1´π

˙N

,

we only need to verify ππp1´ πq1´π ą 1?2

with π “ 110, which holds numerically. ˝

Lemma 4.4. Let ε ą 0 and s˚ be as in Theorem 4.1. Suppose ΣGj ,Gj “ Idjˆdj and

n ě Cs˚ log

eps˚(

for certain constant C. If (4.36) is uniformly integrable, then

P

#

inf! 1

n

nÿ

i“1

pĂXi,˚rvq2 : rv P CpGq

´

p1` εqξ, S,λ¯

, rvSc0 ď s˚, rvTΣrv “ 1)

ě 1´ ε

+

Ñ 1.

(4.78)

If (4.38) is uniformly integrable, then

P

#

inf! 1

n

nÿ

i“1

pĂXi,˚rvq2 : rv P C ˚pGq

´

p1` εqξ, S,λ¯

, rvSc0 ď s˚, rvTΣrv “ 1)

ě 1´ ε

+

Ñ 1.

(4.79)

Proof of Lemma 4.4. By Lemma 4.3, for any A Ă t1, . . . , pu with A Ą S and

|AzS| “ s˚,

!

pĂXi,˚rvq2 ^M : suppprvq Ă A

)

, i “ 1, ..., n

is a VC class of functions of ĂXi,˚ with VC-dimension no greater than 10|A| ` 10. Denote

this VC class as C. Let Q be any probability measure, 0 ă ε ă 1, Npε, C, ¨ q be the

covering number, the minimal number of balls tg : g ´ f ă εu of radius ε needed to cover

86

C. The norm considered here is the L2pQq norm: fQ “` ş

|f |2dQ˘12

, for any f P C. By

Theorem 2.6.4 of van der Vaart and Wellner [78], there exists a universal constant C, such

that

supQN

`

ε, C, L2pQq˘

ď CV pCqp4eqV pCqp1εq2V pCq´2.

It then follows from Theorem 2.14.9 of van der Vaart and Wellner [78] that

P

#

inf! 1

n

nÿ

i“1

“

pĂXi,˚rvq2 ^M ´ EpĂXi,˚rvq

2 ^M‰

: suppprvq Ă A)

ď ´ε

+

À CV pCq expt´2npεMq2u

with C being a positive constant. Since there are totally`

p´ss˚

˘

choices of A,

P

#

inf! 1

n

nÿ

i“1

“


2 ^M‰

: rvSc0 ď s˚)

ď ´ε

+

À

ˆ

p´ s

s˚

˙

CV pCq expt´2npεMq2u

ă`

eps˚˘s˚

CV pCq expt´2nε2M2u

À exp

C0s˚ logpeps˚q ´ 2nε2M2

(

holds for certain constant C0. The second inequality holds because V pCq ď 10ps` s˚q ` 10

and`

ab

˘

ď p e¨ab qb for any integer a ě b ą 0. Then, if n ě Cs˚ logpeps˚q holds for certain C,

P

#

inf! 1

n

nÿ

i“1

“


2 ^M‰

: rvSc0 ď s˚)

ď ´ε

+

Ñ 0. (4.80)

Moreover, since (4.36) is uniformly integrable, there exists M ą 0 such that for rv P

CpGqpξ1, S,λq, rvSc0 ď s˚, rvTΣrv “ 1,

E! 1

n

nÿ

i“1

pĂXi,˚rvq2 ^M

)

ě 1´ε

2. (4.81)

87

Combing (4.80) and (4.81), for ξ1 “ p1` εqξ, we have

P

#

inf! 1

n

nÿ

i“1


)

ď 1´ ε

+

ď P!

inf! 1

n

nÿ

i“1

“


2 ^M‰

:

rv P CpGqpξ1, S,λq, rvSc0 ď s˚)

ď ´ε

2

)

goes to zero. Therefore, (4.78) holds. Moreover, (4.79) can be proved in the same way. ˝

4.8 Discussion

Although bootstrap method beyond the topic of this chapter, we do realize that our results

may provide a theoretical foundation for the bootstrapped penalized least-square estimator.

In bootstrapped estimation, one sample with replacement of size n,

pX˚i,˚, y

˚i q : i “ 1, ..., n

(

from the original data points

pĂXi,˚, yiq : i “ 1, ..., n(

and form the corresponding

design matrix X˚ and response vector y˚. Let Σ˚“ tX˚uTX˚n be the bootstrapped

Gram matrix. Then it can be shown that the RE condition holds on Σ˚

under the low

moment condition on random designs. In other words, the sample RE implies the RE

of bootstrapped sample as population RE implies the sample RE. Then, prediction and

coefficients estimation properties for the bootstrapped Lasso can be guaranteed under the

low moment condition.

In Section 4.2, we argued that the RSC condition can be viewed as an RE-type condition

with a slightly larger cone. Now we provide the detailed results and proofs.

Proposition 4.1. The restricted strong convexity condition (4.13) is equivalent to the RE-

type condition

uTΣu ě

$

’

’

&

’

’

%

αuA22, u2 ď 1

αuA2, u2 ě 1

(4.82)

88

in the union cone

ď

|A|ďCn log p

!

u : uAc1 ď ξa

|A|uA2

)

(4.83)

for certain positive constants α and C.

Remark 4.5. In the linear model, one only need to check the RE for u2 “ 1, (4.82) is

equivalent to κ2pΣ;A, ξq ě α for u in the union cone (4.83).

Proof of Proposition 4.1. RSC to RE: Let α0 ą 0, 0 ď τ0 ď 1 be fixed numbers

satisfying

a

τ1α1 _ pτ2α2q ď νa

τ0α0

for 0 ă ν ă 1. Let A be any set satisfying p1 ` ξqa

|A|pτ0α0qplog pqn ď 1. When

uAc1 ď ξ|A|12uA2, we have

a

pτ0α0qplog pqnu1 ď p1` ξqa

|A|pτ0α0qplog pqnuA2 ď uA2.

The RSC condition (4.13) implies

uTΣu ě

$

’

’

&

’

’

%

p1´ νqα1uA22, u2 ď 1

p1´ νqα2uA2, u2 ě 1.

Therefore, the RE condition (4.82) holds with α “ p1´ νqmintα1, α2u in the union cone

ď

p1`ξq?pταqplog pqnď1|A|

!

u : uAc1 ď ξa

|A|uA2

)

.

RE to RSC: Now suppose (4.82) holds with |A| “ k, and α, τ be positive numbers

satisfying

1pξ?kq ` 1

?4k ď

a

pταqplog pqn.

89

We only need to consider u satisfying

a

pταqplog pqnu1 ď u2,

otherwise the RSC condition (4.13) holds automatically. Let A˚ be the index set of the k

largest elements of |u|. As

uAc˚2 ď u1

?4k,

we have

u1pξ?kq ď

a

pταqplog pqnu1 ´ u1?

4k ď u2 ´ uAc˚2 ď uA˚2.

Thus u is in the cone tu : uAc˚1 ď ξ

a

|A˚|uA˚2u. Then, for u2 ď 1,

uTΣu ě αuA˚22 ě αuA˚

22 ` αuAc

˚22 ´ αu

21t4ku ě αu22 ´ τplog pqnu21.

Similarly, for u2 ě 1,

uTΣu ě αuA˚2 ě αuA˚2 ` αuAc˚2 ´ αu1

?4k ě αu2 ´

a

τplog pqnu1.

Therefore, the RSC condition holds. ˝

90

Chapter 5

Nonparametric Maximum Likelihood for Mixture Models: A

Convex Optimization Approach to Fitting Arbitrary

Multivariate Mixing Distributions

5.1 Introduction

Consider a setting where we have iid observations from a mixture model. More specifically,

let G0 be a probability distribution on T Ď Rd and let tF0p¨ | θquθPT be a family of

probability distributions on Rn indexed by the parameter θ P T . Throughout the chapter,

we assume that T is closed and convex. Assume that X1, ..., Xp P Rn are observed iid

random variables and that Θ1, ...,Θp P Rd are corresponding iid latent variables, which

satisfy

Xj | Θj „ F0p¨ | Θjq and Θj „ G0. (5.1)

In (5.1), it may be the case that F0p¨ | θq and G0 are both known (pre-specified)

distributions; more frequently, this is not the case. In this chapter, we will study problems

where the mixing distribution G0 is unknown, but will assume F0p¨ | θq is known throughout.

Problems like this arise in applications throughout statistics, and various solutions have been

proposed. The distribution G0 can be modeled parametrically, which leads to hierarchical

modeling and parametric empirical Bayes methods [e.g. 18]. Another approach is to model

G0 as a discrete distribution supported on finitely- or infinitely-many points; this leads

to the study of finite mixture models or nonparametric Bayes, respectively [50, 23]. This

chapter focuses on another method for estimating G0: Nonparametric maximum likelihood.

Nonparametric maximum likelihood (NPML) methods for mixture models — and

closely related empirical Bayes methods — have been studied in statistics since the 1950s

[62, 38, 63]. They make virtually no assumptions on the mixing distribution G0 and

91

provide an elegant approach to problems like (5.1). The general strategy is to first find

the nonparametric maximum likelihood estimator for G0, denoted by G, then perform

inference via empirical Bayes [63, 18]; that is, inference in (5.1) is conducted via the posterior

distribution Θj | Xj , under the assumption G0 “ G. Research in to NPMLEs for mixture

models has included work on algorithms for computing NPMLEs and theoretical work on

their statistical properties [e.g. 41, 8, 43, 26, 36]. However, implementing and analyzing

NPMLEs for mixture models has historically been considered very challenging [e.g. p.

571 of 13, 17]. In this chapter, we study a computationally convenient approach involving

approximate NPMLEs, which sidesteps many of these difficulties and is shown to be effective

in a variety of applications.

Our approach is largely motivated by recent work initiated by [40](In fact, Koenker &

Mizera’s work was itself partially inpsired by relatively recent theoretical work on NPMLEs

by [36].) and further pursued by others, including [28, 30, 29] and [15]. Koenker & Mizera

studied convex approximations to NPMLEs for mixture models in relatively large-scale

problems, with up to 100,000s of observations. In [40], they showed that for the Gaussian

location model, where Xj “ Θj ` Zj P R and Θj „ G0, Zj „ Np0, 1q are independent, a

good approximation to the NPMLE for G0 can be accurately and rapidly computed using

generic interior point methods.

[40]’s focus on convexity and scalability is one of the key concepts for this chapter. Here,

we show how a simple convex approximation to the NPMLE can be used effectively in a

broad range of problems with nonparametric mixture models; including problems involving

(i) multivariate mixing distributions, (ii) discrete data, (iii) high-dimensional classification,

and (iv) state-space models. Backed by new theoretical and empirical results, we provide

concrete guidance for efficiently and reliably computing approximate multivariate NPMLEs.

Our main theoretical result (Proposition 5.1) suggests a simple procedure for finding the

support set of the estimated mixing distribution. Many of our empirical results highlight

the benefits of using multivariate mixing distributions with correlated components (Sections

5.6.2, 5.7, and 5.8), as opposed to univariate mixing distributions, which have been the

primary focus of previous research in this area (notable exceptions include theoretical work

on the Gaussian location-scale model in [26] and applications in [30, 29] involving estimation

92

problems with Gaussian models). In Sections 5.7–5.9, we illustrate the performance of the

methods described here in real-data applications involving baseball, cancer microarray data,

and online blood-glucose monitoring for diabetes patients. In comparison with other recent

work on NPMLEs for mixture models, this chapter distinguishes itself from [28] in that it

focuses on more practical aspects of fitting general multivariate NPMLEs. Additionally,

in this chapter we consider a substantially broader swath of applications than [30, 29] and

[15], where the focus is estimation in Gaussian models and classification with a univariate

NPMLE, respectively, and show that the same fundamental ideas may be effectively applied

in all of these settings.

5.2 NPMLEs for mixture models via convex optimization

5.2.1 NPMLEs

Let GT denote the class of all probability distribution on T Ď Rd and suppose that f0p¨ | θq is

the probability density corresponding to F0p¨ | θq (with respect to some given base measure).

For G P GT , the (negative) log-likelihood given the data X1, ..., Xp is

`pGq “ ´1

p

pÿ

j“1

log

"ż

Tf0pXj | θq dGpθq

*

.

The Kiefer-Wolfowitz NPMLE for G0 [38], denoted G, solves the optimization problem

minGPGT

`pGq; (5.2)

in other words, `pGq “ minGPGT `pGq.

Solving (5.2) and studying properties of G forms the basis for basically all of the

existing research into NPMLEs for mixture models (including this chapter). Two important

observations have had significant but somewhat countervailing effects on this research:

(i) The optimization problem (5.2) is convex;

(ii) If f0pXj |θq and T satisfy certain (relatively weak) regularity conditions, then G exists

and may be chosen so that it is a discrete measure supported on at most p points.

93

The first observation above is obvious; the second summarizes Theorems 18–21 of [43].

Among the more significant regularity conditions mentioned in (ii) is that the set

tf0pXj |θquθPT should be bounded for each j “ 1, ..., p.

Observation (i) leads to KKT-like conditions that characterize G in terms of the gradient

of ` and can be used to develop algorithms for solving (5.2) [e.g. 43]. While this approach

is somewhat appealing, (5.2) is typically an infinite-dimensional optimization problem

(whenever T is infinite). Hence, there are infinitely many KKT conditions to check, which

is generally impossible in practice.

On the other hand, observation (ii) reduces (5.2) to a finite-dimensional optimization

problem. Indeed, (ii) implies that G can be found by restricting attention in (5.2) to G P Gp,

where Gp is the set of discrete probability measures supported on at most p points in T .

Thus, finding G is reduced to fitting a finite mixture model with at most p components.

This is usually done with the EM-algorithm [41], where in practice one may restrict to

G P Gq for some q ă p. However, while (ii) reduces (5.2) to a finite-dimensional problem,

we have lost convexity:

minGPGq

`pGq (5.3)

is not a convex problem because Gq is nonconvex. When q is large (and recall that the

theory suggests we should take q “ p), well-known issues related to nonconvexity and finite

mixture models become a significant obstacle [50].

5.2.2 A simple finite-dimensional convex approximation

In this chapter, we take a very simple approach to (approximately) solving (5.2), which

maintains convexity and immediately reduces (5.2) to a finite-dimensional problem.

Consider a pre-specified finite grid Λ Ď T . We study estimators GΛ, which solve

minGPGΛ

`pGq. (5.4)

The key difference between (5.3) and (5.4) is that GΛ, and hence (5.4), is convex, while Gq

is nonconvex. Additionally, (5.4) is a finite-dimensional optimization problem, because Λ is

finite.

94

To derive a more convenient formulation of (5.4), suppose that

Λ “ tt1, ..., tqu Ď T (5.5)

and define the simplex ∆q´1 “ tw “ pw1, ..., wqq P Rq; wl ě 0, w1 ` ¨ ¨ ¨ ` wq “ 1u.

Additionally, let δt denote a point mass at t P Rd. Then there is a correspondence between

G “řqk“1wkδtk P GΛ and points w “ pw1, ..., wqq P ∆q´1. It follows that (5.4) is equivalent

to the optimization problem over the simplex,

minwP∆q´1

´1

p

pÿ

j“1

log

#

qÿ

k“1

f0pXj |tkqwk

+

. (5.6)

Researchers studying NPMLEs have previously considered estimators like GΛ, which

solve (5.4)–(5.6). However, most have focused on relatively simple models with univariate

mixing distributions G0 [8, 36, 40]. In very recent work, [29, 30] have considered multivariate

NPMLEs for estimation problems involving Gaussian models — by contrast, our aim is to

formulate strategies for solving and implementing the general problem, as specified in (5.2),

(5.4), and (5.6).

5.3 Choosing Λ

The approximate NPMLE GΛ is the estimator we use throughout the rest of the chapter.

One remaining question is: How should Λ be chosen? Our perspective is that GΛ is an

approximation to G and its performance characteristics are inherited from G. In general,

GΛ ‰ G. However, as one selects larger and larger finite grids Λ Ď T , which are more and

more dense in T , evidently GΛ Ñ G. Thus, heuristically, as long as the grid Λ is “dense

enough” in T , GΛ should perform similarly to G.

If T is compact, then any regular grid Λ Ď T is finite and implementing (5.4) is

straightforward (specific implementations are discussed in Section 5.5). Thus, for compact

T , one can choose Λ to be a regular grid with as many points as are computationally

feasible. For general T , we propose a two-step approach to choosing Λ: (i) Find a compact

95

convex subset T0 Ď T so that (5.2) is equivalent (or approximately equivalent) to

infGPGT0

`pGq; (5.7)

(ii) choose Λ Ď T0 Ď T to be a regular grid with q points, for some sufficiently large q.

Empirical results seem to be fairly insensitive to the choice of q. In Sections 5.7–5.9, we

choose q “ 30d for models with d “ 2, 3 dimensional mixing distributions G. For some

simple models with univariate G (d “ 1), theoretical results suggest that if q “?p, then

GΛ is statistically indistinguishable from G [15].

For each j “ 1, . . . , p, define

θj “ θpXjq “ arg maxθPT

f0pXj | θq

to be the maximum likelihood estimator (MLE) for Θj , given the data Xj P Rn. The

following proposition implies that (5.2) and (5.7) are equivalent when the likelihoods f0pXj |

θq are from a class of elliptical unimodal distributions, and T0 “ convpθ1, . . . , θpq is the

convex hull of θ1, . . . , θp. This result enables us to employ the strategy described above for

choosing Λ and finding GΛ; specifically, we take Λ to be a regular grid contained in the

compact convex set convpθ1, . . . , θpq.

Proposition 5.1. Suppose that f0 has the form

f0pXj | θq “ h

pθj ´ θqJΣ´1pθj ´ θq

(

upXjq, (5.8)

where h : r0,8q Ñ r0,8q is a decreasing function, Σ is a pˆ p positive definite matrix, and

u : Rn Ñ R is some other function that does not depend on θ. Let T0 “ convpθ1, . . . , θpq.

Then `pGq “ infGPGT0`pGq.

Proof. Assume that G “řqk“1wkδtk , where t1, . . . , tq P T and w1, . . . , wq ą 0. Further

assume that tq R T0 “ convpθ1, . . . , θkq. We show that their is another probability

distribution G “řq´1k“1wkδtk ` wqδtq , with tq P T0, satisfying `pGq ď `pGq. This suffices to

prove the proposition.

96

Let tq be the projection of tq onto T0 with respect to the inner product ps, tq ÞÑ sJΣ´1t.

To prove that `pGq ď `pGq, we show that f0pXj | tqq ě f0pXj | tqq for each j “ 1, . . . , p.

We have

pθj ´ tqqJΣ´1pθj ´ tqq “ pθj ´ tq ` tq ´ tqq

JΣ´1pθj ´ tq ` tq ´ tqq

“ pθj ´ tqqJΣ´1pθj ´ tqq ` 2ptq ´ tqq

JΣ´1pθj ´ tqq

` ptq ´ tqqJΣ´1ptq ´ tqq

ě pθj ´ tqqJΣ´1pθj ´ tqq,

where we have used the fact that ptq ´ tqqJΣ´1pθj ´ tqq “ 0, because tq is the projection of

tq onto T0. By (5.8), it follows that f0pXj | tqq ě f0pXj | tqq, as was to be shown.

The condition (5.8) is rather restrictive, but we believe it applies in a number of

important problems. The fundamental example where (5.8) holds is Xj | Θj „ NpΘj ,Σq; in

this case θj “ Xj and (5.8) holds with u being certain constant and hpzq9e´z2. Condition

(5.8) also holds in elliptical models, where Θj is the location parameter of Xj | Θj .

More broadly, if Xj “ pX1j , . . . , Xnjq P Rn may be viewed as a vector of replicates Xij ,

i “ 1, . . . , n, drawn from some common distribution conditional on Θj , then standard results

suggest that the MLEs θj may be approximately Gaussian if n is sufficiently large, and (5.8)

may be approximately valid. Specific applications where a normal approximation argument

for θj may imply that (5.8) is approximately valid include count data (similar to Section

5.7) and time series modeling (Section 5.9).

5.4 Connections with finite mixtures

Finding GΛ is equivalent to fitting a finite-mixture model, where the locations of the atoms

for the mixing measure have been pre-specified (specifically, the atoms are taken to be the

points in Λ). Thus, the approach in this chapter reduces computations for the relatively

complex nonparametric mixture model (5.1) to a convex optimization problem that is

substantially simpler than fitting a standard finite mixture model (generally a non-convex

problem).

97

An important distinction of nonparametric mixture modelss is that they lack the built-in

interpretability of the components/atoms from finite mixture models, and are less suited

for clustering applications. On the other hand, taking the nonparametric approach provides

additional flexibility for modeling heterogeneity in applications where it is not clear that

there should be well-defined clusters. Moreover, post hoc clustering and finite mixture

model methods could still be used after fitting an NPMLE; this might be advisable if, for

instance, GΛ has several clearly-defined modes.

5.5 Implementation overview

A variety of well-known algorithms are available for solving (5.6) and finding GΛ. We’ve

experimented with several, including the EM-algorithm, interior point methods, and the

Frank-Wolfe algorithm. This section contains a brief overview of how we’ve implemented

these algorithms; numerical results comparing the algorithms are contained in the following

section.

One of the early applications of the EM-algorithm is mixture models [41]. Solving (5.6)

with the EM-algorithm for mixture models is especially simple, because the problem is

convex (recall that the finite mixture model problem — as opposed to the nonparametric

problem — is typically non-convex). [40] have developed interior point methods for solving

(5.6). Along with [39], they created an R package REBayes that solves (5.6) for a handful

of specific nonparametric mixture models, e.g. Gaussian mixtures and univariate Poisson

mixtures; the REBayes packages calls an external optimization software package, Mosek,

and relies on Mosek’s built-in interior point algorithms. In our numerical analyses, we

used REBayes to compute some one-dimensional NPMLEs with interior point methods.

To estimate multi-dimensional NPMLEs with interior point methods, we used our own

implementations based on another R package Rmosek [3] (REBayes does appear to have

some built-in functions for estimating two-dimensional NPMLEs, but we found them to be

somewhat unstable in our applications). We note that our interior point implementation

solves the primal problem (5.6), while REBayes solves the dual. The Frank-Wolfe algorithm

[24] is a classical algorithm for constrained convex optimization problems, which has recently

been the subject of renewed attention [e.g. 35]. Our implementation of the Frank-Wolfe

98

algorithm closely resembles the “vertex direction method,” which has previously been used

for finding the NPMLE in nonparametric mixture models [7].

All of the algorithms used in this chapter were implemented in R. We did not attempt

to heavily optimize any of these implmentations; instead, our main objective was to

demonstrate that there are a range of simple and effective methods for finding (approximate)

NPMLEs. While the REBayes and Rmosek packages were used for their interior point

methods, no packages beyond base R were required for any of our other implementations.

5.6 Simulation studies

This section contains simulation results for NPMLEs and a Gaussian location-scale mixture

model. Section 5.6.1 contains a comparison of the various NPMLE algorithms described

in the previous section. In Section 5.6.2, we compare the performance of NPMLE-based

estimators to other commonly used methods for estimating the mean in a Gaussian location-

scale model.

In all of the simulations described in this section we generated the data as follows. For

j “ 1, . . . , p, we generated independent Θj “ pµj , σjq „ G0 and corresponding observations

Xj P Rn. Each Xj “ pX1j , . . . , Xnjq1 was a vector of n replicates

X1j , . . . , Xnj | Θj „ Npµj , σ2j q (5.9)

that were generated independently, conditional on Θj . In the general model (5.9), the mixing

distribution G0 is bivariate. However, we considered two values of G0 in our simulations:

one where the marginal distribution of σj was degenerate (i.e. σj was constant, so G0 was

effectively univariate) and one where the marginal distribution of σj was non-degenerate.

(Note that for non-degenerate σj , n ě 2 replicates are essential in order to ensure that the

likelihood is bounded and that G exists.)

Throughout the simulations, we took p “ 1000 and n “ 16. For the first mixing

distribution G0 (degenerate σj), we fixed σj “ 4 and took µj so that Ppµj “ 0q “ Ppµj “

5q “ 12. For the second mixing distribution (non-degenerate σj), we took Ppµj “ 0, σj “

5q “ Ppµj “ 5, σj “ 3q “ 0.5; for this distribution µj and σj are correlated.

99

5.6.1 Comparing NPMLE algorithms

For each mixing distribution, we computed GΛ using the algorithms described in Secion 5.5:

the EM-algorithm, an interior point method with Rmosek, and the Franke-Wolfe algorithm.

For each of these algorithms, we also computed GΛ for various grids Λ. Specifically, we

considered regular grids Λ “ tmkuq1k“1ˆtsku

q2k“1 Ď rmin µj ,max µjsˆrmin σj ,max σjs Ď R2,

where µj “ n´1ř

iXij and σ2j “ n´1

ř

ipXij ´ µjq2. The values q1, q2 determine number

of grid-points in Λ for µj , σj , respectively, and in the simulations we fit estimators with

pq1, q2q “ p30, 30q, p50, 50q, and p100, 100q.

In addition to fitting the two-dimensional NPMLEs GΛ described above, for the

simulations with degenerate σj we also fit one-dimensional NPMLEs to the data µ1, . . . , µp,

according to the model

µj | µj „ Npµj , 1q.

We fit these one-dimensional NPMLEs using all of the same algorithms for the two-

dimensional NPMLEs (EM, interior point with Rmosek, and Frank-Wolfe), and we also used

the REBayes interior point implementation to estimate the distribution of µj in this setting.

For the one-dimensional NPMLEs, we took Λ Ă rmin µj ,max µjs Ď R to be the regular grid

mith q “ 300 points. This allows us to compare the performance of methods for one- and

two-dimensional NPLMEs (where the one-dimensional NPMLEs take the distribution of σj

to be known) and compare the performance of the two interior point algorithms, among

other things.

For each simulated dataset and estimator GΛ we recorded several metrics. First, we

computed the total squared error (TSE),

TSE “

pÿ

j“1

tEGΛpµj | Xjq ´ µju

2.

Second, we computed the difference between the log-likelihood of GΛ and the log-likelihood

of GEM, the corresponding estimator for G0 based on the EM-algorithm:

∆plog-lik.q “ `pGEMq ´ `pGΛq.

100

Note that ∆plog-lik.q ą 0 if GΛ has a smaller negative log-likelihood than GEM (we’re

taking the EM-estimator GEM as a baseline for measuring the log-likelihood). Finally, we

recorded the time required to compute GΛ (in seconds; all calculations performed on a 2015

MacBook Pro laptop). Summary statistics are reported in Table 5.1.

It is evident that the results in Table 5.1 are relatively insensitive to the number of

grid-points pq1, q2q chosen for the two-dimensional NPMLE implementations. In terms of

TSE, the EM algorithm and interior point methods perform very similarly across all of the

settings, while the interior point methods appear to slightly out-perform the EM algorithm

in terms of ∆plog-lik.q across the board. Additionally, the interior point methods have

smaller compute time than the EM algorithm, though the difference is not too significant

for applications at this scale (for mixing distribution 1, with degenerate σj , the REBayes

dual implementation appears to be somewhat fast than our Rmosek primal implementation).

The Frank-Wolfe algorithm is the fastest implementation we have considered, but it’s

performance in terms of TSE and ∆plog-lik.q is considerably worse than the EM algorithm or

interior point methods. In the remainder of the chapter, we chose to use the EM algorithm

exclusively for computing NPMLEs — we believe it strikes a balance between simplicity

and performance.

5.6.2 Gaussian location scale mixtures: Other methods for estimating a

normal mean vector

Beyond NPMLEs, we also implemented several other methods that are commonly used

for estimating the mean vector µ “ pµ1, . . . , µpq1 P Rp in Gaussian location-scale models

and computed the corresponding TSE. Specifically, we considered the fixed-Θj MLE, µ “

pµ1, . . . , µpq1 P Rp; the James-Stein estimator; the heteroscedastic SURE estimator of Xie,

Kou, and Brown [81]; and a soft-thresholding estimator. The James-Stein estimator is a

classical shrinkage estimator for the Gaussian location model. The version employed here

is described in [81] and is designed for heteroscedastic data. The heteroscedastic SURE

estimator is another shrinkage estimator, which was designed to ensure that features with

a high noise variance are “shrunk” more than those with a low noise variance. Both the

James-Stein estimator and the heteroscedastic SURE estimator depend on the values σj .

101

The soft-thresholding estimator takes the form µpXq “ stpµjq, where t ě 0 is a constant

and stpxq “ signpxqmaxt|x|´ t, 0u, x P R. For soft-thresholding estimators, t was chosen to

minimize the TSE. Observe that the James-Stein, SURE, and soft-thresholding estimators

all depend on information that is typically not available in practice: the value of σj and

the actual TSE. By contrast, the two-dimensional NPMLEs described in the previous sub-

section utilize only the observed data X1, . . . , Xp.

In Table 5.2, we report the TSE for the different estimators described in this section,

along with the TSE for the bivariate NPMLE fit using the EM algorithm. We also fit a

univariate NPMLE in this example, where σj was not assumed to be known; instead we used

the plug-in estimator σj in place of σj and then computed the NPMLE for the distribution

of µj .

Table 5.2 shows that the NPMLEs dramatically out-perform the alternative estimators

in this setting, in terms of TSE. The bivariate NPMLE out-performs the univariate NPMLE

under both mixing distributions 1 and 2, but its advantage is especially pronounced under

mixing distribution 2, where µj and σj are correlated. This highlights the potential

advantages of bivariate NPMLEs over univariate approaches in settings with multiple

parameters.

5.7 Baseball data

Baseball data is a well-established testing ground for empirical Bayes methods [19]. The

baseball dataset we analyzed contains the number of at-bats and hits for all of the Major

League Baseball players during the 2005 season and has been previously analyzed in a

number of papers [10, 37, 53, 81]. The goal of the analysis is to use the data from the first

half of the season to predict each player’s batting average (hits/at-bats) during the second

half of the season. Overall, there are 929 players in the baseball dataset; however, following

[10] and others, we restrict attention to the 567 players with more than 10 at-bats during

the first half of the season (we follow the other preprocessing steps described in [10] as well).

Let Aj and Hj denote the number of at-bats and hits, respectively, for player j during the

first half of the season. We assume that pAj , Hjq follows a Poisson-binomial mixture model,

102

where Aj | pλj , πjq „ Poissonpλjq, Hj | pAj , λj , πjq „ binomialpAj , πiq, and pλj , πjq „ G0.

This model has a bivariate mixing distribution distribution G0, i.e. d “ 2. In the notation

of (5.1), Xj “ pAj , Hjq and Θj “ pλj , πjq. We propose to estimate each player’s batting

average for the second half of the season by the posterior mean of π, computed under

pλ, πq „ GΛ,

πj “ EGΛpπj | Aj , Hjq. (5.10)

Most previously published analyses of the baseball data begin by transforming the data

via the variance stabilizing transformation

Wj “ arcsin

d

Hj ` 14

Aj ` 12(5.11)

(Muralidharan [53] is a notable exception). Under this transformation, Wj is approximately

distributed as Ntµj , p4Ajq´1u, where µj “ arcsin

?πj . Methods for Gaussian observations

may be applied to the transformed data, with the objective of estimating µj . Following

this approach, a variety of methods based on shrinkage, the James-Stein estimator, and

parametric empirical Bayes methods for Gaussian data have been proposed and studied

[10, 37, 81].

Under the transformation (5.11), it is standard to use total squared error to measure the

performance of estimators µj [e.g. 10]. In this example, the total squared error is defined

as

TSE “ÿ

j

pµj ´ Wjq2 ´

1

4Nj

where

Wj “ arcsin

d

Hj ` 14

Aj ` 12,

and Aj and Hj denote the at-bats and hits from the second half of the season, respectively.

For convenience of comparison, we used TSE to measure the performance of our estimates

πj , after applying the transformation µj “ arcsina

πj .

Results from the baseball analysis are reported in Table 5.3. Following the work of

others, we have analyzed all players from the dataset together, and then the pitchers and

103

non-pitchers from the dataset separately. In addition to our Poisson-binomial NPMLE-

based estimators (5.10), we considered six other previously studied estimators:

1. The (fixed-parameter) MLE estimator µj “ Wj uses each player’s hits and at-bats

from the first half of the season to estimate their performance in the second half.

2. The grand mean µj “ p´1pW1 ` ¨ ¨ ¨ ` Wpq gives the exact same estimate for each

player’s performance in the second half of the season, which is equal to the average

performance of all players during the first half.

3. The James-Stein parametric empirical Bayes estimator described in [10].

4. The weighted generalized MLE (weighted GMLE), which uses at-bats as a covariate

[37]. This is essentially a univariate NPMLE-method for Gaussian models with

covariates.

5. The semiparametric SURE estimator is a flexible shrinkage estimator that may be

viewed as a generalization of the James-Stein estimator [81].

6. The binomial mixture method in [53] is another empirical Bayes method, which does

require the data to be transformed and estimates πj directly (in [53], they work

conditionally on the at-bats Aj). TSE is computed after applying the arcsin?¨

transformation.

The values reported Table 5.3 are the TSEs of each estimator, relative to the TSE of

the fixed-parameter MLE. Our Poisson-binomial method performs very well, recording the

minimum TSE when all of the data (pitchers and non-pitchers) are analyzed together and for

the non-pitchers. Moreover, the Poisson-binomial NPMLE GΛ works on the original scale of

the data (no normal tranformation is required) and may be useful for other purposes, beyond

just estimation/prediction. Figure 5.1 (a) contains a histogram of 20,000 independent draws

from the estimated distribution of pAj , HjAjq, fitted with the Poisson-binomial NPMLE

to all players in the baseball dataset. Observe that the distribution appears to be bimodal.

By comparing this histogram with histograms of the observed data from the non-pitchers

and pitchers separately (Figure 5.1 (b)–(c)), it appears that the mode at the left of Figure

104

Figure 5.1: (a) Histogram of 20,000 independent draws from the estimated distribution ofpAj , HjAjq, fitted with the Poisson-binomial NPMLE to all players in the baseball dataset;(b) histogram of non-pitcher data from the baseball dataset; (c) histogram of pitcher datafrom the baseball dataset.

5.1 (a) represents a group of players that includes the pitchers and the mode at the right

represents the bulk of the non-pitchers.

5.8 Two-dimensional NPMLE for cancer microarray classification

[15] proposed a univariate NPMLE-based method for high-dimensional classification

problems and studied applications involving cancer microarray data. The classifiers from

[15] are based on a Gaussian model with one-dimensional mixing distributions, i.e. d “ 1.

In this section we show that using a bivariate mixing distribution may substantially improve

performance.

Two datasets from the Microarray Quality Control Phase II project [49] are considered;

one from a breast cancer study and one from a myeloma study. The training dataset for

the breast cancer study contains n “ 130 subjects and p “ 22283 probesets (genes); the

test dataset contain 100 subjects. The training dataset for the myeloma study contains

n “ 340 subjects and p “ 54675 probesets; the test dataset contains 214 subjects. The goal

is to use the training data to build binary classifiers for several outcomes, then check the

performance of these classifiers on the test data. Outcomes for the breast cancer data are

response to treatment (“Response”) and estrogen receptor status (“ER status”); outcomes

for the myeloma data are overall and event-free survival (“OS” and “EFS”).

For each of the studies, let Xij denote the expression level of gene j in subject i and let

Yi P t0, 1u be the class label for subject i. Let Xj “ pX1j , . . . , XnjqJ P Rn. We assume that

each class (k “ 0, 1) and each gene (j “ 1, . . . , p) has an associated mean expression level

105

µjk P R, and that conditional on the Yi and µjk all of the Xij are independent and Gaussian,

satisfying Xij | pYi “ k, µjkq „ Npµjk, 1q (the gene-expression levels in the datasets are all

standardized to have variance 1).

In [15], they assume that µ1k, . . . , µpk „ Gk (k “ 0, 1) are all independent draws

from two distributions, G0 and G1. They use the training data from classes k “ 0, 1 to

separately estimate the distributions G0 and G1 using NPMLEs, and then implement the

Bayes classifier, replacing G0 and G1 with the corresponding estimates. In this chapter, we

model Θj “ pµj0, µj1q „ G0 jointly, then compute the bivariate NPMLE GΛ, and finally use

GΛ in place of G0 in the Bayes classifier for this model. The model from [15] is equivalent

to the model proposed here, when µj0 and µj1 are independent. Results from analyzing

the MAQC datasets using these two classifiers (the previously proposed method with 1-

dimensional NPMLEs and the 2-dimensional NPMLE described here), along with some

other well-known and relevant classifiers, may be found in Table 5.4. The other classifiers

we considered were:

1. NP EBayes w/smoothing. Another nonparametric empirical Bayes classifier proposed

in [27], which uses nonparametric smoothing to fit a univariate density to the µj and

then employs a version of linear discriminant analysis

2. Regularized LDA. A version of `1-regularized linear discriminant analysis, proposed in

Mai, Zou, and Yuan [48].

3. Logistic lasso. `1-penalized logistic regression fit using the R package glmnet.

For each of the datasets and outcomes, the 2-dimensional NPMLE classifier substantially

outperforms the 1-dimensional NPMLE, and is very competitive with the top performing

classifiers. Modeling dependence between µj0 and µj1, as with the 2-dimensional NPMLE,

seems sensible because most of the genes are likely to have similar expression levels across

classes, i.e. µj0 and µj1 are likely to be correlated. This may be interpreted as a kind

of sparsity assumption on the data, which are prevalent in high-dimensional classification

problems. Moreover, our proposed method involving NPMLEs should adapt to non-sparse

settings as well, since G0 is allowed to be an arbitrary bivariate distribution.

106

One of the main underlying assumptions of the NPMLE-based classification methods is

that the different genes have independent expression levels. This is certainly not true in

most applications, but is similar in principle to a “naive Bayes” assumption. Developing

methods for NPMLE-based classifiers to better handle correlation in the data may be of

interest for future research.

5.9 Continuous glucose monitoring

The analysis in this section is based on blood glucose data from a study involving 137 type

1 diabetes patients; more details on the study may be found in [31, 16]. Subjects in the

study were monitored for an average of 6 months each. Throughout the course of the study,

each subject wore a continuous glucose monitoring device, built around an electrochemical

glucose biosensor. Every 5 minutes while in use, the device records (i) a raw electrical

current measurement from the sensor (denoted ISIG), which is known to be correlated with

blood glucose density, and (ii) a timestamped estimate of blood glucose density (CGM),

which is based on a proprietary algorithm for converting the available data (including the

electrical current measurements from the sensor) into blood glucose density estimates. In

addition to using the sensors, each study subject maintained a self-monitoring routine,

whereby blood glucose density was measured approximately 4 times per day from a small

blood sample extracted by fingerstick. Fingerstick measurements of blood glucose density

are considered to be more accurate (and are more invasive) than the sensor-based estimates

(e.g. CGM). During the study, the result of each fingerstick measurement was manually

entered into the continuous monitoring device at the time of measurement; algorithms for

deriving continuous sensor-based estimates of blood glucose density, such as CGM, may use

the available fingerstick measurements for calibration purposes.

In the rest of this section, we show how NPMLE-based empirical Bayes methods can

be used to improve algorithms for estimating blood glucose density using the continuous

monitoring data. The basic idea is that after formulating a statistical model relating blood

glucose density to ISIG, we allow for the possibility that the model parameters may differ for

each subject, then use a training dataset to estimate the distribution of model parameters

across subjects (i.e. estimate G0) via nonparametric maximum likelihood. This is illustrated

107

for two different statistical models in Sections 5.9.1–5.9.2.

Throughout the analysis below, we use fingerstick measurements as a proxy for the actual

blood glucose density values. Let FSjptq and ISIGjptq denote the fingerstick blood glucose

density and ISIG values, respectively, for the j-th subject at time t. Recall that FSjptq

is measured, on average, once every 6 hours, while ISIGjptq is available every 5 minutes.

Let Ft denote the σ-field of information available at time t (i.e. all of the fingerstick and

ISIG measurements taken before time t, plus ISIGjptq). For each methodology, we use the

first half of the available data for each subject to fit a statistical model relating ISIGjptq

to FSjptq, then estimate each value FSjptq in the second half of the data using xFSjptq, an

estimator based on Ft. The performance of each method is measured by the average MSE

on the test data, relative to the MSE of the proprietary estimator CGM.

5.9.1 Linear model

First we consider a basic linear regression model relating FS and ISIG,

FSjptq “ µj ` βjISIGjptq ` σjεjptq, (5.12)

where the εjptq are iid Np0, 1q random variables and Θj “ pµj , βj , logpσjqq P R3 are

unknown, subject-specific parameters. Three ways to fit (5.12) are (i) using the combined

model, where Θj “ Θ “ pµ, β, logpσqq for all j, i.e. all the subject-specific parameters are

the same; (ii) the individual model, where Θ1, . . . ,Θp are all estimated separately, from the

corresponding subject data; and (iii) the nonparametric mixture model, where Θj „ G0 are

iid draws from the d “ 3-dimensional mixing distribution G0. For each of these methods,

we took xFSjptq “ µj ` βjISIGptq, where µj and βj are the corresponding MLEs under the

combined and individual models, and, under the mixture model, µj “ EGΛpµj | Ftq and

βj “ EGΛpβj | Ftq. Results are reported in Table 5.5.

5.9.2 Kalman filter

Substantial performance improvements are possible by allowing the model parameters

relating FS and ISIG to vary with time. In this section we consider the Gaussian state

108

space model (Kalman filter)

FSjptiq “ αjptiqISIGjptiq ` σjεjpti´1q,

αjptiq “ αjpti´1q ` τjδjpti´1q,(5.13)

where we assume that FS along with ISIG are observed at times t1, . . . , tn and εjptq, δjptq „

Np0, 1q are iid. In (5.13), tαjptqu are the state variables that evolve according to a random

walk and Θj “ plogpτjq, logpσjqq are unknown parameters. Unlike (5.12), there is no

intercept term in (5.13); dropping the intercept term has been previously justified when

using state space models to analyze glucose sensor data [e.g. 16]. The parameters σj , τj

control how heavily recent observations are weighted when estimating αjptq.

Similar to the analysis in Section 5.9.1, we fit (5.13) using (i) a combined model where

Θj “ Θ for all j; (ii) an individual model where Θ1, . . . ,Θp are estimated separately; and (iii)

a nonparametric mixture model, where Θj „ G0 are iid draws from a d “ 2-dimensional

mixing distribution. Under (i)–(ii), σj and τj are estimated by maximum liklihood and

xFSjptiq “ Etαjptiq | Ftiu ˆ ISIGjptiq, where the conditional expectation is computed with

respect to the Gaussian law governed by (5.13), with σj and τj replacing σj and τj (i.e.

we use the Kalman filter). For the nonparametric mixture (iii), xFSjptiq “ EGΛtαjptiq |

Ftiu ˆ ISIGjptiq, where the expectation is computed with respect to the model (5.13) and

the estimated mixing distribution GΛ. Results are reported in Table 5.5.

5.9.3 Comments on results

From Table 5.5, it is evident that the NPMLE mixture approach outperforms the individual

and combined methods for both the linear model and the Kalman filter/state space model.

The Kalman filter methods perform substantially better than the linear model, highlighting

the importance of time-varying parameters (scientifically, this is justified because the

sensitivity of the glucose sensor is known to change over time). Note that all of the relative

MSE values in Table 5.5 are greater than 1, indicating that CGM still outperforms all of

the methods considered here. Somewhat more ad hoc methods for estimating blood glucose

density that do outperform CGM are described in [16]; these methods (and CGM) leverage

additional data available to the continuous monitoring system, which is not described here

109

for the sake of simplicity. The methods in [16] are somewhat similar to the “combined”

Kalman filtering method from Section 5.9.2, where Θj “ Θ for all j; it would be interesting

to see if the performance of these methods could be further improved by using NPMLE

ideas.

5.10 Discussion

We have proposed a flexible, practical approach to fitting general multivariate mixing

distributions with NPMLEs and illustrated the effectiveness of this approach through

several real data examples. Theoretically, we proved that the support set of the NPMLE

is the convex hull of MLEs when the likelihood F0 comes from a class of elliptical

unimodal distributions. We believe that this approach may be attractive for many problems

where mixture models and empirical Bayes methods are relevant, offering both effective

performance and computational simplicity.

110

Table 5.1: Comparison of different NPMLE algorithms. Mean values (standard deviationin parentheses) reported from 100 independent datasets; p “ 1000, throughout simulations.Mixing distribution 1 has constant σj ; mixing distribution 2 has correlated µj and σj .

TSE∆plog-likq

104Time(secs.)

Mixing dist. 1 EM(Bivariate) pq1, q2q “ p30, 30q 130.5 (42.6) 0 (0) 9

pq1, q2q “ p50, 50q 130.4 (42.7) 0 (0) 33pq1, q2q “ p100, 100q 130.4 (42.6) 0 (0) 136

Interior point (Rmosek)pq1, q2q “ p30, 30q 130.7 (42.6) 6 (1) 8pq1, q2q “ p50, 50q 130.5 (42.9) 9 (1) 20pq1, q2q “ p100, 100q 130.6 (42.8) 11 (1) 80

Frank-Wolfepq1, q2q “ p30, 30q 147.3 (45.1) -234 (130) 5pq1, q2q “ p50, 50q 147.0 (45.9) -238 (134) 14pq1, q2q “ p100, 100q 146.2 (45.4) -238 (128) 55

Mixing dist. 1 EM 124.4 (41.6) 0 (0) 1(univariate; q “ 300; Interior point (Rmosek) 124.3 (42.1) 6 (1) 3assume known σj) Interior point (REBayes) 124.3 (42.1) 6 (1) 1

Frank-Wolfe 126.1 (41.8) -4 (4) 1

Mixing dist. 2 EMpq1, q2q “ p30, 30q 54.0 (28.4) 0 (0) 9pq1, q2q “ p50, 50q 54.0 (28.9) 0 (0) 34pq1, q2q “ p100, 100q 53.9 (28.8) 0 (0) 141

Interior point (Rmosek)pq1, q2q “ p30, 30q 54.3 (28.8) 5 (1) 8pq1, q2q “ p50, 50q 54.2 (29.1) 8 (1) 20pq1, q2q “ p100, 100q 54.3 (29.1) 10 (1) 82

Frank-Wolfepq1, q2q “ p30, 30q 82.2 (39.3) -372 (217) 5pq1, q2q “ p50, 50q 83.1 (36.4) -402 (232) 14pq1, q2q “ p100, 100q 82.0 (37.7) -396 (240) 56

Table 5.2: Mean TSE for various estimators of µ P Rp based on 100 simulated datasets;p “ 1000. pq1, q2q indicates the grid points used to fit GΛ.

Method Mixing dist.1 Mixing dist.2

Fixed-Θj MLE 997.0 (48.2) 1059.3 (57.9)Soft-Thresholding 826.2 (50.0) 793.7 (46.0)James-Stein 859.7 (43.6) 935.2 (53.0)SURE 859.7 (43.6) 880.7 (48.1)Univariate NPMLE q “ 300 170.7 (47.0) 285.4 (63.5)Bivariate NPMLE pq1, q2q “ p100, 100q 130.4 (42.6) 53.9 (28.8)

111

Table 5.3: Baseball data. TSE relative to the naive estimator. Minimum error is in boldfor each analysis.

Non- Non-Method All Pitchers Pitchers Method All Pitchers Pitchers

Naive 1 1 1 SURE 0.41 0.26 0.08Grand mean 0.85 0.38 0.13 Binomial mixture 0.59 0.31 0.16James-Stein 0.53 0.36 0.16 NPMLE 0.27 0.25 0.13GMLE 0.30 0.26 0.14

Table 5.4: Microarray data. Number of misclassification errors on test data.Logistic

Dataset Outcome ntest 2d-NPMLE 1d-NPMLE EBayes LDA lasso

Breast Response 100 15 36 47 30 18Breast ER status 100 19 40 39 11 11Myeloma OS 214 30 55 100 97 27Myeloma EFS 214 34 76 100 63 32

Table 5.5: Blood glucose data. MSE relative to CGM.Linear model Kalman filter

Combined Individual NPMLE Combined Individual NPMLE

1.56 1.54 1.51 1.05 1.07 1.03

112

Bibliography

[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence of gradientmethods for high-dimensional statistical recovery. Ann. Stat., 40:2452–2482, 2012.

[2] Anestis Antoniadis. Comments on: `1-penalization for mixture regression models. Test,19(2):257–258, 2010.

[3] MOSEK ApS. MOSEK Rmosek Package. Release 8.0.0.46, 2015. URL https://

mosek.com/resources/doc/.

[4] Pierre C Bellec, Guillaume Lecue, and Alexandre B Tsybakov. Slope meets lasso:improved oracle bounds and optimality. arXiv preprint arXiv:1605.08651, 2016.

[5] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Square-root lasso: pivotalrecovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.

[6] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis oflasso and dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.

[7] D. Bohning. A review of reliable maximum likelihood algorithms for semiparametricmixture models. J. Stat. Plan. Infer., 47:5–28, 1995.

[8] D. Bohning, P. Schlattmann, and B. G. Lindsay. Computer-assisted analysis ofmixtures (CA MAN): Statistical algorithms. Biometrics, 48:283–303, 1992.

[9] Patrick Breheny and Jian Huang. Coordinate descent algorithms for nonconvexpenalized regression, with applications to biological feature selection. Annals of AppliedStatistics, 5:232–253, 2011.

[10] L. D. Brown. In-season prediction of batting averages: A field test of empirical Bayesand Bayes methodologies. Ann. Appl. Stat., 2:113–152, 2008.

[11] Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation whenp is much larger than n. The Annals of Statistics, pages 2313–2351, 2007.

[12] Emmanuel J Candes and Terence Tao. Decoding by linear programming. InformationTheory, IEEE Transactions on, 51(12):4203–4215, 2005.

[13] A. DasGupta. Asymptotic Theory of Statistics and Probability. Springer, 2008.

[14] Abhirup Datta and Hui Zou. Cocolasso for high-dimensional error-in-variablesregression. arXiv preprint arXiv:1510.07123, 2015.

[15] L. H. Dicker and S. D. Zhao. High-dimensional classification via nonparametricempirical Bayes and maximum likelihood inference. Biometrika, 103:21–34, 2016.

https://mosek.com/resources/doc/

https://mosek.com/resources/doc/

113

[16] L. H. Dicker, T. Sun, C.-H. Zhang, D. B. Keenan, and L. Shepp. Continuous bloodglucose monitoring: A bayes-hidden markov approach. Stat. Sinica, 23:1595–1627,2013.

[17] D. L. Donoho and G. Reeves. Achieving Bayes MMSE performance in the sparse signal+ Gaussian white noise model when the noise level is unknown. In IEEE Int. Symp.Inf. Theory, pages 101–105, 2013.

[18] B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing,and Prediction. Cambridge University Press, 2010.

[19] B. Efron and C. Morris. Data analysis using Stein’s estimator and its generalizations.J. Am. Stat. Assoc., 70:311–319, 1975.

[20] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angleregression. The Annals of statistics, 32(2):407–499, 2004.

[21] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihoodand its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.

[22] Jianqing Fan, Han Liu, Qiang Sun, and Tong Zhang. Tac for sparse learning:Simultaneous control of algorithmic complexity and statistical error. arXiv preprintarXiv:1507.01037, 2015.

[23] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Stat., 1:209–230, 1973.

[24] M. Frank and P. Wolfe. An algorithm for quadratic programming. Nav. Res. Log., 3:95–110, 1956.

[25] Jerome Friedman, Trevor Hastie, Holger Hofling, Robert Tibshirani, et al. Pathwisecoordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.

[26] S. Ghosal and A. W. Van der Vaart. Entropies and rates of convergence for maximumlikelihood and Bayes estimation for mixtures of normal densities. Ann. Stat., 29:1233–1263, 2001.

[27] E. Greenshtein and J. Park. Application of non parametric empirical Bayes estimationto high dimensional classification. J. Mach. Learn. Res., 10:1687–1704, 2009.

[28] J. Gu and R. Koenker. On a problem of Robbins. Int. Stat. Rev., 84:224–244, 2016.

[29] J. Gu and R. Koenker. Empirical Bayesball remixed: Empirical Bayes methods forlongitudinal data. J. Appl. Econom., 2016. To appear.

[30] J. Gu and R. Koenker. Unobserved heterogeneity in income dynamics: An empiricalBayes perspective. J. Bus. Econ. Stat., 2016. To appear.

[31] I. B. Hirsch, J. Abelseth, B. W. Bode, J. S. Fischer, F. R. Kaufman, J. Mastrototaro,C. G. Parkin, H. A. Wolpert, and B.A . Buckingham. Sensor-augmented insulin pumptherapy: Results of the first randomized treat-to-target study. Diabetes Technol. The.,10:377–383, 2008.

114

[32] Jian Huang and Cun-Hui Zhang. Estimation and selection via absolute penalizedconvex minimization and its multistage adaptive applications. Journal of MachineLearning Research, 13:1809–1834, 2012.

[33] Junzhou Huang and Tong Zhang. The benefit of group sparsity. The Annals ofStatistics, 38(4):1978–2004, 2010.

[34] P. J. Huber and E. M. Ronchetti. Robust statistics, pages 172–175. Wiley, secondedition, 2009.

[35] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. ICML2013, 28:427–435, 2013.

[36] W. Jiang and C.-H. Zhang. General maximum likelihood empirical Bayes estimationof normal means. Ann. Stat., 37:1647–1684, 2009.

[37] W. Jiang and C.-H. Zhang. Empirical Bayes in-season prediction of baseball battingaverages. In Borrowing Strength: Theory Powering Applications – A Festschrift forLawrence D. Brown, pages 263–273. Institute of Mathematical Statistics, 2010.

[38] J. Kiefer and J. Wolfowitz. Consistency of the maximum likelihood estimator in thepresence of infinitely many incidental parameters. Ann. Math. Stat., 27:887–906, 1956.

[39] R. Koenker and J. Gu. REBayes: An R package for empirical Bayes mixture methods,2016.

[40] R. Koenker and I. Mizera. Convex optimization, shape constraints, compounddecisions, and empirical Bayes rules. J. Am. Stat. Assoc., 109:674–685, 2014.

[41] N. Laird. Nonparametric maximum likelihood estimation of a mixing distribution. J.Am. Stat. Assoc., 73:805–811, 1978.

[42] Guillaume Lecue and Shahar Mendelson. Sparse recovery under weak momentassumptions. arXiv preprint arXiv:1401.2188, 2014.

[43] B. G. Lindsay. Mixture Models: Theory, Geometry, and Applications. IMS, 1995.

[44] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisyand missing data: Provable guarantees with non-convexity. In Advances in NeuralInformation Processing Systems, pages 2726–2734, 2011.

[45] Po-Ling Loh and Martin J Wainwright. Support recovery without incoherence: A casefor nonconvex regularization. arXiv preprint arXiv:1412.5632, 2014.

[46] Po-Ling Loh and Martin J Wainwright. Regularized m-estimators with nonconvexity:Statistical and algorithmic theory for local optima. Journal of Machine LearningResearch, 16:559–616, 2015.

[47] Karim Lounici, Massimiliano Pontil, Sara Van De Geer, and Alexandre B Tsybakov.Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics,pages 2164–2204, 2011.

[48] Q. Mai, H. Zou, and M. Yuan. A direct approach to sparse discriminant analysis inultra-high dimensions. Biometrika, 99:29–42, 2012.

115

[49] MAQC Consortium. The microarray quality control (MAQC)-II study of commonpractices for the development and validation of microarray-based predictive models.Nat. Biotechnol., 28:827–838, 2010.

[50] G. McLachlan and D. Peel. Finite Mixture Models. John Wiley & Sons, 2004.

[51] Nicolai Meinshausen and Peter Buhlmann. High-dimensional graphs and variableselection with the lasso. The Annals of Statistics, pages 1436–1462, 2006.

[52] Ritwik Mitra and Cun-Hui Zhang. The benefit of group sparsity in group inferencewith de-biased scaled group lasso. Electronic Journal of Statistics, 10(2):1829–1873,2016.

[53] O. Muralidharan. An empirical Bayes mixture method for effect size and false discoveryrate estimation. Ann. Appl. Stat., 4:422–438, 2010.

[54] Yuval Nardi and Alessandro Rinaldo. On the asymptotic properties of the group lassoestimator for linear models. Electronic Journal of Statistics, 2:605–633, 2008.

[55] Sahand Negahban, Pradeep K Ravikumar, Martin J Wainwright, and Bin Yu. Aunified framework for high-dimensional analysis of m-estimators with decomposableregularizers. In Advances in Neural Information Processing Systems, pages 1348–1356,2009.

[56] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. Aunified framework for high-dimensional analysis of m-estimators with decomposableregularizers. Statistical Science, pages 538–557, 2012.

[57] Roberto Imbuzeiro Oliveira. The lower tail of random quadratic forms, withapplications to ordinary least squares and restricted eigenvalue properties. arXivpreprint arXiv:1312.2903, 2013.

[58] Michael R Osborne, Brett Presnell, and Berwin A Turlach. A new approach to variableselection in least squares problems. IMA Journal of Numerical Analysis-Institute ofMathematics and its Applications, 20(3):389–404, 2000.

[59] Michael R Osborne, Brett Presnell, and Berwin A Turlach. On the lasso and its dual.Journal of Computational and Graphical statistics, 9(2):319–337, 2000.

[60] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue propertiesfor correlated gaussian designs. The Journal of Machine Learning Research, 11:2241–2259, 2010.

[61] Stephen Reid, Robert Tibshirani, and Jerome Friedman. A study of error varianceestimation in lasso regression. Statistica Sinica, 26:35–67, 2016.

[62] H. E. Robbins. A generalization of the method of maximum likelihood: Estimating amixing distribution (abstract). Ann. Math. Stat., 21:314–315, 1950.

[63] H. E. Robbins. The empirical Bayes approach to statistical decision problems. In Proc.Third Berkeley Symp. on Math. Statist. and Prob., volume 1, pages 157–163, 1956.

[64] Mathieu Rosenbaum, Alexandre B Tsybakov, et al. Sparse recovery under matrixuncertainty. The Annals of Statistics, 38(5):2620–2651, 2010.

116

[65] Mathieu Rosenbaum, Alexandre B Tsybakov, et al. Improved matrix uncertaintyselector. In From Probability to Statistics and Back: High-Dimensional Models andProcesses–A Festschrift in Honor of Jon A. Wellner, pages 276–290. Institute ofMathematical Statistics, 2013.

[66] Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic randommeasurements. Information Theory, IEEE Transactions on, 59(6):3434–3447, 2013.

[67] Nicolas Stadler, Peter Buhlmann, and Sara van de Geer. `1-penalization for mixtureregression models. Test, 19(2):209–256, 2010.

[68] T. Sun and C.-H. Zhang. Comments on: `1-penalization for mixture regression models.Test, 19(2):270–275, 2010.

[69] Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, pageass043, 2012.

[70] Tingni Sun and Cun-Hui Zhang. Sparse matrix inversion with scaled lasso. The Journalof Machine Learning Research, 14(1):3385–3418, 2013.

[71] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of theRoyal Statistical Society. Series B (Methodological), pages 267–288, 1996.

[72] Joel Tropp et al. Just relax: Convex programming methods for identifying sparsesignals in noise. Information Theory, IEEE Transactions on, 52(3):1030–1051, 2006.

[73] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations ofcomputational mathematics, 12(4):389–434, 2012.

[74] Sara van de Geer. The deterministic lasso. 2007.

[75] Sara van de Geer. The lasso with within group structure. In Nonparametrics andRobustness in Modern Statistical Inference and Time Series Analysis: A Festschrift inhonor of Professor Jana Jureckova, pages 235–244. Institute of Mathematical Statistics,2010.

[76] Sara van de Geer and Alan Muro. On higher order isotropy conditions and lowerbounds for sparse quadratic forms. Electronic Journal of Statistics, 8(2):3031–3061,2014.

[77] Sara A van de Geer and Peter Buhlmann. On the conditions used to prove oracleresults for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009.

[78] Aad W van der Vaart and Jon A Wellner. Weak Convergence. Springer, 1996.

[79] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsityrecovery using-constrained quadratic programming (lasso). Information Theory, IEEETransactions on, 55(5):2183–2202, 2009.

[80] Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statisticalrates of convergence for sparse nonconvex learning problems. Annals of statistics, 42(6):2164, 2014.

117

[81] X. Xie, S. C. Kou, and L. D. Brown. SURE estimates for a heteroscedastic hierarchicalmodel. J. Am. Stat. Assoc., 107:1465–1479, 2012.

[82] Fei Ye and Cun-Hui Zhang. Rate minimaxity of the lasso and dantzig selector for thelq loss in lr balls. Journal of Machine Learning Research, 11(Dec):3519–3540, 2010.

[83] Ming Yuan and Yi Lin. Model selection and estimation in regression with groupedvariables. Journal of the Royal Statistical Society: Series B (Statistical Methodology),68(1):49–67, 2006.

[84] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty.The Annals of Statistics, pages 894–942, 2010.

[85] Cun-Hui Zhang and Jian Huang. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, pages 1567–1594, 2008.

[86] Tong Zhang. Some sharp performance bounds for least squares regression with l1regularization. The Annals of Statistics, 37(5A):2109–2144, 2009.

[87] Tong Zhang. Analysis of multi-stage convex relaxation for sparse regularization.Journal of Machine Learning Research, 11:1087–1107, 2010.

[88] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal ofMachine Learning Research, 7:2541–2563, 2006.

[89] Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihoodmodels. Annals of statistics, 36(4):1509, 2008.

topics in high-dimensional regression and …

Documents