a regularization approach for estimation and variable selection in … · 2018. 9. 27. · a...

A regularization approach for estimation andvariable selection in high dimensional regression

models

Y. Dendramis�� L. Giraitis† G. Kapetanios‡

Abstract

Model selection and estimation are important topics in econometric analysis which

can become considerably complicated in high dimensional settings, where the set of

possible regressors can become larger than the set of available observations. For large

scale problems the penalized regression methods (e.g. Lasso) have become the de factor

benchmark that can effectively trade off parsimony and fit. In this paper we introduce

a regularized estimation and model selection approach that is based on sparse large

covariance matrix estimation, introduced by Bickel and Levina (2008) and extended by

Dendramis et al (2017). We provide asymptotic and small sample results that indicate

that our approach can be an important alternative to the penalized regression. More-

over, we also introduce a number of extensions that can improve the asymptotic and

small sample performance of the proposed method. The usefulness of what we propose

is illustrated via Monte Carlo exercises and an empirical application in macroeconomic

forecasting.

Keywords: large dimensional regression, sparse matrix, thresholding, shrinkage,

model selection.

�Yiannis Dendramis acknowledges support from the European Union’s Horizon

2020 research and innovation programme under the Marie Sklodowska-Curie grant

agreement No 708501.

�University of Cyprus, Department of Accounting and Finance, email: [email protected]†Queen Mary, London University, Department of Economics and Finance, email: [email protected]‡King’s College London, King’s Business School, email: [email protected]

1

1 Introduction

In this paper we study the problem of model selection and estimation for high dimensional

datasets. The recent advent of large datasets has made this problem particularly important

as the information which is extracted from large datasets is a key to various scientific dis-

coveries or policy suggestions. In large datasets, conventional statistical and econometric

techniques such as regression estimation fail to work consistently due to the dimensional-

ity of the examined economic relationships. For instance, in a linear economic relationship

we frequently have T observations of a dependent variable (y) as a function of many poten-

tial predictors (p predictors). When the number of predictors (p) is large or larger than the

temporal dimension (T) then a regression with all available covariates becomes extremely

problematic if not impossible. In this paper we suggest new, theoretically valid techniques

to handle situations like the former.

Two main strands have ruled the relevant literature, so far. The penalized regression

methods (see e.g. Tibshirani (1996), Fan and Li (2001), Zhou and Hastie (2005), Lv and Fan (2009))

and the greedy methods (see e.g. Friedman (2001)), whose origins come from the machine

learning literature. In the penalized regression methods all the available regressors are

considered at the same time, while a penalty function shrinks the estimated parameter co-

efficients, though a choice of a tuning parameter. The main drawbacks of this approach is

that the applied researcher has to choose a specific penalty function and a tuning parame-

ter. Different penalty functions lead to diverse theoretical and small sample performance,

while in applied research there is no way to choose in advance an optimal penalty function.

In the greedy methods, the idea is to screen in advance some covariates which are con-

sidered as more likely to explain the dependent variable. This is done based on their indi-

vidual ability of the regressors to explain the dependent variable. Machine learning meth-

ods such us boosting, regression trees, and step wise regressions are casted in this category.

The main drawback of this strand of literature is the scarcity of theoretical results for most

of the methods.

In a recent paper Fan and Lv (2008) have proposed a combination of the two previous

strands. They propose a two step procedure in which, in the first step all regressors are

ranked according to their absolute correlations with the dependent variable, and a fixed

proportion of regressors is selected. In the second step a penalized regression approach is

applied on the selected subset of regressors.

The main idea of our approach is to first propose a theoretically valid regularized or-

dinary least squares (ols) estimator, on the full set of available regressors. This is done

through regularized (threshold) estimate of covariances that enables application to large

2

datasets. As we will show, this new estimator, retains good theoretical properties even in

the case that the number of available regressors is increasingly large, allowing us to extract

the true signal covariates and discard the noise variables. Crucially all the available regres-

sors are considered in a single regression step. To enhance the small sample performance

of this basic estimator we provide extensions related to the hypothesis testing literature.

To this end we provide an asymptotically valid testing procedure which can be used as an

extended screening mechanism that can extract the true signals, and significantly improve

the small sample performance of the basic estimator. This is done through a Bonferoni type

critical value, specialized to minimize the Type I error of the multiple testing procedure.

As a final step, all covariates that are considered as significant can be considered as joint

determinants of the dependent variable in a multiple regression setting.

Our approach provides some important advantages over the alternatives. First we do

not need to rely and choose a penalty function over the large set of proposals which are

available in the literature. Second we do not need to examine each regressor separately,

as it is in greedy methods. Third, our method does not need standardized regressors as in

penalized regression methods. Fourth, it can be extended to an inferential procedure that

can be directly linked to classical statistical analysis. Fifth, it is theoretically valid. And

sixth, it can be adapted to the Fan and Lv (2008) method, resulting to possible advances in

the small sample performance.

The paper is structured as follows: Section 2 introduces the basic estimator for the large

dimensional regression problem and presents its asymptotic properties. Section 3 extends

the basic estimator, by proposing a testing approach that is able to enhance the performance

of the model selection and estimation mechanism. Theoretical results are also derived in

this section. Section 4 presents a number of further extensions and improvements that can

be useful in diverse real world situations. Section 5 presents in detail the cross validation

schemes which are essential for the application of the proposed methodologies. Section

6 gives the details of the Monte Carlo experiments and the simulation results. Section 7

presents the empirical application, and Section 8 concludes. Technical proofs are relegated

to appendices.

2 Estimation by regularization

We consider the case that we observe a sample from a dependent variable fytgt=1,..,T and a

p-variate sample fxtgt=1,..,T of possible regressors. Given the increased availability of large

datasets, we assume that p and T are increasingly large, while it is possible to have p > T.

3

In this setting, the dependent variable yt is assumed to be explained by a relatively small

subset of regressors on the set fxitgi=1,..,p where xit is the t-th observation (t 2 f1, 2, .., Tg)

for covariate i (i 2 f1, 2, .., pg). In vector notation the dependent variable y and the full

set of possible regressors fxigi=1,..,p are T dimensional column vectors, the fxtgt=1,..,T are

p-dimensional row vectors, while in matrix form, the T � p matrix x includes all available

potential covariates.

Our aim is to find the subset of covariates fxigi=1,..,p that truly explain the dependent

y, and estimate their corresponding regression coefficients. This is classified as a model

selection and estimation problem, for high-dimensional regression and it is settled on the

estimation of a sparse regression model of the form

yt =p

∑i=1

βixit + ut, ut � N�

0, σ2u

�, t = 1, ., T (1)

Without loss of generality we assume that the first k regressors truly explain the dependent

variable yt (fβigki=1 6= 0) and the last p� k are poor noise variables (fβig

pi=k+1 = 0). Es-

timating the parameter vector βi for each regressor i in model (1) by classical ols is not a

feasible option. This is because all the good theoretical properties of ols rely on the assum-

tion that p is small and fixed. This problem can be further complicated when we assume a

non zero cross sectional correlation, that is frequently observed in large panels comprised

of macroeconomic and/or financial series. Non-sparse correlation structures of xt will in-

flate the correlation between the dependent variable y and a possible regressor xi, which

does not truly explain the y, stimulating the uncertainty about the true regression model

for y.

Our approach relies on the estimation of the large dimensional covariance of the p +

1(= pz) dimensional random vector zt = (yt, xt), denoted as Σz, of dimension pz � pz,

which is consedered as sparse, or approximately sparse. To this end, Σz is assumed to

belong to the following class of matrices:

Σz 2 Q�npz , K

�=

(Σ : σii � K,

pz

∑i=1

1�

σij 6= 0� npz

)(2)

The sparsity parameter npz is the maximum number of non zero row elements of Σz and

it does not grow too fast with p. Bickel and Levina (2008) and Cai and Liu (2011) provide

threshold and adaptive threshold estimators, respectively, of Σz under the iid and light

tailed assumption for random variable zt. In a recent paper by Dendramis Giraitis Kapetanios (2017)

it is shown that the nice theoretical results of Bickel and Levina (2008) are also valid when

4

zt is a stationary a-mixing process. Moreover, it is proven that zt can be also allowed to

be heavy tailed random variable. The last result comes at a cost on the rate at which pz is

allowed to increase, compared to T.

In the hard thresholding estimation approach the Σz is estimated by

Tλ

�bΣz

�=�bσij I

�jbσijj > λij

��(3)

with bσij =hbΣz

iij

, bΣz = T�1 ∑Tt=1 (zt � z)0 (zt � z), z = T�1 ∑T

t=1 zt and I�jbσijj > λij

�= 1

when jbσijj > λij and 0 otherwise. Estimator (3) sets a lower bound for the elements of

the estimated covariance bΣz. Elements of bΣz bellow that bound are set equal to zero. In

Bickel and Levina (2008) the λij is universal for all i,j, with

λij = λ = κ

rlog pz

T, κ > 0 (4)

In Cai and Liu (2011) the λij is adapted to each entry i,j of the matrix bΣz, as follows

λij = κ

sbθij log pz

T, κ > 0 (5)

bθij =1T

T

∑t=1

�(zi,t � zi)

�zj,t � zj

�� bσij

�2Both universal and adatprive threshold λij depends on a tuning parameter κ which does

not affect the asymptotic performance of the estimator, but it can have significant impact

on the the small sample functioning. In practice κ can be chosen through cross validation

(see also the following sections).

Given the theoretically valid regularized covariance matrix estimator of zt, for large pz

and T dimensions, we propose the regularized estimator of β as

βλ = Tλ

�Σx��1 Tλ(Σxy) (6)

with Tλ

�Σx�, Tλ(Σxy) being the submatrice and subvector of thresholded estimate Tλ

�bΣz

�that correspond to the regressors xt covariance matrix, and covariance vector of yt with xt,

respectively. The Tλ (.) is defined in (3).

In Theorem 1 we prove consistency with specific rates of convergence for the basic

estimator (6) under the following set of assumptions:

Assumption A1 The error term ut in model (1) is a matringale difference sequence (MDS) with

5

respect to the filtration Fut�1 = σ(ut�1, ut�2, ..)with zero mean and a constant variance 0<σ2<C<8.

Assumption A2 There exist sufficiently large positive constants C0, C1, C2, C3 and su, sz > 0

such that zit and ut satisfy the following conditions

Pr(jzitj = ζ) = C0exp(�C1ζsz) (7)

Pr(jutj = ζ) = C2exp(�C3ζsu) (8)

This can be weakened to a polynomial rate, following Dendramis et al (2017) but at a cost of smaller

p.

Assumption A3 Regressors are uncorrelated with the errors E(xitutjFt�1) = 0 for all t = 1, 2, .., T, i =

1, .., p and Ft�1 = Fut�1[Fx

t�1 = σ(ut�1, ut�2, ..) [ f[pj=1σ(xjt�1, xjt�2, ..)g.

Assumption A4 thexit for t = 1, .., T, i = 1, .., p either non stochastic or random with finite 8th

order moment.

Assumption A5 the number of true regressors k is finite or at least k < T.

Assumption A6 E(xitxjt � E(xitxjt)jFt�1) = 0, for all i, j.

Assumption 1 allows for some dependence through e.g. an arch process but no serial

correlation. Assumption 2 can be relaxed to allow the vector zt to be heavy tailed distrib-

uted. This is a departure from the usual in the literature exponentially declining bound for

the probability tails but this comes at the cost of smaller rates in dimension pz. Assumption

3 is a MDS assumption on the xitut with respect to Ft�1. The rest of the assumptions are

technical requirements which are necessary for the proof of Theorem 1 and are analogous

to the assumptions needed in the recent literature (see e.g. Chudik et al (2018))

Theorem 1 Consider the DGP defined in (1), and suppose that assumptions 1-6 hold.

Then for sufficiently large κ, the regularized estimator βλ satisfies

βλ � β = Op

�n3/2

pz λpz

�(9)

Also, the probability bound of βλ � β

is defined as

Pr� βλ � β

> n3/2pz λpz

�� p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

�(10)

for constants C1, C2, C3, C4.

6

Theorem 1 states that under reasonable assumptions on the DGP of model (1), the pro-

posed estimator converges to the true β. This means that asymptotically we will be able to

extract the true signals (βi 6= 0) and discard the noise variables (βi = 0).

3 Testing to enhance the basic regularized estimator

In order to further boost the performance of our basic regularized estimator βλ we in-

troduce a testing procedure. To this end, given the estimate βλ, for each covariate xi,

i 2 f1, .., pg we define the following ratio

ξλ,i =βλ,ibγλ,i

(11)

with bγλ,i =

ss2��

T � Tλ

�bΣxx

��1�

ii, s2 = 1

T ∑Tt=1 bu2

t , but = yt� xtbβλ and [A]ij is the i,j-th

element of the matrix A, for i = 1, .., p. The data implied parameter bγλ,i is a normalization

constant and it is the vehicle that allow us to introduce the testing procedure. When p is

small relative to T (e.g. when p < T�1/2) the bγλ,i is the standard error of the estimate βi,

but in general it is not. Then, the testing procedure is defined as

I�\βλ,i 6= 0

�=

(1 when ξλ,i > cp

0 otherwise(12)

with cp = Φ�1�

1� αcpπ

�, cp = O

�(π log (p))1/2

�= o (Tc0 ), 8c0 > 0, Φ�1 (.) is the quan-

tile function of the normal distribution and α is the nominal size of the individual test, to

be set by the researcher (e.g. α = 1% or 5%), and the constants π, c satisfy π > 0, c < ∞.

The critical value cp is a Bonferoni type critical value controlling the family wise error of

the multiple testing problem to be at the level of α%. The constants c, π do not affect the

asymptotic results but in small samples the testing performance can be boosted by these

parameters. In practise on can set them both 1, or choose them through cross validation.

In our empirical and simulation exercise we set them both equal to 1.

The introduced testing procedure works as follows: When ξλ,i is bounded in T , then

xit is considered as a noise variable. In the opposite case, the ξλ,i diverges in T, meaning

that xit has power in explaining the yt, i.e. xit can be considered as a true signal variable.

In the large dimensional regression literature two critical measures of how reasonable a

method can separate the true regressors from the noise variables are the true positive rate

7

(TPR) and false positive rate (FPR) defined bellow

TPRp,T =#n

i : βλ,i 6= 0 and βi 6= 0o

# fi : βi 6= 0g (13)

FPRp,T =#n

i : βλ,i 6= 0 and βi = 0o

# fi : βi = 0g

The TPRp,T measures the percentage of true signals that we correctly identify as true

signals (should converge to 1) while the FPRp,T measures the percentage of noise vari-

ables that we incorrectly identify as true signals (should converge to zero). We enrich our

methodological contribution of Theorem 1 by proposing the following two step procedure.

1. estimate the parameter vector of model (1) by the basic estimator βλ,i defined in (6)

2. filter the covariates fxigpi=1 that truly explain yt by the testing approach (12)

Theorem 2 proves that the previous two step procedure can result to TPRp,T, FPRp,T

that asymptotically converge to 1 and 0 respectively, meaning that for suitable rates of p

and T we can correctly identify the true model.

Theorem 2 Consider the DGP defined in (1), and suppose that assumption 1-6 hold.

Then the previously defined 2-step procedure can identify the true model, or equivalently

for the true positive rate we have that

E�TPRp,T

�= k�1

k

∑i=1

Pr�jξλ,ij > cT

�� βi 6= 0�

� 1�O�

p2zC1 exp

��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

��or equivalently TPRp,T !p 1, and the false positive rate

E�FPRp,T

�= (p� k)�1

p

∑i=k+1

Pr�jξλ,ij > cT

�� βi = 0�

=1

p� kO�

p2z exp

��C�2 Tλ2

pz

�+ pz exp

��C�4 Tλpz

��= O

�pz exp

��C�2 Tλ2

pz

�+ exp

��C�4 Tλpz

��or equivalently FPRp,T !p 0

As a final theoretical contribution we also provide theorem 3, which states that if the

8

previously defined two step procedure is followed by a final regression step, the selected

covariates and parameter vector converges to the true one, asymptotically.

Theorem 3 Consider the DGP defined in (1), and suppose that assumption 1-6 hold.

Then, consider the following three step procedure

1. estimate the parameter vector of model (1) by the basic estimator βλ,i defined in (6).

2. filter the covariates fxigpi=1 that truly explain yt by the testing approach (12). Let xt,�J

denote the xt, excluding columns J =n

j : I�\βλ,j 6= 0

�= 0

o. These are considered as the

important covariates of model (1).

3. run a final regression on the ext,�J covariates, βλ, f =�

x0T,�JxT,�J

��1xT,�JyT

For sufficiently large constants C1, C2, C3, C4, the previously rate of convergence of βλ,2s

is given by

βλ,2s � β = Op

�1pT

�+Op

�p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

��

4 Extensions and improvements

4.1 The sure screening extension

The Sure Independence Screening (SIS) developed by (Fan and Lv (2008)) is a simple, pow-

erful, with optimal properties method for variable selection when p and T are large. Ac-

cording to this, marginal correlations of potential covariates fxigpi=1 with the dependent

variable yt are ranked in ascending order. The k� covariates with the highest absolute cor-

relation are considered as the most likely true regressors of dependent variable yt, with

k� << p. As a final step the researcher can use a penalized regression method (e.g. Lasso)

to estimate the parameter vector β in (1). In practice the researcher has to choose the thresh-

old k�, which sets an upper limit on the number of likely covariates and the final step

penalized regression method.

In our framework the SIS can result to the dimensionality reduction which is important

for better small sample performance. To this end, we first screen the important variables by

SIS for a predetermined level of k�. Then, we apply the basic estimator βλ,i defined in (6),

on the selected k� regressors from the set fxigpi=1 and the dependent variable yt. Instead of

this, one can also apply the developed testing methodology (see Theorem 2) and the final

regression step (see Theorem 3).

9

4.2 Non sparse covariance matrix estimator

The proposed basic estimator βλ,i defined in (6), depends crucially, on the large dimen-

sional covariance matrix of zt = (yt, xt) which is assumed to be sparse. Sparsity of Σz, is

an essential assumption, necessary for acceptable rates of convergence of βλ,i. In the litera-

ture there have been proposed other consistent covariance matrix estimators which do not

need the condition (2) for Σz. This is an important development as there are applications

in economics in which the sparsity belief might be inappropriate, owing to the presence of

common factors of zt. In (Fan et al (2013)) the leading assumption for Σz is that the first q

eigenvalues are spiked enough and grow fast, at a rate O (p). Then, the common and idio-

syncratic components of zt can be identified, and in addition, principal component analysis

(PCA) on the sample covariance matrix can consistently estimate the space spanned by the

eigenvectors of Σz. Let bµ1 � bµ2 � ..bµp be the ordered eigenvalues of the sample covariance

matrix bΣz and�bνjp

j=1 be their corresponding eigenvectors. The sample covariance of zt

has the following spectral decomposition

bΣz =q

∑i=1bµibνibν0i + bRq (14)

where bRq = ∑pi=q+1 bµibνibν0i = �brij

�p�p is the principal orthogonal complement, and q is

the number of diverging eigenvalues of Σz. When bRq is sparse the estimator of Σz is then

defined as bΣz,q =q

∑i=1bµibνibν0i + Tλ

�bRq

�(15)

with Tλ (.) being the regularized estimator defined in (3) and the threshold λij defined ac-

cording to (4) or (5). In this setting, for a given value of q the basic estimator βλ defined in

(6), can be trivially adjusted for the large covariance estimator bΣz,q, given in (15) by replac-

ing the (3) with (15). When q is unknown, then it can be calibrated via cross validation, or

use information criteria for the optimal bq (see e.g. Bai and Ng (2002)).

Preliminary simulations for this extension do not suggest significant improvements

over the basic estimator and testing as it is defined in sections 2 and 3.

4.3 Computational aspects and invertability conditions

The proposed basic estimator βλ involves several computational challenges. The main

challenge involves the estimation of the large inverse covariance matrix Tλ

�Σx��1. When

p > T the Tλ

�Σx�

does not necessitate positive definiteness (PD) and thus existence of

10

Tλ

�Σx��1. When the tuning parameter κ of the threshold λij in (4) or (5) is unknown,

and needs to be calibrated via a cross validation method, then conditions on the minimum

eigenvalues of Tλ

�Σx�

can be ad hoc adopted. For instance, for a given grid of points for

κ 2 fκ1, κ2, .., κNg, one can consider only the points which result to PD matrices Tλ

�Σx�.

This will ensure existence of Tλ

�Σx��1. In the case that κ is known or when there is no a

priori reason to discard specific values for κ, an alternative that results to PD Tλ

�Σx�

is to

consider convex combination of Tλ

�Σx�

and a well-defined target matrix which is known

to result to PD matrix. In the literature there have been proposed simple target matrices like

the identity matrix Ip, or the diagonal matrix diag�Σx�. The shrinking to a target matrix

approach is formally given as

Tλ,s�Σx�= ρsΣtar + (1� ρs) Tλ

�Σx�

(16)

with Σtar being the target matrix, and ρs 2 (ρinv, 1), 0 < ρinv < 1, with ρinv being the

minimum shrinkage parameter that ensures the PD of Tλ,shr�Σx�.

Moreover, the problem of inverting a large covariance matrix, which is essential for the

proposed βλ can be tackled through another interesting direction. Cai et al (2011) suggest

a constrained l1 minimization method for estimating a sparse inverse covariance matrix.

The authors provide interesting theoretical and simulation results for data which are iid

random variables, while they show that their method is fast and easily implemented by

linear programming.

Preliminary simulations for the (16) and Cai et al (2011) extensions do not suggest sig-

nificant improvements over the basic estimator and testing as it is defined in sections 2 and

3.

5 Implementation and Cross Validation Schemes

The application of the proposed methods require the selection of a number of tuning pa-

rameters. For instance, the basic estimator βλ given in (1) depends on the value of the κ

parameter, while extensions of it, adapted to the (15) or (16), require selection of additional

tuning parameters. Moreover in the developed testing approach one can also calibrate the

π parameter, of the Bonferoni critical value, as it is often done in the literature. To obtain

values for all these parameters we propose a simple cross validation scheme, whose ob-

jective focuses directly on the linear regression model (1). This is done through the K-fold

cross validation.

11

In this approach the original sample is randomly partitioned into K roughly equally

sized subsamples. Let ζ i, i 2 f1, .., Kg denote the relevant subsamples. Pick up a partition

ζ i, and use the K � 1 partitions ζ j, j 6= i for j 2 f1, .., Kg to estimate the regression coeffi-

cient. For a given vector of tuning parameter γ, let bβγ,j denote this estimate. The discarted

partition ζ i is then used as a testing sample, on which we compute the average squared

error

ei,γ =1

f# obs. in part. ζ ig∑t2ζ i

�yt � xtbβγ,j

�2(17)

This is repeated for all partitions ζ i, i 2 f1, .., Kg. Finally, the optimal tuning parameter is

obtained by minimizing the objective

bγ = arg minγ2Γ

1K

K

∑i=1

ei,γ (18)

over a suitable parameter vector space Γ for the vector γ. In the K-fold cross validation

when K = T � 1, with T being the number of observations, we have the leave on out cross

validation. For instance, in the basic estimator βλ we have that γ � λ and the parameter

space Γ includes κ-points which imply invertibility for Tλ

�Σx�.

6 A Monte Carlo study

To examine the small sample performance of our methodological contributions we carry

out an extensive Monte Carlo study. To do so, we allow for a wide set of covariates distribu-

tional assumptions, which are considered as plausible for macroeconomic and/or financial

time series. Then, we examine the ability of each method to disentangle the true regressors

of the dependent variable, and the ability to recover the true parameter vector.

The literature of large dimensional regression has been dominated by the penalized re-

gression methods. Top among them are the Lasso and Adaptive Lasso methods. These are

considered as extensions of the multiple regression, where the vector of regression coeffi-

cients is given as a solution to the following optimization problem

bβ = arg minβ

T

∑t=1

yt �

p

∑i=1

βixit

!2

+ P (β, λ) (19)

where P (β, λ) is a function of the tuning parameter λ that penalizes the complexity of coef-

ficient β. In empirical applications, the tuning parameter λ is calibrated by cross validation.

12

In the celebrated Lasso regression, the P (β, λ) is the L1 norm of the vector β. The L1 geome-

try of the solution implies that Lasso is performing variable selection in the sense that an es-

timated component can be exactly zero. In adaptive lasso (see e.g. Zhou and Hastie (2005))

the L1 norm is replaced by a re-weighted version which diminishes the estimation bias

implied by Lasso.

In our simulation experiment we generate the data from model (1) assuming that yt is

truly explained by the first 4 columns of xt (i.e. k = 4) and βi = 1, for i = 1, .., 4. Since most

economic data support the inclusion of a constant in model (1), we also include a β0 = 1. In

all methods the inclusion of the constant is assumed to be known, that is, for all methods

the TPR and FPR do not account for the constant but only for the rest of the p possible

regressors.

The tuning parameters of the methods are calibrated using the methods presented in

the previous section. The Lasso and Adaptive Lasso parameters are calibrated using the

10-fold cross validation, as this is considered as the literatures’ benchamark, while for the

rest of the proposed methods the T� 1-fold cross validation is used, which is also known as

the leave one out cross validation. The Adaptive Lasso method is applied as it is described

in section 2.8.4 of Buhlmann and Van deGeer (2011).

We then use a number of 500 simulation experiments to measure average values of TPR

and FPR defined in (13), average RMSE of the parameter vector defined as

RMSE =

vuut 1p

p

∑i=1

�bβi � βi

�2(20)

, average probability of observing exactly the true model defined as

TrueModel = I

(k

∑i=1

I�bβi 6= 0

�= k

)\(

p

∑i=k+1

I�bβi 6= 0

�= 0

)!(21)

with I�bβi 6= 0

�= 1 when bβi 6= 0 and zero otherwise, and average out of sample mean

squared forecast error, computed at the T + 1-th simulated observation and defined as

RMSFE =q(yT+1 � byT+1)

2 (22)

For all these measures of performance the large set of covariates xt are generated accord-

ing to a wide range of distributional designs, presented bellow, and a rich set of T, p, R2

parameters.

13

6.1 Designs for regressors

6.1.1 dgp 1: naive case

In the naive case we assume that covariates are generated from xt � N�0, Ip

�, for t =

1, .., T. This is the simplest case that we examine in which the regressors are uncorrelated.

6.1.2 dgp 43: temporally uncorrelated-weakly collinear covariates

In the Temporally uncorrelated-weakly collinear covariates we assume that xt is generated

from

xit = (εit + gt) /p

2, for i = 1, .., 4, x5t = ε5t (23)

xit = (εit + εi�1t) /p

2, for i > 5, gt � N (0, 1) , εit � N (0, 1)

In this case, there is correlation among the true signal variables, and among the noisy vari-

ables, but there is no correlation between the true signal and the noisy regressros.

6.1.3 dgp 52: pseudo signal variables

In the pseudo signal variables case we assume that xt is generated from


2, for i = 1, .., 4 (24)

x5t = ε5t + κx1t, x6t = ε6t + κx2t for κ = 1.33


2, for i > 6, gt � N (0, 1) , εit � N (0, 1)

Now, we allow for all types of correlations between the variables. Correlations among the

true signals, correlation among the noisy variables and correlation between the true signals

and the noisy variables.

14

6.1.4 dgp 44: strongly collinear noise variables

In the strongly collinear noise variables due to a persistent and unobserved common factor,

we assume that


2, for i = 1, .., 4, x5t = (ε5t + bi ft) /p

3 (25)

xit =�(εit + εi�1t) /

p2+ bi ft

�/p

3, for i > 5, bi � N (1, 1)

ft = 0.95 ft�1 +p

1� 0.952vt, vt � N (0, 1)

6.1.5 dgp 45: temporally correlated and weakly collinear covariates

In the temporally correlated and weakly collinear covariates we assume that


2, for i = 1, .., 4, x5t = ε5t


2, for i > 5

εit = ρεit�1 +q

1� ρ2vit, vt � N (0, 1) , ρ = 0.5

6.1.6 dgp 51: all covariates are collinear with the true signals

Finally, we examine the case in which all variables are collinear with the true signals

xt � N�0, Σp

�σij = 0.5ji�jj, 1 � i, j � p

6.2 Summary of simulation results

A number of interesting conclusions can be drawn from tables 1 to 6. First, in all experi-

ments considered, Lasso and Adaptive lasso fail considerably to identify exactly the true

model as FPR is consistently positive. This result is not surprising as it is known that these

two methods are choosing larger than the truth models. Second, in the vanilla case of un-

correlated covariates, overall the methods that we propose outperform the Lasso/Adaptive

Lasso. The larger the T and R2 the better the performance. Only when small T (T = 150)

15

and small R2 (R2 = 0.3) the Lasso methods can provide some gains. Third, in the tem-

porally uncorrelated weakly collinear covariates (43), the R-BL-t-r seems to be the best

method while this is not affected by the T, R2. The CL version of the estimator seem to pro-

vide some gains here over the universal thresholding of BL. Forth, looking at the pseudo

signals case (52), the R-BL-t-r works well for high and medium T, for small T it needs the

help of the sure screening. Fifth, in the case that all covariates are collinear with the true

signals (51), the R-BL-t-r or S-R-BL-t-r provide gains over all T, R2. The CL method here

seems to fail to outperform Lasso in terms of RMSE but its performance is comparable with

the other methods in terms of the rest of the metrics. Sixth, the good results of R-BL-t-r or

S-R-BL-t-r, remain in strongly collinear noise variables due to a persistent unobserved com-

mon factor case (44), and the temporally correlated and weakly collinear covariates 45. In

the last DGP the CL version of the regularized estimator seems to work well. To sum up

the (S)-R-BL(CL) is a nice choice when there is significant sparsitiness, while overall the

(S)-R-BL-t-r seems to be the most reliable method

7 Empirical illustration

As an empirical illustration, we perform an out of sample forecasting exercise of two

macroeconomic series using a large set of available covariates. We consider the quarterly

data from Stock Watson (2012) which consists of a set of 109 macroeconomic series. With

this set of possible regressors augmented with the 4 subsequent lags of the dependent vari-

able, we forecast the series non durable goods price index and household operation price

index. To this end, we use a rolling window of 120 observations as an estimation sample,

and forecast data points from Dec-1991 until the end of sample, at Dec-2009. The first ob-

servation of our sample is observed at Dec-1961. As in the monte carlo section, the tuning

parameter here is also calibrated by the leave one out cross validation, for the proposed

methods, while for Lasso and Adaptive Lasso we consider the 10-fold cross validation,

which is considered as the literatures benchmark. We compare our results with Lasso and

Adaptive Lasso, the autoregressive model of order d, (AR(d)), where d is selected by the

Bayesian information criterion (BIC), and the factor augmented AR(d), (FAAR(d)), where

the number of factors is chosen by the Bai and Ng (2002) information criterion.

As it becomes visible from the table 7 there are forecasting gains from using the pro-

posed methods. In the nondurable goods Price Index the Lasso is almost equivalent to the

proposed methods while in the household operation Price Index the proposed methods

perform better. As it become visible from figure 1 and 2 the Lasso/Adaptive Lasso choose

16

larger forecasting models, with a different set of important covariates, for the sequential,

out of sample forecasting.

8 Conclusion

In this paper we have considered the problem of model specification and estimation in

large dimensional settings, where there is a large set of likelly regressors for a dependent

variable yt but only a small number of covariates trully explain it. We provided an alterna-

tive of penalized regression methods, and greedy methods, which can be related to classi-

cal statistical analysis. A fundamental step in our theoretical contributions is the existance

of sparse covariance matrix. Our theoretical, simulation and empirical results provide evi-

dences that the methods that we propose, provide important gains over the methods which

are currently applied in the literature.

References

Bai, J. and Ng, S. (2002) Determining the number of factors in approximate factor models.

Econometrica, 70, 191–221

Belloni, A., V. Chernozhukov, and C. Hansen (2014a): “High-Dimensional Methods and

Inference on Structural and Treatment Effects,” Journal of Economic Perspectives, 28, 29-

50.

Belloni, A., V. Chernozhukov, and C. Hansen (2014a): (2014b): “Inference on Treatment

Effects after Selection among High-Dimensional Controls,” Review of Economic Studies,

81, 608-650.

Bickel, P., and E. Levina (2008): “Covariance Regularization by Thresholding,” The Annals

of Statistics, 36, 2577–2604.

Buhlmann, P., and S. van de Geer (2011): Statistics for High-Dimensional Data: Meth- ods,

Theory and Applications. Springer.

Cai, T., and W. Liu (2011): “Adaptive Thresholding for Sparse Covariance Matrix Estima-

tion,” Journal of the American Statistical Association, 106, 672–684.

17

Cai, T., Liu W., and Luo, X. (2011): A Constrained l1 Minimization Approach to Sparse

Precision Matrix Estimation, Journal of the American Statistical Association, 106, 594-607.

Chudik, A., Kapetanios, G., and Pesaran, M., H., (2018): “A One-Covariate at a Time,

Multiple Testing Approach to Variable Selection in High-Dimensional Linear Regression

Models,” Econometrica.

Dendramis, Y., L. Giraitis, and G. Kapetanios (2017): “Estimation of random coefficient

time varying covariance matrices for large datasets,” Mimeo QM.

Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004): “Least Angle Regression,” An-

nals of Statistics, 32, 407-499.

Fan, J., and R. Li (2001): “Variable Selection via Nonconcave Penalized Likelihood and Its

Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360.

Fan, J., and J. Lv (2008): “Sure Independence Screening for Ultra-High Dimensional Feature

Space,” Journal of Royal Statistical Society B, 70, 849–911.

Fan, J., Liao, Y. and Mincheva, M. (2013): “Large covariance estimation by thresholding

principal orthogonal complements, “ Journal of the Royal Statistical Society: Series B 75

1–44.

Fan, J., and C. Tang (2013): “Tuning parameter selection in high dimensional penalized

likelihood,” Journal of the Royal Statistical Society Series B, 75, 531–552.

Friedman, J., T. Hastie, and R. Tibshirani (2000): “Additive Logistic Regression: A Statisti-

cal View of Boosting,” Annals of Statistics, 28, 337-374.

Friedman, J. (2001): “Greedy Function Approximation: A Gradient Boosting Machine,”

Annals of Statistics, 29, 1189–1232.

Holm, S. (1979): “A simple sequentially rejective multiple test procedure,” Scandinavian

Journal of Statistics, 6, 65–70.

Lv, J., and Y. Fan (2009): “A Unified Approach to Model Selection and Sparse Recovery

Using Regularized Least Squares,” Annals of Statistics, 37, 3498–3528.

Tibshirani, R. (1996): “Regression Shrinkage and Selection via the Lasso,” Journal of the

Royal Statistical Society B, 58, 267-288.

18

Zhou, H., and T. Hastie (2005): “Regularization and Variable Selection via the Elastic Net,”

Journal of the Royal Statistical Society B, 67, 301-320.

Proof of Theorem 1: Consider the true parameter vector of model (??) given by β =

Σ�1x Σxy. We then have that:

β� β = T�Σx��1 T(Σxy)� Σ�1

x Σxy = T�Σx��1 T(Σxy)� (26)

T�Σx��1 Σxy + T

�Σx��1 Σxy � Σ�1

x Σxy (27)

= T�Σx��1 �T(Σxy)� Σxy

�+�

T�Σx��1 � Σ�1

x

�Σxy (28)

So

β� β = β� β

� T�Σx��1 T(Σxy)� Σxy

+ T�Σx��1 � Σ�1

x

Σxy (29)

=� T

�Σx��1 � Σ�1

x

+ Σ�1x

� T(Σxy)� Σxy + (30) T

�Σx��1 � Σ�1

x

Σxy (31)

from Theorem 1, in Dendramis Giraitis Kapetanios (2017), for λpz = Mq

log pzT , npz sparsity

of Σz, for pz = p+ 1

T�Σx��1 � Σ�1

x

� Op�npz λpz

�, (32)

and T(Σxy)� Σxy � Op

�npz λpz

�(33)

while from Lemma 1 Σ�1

x = O (1)

For the term Σxy

2, first notice that

kΣzk2F =

1T2

p

∑i=1

p

∑j=1

E

T

∑t=1(zit � E (zit))

�zjt � E

�zjt��!!2

(34)

=1

T2

p

∑i=1

p

∑j=1

T

∑t=1

T

∑t0=1

E�(zit � E (zit))

�zjt � E

�zjt��

� Eh(zit0 � E (zit0))

�zjt0 � E

�zjt0��i

(35)

= O�

n2pz

�(36)

given that Σz has at most npz non zero row elements and E�(zit � E (zit))

�zjt � E

�zjt��

is

bounded 8i, j, t which is satisfied when wijt is a stationary a-mixing process. It then follows

19

that

Σxy =

0@ 1T2

p

∑i=1

E

T

∑t=1(xit � E (xit)) (yt � E (yt))

!!21A1/2

=

1

T2

p

∑i=1

T

∑t=1

T

∑t0=1

E�(xit � E (xit))

�yjt � E

�yjt��

� Eh(xit0 � E (xit0))

�yjt0 � E

�yjt0��i!1/2

= O�p

npz

�(37)

It then follows that

β� β = �Op

�npz λpz

�+O (1)

Op�npz λpz

�+Op

�npz λpz

�O�p

npz

�(38)

or equivalently β� β = Op

�n2

pz λ2pz

�+Op

�n3/2

pz λpz

�(39)

To find the probability bound for the β� β

we work as follows

Pr� β� β

> q�� (40)

� Pr�� T

�Σx��1 � Σ�1

x

+ Σ�1x

� T(Σxy)� Σxy + T

�Σx��1 � Σ�1

x

Σxy > q

�

� Pr�� T

�Σx��1 � Σ�1

x

+ Σ�1x

� T(Σxy)� Σxy > q

2

�+ (41)

+Pr� T

�Σx��1 � Σ�1

x

Σxy > q

2

�(42)

� Pr�� T

�Σx��1 � Σ�1

x

+ Σ�1x

� > q2C1

�+ Pr

� T(Σxy)� Σxy > C1

�+ (43)

+Pr� T

�Σx��1 � Σ�1

x

> q2C2

�+ Pr

� Σxy > C2

�(44)

20

� Pr� T

�Σx��1 � Σ�1

x

> q4C1

�+ Pr

� Σ�1x

> q4C1

�+ Pr

� T(Σxy)� Σxy > C1

�+

(45)

+Pr� T

�Σx��1 � Σ�1

x

> q2C2

�+ Pr

� Σxy > C2

�(46)

omitting the constants we have: when C1 � npz λpz , we have

Pr� T

�Σxy�� Σxy

> C1�< p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

�(47)

when C2 large such that C2 �pnpz , we have

Pr� Σxy

> C2�= 0 (48)

when

q2C2

� npz λpz ) q � 2C2npz λpz ) q � 2n3/2pz λpz (49)

we have that

Pr� T

�Σx��1 � Σ�1

xx

> q2C2

�< p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

�(50)

when C1 is an increasing sequence such that

q4C1

� 1 ) C1 �q4

(51)

we have that

Pr� Σ�1

x

> q4C1

�= 0 (52)

whenq

4C1� npz λpz ) C1 �

q4npz λpz

(53)

we have that

Pr� T

�Σx��1 � Σ�1

x

> q4C1

�< p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

�(54)

21

for q = n3/2p λp all inequalities hold for a proper choice of constants C1, C2 and overall we

have

Pr� β� β

> n3/2p λp

�� p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

�(55)

For the variance of the error term u, first consider the following

u = y� xβ = xβ+ u� xβ = x�

β� β�+ u (56)

and

u0u = u0u+ 2u0x�

β� β�+�

β� β�

x0x�

β� β�

(57)

So

T�1 u0u � σ2

u � �β� β

�0 1T

x0x�

β� β� + 2

�β� β�0 1

Tx0u + (58)�

T�1 u0u � σ2

u

�(59)

= β� β

2 1

Tx0x + 2

β� β 1

Tx0u + �T�1 u0u

� σ2u

�(60)

� β� β

2 � Σx � Σx + kΣxk

�+ (61)

+ 2 β� β

bΣxu

+ �T�1 u0u � σ2

u

�(62)

Since

E� Σx � Σx

2F

�= O

�p2

T

�, from lemma 2 (63)

and E� Σx � Σx

F

�� E

� Σx � Σx 2

F

�1/2= Op

�ppT

�(64)

and since convergence in L1 norm implies convergence in probability Σx � Σx

F = Op

�ppT

�kΣxk = O (1) Lemma 1 (65)

22

also since u is a martingale difference and by assumption 3

E 1

Tx0u 2

F=

1T2

p

∑i=1

E

0@ T

∑t=1

xitut

!21A =

1T2

p

∑i=1

T

∑t=1

T

∑t0=1

E (XitutXit0ut0) = O� p

T

�(66)

bΣxu

F=

1T

X0u

F= Op

�rpT

�(67)

We then have:

T�1 u0u � σ2

u ��

Op

�n2

pz λ2pz

�+Op

�npz λpz

pnpz

��2�

Op

�ppT

�+O (1)

�(68)

+�

Op

�n2

pz λ2pz

�+Op

�npz λpz

pnpz

��O�r

pT

�+O

�T�1/2

�(69)

= Op

�n4

pz λ4pz

ppT

�+Op

�n3

pz λ2pz

ppT

�+ (70)

+

�Op

�n2

pz λ2pz

rpT

�+Op

�n

32pz λpz

rpT

��(71)

The leading term of the previous formula is Op

�n3

pz λ2pz

ppT

�i.e.

T�1 u0u � σ2

u = Op

�n3

pz λ2pz

ppT

�(72)

Lemma 1: (Lemma A4 in the OCMT paper)

Let AT =�aijT�

be a symmetric p� p matrix with eigenvalues µ1 � µ2 � ... � µp such

that for ε > 0 independent of p, ε � µ1 � µp � 1/ε < ∞.

Then: a) jjATjjF = O (1) and jjA�1T jjF = O (1).

Proof:

jjATjj22 = µp = O (1) (73)

jjATjj2 = O (1) (74)

also

jjA�1T jj22 = 1/µ1 = O (1) (75)

Lemma 2: (Lemma 11 in the OCMT paper)

23

Let Sa, Sb be T� pa and T� pb matrices of observations and suppose that�

sa,it, sb,jt

are

either non stochastic or bounded or random with finite 8th order moment. Consider the

sample covariance bΣab = T�1eS0aeSb, eSb = Sb � 1T i0iSb and denote its expectation by Σab =

T�1E�eS0aeSb

�. Let wijt = (sa,it � sa,i)

�sb,jt � sb,j

�� E

�(sa,it � E (sa,i))

�sb,jt � E

�sb,j��

and

suppose that

supi,j

T

∑t=1

T

∑t0=1

E�

wijtwijt0�!

= O (T) (76)

then

E bΣab � Σab

2

F= O

� pa pbT

�(77)

Proof:

The E�

wijtwijt0�

, and E�

wijtwijt0wi0 j0swi0 j0s0�

exist by the assumption the fsa,it, sb,itg have

finite 8 moments or they are non stochastic. The i, j element of bΣab � Σab is given by aijt =

T�1 ∑Tt=1 wijt. Then

E bΣab � Σab

2

F=

pb

∑i=1

pa

∑j=1

E�

a2ijt

�= T�2

pb

∑i=1

pa

∑j=1

T

∑t=1

T

∑t0=1

E�

wijtw0ijt�

(78)

� T�2pb pa supi,j

"T

∑t=1

T

∑t0=1

E�

wijtw0ijt�#= O

� pb pa

T

�(79)

Proof of Theorem 2

For the TPR we have that

E�TPRp,T

�= k�1

k

∑i=1

Pr�jbtij > cT

�� βi 6= 0�

(80)

When βi 6= 0 we have that:

Pr�jbtij > cT jβi 6= 0

�= Pr

24��bβibse�bβi

� � bβi

se�bβi

� + bβi

se�bβi

�� > cT

35 =(L1)

but ��bβibse�bβi

� � bβi

se�bβi

� + bβi

se�bβi

�� > cT

24

will occur when ��bβi

se�bβi

�� > 2cT and

��bβibse�bβi

� � bβi

se�bβi

�� < cT

also notice

(L1) � Pr

248<:��bβi

se�bβi

�� > 2cT

9=; \8<:��bβibse�bβi

� � bβi

se�bβi

�� < cT

9=; jβi 6= 0

35 (81)

= 1� Pr

248<:��bβi

se�bβi

�� < 2cT

9=; [8<:��bβibse�bβi

� � bβi

se�bβi

�� > cT

9=; jβi 6= 0

35 (82)

� 1� Pr

24��bβi

se�bβi

�� < 2cT

35� Pr


� � bβi

se�bβi

�� > cT

35 = (L2) (83)

For the first prob in (L2) we have

Pr

24��bβi

se�bβi

�� < 2cT

35 = Pr

" ��bβi

�� < 2cT

s�σ2

u1T

Σ�1x

�ii

#(84)

= Pr

" ��bβi � βi + βi

�� < 2cT

s�σ2

u1T

Σ�1x

�ii

#= (85)

It is true that when

jβij � 2cT

s�σ2

u1T

Σ�1x

�ii> 0

we have

Pr

"��bβi � βi + βi

�� < 2cT

s�σ2

u1T

Σ�1x

�ii

#� (86)

� Pr

"��bβi � βi

�� > jβij � 2cT

s�σ2

u1T

Σ�1x

�ii

#� p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

�,

(87)

when

jβij � 2cTp

T

rhσ2

uΣ�1x

iii� n3/2

p λp(� 0) (N1) (88)

25

For the second probability term in (L2) we have that

Pr


� � bβi

se�bβi

�� > cT

35 = Pr

24��bβi

se�bβi

��se�bβi

�bse�bβi

� � 1

�� > cT

35 (89)

= Pr

24��bβi

��se�bβi

�bse�bβi

� � 1

�� > cTse�bβi

� 35 � (90)

� Pr

24��se�bβi

�bse�bβi

� � 1

�� >cT se

�bβi

�C1

35+ Prh��bβi � βi + βi

�� > C1

i� (91)

� Pr

24��se�bβi

�bse�bβi

� � 1

�� >cT se

�bβi

�C1

35+ Prh��bβi � βi

��+ jβij > C1

i(L3) (92)

� Pr

24��se�bβi

�bse�bβi

� � 1

�� >cT se

�bβi

�C1

35+ Pr��bβi � βi

�� > 12

C1

�+ Pr

�jβij >

12

C1

�= (L4)

(93)

where se�bβi

�=

rhσ2

u1T Σ�1

x

iii

and a large constant C1.

The second prob in (L4)

Pr��bβi � βi

�� > 12

C1

�� p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

�, (94)

when12

C1 � n3/2p λp(� 0) (N2) (95)

The third prob in (L4) with (N1) is zero when C1 > 2 jβij > 4cTse�bβi

�+ 2n3/2

p λp, there is

such large constant C1

For the term in (L3): Pr

24��se�bβi

�bse�bβi

� � 1

�� >cT se

�bβi

�C1

35 (96)

26

First notice that

se�bβi

�bse�bβi

� � 1 =

rhσ2

u1T Σ�1

x

iiis�bσ2

u1T Tλ

�bΣx

��1�

ii

� 1 (97)

=

[σ2u

1T Σ�1

x ]iihbσ2u

1T Tλ(bΣx)

�1iii

� 1

q[σ2

u1T Σ�1

x ]iirhbσ2u

1T Tλ(bΣx)

�1iii

+ 1

��σ2

uΣ�1x�

ii�bσ2uTλ

�bΣx

��1�

ii

� 1 (98)

since

q[σ2

u1T Σ�1

x ]iirhbσ2u

1T Tλ(bΣx)

�1iii

+ 1 > 1

Pr

24��se�bβi

�bse�bβi

� � 1

�� >cT se

�bβi

�C1

35 � Pr

2664��

�σ2

uΣ�1x�

ii�bσ2uTλ

�bΣx

��1�

ii

�

�bσ2uTλ

�bΣx

��1�

ii�bσ2uTλ

�bΣx

��1�

ii

�� >cT se

�bβi

�C1

3775 =(99)

= Pr

2664��

1�bσ2uTλ

�bΣx

��1�

ii

��hσ2

uΣ�1x

iii��bσ2

uTλ

�bΣx

��1�

ii

�� > cT se�bβi

�C1

3775 � (100)

� Pr

2664��

1�bσ2uTλ

�bΣx

��1�

ii

�� > C2

3775+ Pr

24��hσ2uΣ�1

x

iii��bσ2

uTλ

�bΣx

��1�

ii


�C1C2

35(101)

� Pr��bσ2

uTλ

�bΣx

��1�

ii

�� > 1/C2

�(102)

+Pr

24��hσ2uΣ�1

x

iii�hbσ2

uΣ�1x

iii+hbσ2

uΣ�1x

iii��bσ2

uTλ

�bΣx

��1�

ii


�C1C2

35 (103)

27

� Pr��bσ2

uTλ

�bΣx

��1�

ii��

σ2uTλ

�bΣx

��1�

ii+

�σ2

uTλ

�bΣx

��1�

ii

�� > 1/C2

�(104)

+Pr

24��hσ2uΣ�1

x

iii�hbσ2

uΣ�1x

iii

�� > 12

cT se�bβi

�C1C2

35 (105)

+Pr

24��hbσ2uΣ�1

x

iii��bσ2

uTλ

�bΣx

��1�

ii

�� > 12

cT se�bβi

�C1C2

35 (106)

� Pr��Tλ

�bΣx

��1�

ii

�bσ2u � σ2

u

�� > 12C2

�+ Pr

��σ2uTλ

�bΣx

��1�

ii

�� > 12C2

�+ (107)

+Pr

24��σ2u � bσ2

u

�� > 12

cT se�bβi

�C1C2

1hΣ�1

x

iii

35 (108)

+Pr

24��bσ2u

�� hΣ�1x

iii��

Tλ

�bΣx

��1�

ii

�� > 12

cT se�bβi

�C1C2

35 � (109)

� Pr

"��Tλ

�bΣx

��1�

ii

�� >s

12C2

#+ Pr

"��bσ2u � σ2

u

�� > s 12C2

#(110)

+Pr��Tλ

�bΣx

��1�

ii

�� > 12C2

1σ2

u

�+ (111)

Pr

24��σ2u � bσ2

u

�� > 12

cT se�bβi

�C1C2

1hΣ�1

x

iii

35+ Pr

2664��bσ2u

�� >vuut1

2

cT se�bβi

�C1C2

3775 (112)

+Pr

2664��hΣ�1x

iii��

Tλ

�bΣx

��1�

ii

�� >vuut1

2

cT se�bβi

�C1C2

3775 (113)

Set

ω1 = min

s1

2C2,

12C2

1σ2

u

!=

12C2

1σ2

u(114)

28

when σ2u is a constant [ cTse

�bβi

�= cTp

T

rhσ2

uΣ�1x

iii= O

�(π log(p))1/2

pT

�]

ω2 = min

0@12

cT se�bβi

�C1C2

1hΣ�1

x

iii

,

s1

2C2

1A (115)

=12

cT se�bβi

�C1C2

1hΣ�1

x

iii

(116)

when cTse�bβi

�= O

�(π log(p))1/2

pT

�.

� 2 Pr��Tλ

�bΣx

��1�

ii

�� > ω1

�+ 2 Pr

h��σ2u � bσ2

u

�� > ω2

i(117)

+Pr

2664��bσ2u

�� >vuut1

2

cT se�bβi

�C1C2

3775+ (118)

+Pr

2664��hΣ�1x

iii��

Tλ

�bΣx

��1�

ii

�� >vuut1

2

cT se�bβi

�C1C2

3775 (L5) (119)

the second term in (L5)

Prh��σ2

u � bσ2u

�� > ω2

i� exp

��C�5 TC�6

�(120)

for C2 = O�

1Tv

�for v > 0

first term in (L5)

Pr��Tλ

�bΣx

��1�

ii

�� > ω1

�� Pr

��Tλ

�bΣx

��1�

ii�hΣ�1

x

iii+hΣ�1

x

iii

�� > ω1

�� (121)

� Pr��Tλ

�bΣx

��1�

ii�hΣ�1

x

iii

�� > ω1

2

�+ Pr

h��hΣ�1x

iii

�� > ω1

2

i� p2

zC�1 exp��C�2 Tλ2

pz

�+ pzC�3 exp

��C�4 Tλpz

�(122)

third term in (L5)

Pr

2664��bσ2u

�� >vuut1

2

cT se�bβi

�C1C2

3775 � (123)

29

� Pr

2664��bσ2u � σ2

u

�� > 12

vuut12

cT se�bβi

�C1C2

3775+ Pr

2664��σ2u

�� > 12

vuut12

cT se�bβi

�C1C2

3775 � exp��C�5 TC�6

�(124)

when��σ2

u�� < 1

2

r12

cT se(bβi)C1C2

! ∞ for C1 constant, C2 = O�

1Tv

�for v > 0

fourth term in (L5)

Pr

2664��hΣ�1x

iii��

Tλ

�bΣx

��1�

ii

�� >vuut1

2

cT se�bβi

�C1C2

3775 � p2zC�1 exp

��C�2 Tλ2

pz

�+ pzC�3 exp

��C�4 Tλpz

�(125)

since

vuut12

cT se�bβi

�C1C2

! ∞ (126)

So overall we have that

Pr�jbtij > cT jβi 6= 0

�� 1� 4p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

�� exp

��C�5 TC�6

�when cT = o (Tc0 ), 8c0 > 0, and se

�bβi

�= O

�T�θ

�, for θ > c0,

actualy we need cTse�bβ� = cTp

T

rhσ2

uΣ�1x

iii! 0 as

�σ2

uΣ�1x�

ii constant and cT =

O (log p), then cTse�bβ� = O

�log pp

T

�

(127)

E�TPRp,T

�= k�1

k

∑i=1

Pr�jbtij > cT

�� βi 6= 0�� 1�O

�p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

��(128)

Proof for FPR

E�FPRlT ,T

�= (p� k)�1

k

∑i=k+1

Pr�jbtij > cT

�� βi = 0�

(129)

30

Pr�jbtij > cT jβi = 0

�= Pr

24��bβi � βi + βi

se�bβi

��

se�bβi

�� bse

�bβi

�� > cT

35 (130)

� Pr

24�� 1

se�bβi

��

se�bβi

�� bse

�bβi

�� > pcT


�� > pcT

i(131)

� C9 Pr

24��bse�bβi

�� se

�bβi

�se�bβi

�� > pcT

�� βi = 0


�� > pcT

i, for C1 > 0 (S1)

(132)

The second prob in S1 is

Prh ��bβi � βi + βi

�� > pcT

��i � Prh ��bβi � βi

�� > pcT

�� βi = 0i+ Pr [ jβij >

pcTj βi = 0]

(133)

but

Pr [ jβij >p

cTj βi = 0] = 0 (134)

and

Prh ��bβi � βi

�� > pcT

�� βi = 0i� p2

z exp��C�2 Tλ2

pz

�+ pz exp

��C�4 Tλpz

�, when

pcT � n3/2

pz λp (W1)

(135)

So, overall we have that

Prh ��bβi � βi + βi

�� > pcT

�� βi = 0i� p2


pz

�+ pz exp

��C�4 Tλpz

�For the first prob in (S1) we have that

Pr

24��bse�bβi

�� se

�bβi

�se�bβi

�� > pcT

�� βi = 0

35 � Pr

2664��bσ2

uT � Tλ

�bΣx

��1�

iihσ2

uT � Σ�1x

iii

� 1

�� >p

cT

�� βi = 0

3775(136)

= Pr

2664��bσ2

uσ2

u

�Tλ

�bΣx

��1�

ii��Σ�1

x�

iihΣ�1

x

iii

�� >p

cT

3775 = Pr

"��bσ2u

σ2u

�Tλ

�bΣx

��1�

ii�hΣ�1

x

iii

�� > pcT

hΣ�1

x

iii

#

(137)

31

notice that�Σ�1

x�

ii is the i,i element of Σ�1x which is constant (see also lemma 1)

= Pr

"��bσ2u

σ2u

�Tλ

�bΣx

��1�

ii��

Tλ

�bΣx

��1�

ii+

�Tλ

�bΣx

��1�

ii�hΣ�1

x

iii

�� > pcT

hΣ�1

x

iii

#(138)

� Pr

"��bσ2u

σ2u

�Tλ

�bΣx

��1�

ii��

Tλ

�bΣx

��1�

ii

�� >p

cT

2

hΣ�1

x

iii

#+ Pr

��Tλ

�bΣx

��1�

ii�hΣ�1

x

iii

�� > pcT

2

hΣ�1

x

iii

�(139)

� Pr��Tλ

�bΣx

��1�

ii

�� > pcT

2C3

hΣ�1

x

iii

�+ Pr

"��bσ2u

σ2u� 1

�� > C3

#(140)

+Pr��Tλ

�bΣx

��1�

ii�hΣ�1

x

iii

�� > pcT

2

hΣ�1

x

iii

�= (141)

= Pr��Tλ

�bΣx

��1�

ii�hΣ�1

x

iii+hΣ�1

x

iii

�� > pcT

2C3

hΣ�1

x

iii

�+ Pr

"��bσ2u

σ2u� 1

�� > C3

#(142)

+Pr��Tλ

�bΣx

��1�

ii�hΣ�1

x

iii

�� > pcT

2

hΣ�1

x

iii

�(143)

� Pr��Tλ

�bΣx

��1�

ii�hΣ�1

x

iii

�� > pcT

4C3

hΣ�1

x

iii

�+ Pr

��hΣ�1x

iii

�� > pcT

4C3

hΣ�1

x

iii

�+ Pr

"��bσ2u

σ2u� 1

�� > C3

#(144)

+Pr��Tλ

�bΣx

��1�

ii�hΣ�1

x

iii

�� > pcT

2

hΣ�1

x

iii

�(S2) (145)

the first prob in (S2) is

Pr��Tλ

�bΣx

��1�

ii�hΣ�1

x

iii

�� > pcT

4C3

hΣ�1

x

iii

�� p2


pz

�+ pzC�3 exp

��C�4 Tλpz

�(146)

whenp

cT4C3

�Σ�1

x�

ii > npz λpz , C3 can be a large constant

the second prob in (S2) is zero

32

the third prob in (S2) is

Pr

"��bσ2u

σ2u� 1

�� > C3

�� βi = 0

#� exp

��C10TC11

�(147)

for C3 be a large constant

the fourth prob in (S2)

Pr��Tλ

�bΣx

��1�

ii�hΣ�1

x

iii

�� > pcT

2

hΣ�1

x

iii

�� p2


pz

�+ pzC�3 exp

��C�4 Tλpz

�,

(148)

whenp

cT2

�Σ�1

x�

ii > npz λpz .

So overall we have that

Pr�jbtij > cT jβi = 0

�� C9 Pr

24��bse�bβi

�� se

�bβi

�se�bβi

�� > pcT

�� βi = 0

35 (149)

+Prh��bβi � βi + βi

�� > pcT

i(150)

� p2z exp

��C�2 Tλ2

pz

�+ pz exp

��C�4 Tλpz

�(151)

E�FPRlT ,T

�= (p� k)�1

p

∑i=k+1

Pr�jbtij > cT

�� βi = 0�

(152)

E�FPRlT ,T

�� (p� k)�1 O

�p2


pz

�+ pz exp

��C�4 Tλpz

��(153)

= O�

pz exp��C�2 Tλ2

pz

�+ exp

��C�4 Tλpz

��(154)

since pz = p+ 1

Proof of theorem 3

33

Define the event of correctly identifying the true model as

A0 = A0 =

(k

∑i=1

I \(βi 6= 0) = k

)\(

p

∑i=k+1

I \(βi 6= 0) = 0

)(155)

Then Pr (Ac0) � Pr

k

∑i=1

I \(βi 6= 0) < k

!+ Pr

p

∑i=k+1

I \(βi 6= 0) > 0

!(156)

Pr

p

∑i=k+1

I \(βi 6= 0) > 0

!�

p

∑i=k+1

Eh

I \(βi 6= 0) > 0�� βi = 0

i= (157)

=p

∑i=k+1

Pr�jbtij > cT

�� βi = 0�� (p� k)O

�pz exp

��C�2 Tλ2

pz

�+ exp

��C�4 Tλpz

��(158)

� O�

p2z exp

��C�2 Tλ2

pz

�+ pz exp

��C�4 Tλpz

��(159)

Pr

k

∑i=1

I \(βi 6= 0) < k

!= 1� Pr

k

∑i=1

I \(βi 6= 0) = k

!(160)

� 1� 1+O�

p2zC1 exp

��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

��(161)

That is Pr (Ac0) � O

�p2

zC1 exp��C2Tλ2

pz

�+ pzC3 exp

��C4Tλpz

��For any ε > 0 there exists some Bε < ∞ such that

Pr�p

T bβ� β

> Bε

�= Pr

�pT bβ� β

> Bε

�� A0

�Pr (A0) (162)

+Pr�p

T bβ� β

> Bε

�� Ac0

�Pr (Ac

0) (163)

� Pr�p

T bβ� β

> Bε

�� A0

�+ Pr (Ac

0) < ε (164)

Also define Fu = T�1 ku0uk, for the error term of the second step regression we have

that for any ε > 0 there exists some Cε < ∞ such

Pr�p

T��Fu � σ2

u

�� > Cε

�= Pr

�pT��Fu � σ2

u

�� > Cε

�� A0

�Pr (A0)

+Pr�p

T��Fu � σ2

u

�� > Cε

�� Ac0

�Pr (Ac

0)

� Pr�p

T��Fu � σ2

u

�� > Cε

�� A0

�+ Pr (Ac

0) < ε

34

Letter Explanation

S SIS

R-BL Regularized estimator with Bickel and Levina

R-CL Regularized estimator with Cai and Liu

t testing procedure

r final regression on important covariates

Table A: Method names in the tables. Each row stands for a method. Two or more rows

result to a combined method. E.g. S-R-BL-t-r is a sure screening method (S), combined the

basic regularized estimator (R-BL), with testing (t) and a final regression (r). R-BL-t-r is

the basic regularized estimator (R-BL), with testing (t) and a final regression (r). R-BL-t is

the basic regularized estimator. (R-BL), with testing (t). R-BL is the basic regularized

estimator (R-BL).

35

DGP=1, T=200, p=200 Lass

o

A-L

asso

S-R-

BL-t

S-R-

BL-t-

r

S-R-

BL

S-R-

CL-

t

S-R-

CL-

t-r

S-R-

CL

R-BL

-t

R-BL

-t-r

R-BL

R-C

L-t

R-C

L-t-r

R-C

L

R2 = 0.7

TPR 1 1 1 1 1 1 1 1 1 1 1 1 1 1FPR 0.08 0.08 0 0 0 0 0.01 0 0 0 0 0 0 0.08

RMSE(β) 1 1.32 0.65 0.55 0.82 0.7 0.68 0.68 0.67 0.53 0.82 0.67 0.55 0.7TrueModel 0 0 0.98 0.67 0.72 0.88 0.35 0.94 0.98 0.75 0.72 0.95 0.56 0.84

RMFSE 1 1.05 0.97 0.98 0.99 0.98 1 0.97 0.97 0.98 0.99 0.98 0.98 0.98

R2 = 0.5

TPR 1 1 0.99 1 0.97 1 1 1 0.99 0.99 0.97 0.99 0.99 0.99FPR 0.08 0.08 0 0 0 0 0.01 0 0 0 0 0 0 0.27

RMSE(β) 1 1.34 0.67 0.6 0.87 0.76 0.73 0.73 0.67 0.57 0.87 0.69 0.59 0.75TrueModel 0.01 0.01 0.82 0.6 0.52 0.63 0.29 0.7 0.82 0.66 0.52 0.79 0.57 0.55

RMFSE 1 1.06 0.98 0.97 0.99 0.97 0.97 0.98 0.98 0.97 0.99 0.99 0.97 1

R2 = 0.3

TPR 0.99 0.99 0.86 0.89 0.87 0.89 0.9 0.89 0.77 0.78 0.87 0.78 0.8 0.88FPR 0.07 0.07 0 0.01 0.01 0 0.01 0.01 0 0 0.01 0 0 0.59

RMSE(β) 1 1.26 0.98 0.88 1.1 0.96 0.87 1.04 1.04 0.98 1.1 1.03 0.94 1.06TrueModel 0 0 0.24 0.21 0.17 0.25 0.2 0.21 0.21 0.21 0.17 0.24 0.24 0.07

RMFSE 1 1.04 1 0.99 1.01 1.01 1 1 1 1 1.01 1 0.99 1.01

DGP=1, T=300, p=200R2 = 0.7

TPR 1 1 1 1 1 1 1 1 1 1 1 1 1 1FPR 0.08 0.08 0 0.01 0 0 0.01 0 0 0.01 0 0 0.01 0.05

RMSE(β) 1 1.34 0.68 0.69 0.71 0.73 0.74 0.7 0.66 0.61 0.71 0.7 0.65 0.7TrueModel 0 0 0.94 0.35 0.91 0.82 0.27 0.93 0.98 0.53 0.91 0.92 0.39 0.89

RMFSE 1 1.02 0.98 0.97 0.97 0.98 0.98 0.98 0.98 0.98 0.97 0.97 0.97 0.97

R2 = 0.5

TPR 1 1 1 1 0.99 1 1 1 1 1 0.99 1 1 1FPR 0.08 0.08 0 0.01 0 0 0.01 0 0 0 0 0 0 0.15

RMSE(β) 1 1.38 0.65 0.67 0.69 0.76 0.76 0.68 0.62 0.59 0.69 0.64 0.62 0.65TrueModel 0.01 0.01 0.85 0.44 0.82 0.66 0.27 0.8 0.91 0.6 0.82 0.87 0.51 0.74

RMFSE 1 1.05 0.98 0.98 0.98 0.97 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.97

R2 = 0.3

TPR 1 1 0.98 0.99 0.95 0.99 0.99 0.99 0.95 0.96 0.95 0.96 0.97 0.98FPR 0.07 0.07 0 0.01 0 0 0.01 0.01 0 0 0 0 0 0.48

RMSE(β) 1 1.34 0.76 0.71 0.81 0.8 0.76 0.81 0.71 0.66 0.81 0.7 0.66 0.77TrueModel 0 0 0.54 0.4 0.52 0.42 0.27 0.49 0.66 0.59 0.52 0.68 0.57 0.34

RMFSE 1 1.01 0.99 0.98 1 0.98 0.99 1 0.99 0.98 1 0.98 0.98 1

DGP=1, T=150, p=200R2 = 0.7

TPR 1 1 0.99 1 0.99 1 1 1 0.99 1 0.99 0.99 1 0.99FPR 0.08 0.08 0 0 0 0 0.01 0 0 0 0 0 0 0.17

RMSE(β) 1 1.28 0.74 0.54 1 0.72 0.63 0.76 0.76 0.53 1 0.76 0.52 0.83TrueModel 0.01 0.01 0.91 0.66 0.49 0.83 0.4 0.85 0.88 0.64 0.49 0.85 0.61 0.66

RMFSE 1 1.04 0.95 0.93 0.98 0.96 0.94 0.95 0.94 0.93 0.98 0.95 0.94 0.96

R2 = 0.5

TPR 1 1 0.94 0.97 0.93 0.97 0.98 0.95 0.93 0.94 0.93 0.93 0.94 0.93FPR 0.08 0.08 0 0.01 0 0 0.01 0 0 0 0 0 0 0.41

RMSE(β) 1 1.29 0.88 0.72 1.01 0.84 0.72 0.9 0.87 0.72 1.01 0.88 0.73 0.99TrueModel 0 0 0.54 0.4 0.33 0.51 0.33 0.47 0.53 0.45 0.33 0.54 0.47 0.24

RMFSE 1 1.03 1.02 0.99 1.03 0.99 0.98 1.02 1.03 0.99 1.03 1.03 0.99 1.03

R2 = 0.3

TPR 0.97 0.97 0.73 0.77 0.79 0.77 0.78 0.77 0.61 0.62 0.79 0.61 0.63 0.77FPR 0.07 0.07 0 0.01 0.01 0 0.01 0.01 0 0 0.01 0 0 0.55

RMSE(β) 1 1.24 1.13 1.02 1.19 1.09 1.02 1.17 1.16 1.12 1.19 1.17 1.1 1.21TrueModel 0 0 0.09 0.08 0.06 0.09 0.07 0.06 0.08 0.08 0.06 0.08 0.08 0.02

RMFSE 1 1.05 1.03 1.01 1.04 1.01 1 1.05 1.03 1.03 1.04 1.03 1.02 1.05

Table 1: Simulation results for dgp 1 (iid covariates). Method names are given in Table A.The number of simulations is 500. The MSE and the RMSFE are given as ratios from theLasso method.

36

DGP=43, T=200, p=200 Lass

o

A-L

asso

S-R-

BL-t

S-R-

BL-t-

r

S-R-

BL

S-R-

CL-

t

S-R-

CL-

t-r

S-R-

CL

R-BL

-t

R-BL

-t-r

R-BL

R-C

L-t

R-C

L-t-r

R-C

L

R2 = 0.7

TPR 1 1 0.74 0.99 0.86 0.99 0.99 1 0.65 1 0.78 0.99 0.99 1FPR 0.04 0.04 0 0 0.06 0 0 0 0 0 0.01 0 0 0

RMSE(β) 1 1.36 2.73 0.82 4.83 0.79 0.91 0.75 5.09 0.74 5.65 0.75 0.79 0.74TrueModel 0.1 0.1 0.09 0.91 0.01 0.94 0.7 0.98 0.02 0.98 0.14 0.97 0.95 1

RMFSE 1 1.01 1.14 0.97 1.46 0.98 0.97 0.97 1.46 0.97 1.6 0.97 0.97 0.97

R2 = 0.5

TPR 1 1 0.55 0.97 0.78 0.79 0.85 0.99 0.6 0.99 0.75 0.8 0.9 1FPR 0.04 0.04 0 0 0.05 0 0 0 0 0 0 0 0 0.01

RMSE(β) 1 1.35 2.57 0.83 3.5 1.48 1.32 0.89 3.26 0.77 3.5 1.32 1.08 0.76TrueModel 0.06 0.06 0 0.85 0 0.19 0.35 0.82 0 0.94 0.11 0.24 0.55 0.99

RMFSE 1 1.01 1.14 0.98 1.32 1.02 1 0.98 1.2 0.97 1.33 1.01 0.99 0.97

R2 = 0.3

TPR 0.95 0.95 0.49 0.9 0.74 0.52 0.55 0.94 0.5 0.89 0.73 0.49 0.76 1FPR 0.03 0.03 0 0 0.02 0 0 0 0 0 0 0 0 0.02

RMSE(β) 1 1.34 2.07 0.93 2.48 1.56 1.7 1.09 2.2 0.91 2.4 1.64 1.2 0.81TrueModel 0.07 0.07 0 0.57 0.09 0 0 0.41 0 0.59 0.13 0.02 0.43 0.98

RMFSE 1 1.03 1.06 0.98 1.17 1.04 1.02 0.99 1.05 0.99 1.15 1.05 1.01 0.99

DGP=43, T=300, p=200R2 = 0.7

TPR 1 1 0.98 1 1 1 1 1 0.89 1 0.82 1 1 1FPR 0.03 0.03 0 0.01 0.06 0 0.01 0 0 0 0.11 0 0 0

RMSE(β) 1 1.38 1.09 0.97 3.63 0.81 1.06 0.75 2.77 0.87 14.43 0.74 0.78 0.74TrueModel 0.08 0.08 0.77 0.56 0 0.93 0.43 0.99 0.52 0.76 0.03 1 0.97 1

RMFSE 1 1 0.99 0.97 1.22 0.98 0.98 0.98 1.12 0.98 1.55 0.98 0.98 0.98

R2 = 0.5

TPR 1 1 0.81 0.97 0.98 0.93 0.94 1 0.67 0.99 0.77 0.94 0.96 1FPR 0.03 0.03 0 0 0.1 0 0 0 0 0 0.02 0 0 0

RMSE(β) 1 1.38 1.82 0.96 3.49 1.24 1.21 0.91 2.82 0.83 5 0.97 0.95 0.75TrueModel 0.09 0.09 0.1 0.74 0 0.5 0.42 0.82 0.04 0.88 0.07 0.71 0.8 1

RMFSE 1 1.01 1.02 0.99 1.26 1 1.01 0.99 1.14 0.98 1.3 0.98 0.99 0.98

R2 = 0.3

TPR 0.99 0.99 0.53 0.94 0.85 0.64 0.67 0.97 0.54 0.95 0.75 0.62 0.88 1FPR 0.03 0.03 0 0 0.12 0 0 0 0 0 0.01 0 0 0.02

RMSE(β) 1 1.39 2.1 0.94 3.33 1.62 1.74 1.13 2.44 0.9 3.07 1.62 1.09 0.79TrueModel 0.08 0.09 0 0.67 0 0.01 0.01 0.48 0 0.77 0.14 0.01 0.67 0.98

RMFSE 1 1.01 1.04 0.98 1.17 1.02 1.01 1.01 1.08 0.98 1.19 1.03 1 0.98

DGP=43, T=150, p=200R2 = 0.7

TPR 1 1 0.56 0.99 0.79 0.96 0.97 1 0.6 1 0.78 0.96 0.98 0.99FPR 0.04 0.04 0 0 0.04 0 0 0 0 0 0 0 0 0

RMSE(β) 1 1.34 3.73 0.76 4.48 0.91 0.91 0.75 4.22 0.74 4.59 0.91 0.84 0.8TrueModel 0.08 0.08 0 0.97 0.05 0.79 0.72 0.98 0.01 0.99 0.17 0.81 0.88 0.98

RMFSE 1 1.02 1.37 0.97 1.73 0.99 0.99 0.98 1.45 0.98 1.92 0.99 0.98 0.98

R2 = 0.5

TPR 0.99 0.99 0.51 0.97 0.77 0.69 0.78 0.99 0.48 0.97 0.77 0.68 0.84 1FPR 0.03 0.03 0 0 0.02 0 0 0 0 0 0 0 0 0.01

RMSE(β) 1 1.35 2.76 0.82 3.21 1.65 1.44 0.88 2.81 0.8 3.15 1.59 1.23 0.83TrueModel 0.09 0.09 0 0.83 0.13 0.05 0.19 0.83 0 0.85 0.19 0.06 0.35 0.98

RMFSE 1 1.03 1.19 0.99 1.45 1.03 1.01 1 1.23 0.99 1.45 1.02 1 0.99

R2 = 0.3

TPR 0.91 0.91 0.43 0.87 0.75 0.48 0.51 0.91 0.43 0.83 0.75 0.46 0.7 0.99FPR 0.03 0.03 0 0 0 0 0 0 0 0 0 0 0 0.04

RMSE(β) 1 1.34 1.98 0.94 2.21 1.55 1.63 1.05 1.97 0.95 2.21 1.79 1.23 1.17TrueModel 0.06 0.06 0 0.56 0.16 0 0 0.42 0 0.45 0.17 0.01 0.29 0.93

RMFSE 1 1.02 1.03 0.97 1.14 1.03 1.02 0.99 1.05 0.97 1.15 1.08 1 0.99

Table 2: Simulation results for dgp 43 (Temporally uncorrelated and weakly collinear co-variates).

37

DGP=52, T=200, p=200 Lass

o

A-L

asso

S-R-

BL-t

S-R-

BL-t-

r

S-R-

BL

S-R-

CL-

t

S-R-

CL-

t-r

S-R-

CL

R-BL

-t

R-BL

-t-r

R-BL

R-C

L-t

R-C

L-t-r

R-C

L

R2 = 0.7

TPR 1 1 0.56 1 0.77 0.9 0.97 1 0.51 0.99 0.45 0.65 0.94 0.98FPR 0.03 0.03 0 0 0.08 0 0 0.01 0 0 0.02 0 0.01 0.68

RMSE(β) 1 1.31 3.57 0.86 4.57 1.49 0.96 1.04 4.92 0.86 5.05 3.16 1.26 4.21TrueModel 0.06 0.06 0.01 0.51 0 0.49 0.57 0 0.04 0.6 0 0.07 0.07 0

RMFSE 1 1.01 1.37 0.97 1.45 1.04 0.98 0.99 1.53 0.98 1.53 1.2 0.98 1.07

R2 = 0.5

TPR 1 0.99 0.48 0.97 0.5 0.66 0.8 1 0.56 0.92 0.38 0.5 0.83 0.99FPR 0.04 0.04 0 0 0.05 0 0 0.01 0 0 0.01 0 0.01 0.65

RMSE(β) 1 1.37 2.8 0.92 3.42 1.99 1.43 1.06 3.07 1.04 3.36 2.78 1.46 2.87TrueModel 0.05 0.05 0.03 0.44 0 0.03 0.2 0.06 0.03 0.48 0 0 0.01 0

RMFSE 1 1.02 1.17 0.98 1.31 1.06 0.98 0.98 1.12 0.96 1.25 1.14 0.99 1.02

R2 = 0.3

TPR 0.92 0.92 0.49 0.85 0.35 0.5 0.52 0.94 0.52 0.72 0.34 0.35 0.65 0.99FPR 0.03 0.03 0 0 0.02 0 0 0.01 0 0 0.01 0 0.01 0.69

RMSE(β) 1 1.35 2.04 1.07 2.34 1.59 1.62 1.13 2.1 1.26 2.32 2.35 1.49 6.23TrueModel 0.04 0.04 0 0.25 0 0 0 0.13 0 0.08 0 0 0 0

RMFSE 1 1.02 1.09 1 1.14 1.04 1.05 1.01 1.06 1.01 1.12 1.15 1.02 1.29

DGP=52, T=300, p=200R2 = 0.7

TPR 1 1 0.87 1 1 0.98 0.98 1 0.66 1 0.69 0.77 0.98 1FPR 0.04 0.04 0.01 0.01 0.07 0 0.01 0.01 0 0 0.15 0 0.01 0.86

RMSE(β) 1 1.33 2.41 0.92 3.51 1.08 1.02 0.97 3.76 0.86 15.36 2.83 1.11 4.65TrueModel 0.06 0.06 0.11 0.4 0 0.8 0.42 0 0 0.48 0 0.15 0.08 0

RMFSE 1 1.01 1.08 0.98 1.24 0.99 0.98 0.98 1.27 0.99 1.52 1.11 0.99 1.03

R2 = 0.5

TPR 1 1 0.63 0.99 0.94 0.81 0.89 1 0.55 0.99 0.54 0.58 0.92 0.99FPR 0.04 0.04 0 0.01 0.11 0 0 0.01 0 0 0.02 0 0.01 0.83

RMSE(β) 1 1.37 2.48 0.96 3.44 1.71 1.28 1.08 3.17 0.92 4.12 2.87 1.34 3.29TrueModel 0.05 0.05 0.01 0.33 0 0.24 0.31 0.1 0.04 0.45 0 0.01 0.01 0

RMFSE 1 1.03 1.12 0.99 1.23 1.03 1.02 1 1.19 0.99 1.23 1.13 1.01 1.02

R2 = 0.3

TPR 0.98 0.98 0.47 0.95 0.63 0.59 0.62 0.98 0.56 0.92 0.4 0.37 0.7 1FPR 0.03 0.03 0 0 0.11 0 0 0.01 0 0 0.02 0 0.01 0.88

RMSE(β) 1 1.38 2.22 1.02 3.11 1.67 1.71 1.1 2.38 1.01 2.87 2.44 1.6 4.62TrueModel 0.04 0.04 0.02 0.34 0 0.01 0.01 0.2 0.03 0.39 0 0.01 0.01 0

RMFSE 1 1.01 1.1 1 1.16 1.05 1.03 0.99 1.09 1 1.14 1.13 1.05 1.03

DGP=52, T=150, p=200R2 = 0.7

TPR 1 1 0.47 0.99 0.42 0.8 0.93 1 0.56 0.88 0.37 0.57 0.89 0.92FPR 0.04 0.04 0 0 0.02 0 0 0.01 0 0 0.01 0 0.01 0.38

RMSE(β) 1 1.35 4.06 0.85 4.4 1.92 1.11 1.22 4.04 1.39 4.43 3.26 1.45 4TrueModel 0.06 0.06 0.01 0.61 0 0.22 0.53 0 0.01 0.43 0 0.04 0.07 0

RMFSE 1 1.03 1.48 0.97 1.51 1.06 0.98 0.99 1.34 0.99 1.52 1.31 0.99 1.2

R2 = 0.5

TPR 0.99 0.98 0.49 0.89 0.34 0.58 0.74 0.98 0.53 0.7 0.34 0.44 0.76 0.92FPR 0.04 0.04 0 0 0.01 0 0 0.01 0 0 0.01 0 0.01 0.38

RMSE(β) 1 1.31 2.62 1.05 2.92 2 1.44 1.14 2.62 1.49 2.92 2.63 1.5 2.94TrueModel 0.06 0.06 0.01 0.41 0 0.01 0.12 0.05 0 0.1 0 0 0.01 0

RMFSE 1 1.04 1.19 0.99 1.23 1.12 1.03 0.99 1.16 1.04 1.2 1.16 0.99 1.07

R2 = 0.3

TPR 0.89 0.89 0.48 0.76 0.37 0.45 0.46 0.88 0.34 0.4 0.37 0.34 0.6 0.99FPR 0.04 0.04 0 0 0.01 0 0 0.01 0 0 0.01 0 0.01 0.39

RMSE(β) 1 1.35 1.92 1.11 2.12 1.56 1.58 1.2 1.79 1.51 2.11 2.74 1.46 4.6TrueModel 0.03 0.03 0 0.09 0 0 0.01 0.11 0 0 0 0 0.01 0

RMFSE 1 1.02 1.02 0.98 1.16 1.03 1.04 1 1.05 1.06 1.15 2.07 1.05 1.22

Table 3: Simulation results for dgp 52 (pseudo signal variables).

38

DGP=44, T=200, p=200 Lass

o

A-L

asso

S-R-

BL-t

S-R-

BL-t-

r

S-R-

BL

S-R-

CL-

t

S-R-

CL-

t-r

S-R-

CL

R-BL

-t

R-BL

-t-r

R-BL

R-C

L-t

R-C

L-t-r

R-C

L

R2 = 0.7

TPR 1 1 0.82 0.99 0.88 0.97 0.98 1 0.65 0.99 0.74 0.67 0.98 0.69FPR 0.03 0.03 0 0 0.02 0 0.01 0.02 0 0 0 0 0 0

RMSE(β) 1 1.43 3.2 0.86 3.58 1.67 1.12 2.05 4.47 0.76 4.69 4.39 0.8 4.41TrueModel 0.08 0.05 0.08 0.7 0.01 0.34 0.26 0.01 0 0.93 0.16 0.07 0.91 0.1

RMFSE 1 0.99 1.18 0.92 1.21 0.93 0.88 0.93 1.46 0.96 1.86 1.42 0.94 1.47

R2 = 0.5

TPR 1 1 0.64 0.99 0.81 0.77 0.81 1 0.64 0.96 0.73 0.68 0.84 0.82FPR 0.03 0.03 0 0 0.02 0 0 0.02 0 0 0 0 0 0

RMSE(β) 1 1.43 2.86 0.74 3.17 1.8 1.52 1.94 3.15 0.78 3.3 3.09 1.28 3.22TrueModel 0.15 0.12 0 0.9 0.08 0.03 0.16 0.01 0.02 0.9 0.16 0.14 0.35 0.44

RMFSE 1 1.01 1.1 0.93 1.21 1.03 1.03 1.03 1.22 0.97 1.37 1.13 0.94 1.2

R2 = 0.3

TPR 0.96 0.95 0.6 0.98 0.72 0.49 0.48 0.98 0.66 0.95 0.72 0.49 0.59 0.99FPR 0.02 0.02 0 0 0 0 0 0.01 0 0 0 0 0 0

RMSE(β) 1 1.47 2.27 0.76 2.4 1.51 1.84 1.34 2.36 0.82 2.39 1.88 1.56 1.09TrueModel 0.1 0.07 0 0.91 0.06 0 0 0.47 0 0.78 0.1 0.07 0.05 0.94

RMFSE 1 1.03 1.24 0.94 1.22 1.01 1.04 0.98 1.18 0.99 1.21 1.1 1 1

DGP=44, T=300, p=200R2 = 0.7

TPR 1 1 0.98 1 0.98 1 1 1 0.67 1 0.74 0.64 1 0.64FPR 0.02 0.02 0 0 0.04 0 0 0.02 0 0 0.01 0 0 0

RMSE(β) 1 1.37 1.98 0.82 3.04 1.47 1.02 2.3 6.12 0.73 6.38 6.04 0.73 6.11TrueModel 0.16 0.13 0.43 0.74 0 0.52 0.41 0.02 0 0.94 0.15 0.03 1 0.01

RMFSE 1 0.97 1.08 0.94 1.1 1.01 0.98 1.1 1.62 0.93 1.7 1.4 0.97 1.42

R2 = 0.5

TPR 1 1 0.84 0.99 0.95 0.9 0.91 1 0.63 0.99 0.73 0.74 0.87 0.79FPR 0.02 0.02 0 0 0.05 0 0 0.02 0 0 0.01 0 0 0

RMSE(β) 1 1.45 2 0.85 2.97 1.67 1.4 2.16 3.99 0.8 4.17 4.06 1.35 4.1TrueModel 0.07 0.03 0.24 0.79 0.01 0.24 0.2 0.02 0 0.91 0.09 0.21 0.44 0.36

RMFSE 1 1.02 1.08 0.99 1.24 1.04 1.04 1.02 1.24 0.97 1.34 1.24 0.98 1.29

R2 = 0.3

TPR 0.99 0.99 0.57 0.97 0.83 0.6 0.61 1 0.63 0.94 0.72 0.6 0.7 1FPR 0.02 0.02 0 0 0.03 0 0 0.01 0 0 0.03 0 0 0

RMSE(β) 1 1.48 2.34 0.82 2.93 1.56 1.79 1.54 2.66 0.85 2.76 1.72 1.49 0.98TrueModel 0.07 0.06 0.01 0.84 0.02 0 0 0.34 0 0.79 0.08 0.05 0.13 0.99

RMFSE 1 1.02 1.09 1.01 1.13 0.97 0.98 1.01 1.1 0.99 1.19 1 0.96 1.01

DGP=44, T=150, p=200R2 = 0.7

TPR 1 1 0.7 1 0.77 0.94 0.95 1 0.66 1 0.74 0.68 0.97 0.73FPR 0.03 0.03 0 0 0 0 0 0.01 0 0 0 0 0 0

RMSE(β) 1 1.45 3.81 0.68 4.11 1.84 1.22 2.09 4.18 0.68 4.39 4.15 0.88 4.17TrueModel 0.09 0.07 0.01 0.94 0.05 0.21 0.16 0 0.02 1 0.12 0.06 0.84 0.16

RMFSE 1 1.01 1.47 0.97 1.95 1.06 1.03 1.13 1.73 0.99 2.06 1.82 1.01 1.91

R2 = 0.5

TPR 0.99 0.99 0.64 1 0.77 0.66 0.71 1 0.66 0.97 0.77 0.66 0.78 0.84FPR 0.02 0.03 0 0 0 0 0 0.01 0 0 0.02 0 0 0.01

RMSE(β) 1 1.45 2.69 0.72 2.92 1.95 1.67 1.86 2.8 0.79 2.94 2.83 1.34 2.82TrueModel 0.07 0.06 0.01 0.98 0.13 0 0.05 0.02 0.01 0.88 0.17 0.06 0.22 0.47

RMFSE 1 1 1.11 1.02 1.34 1.11 1 1.05 1.13 0.99 1.34 1.13 1.02 1.16

R2 = 0.3

TPR 0.92 0.92 0.58 0.96 0.73 0.42 0.42 0.98 0.64 0.86 0.74 0.53 0.56 0.98FPR 0.03 0.03 0 0 0 0 0 0.01 0 0 0.05 0 0 0.01

RMSE(β) 1 1.53 1.88 0.77 1.99 1.36 1.58 1.17 1.91 0.83 1.97 2.03 1.28 1.27TrueModel 0.02 0.01 0 0.8 0.12 0 0 0.52 0 0.49 0.13 0.12 0.06 0.93

RMFSE 1 1.05 1.08 0.99 1.13 1.01 1.04 0.95 1.08 0.99 1.12 1.13 1.06 1.01

Table 4: Simulation results for dgp 44 (Strongly collinear noise variables due to a persistentunobserved common factor).

39

DGP=45, T=200, p=200 Lass

o

A-L

asso

S-R-

BL-t

S-R-

BL-t-

r

S-R-

BL

S-R-

CL-

t

S-R-

CL-

t-r

S-R-

CL

R-BL

-t

R-BL

-t-r

R-BL

R-C

L-t

R-C

L-t-r

R-C

L

R2 = 0.7

TPR 1 1 0.61 1 0.77 1 1 1 0.66 1 0.78 1 1 1FPR 0.04 0.04 0 0 0.02 0 0 0 0 0 0 0 0 0

RMSE(β) 1 1.41 3.77 0.74 5.23 0.76 0.81 0.75 5.13 0.74 5.41 0.78 0.76 0.75TrueModel 0.02 0.02 0 0.99 0.06 0.98 0.83 0.98 0 1 0.19 0.98 0.99 1

RMFSE 1 1 1.18 0.91 1.37 0.91 0.96 0.92 1.35 0.92 1.62 0.91 0.92 0.91

R2 = 0.5

TPR 1 1 0.52 1 0.71 0.8 0.8 1 0.62 1 0.71 0.8 0.83 1FPR 0.04 0.03 0 0 0.01 0 0 0 0 0 0 0 0 0

RMSE(β) 1 1.29 2.77 0.75 3.37 1.31 1.45 0.73 3.21 0.73 3.37 1.27 1.33 0.74TrueModel 0.09 0.06 0 1 0.06 0.22 0.22 0.99 0 1 0.1 0.24 0.3 1

RMFSE 1 1 1.1 0.95 1.46 0.96 1.02 0.94 1.17 0.95 1.45 0.96 0.98 0.93

R2 = 0.3

TPR 0.96 0.95 0.49 0.96 0.75 0.49 0.51 0.95 0.52 0.96 0.75 0.47 0.66 1FPR 0.03 0.03 0 0 0 0 0 0 0 0 0 0 0 0.01

RMSE(β) 1 1.34 2.35 0.82 2.53 1.59 1.84 0.84 2.34 0.78 2.53 1.81 1.49 1.04TrueModel 0.03 0.01 0 0.79 0.15 0 0 0.78 0.01 0.79 0.15 0.03 0.16 0.99

RMFSE 1 1.05 1.06 1 1.28 1.05 0.98 1.02 1.04 1.02 1.28 1.08 1.01 1.03

DGP=45, T=300, p=200R2 = 0.7

TPR 1 1 0.97 1 0.98 1 1 1 0.67 1 0.75 1 1 1FPR 0.03 0.03 0 0 0.07 0 0 0 0 0 0.01 0 0 0

RMSE(β) 1 1.33 1.41 0.84 4.36 0.72 0.83 0.72 6.42 0.76 6.73 0.72 0.72 0.72TrueModel 0.09 0.05 0.56 0.69 0 1 0.74 1 0.01 0.88 0.06 1 1 1

RMFSE 1 0.98 0.94 0.98 1.25 0.99 0.96 0.99 1.47 0.99 1.65 0.99 0.98 0.99

R2 = 0.5

TPR 1 1 0.72 1 0.85 0.94 0.94 1 0.64 1 0.74 0.95 0.93 1FPR 0.04 0.04 0 0 0.07 0 0 0 0 0 0 0 0 0

RMSE(β) 1 1.42 2.11 0.8 3.85 1.02 1.08 0.73 4.06 0.74 4.27 0.9 1.06 0.72TrueModel 0.06 0.07 0 0.87 0 0.64 0.55 1 0 0.97 0.1 0.76 0.67 1

RMFSE 1 0.99 1.03 1.03 1.21 1.04 1.04 1.03 1.27 1.03 1.45 1.04 1.03 1.03

R2 = 0.3

TPR 0.99 0.98 0.5 0.99 0.75 0.64 0.65 0.98 0.62 1 0.71 0.64 0.76 1FPR 0.03 0.03 0 0 0.04 0 0 0 0 0 0 0 0 0

RMSE(β) 1 1.38 2.25 0.81 3.04 1.56 1.76 0.79 2.8 0.79 2.95 1.51 1.43 0.75TrueModel 0.05 0.02 0 0.88 0.06 0 0 0.88 0.01 0.95 0.12 0 0.23 1

RMFSE 1 1.02 0.93 0.98 1.18 0.99 1 0.97 1.09 0.95 1.11 1 0.96 0.95

DGP=45, T=150, p=200R2 = 0.7

TPR 1 1 0.56 1 0.78 0.97 0.97 1 0.63 1 0.77 0.93 0.97 1FPR 0.04 0.04 0 0 0.01 0 0 0 0 0 0 0 0 0.01

RMSE(β) 1 1.35 4.42 0.75 4.85 0.94 0.93 0.81 4.55 0.75 4.89 1.19 0.94 1.08TrueModel 0.02 0.01 0 1 0.13 0.8 0.79 0.95 0 1 0.2 0.74 0.85 0.98

RMFSE 1 1.01 1.69 1 2.06 1.04 1.02 1.05 1.73 1 2.27 1.05 1.07 1.08

R2 = 0.5

TPR 0.99 0.99 0.51 0.99 0.75 0.66 0.7 0.99 0.54 0.99 0.75 0.65 0.77 1FPR 0.04 0.04 0 0 0 0 0 0 0 0 0 0 0 0.03

RMSE(β) 1 1.32 2.82 0.81 3.12 1.6 1.71 0.87 2.84 0.79 3.12 1.6 1.46 0.93TrueModel 0.06 0.06 0 0.91 0.16 0 0.02 0.9 0 0.94 0.16 0.01 0.17 0.95

RMFSE 1 0.95 1.16 0.99 1.29 1.04 1.09 0.98 1.16 0.98 1.29 1.02 1 0.97

R2 = 0.3

TPR 0.9 0.89 0.41 0.94 0.74 0.44 0.47 0.85 0.44 0.83 0.74 0.46 0.61 1FPR 0.03 0.03 0 0 0 0 0 0 0 0 0 0 0 0.02

RMSE(β) 1 1.28 1.91 0.84 2.16 1.5 1.64 0.97 1.94 0.93 2.16 1.98 1.34 1.79TrueModel 0.07 0.02 0 0.69 0.14 0 0 0.49 0 0.38 0.14 0.05 0.12 0.97

RMFSE 1 1.02 1.11 0.96 1.23 0.97 0.93 0.99 1.12 0.99 1.23 1.06 0.95 1

Table 5: Simulation results for dgp 45 (Temporally correlated and weakly collinear covari-ates).

40

DGP=51, T=200, p=200 Lass

o

A-L

asso

S-R-

BL-t

S-R-

BL-t-

r

S-R-

BL

S-R-

CL-

t

S-R-

CL-

t-r

S-R-

CL

R-BL

-t

R-BL

-t-r

R-BL

R-C

L-t

R-C

L-t-r

R-C

L

R2 = 0.7

TPR 1 1 0.69 1 0.8 0.99 1 1 0.69 1 0.74 0.81 0.97 0.89FPR 0.04 0.04 0 0 0.02 0 0 0 0 0 0 0 0 0.45

RMSE(β) 1 1.37 3.25 0.72 4.47 1.26 0.84 1.28 4.49 0.69 4.67 2.83 1.01 2.92TrueModel 0.01 0 0 0.67 0.02 0.66 0.58 0.48 0.01 0.9 0.07 0.2 0.49 0.19

RMFSE 1 1.06 1.21 1.08 1.62 1.12 1.06 1.12 1.77 1.05 1.78 1.15 1.07 1.18

R2 = 0.5

TPR 1 1 0.61 1 0.75 0.85 0.86 1 0.6 1 0.75 0.6 0.84 0.83FPR 0.04 0.04 0 0 0 0 0 0 0 0 0 0 0 0.45

RMSE(β) 1 1.32 2.37 0.71 2.86 1.38 1.29 0.99 2.76 0.66 2.82 2.43 1.33 2.26TrueModel 0.07 0.03 0 0.74 0.12 0.3 0.29 0.56 0.01 0.95 0.16 0.01 0.2 0.07

RMFSE 1 1.03 1.03 0.97 1.11 0.97 0.98 0.99 1.06 0.95 1.11 1 0.98 1.01

R2 = 0.3

TPR 0.97 0.96 0.53 0.96 0.73 0.59 0.63 0.96 0.53 0.93 0.73 0.43 0.72 0.93FPR 0.04 0.04 0 0 0 0 0 0 0 0 0 0 0 0.77

RMSE(β) 1 1.35 1.96 0.77 2.08 1.57 1.55 1.11 2.02 0.8 2.08 2.08 1.3 2.48TrueModel 0.06 0.02 0 0.67 0.07 0 0 0.47 0 0.64 0.07 0 0.08 0.01

RMFSE 1 1.01 0.99 0.91 0.98 0.98 0.96 0.91 1.02 0.89 0.98 1.02 0.97 0.99

DGP=51, T=300, p=200R2 = 0.7

TPR 1 1 0.93 1 0.99 1 1 1 0.81 1 0.78 0.97 1 1FPR 0.04 0.04 0 0.01 0.04 0 0.01 0.01 0.01 0 0 0 0 0.86

RMSE(β) 1 1.37 1.71 0.85 4.08 1.11 0.95 1.22 3.51 0.75 5.45 1.89 0.79 2.35TrueModel 0.07 0.05 0.58 0.28 0 0.71 0.34 0.22 0.03 0.66 0.05 0.58 0.53 0.12

RMFSE 1 0.99 1.04 0.93 1.2 0.97 0.96 0.97 1.23 0.93 1.43 0.98 0.93 0.98

R2 = 0.5

TPR 1 1 0.77 1 0.93 0.96 0.97 1 0.63 1 0.73 0.77 0.91 0.99FPR 0.04 0.04 0 0 0.06 0 0 0.01 0 0 0 0 0 0.88

RMSE(β) 1 1.34 2.17 0.82 3.48 1.21 1.09 1.11 2.95 0.76 3.68 2.22 1.33 2.08TrueModel 0.06 0.02 0.03 0.45 0 0.52 0.37 0.36 0.02 0.72 0.07 0.1 0.24 0.08

RMFSE 1 1.04 1.04 1.02 1.18 1.04 1.03 1.04 1.02 1.01 1.26 1.03 0.94 1.03

R2 = 0.3

TPR 0.99 0.98 0.59 0.97 0.78 0.72 0.72 0.99 0.52 0.98 0.74 0.57 0.75 1FPR 0.04 0.04 0 0 0.02 0 0 0.01 0 0 0 0 0 0.99

RMSE(β) 1 1.33 1.91 0.83 2.5 1.53 1.55 1.05 2.19 0.77 2.41 2.16 1.45 1.78TrueModel 0.02 0 0 0.62 0.06 0 0 0.38 0.01 0.79 0.07 0 0.06 0

RMFSE 1 1.05 1.1 1.11 1.23 1.13 1.09 1.1 1.16 1.09 1.23 1.08 1.05 1.09

DGP=51, T=150, p=200R2 = 0.7

TPR 1 1 0.66 1 0.75 0.97 0.99 1 0.66 1 0.74 0.76 0.93 0.83FPR 0.05 0.04 0 0 0 0 0 0 0 0 0 0 0 0.21

RMSE(β) 1 1.33 3.27 0.69 3.89 1.28 0.84 1.22 3.78 0.65 3.89 2.76 1.15 2.73TrueModel 0.01 0.01 0.01 0.84 0.08 0.7 0.71 0.65 0 0.98 0.08 0.15 0.44 0.21

RMFSE 1 1.04 1.32 0.98 1.77 1.03 0.95 1.05 1.69 0.97 1.77 1.23 1.02 1.17

R2 = 0.5

TPR 1 0.99 0.54 0.99 0.75 0.75 0.79 1 0.52 0.98 0.75 0.56 0.77 0.7FPR 0.04 0.04 0 0 0 0 0 0 0 0 0 0 0 0.15

RMSE(β) 1 1.28 2.37 0.73 2.51 1.6 1.44 1.09 2.51 0.7 2.51 2.26 1.46 2.29TrueModel 0.06 0.05 0 0.74 0.14 0 0.08 0.63 0.01 0.9 0.14 0 0.08 0.08

RMFSE 1 1.08 1.2 0.99 1.3 1.02 1.01 1.02 1.13 1 1.3 1.08 1.04 1.09

R2 = 0.3

TPR 0.92 0.93 0.46 0.92 0.77 0.54 0.55 0.86 0.47 0.8 0.77 0.36 0.61 0.87FPR 0.03 0.03 0 0 0 0 0 0 0 0 0 0 0 0.26

RMSE(β) 1 1.31 1.78 0.84 1.84 1.56 1.52 1.29 1.79 0.98 1.84 1.92 1.34 3.29TrueModel 0.05 0 0 0.55 0.14 0 0 0.29 0 0.27 0.14 0 0.03 0.06

RMFSE 1 1.03 1.11 0.98 1.06 1.02 1.05 0.99 1.06 0.99 1.06 1.09 1.04 1.17

Table 6: Simulation results for dgp 51 (all variables collinear with the true signals).

41

Forecasting Exercise

Nondurable goods Price Index Household operation Price Index

RMSFE ratio from ar(p)

Lasso 0.92 1.06

A-Lasso 0.97 1.16

ar(p)-factor 0.99 1.0

ar(p) 1 1

S-R-BL-t-r 0.92 0.97

S-R-CL-t-r 0.91 1.02

R-BL-t-r 0.96 0.96

R-CL-t-r 0.92 0.89

Table 7: Out of sample forecasting. Dataset starts at 08-Jan-1961 until 02-Jan-2009. The

estimation is rolling with 120 obs, in each estimation window. The forecast starts at

08-Jan-1991. Results are in ratios from the AR(1) model.

frequency of regressors over the forecasting interval (Nondurable goods Price Index)

1 2 3 4 5 6 7 8 910 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 10

010

110

210

310

410

510

610

710

810

911

011

111

2

Lasso

Ad. Lasso

S-R-BL-t-r

S-R-CL-t-r

R-BL-t-r

R-CL-t-r

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: Frequency of regressors included into the forecasting equation over the forecastinterval. The first 4 regressors are the 4 lags of the dependent variable. Regressors indexedfrom number 5 to number 112 are the 108 likely regressors from the Stock and Watson(2012) large dataset.

42

frequency of regressors over the forecasting interval (Household operation Price Index)

1 2 3 4 5 6 7 8 910 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 10

010

110

210

310

410

510

610

710

810

911

011

111

2

Lasso

Ad. Lasso

S-R-BL-t-r

S-R-CL-t-r

R-BL-t-r

R-CL-t-r

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 2: See notes in the previous figure.

43

a regularization approach for estimation and variable selection in … · 2018. 9. 27. · a...

Documents