valid simultaneous inference in high-dimensional settings ... · review of classical and modern...

26
Valid simultaneous inference in high-dimensional settings (with the HDM package for R) Philipp Bach Victor Chernozhukov Martin Spindler The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP30/19

Upload: others

Post on 19-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

Valid simultaneous inference in high-dimensional settings (with the HDM package for R)

Philipp BachVictor ChernozhukovMartin Spindler

The Institute for Fiscal Studies

Department of Economics,

UCL

cemmap working paper CWP30/19

Page 2: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

VALID SIMULTANEOUS INFERENCE IN HIGH-DIMENSIONAL SETTINGS

(WITH THE HDM PACKAGE FOR R)

PHILIPP BACH, VICTOR CHERNOZHUKOV, MARTIN SPINDLER

Abstract. Due to the increasing availability of high-dimensional empirical applications in many

research disciplines, valid simultaneous inference becomes more and more important. For instance,

high-dimensional settings might arise in economic studies due to very rich data sets with many

potential covariates or in the analysis of treatment heterogeneities. Also the evaluation of potentially

more complicated (non-linear) functional forms of the regression relationship leads to many potential

variables for which simultaneous inferential statements might be of interest. Here we provide a

review of classical and modern methods for simultaneous inference in (high-dimensional) settings

and illustrate their use by a case study using the R package hdm. The R package hdm implements

valid joint powerful and efficient hypothesis tests for a potentially large number of coefficients as well

as the construction of simultaneous confidence intervals and, therefore, provides useful methods to

perform valid post-selection inference based on the LASSO.

R and the package hdm are open-source software projects and can be freely downloaded from

CRAN: http://cran.r-project.org.

Contents

1. Introduction 2

2. The Setting 3

3. Simultaneous Post-Selection Inference in High Dimensions – An Overview 3

4. Methods for Testing multiple Hypotheses 5

4.1. A Global Test for Joint Significance with Lasso Regression 5

4.2. Multiple Hypotheses Testing with Control of the Familywise Error Rate 6

4.3. Multiple Hypotheses Testing with Control of the False Discovery Rate 7

5. Implementation in R 8

6. A Simulation Study for Valid Simultaneous Inference in High Dimensions 9

6.1. Simulation Setting 9

6.2. Results 10

7. A Real-Data Example for Simultaneous Inference in High Dimensions - The Gender Wage

Gap Analysis 12

7.1. Data Preparation 12

7.2. Valid Simultaneous Inference on a Heterogeneous Gender Wage Gap 12

8. Conclusion 17

9. Appendix 18

9.1. Details of Simulation Study 18

9.2. Additional Simulation Results 18

Version: September 14, 2018.

1

Page 3: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

2 Simultaneous Inference with HDM

9.3. Additional Results, Real-Data Example 21

References 25

1. Introduction

Valid simultaneous inference becomes very important when multiple hypotheses are tested at the same

time. For instance, suppose a researcher wants to estimate a potentially large number of regression

coefficients and to assess which of these coefficients are significantly different from zero at a given

significance level α. It is well known that in such a situation, the naive approach, i.e. simply ignoring

multiple testing issues, will generally lead to flawed conclusions due to a large number of mistakenly

rejected hypotheses. Indeed, the actual probability of incorrectly rejecting one or more hypotheses will

in general exceed the claimed/desired level α by large.

The statistical literature has proposed various approaches to mitigate the consequences of testing

multiple hypotheses at the same time. These methods can be grouped into two approaches according

to the underlying criterion. The first approach, initiated by the famous Bonferroni correction, seeks

to control the probability of at least one false rejection which is called the family-wise error rate

(FWER). Since the definition of the FWER refers to the probability of making at least one type

I error, the FWER-criterion is appealing from an intuitive point of view. However, FWER control

is often criticized to be conservative and, instead, the false discovery rate (FDR) control is used as

a criterion in many test procedures leading to the second major class of multiple testing correction

methods, e.g. Benjamini and Hochberg (1995). The FDR refers to the expected share of falsely

rejected null hypotheses and, hence, results from FDR-procedures differ from classical tests results in

terms of interpretation. Various approaches aim at maintaining control of the FWER and reducing

conservativeness at the same time by incorporating a stepwise procedure, for instance, the stepdown

method of Holm (1979). Moreover, taking the dependence structure of test statistics into consideration

allows to reduce conservativeness of FWER-procedures, as in the stepdown procedure of Romano and

Wolf (2005a,b) which is based on resampling methods.

In the following, we review classical methods and will describe how valid simultaneous inference can be

conducted in a high-dimensional regression setting, i.e. if the number of covariates exceeds the number

of observations, and give examples how the presented methods can be applied with the statistical

software R . It is well-known that classical regression methods, such as ordinary least squares, break

down in high-dimensional settings. Instead, regularization methods, e.g. the lasso, can be used for

estimation. However, post-selection inference is non-trivial and requires modification of the estimators.

The R package hdm implements the double-selection approach of Belloni, Chernozhukov, and Hansen

(2014) that allows to perform valid inference based on model-selection with the lasso. Moreover, hdm

provides powerful tests for a large number of hypotheses using the multiplier bootstrap, potentially

in combination with the Romano and Wolf stepdown procedure (starting with Version 0.3.0) . The

implemented methods are less conservative / more powerful than traditional approaches as they allow

to take the dependence structure among test statistics into account and proceed in a stepwise manner.

At the same time, the implemented procedures are attractive from an intuitive point of view as they

guarantee control of the family-wise error rate (FWER).

Page 4: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

3

The remainder of the paper is organized as follows. First, the setting is introduced and an overview on

valid post-selection inference in high dimensions is provided. Second, a short and selective review on

traditional and recent methods to adjust for multiple testing is presented. Third, we give an overview

on the tools for valid post-selection inference in high-dimensions available in the R package hdm. Fourth,

the use of the software is illustrated in a simulated and a replicable real-data example. A conclusion

is provided in the last section.

2. The Setting

We are interested in testing a set of K hypotheses H1, ...,HK in a high-dimensional regression model,

i.e. a regression where the number of covariates p is large, potentially much larger than the number of

observations n, i.e. we have p� n. The ultimate objective in this setting is to perform inference on a

subset of the regression coefficients, i.e. a vector of so-called “target” coefficients θk with k = 1, ...,K

and K < n.1

yi = β0 + d′iθ + x′iβ + εi, i = 1, . . . , n,(1)

where β0 is an intercept and β denote the regression coefficients of the control variables xi. Moreover,

it is assumed that En[εixi] = 0. In this setting, K hypotheses are tested for the coefficients that

correspond to the effect of the “target” variables di on the outcome yi

H0,k : θk = 0, k = 1, ...,K.(2)

For instance, such a high-dimensional regression setting arises in causal program evaluation studies,

where a large number of regressors is included to approximate a potentially complicated, non-linear

population regression function using transformations with dictionaries, e.g. interactions, splines or

polynomials. Alternatively, an analysis of heterogeneous treatment effects across possibly many sub-

groups as in the example in Section 7 might require a large number of interactions of the regressors.

Suppose, there is a test procedure for each of the hypotheses leading to test statistics t1, ..., tK and

unadjusted p-values p1, ..., pK . In the context of multiple testing, it is often helpful to sort the p-values

in an increasing order (i.e. the “most significant” test result as the first in the row) and the hypotheses

likewise, i.e. p(1), ..., p(K) and H(1), ...,H(K) with p(1) ≤ p(2) ≤ ... ≤ p(K). Also the test statistics are

ordered by the same logic |t(1)| ≥ |t(2)| ≥ ... ≥ |t(K)|. A researcher decides whether to accept or to

reject a null hypothesis if the corresponding p-value pk is above or below a prespecified significance

level α. Generally, the significance level corresponds to the probability of erroneously rejecting a true

null hypothesis. However, if the conclusions are based on a comparison of unadjusted p-values and the

significance level, the probability of incorrectly rejecting at least one of the hypotheses will, potentially

by large, exceed the claimed level α. Hence, adjustment for multiple testing becomes necessary to

draw appropriate inferential conclusions.

3. Simultaneous Post-Selection Inference in High Dimensions – An Overview

In high-dimensional settings traditional regression methods such as ordinary least squares break down

and testing the K hypotheses will severely suffer from the shortcomings of the underlying estimation

1The setting with a potentially infinite-dimensional vector of target coefficients is theoretically considered in Belloni,

Chernozhukov, Chetverikov, and Wei (forthcoming) and Belloni, Chernozhukov, and Kato (2015).

Page 5: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

4 Simultaneous Inference with HDM

method. Penalization methods, for instance the lasso or other machine learning techniques, provide

an opportunity to overcome the failure of traditional least squares estimation as they regularize the

regression problem in Equation 1 by introducing a penalization term. In the example of lasso, the

ordinary least squares minimization problem is extended by a penalization of the regression coefficients

using the l1-norm. The lasso estimator is the solution to the maximization problem(θ, β)

= arg minθ,β

En[(yi − β0 − d′iθ − x′iβ)

2]

n

∥∥∥ψ (θ′, β′)′∥∥∥1,(3)

with ‖ • ‖1 being the l1-norm, λ is a penalization parameter and ψ denotes a diagonal matrix of

penalty loadings. More details on the choice of λ and ψ as implemented in the hdm package can be

found in the package vignette available at CRAN and Chernozhukov, Hansen, and Spindler (2016).

As a consequence of the l1-penalization, some of the coefficients are shrunk towards zero and some of

them are set exactly equal to zero. In general, inference after such a selection step is only valid under

the strong assumption of perfect model selection, i.e. the lasso does only set those coefficients to zero

that truly have no effect on yi. However perfect model selection and the underlying assumptions are

often considered unrealistic in real-world applications leading to a breakdown of the naive inferential

framework and, thus, flawed inferential conclusions.

In contrast to the naive procedure, the so-called double-selection approach of Belloni, Chernozhukov,

and Hansen (2014) tolerates small model selection mistakes so that valid confidence intervals and

test procedures can be based on the lasso. The double-selection method is based on orthogonalized

moment equations and introduces an auxiliary (lasso) regression step for each of the target coefficients

in order to avoid that variable selection erraneously excludes variables that are related to both the

outcome and the target regressors by setting their coefficients equal to zero. Double selection proceeds

as follows: (1) For each of the target variables in dj,i, j = 1, . . . ,K, a lasso regression is estimated

to identify the most important predictors among the covariates xi and the remaining target variables

d−j,i. (2) A lasso regression of the outcome variable yi on all explanatory variables, except for dj,i, is

estimated to identify predictors of yi. This step is executed for each of the target variables dj with

j = 1, . . . ,K. (3) The target coefficients θ are estimated from a linear regression of the outcome on

all target variables as well as all covariates that have been selected in either step (1) or (2). As a

consequence of the double-selection procedure, the risk of an omitted variable bias that might arise

due to imperfect model selection is reduced. It can be shown that the double-selection estimator

θDSk is asymptotically normally distributed under a set of regularity assumptions. Probably, the most

important of these assumptions is (approximate) sparsity. This assumption states that only a subset of

the regressors suffice to describe the relationship of the outcome variable and the covariates and that

all other regressors have no or only a negligible effect on the outcome. In general, valid post-selection

inference is compatible with other tools from the machine learning literature, for instance elastic nets

or regression trees, as long as these methods satisfy some regularity conditions (Belloni, Chernozhukov,

and Kato, 2015; Belloni, Chernozhukov, and Hansen, 2014).

In the multiple testing scenario described above, the double-selection approach can be used to test

the K null hypotheses H0,k, k = 1, ...,K. Belloni, Chernozhukov, and Kato (2015) show that a

valid (1 − α)-confidence interval can be constructed by using the multiplier boostrap as established

in Chernozhukov, Chetverikov, and Kato (2013). Moreover, Chernozhukov, Chetverikov, and Kato

Page 6: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

5

(2013) and Belloni, Chernozhukov, and Kato (2015) show that a multiplier bootstrap version of the

Romano-Wolf method can be used to construct a joint significance test in a high-dimensional setting

such that asymptotic control of the FWER is obtained.

An advantage of the stepdown method of Romano and Wolf (2005a,b) is that it is based on resampling

methods. Hence, it is able to account for the dependence structure underlying the test statistics and

to give less conservative results as compared to methods such as the Bonferroni and Holm correction.

The idea of the Romano-Wolf procedure is to construct rectangular simultaneous confidence intervals

in subsequent steps whereas in each step, the coverage probability is kept above a level of (1−α). If in

step j, the confidence set does not contain zero in dimension k, the corresponding Hk is rejected. In

step j+1, the algorithm proceeds analogously by constructing a rectangular joint confidence region for

those coefficients for which the null hypotheses has not been rejected in step j or before. The algorithm

stops if no hypothesis is rejected anymore. To take the dependence structure of the test hypotheses into

account, the classical Romano-Wolf stepdown procedure uses the bootstrap to compute the constant

c(1−α) which is needed to construct a rectangular confidence interval. This constant is estimated by

the (1 − α)-quantile of the maxima of the bootstrapped test statistics in each step to guarantee the

coverage probability of (1− α). The computational burden of the Romano-Wolf stepdown procedure

can be reduced by using the multipler bootstrap. This bootstrap method requires the calculation

of the solution of an orthogonal moment equation only once and then operates on pertubations by

realizations of an independently and standard normally distributed random variable (Chernozhukov,

Chetverikov, and Kato, 2013).

4. Methods for Testing multiple Hypotheses

4.1. A Global Test for Joint Significance with Lasso Regression. A basic question frequently

arising in empirical work is whether the Lasso regression has explanatory power, comparable to a

F-test for the classical linear regression model. The construction of a joint significance test follows

Chernozhukov, Chetverikov, and Kato (2013) (Appendix M). Based on the model yi = β0+d′iθ+x′iβ+εi

with intercept β0, the null hypothesis of joint statistical in-significance is H0 : (θ′, β′)′ = 0. The null

hypothesis implies that

E [(yi − β0)xi] = 0,

and the restriction can be tested using the sup-score statistic:

S = ‖√nEn

[(yi − β0)xi

]‖∞,

where β0 = En[yi]. The critical value for this statistic can be approximated by the multiplier bootstrap

procedure, which simulates the statistic:

S∗ = ‖√nEn

[(yi − β0)xigi

]‖∞,

where gi’s are i.i.d. N(0, 1), conditional on the data. The (1− α)-quantile of S∗ serves as the critical

value, c(1− α). We reject the null if S > c(1− α) in favor of statistical significance, and we keep the

null of non-significance otherwise.

Page 7: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

6 Simultaneous Inference with HDM

4.2. Multiple Hypotheses Testing with Control of the Familywise Error Rate. The FWER

is defined as the probability of falsely rejecting at least one hypothesis. The goal is to control the

FWER and to secure that it does not exceed a prespecified level α. We assume that for the individual

tests the significance level is set uniformly to α.

4.2.1. Bonferroni Correction. According to the Bonferroni correction the cutoff of the p-values is set to

α∗ = α/K and all hypotheses with p-values below the adjusted level α∗ are rejected. Bool’s inequality

then gives directly that the FWER is smaller or equal to α. Instead of adjusting the level of α to α∗,

it is possible to adjust the p-values so that we reject a hypothesis Hk if p∗k = K · pk < α. A drawback

of the procedure is that it is quite conservative, meaning that in many applications, in particular

in high-dimensional settings when many hypotheses are tested simultaneously, often no or very few

hypotheses are rejected, increasing the risk of accepting false null hypotheses (i.e. of a type II error).

4.2.2. Bonferroni-Holm Correction. We again assume that the p-values are ordered (from lowest to

highest) p(1) ≤ . . . ≤ p(K) with corresponding hypotheses H(1), . . . ,H(K). The Bonferroni-Holm

procedure controls the FWER by the following procedure: Let k be the smallest index such that

the corresponding p-value exceeds the adjusted cutoff α∗.

k = minj{p(j) >

α

K − j + 1︸ ︷︷ ︸α∗

},

Reject the null hypothesisH(1), . . . ,H(k−1) and acceptH(k), . . . ,H(K). The Bonferroni-Holm procedure

can be considered as general improvement over the Bonferroni correction that maintains control of the

FWER and reduces the risk of a type II error at the same time. The adjusted p-value according to the

Bonferroni-Holm correction are computed as p∗(j) = maxl≤j min{(K − j + 1)p(j), 1} with l = 1, . . . , j.

4.2.3. Joint Confidence Region Using Multiplier Bootstrap. Belloni, Chernozhukov, and Kato (2015)

derive valid (1− α)-confidence regions for the vector of target coefficients, θ, in the high-dimensional

regression setting in Equation 1 estimated with lasso. The confidence regions which are constructed

with the multiplier bootstrap can be used equivalently to a joint signficance test of the K hypotheses.

Accordingly, the null hypotheses H0,k : θk = 0, k = 1, . . . ,K, would be rejected at the level α if the

simultaneous (1− α)-confidence region does not cover zero in dimension k.

4.2.4. Romano-Wolf Stepdown Procedure. While the Bonferroni and Bonferroni-Holm correction do

not take into account the dependence structure of the test statistics, further improvements in the

control of the type II error can be achieved by modeling the dependence structure using resampling. A

very popular and powerful method in this regard is the Romano-Wolf stepdown procedure. Stepdown

methods proceed in several rounds where in each round a decision is taken on the set of hypotheses

being rejected. The algorithm continues until no further hypotheses are rejected. The Romano-Wolf

stepdown procedure guarantees asymptotic control of the FWER at level α by constructing a sequence

of simultaneous tests. We present the recent version of the Romano-Wolf method from Chernozhukov,

Chetverikov, and Kato (2013) and Belloni, Chernozhukov, and Kato (2015) who prove the validity of

the procedure in combination with the multiplier bootstrap.

Page 8: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

7

1) Sort the test statistics in a decreasing order (in terms of their absolute values):

|t(1)| ≥ |t(2)| ≥ .... ≥ |t(K)|.

2) Draw B multiplier bootstrap versions for each of the test statistics t∗,b(k), b = 1, ..., B, and

k = 1, ...,K,

3) For each b and k determine the maximum of the bootstrapped test statistics m(t∗,b(k)) =

max{|t∗,b(k)|, |t

∗,b(k+1)|, ..., |t

∗,b(K)|

}.

4) Compute initial p-values, for (k) = 1, ...,K

pinit(k) :=

∑Bb=1 1{m(t∗,b(k)) ≥ |t(k)|}

B

5) Compute adjusted p-values by ensuring monotonicity

a) if (k) = 1

p∗(1) := pinit(1)

b) if (k) = 2, ...,K

p∗(k) := max{pinit(k) , p∗(k−1)}

The p-adjustment algorithm parallels that in Romano and Wolf (2016) with the only difference that

the bootstrap test statistics are computed efficiently with the multiplier bootstrap procedure instead

of the classical bootstrap. In contrast to traditional bootstrap methods, the multiplier bootstrap does

not require re-estimation of the lasso regression for each bootstrap sample. Romano and Wolf (2005b,

2016) recommend a high number of bootstrap repetitions B ≥ 1000.

If the data stem from a randomized experiment, the method introduced in List, Shaikh, and Xu (2016)

can be used. It is a variant of the Romano-Wolf procedure under uncounfoundedness. Moreover, it

allows to compare the effect of different treatments and several outcome variables simultaneously.

4.3. Multiple Hypotheses Testing with Control of the False Discovery Rate. The FWER is

a very strict criterion which is often very conservative. This means that in settings when thousands or

hundred thousands of hypotheses are tested simultaneously, the FWER does often not detect useful

signals. Hence, in large-scale settings frequently a less strict criterion, the so-called false discovery

rate (FDR) is employed. The false discovery proportion (FDP) is defined as the ratio of the number

of hypotheses which are wrongly classified as significant (false positives) and the total number of

positives. If the latter is zero, it is defined as zero. The FDR is defined as the expected value of the

FDP : FDR = E(FDP ). The FDR concept reflects the tradeoff between false discoveries and true

discoveries.

4.3.1. Benjamini-Hochberg Procedure. To control the FDR, the Benjamini-Hochberg (BH) procedure

ranks the hypotheses according to the corresponding p-values and then chooses a cutoff along the

ranking to control the FDR at a prespecified level of γ ∈ (0, 1). The BH procedure first uses a stepup

comparision to find a cutoff p-value:

k = maxj{p(j) ≤ j

γ

K},

and then rejects all hypotheses Hj , j = 1, . . . , k. In most applications, γ = 0.1 is chosen.

Page 9: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

8 Simultaneous Inference with HDM

5. Implementation in R

Estimation of the high-dimensional regression model in Equation 1 and simultaneous inference on

the target coefficients is implemented in the R package hdm available at CRAN. hdm provides an

implementation of the double-selection approach of Belloni, Chernozhukov, and Kato (2015) using

the lasso as the variable selection device. The function rlassoEffects() does valid inference on a

specified set of target parameters and returns an object of S3 class rlassoEffects. This output object

is used to perform simultaneous inference subsequently as described in the following. More details on

the hdm package and introductory examples are provided in the hdm vignette available at CRAN and

Chernozhukov, Hansen, and Spindler (2016). The package hdm offers three ways to perform valid

simultaneous inference in high-dimensional settings:

(1) Overall Significance Test

The hdm provides a joint significance test that is comparable to a F-test known from classical

ordinary least squares regression. Based on Chernozhukov, Chetverikov, and Kato (2013)

(Appendix M), the null hypothesis that no covariate has explanatory power for the outcome

yi is tested, i.e.

H0 : (θ′, β′)′ = 0.

The test is performed automatically if summary() is called for an object of the S3 class rlasso.

This object corresponds to the output of the function rlasso() which implements the lasso

estimator using a theory-based rule for determining the penalization parameter.

(2) Joint Confidence Interval

Based on an object of the S3 class rlassoEffects, a valid joint confidence interval with cov-

erage probability (1 − α) can be constructed for the specified target coefficients using the

command confint() with the option joint=TRUE.

(3) Multiple Testing Adjustment of p-Values

Starting with Version 0.3.0, the hdm package offers the S3 method p adjust() for objects in-

heriting from classes rlassoEffects and lm. By default, p adjust() implements the Romano-

Wolf stepdown procedure using the computationally efficient multiplier bootstrap procedure

(option method = "RW"). Hence, the hdm offers an implementation of the p-value adjustment

that corresponds to a joint test a la Romano-Wolf for both post-selection inference based on

double selection with the lasso as well as for ordinary least squares regression. Moreover, the

p adjust() call offers classical adjustment methods in a user-friendly way, i.e. the function can

be executed directly on the output object returned from a regression with rlassoEffects() or

lm(). The hosted correction methods are the methods provided in the p.adjust() command

of the basic stats package, i.e. Bonferroni, Bonferroni-Holm, and Benjamini-Hochberg among

others. If an object of class lm is used, the user can provide an index of the coefficients to be

tested simultaneously. By default, all coefficients are tested.

Page 10: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

9

0.0

2.5

5.0

7.5

10.0

0 20 40 60

k

θ

Figure 1. Regression Coefficients, Simulation Study.

6. A Simulation Study for Valid Simultaneous Inference in High Dimensions

The simulation study provides a finite-sample comparison of different multiple testing corrections in

a high-dimensional setting, i.e. the Bonferroni method, the Bonferroni-Holm procedure, Benjamini-

Hochberg adjustment and the Romano-Wolf stepdown method. In addition, the study illustrates the

failure of the naive approach that ignores the problem of simultaneous hypotheses testing, i.e. without

any correction of the significance level or p-values.

6.1. Simulation Setting. We consider a regression of a continuous outcome variable yi on a set of

K = 60 regressors, di,

yi = β0 + d′iθ + εi, i = 1, . . . , n,(4)

with εi ∼ N(0, σ2) and variance σ2 = 3. In our setting, the realizations of di are generated by a joint

normal distribution di ∼ N(µ,Σ) with µ = 0 and Σj,k = ρ|j−k| with ρ = 0.9. We consider the case

of an i.i.d. sample with n = 200 and n = 500 observations. The setting is sparse in that only few,

s = 12, regressors are truly non-zero whereas the location of the non-zero coefficient is only known by

an inaccessible oracle. Thus, the lasso is used to select the set of explanatory variables with a non-zero

coefficient. Figure 1 presents the regression coefficients in the simulation study.

Page 11: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

10 Simultaneous Inference with HDM

The regression in Equation 4 is estimated wih post-lasso and inference for all regression coefficients is

based on double selection.2 The K = 60 hypotheses are tested simultaneously

H0,k : θk = 0, k = 1, ...,K.

6.2. Results. The simulation results are summarized in Table 1. The reported results refer to averages

overR = 5000 repetitions in terms of correct and incorrect rejections of null hypotheses at a prespecified

level of α = 0.1 as well as the empirical FWER and FDR.

The results show that multiple testing adjustment is of great importance since naive inferential state-

ments might be invalid. If each of the hypotheses is tested at a significance level of α = 0.1 without

adjustment of the p-values, the naive procedure leads to at least one incorrect rejection with a prob-

ability of almost 1. Also the Benjamini-Hochberg procedure with γ = 0.1 incorrectly rejects a true

null in more than 3000 out of the 5000 repetitions. On average, more than 5 (n = 200) and more

than 4 (n = 500) true hypotheses are rejected without adjustment of p-values. The simulation study

illustrates that control of the FDR is achieved by the Benjamini-Hochberg correction. Accordingly, one

would incorrectly reject more than 1 hypothesis on average. Over all 5000 repetitions 9.5% (n = 200)

and 8.7% (n = 500) of all rejections are incorrect (false positives) which is below the specified level of

γ = 0.1.

Methods with asymptotic control of the FWER are much less likely to erraneously reject true null

hypotheses. In the setting with n = 200 observations, the number of incorrect rejections is on average

around 0.1 to 0.17 with the Bonferroni correction being most conservative. The empirical familywise

error rates are very close to the desired level 0.1, despite the small number of observations and the

relatively large number of tested hypotheses. With larger sample size (n = 500) the empirical FWERs

approach the level 0.1. The price for control of the probability of at least one type I error is paid

in terms of power with less correct rejections for the FWER-methods as compared to the Benjamini-

Hochberg correction. The largest number of correct rejections while still maintaining control of the

FWER is achieved by the Romano-Wolf stepdown procedure. Hence, taking the dependence structure

of the test statistics into account is favorable in case of dependencies among test statistics.

2More details on implementation of the simulation study are provided in the appendix and the supplemental material

available at https://www.bwl.uni-hamburg.de/en/statistik/forschung/software-und-daten.html.

Page 12: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

11

Naive Benjamini-Hochberg Bonferroni Bonferroni-Holm Romano-Wolf

Correct Rejections

n = 200 10.749 9.375 7.597 7.674 7.765

(0.798) (0.976) (0.965) (0.966) (0.963)

Incorrect Rejections

5.114 1.128 0.129 0.145 0.163

(2.386) (1.271) (0.372) (0.397) (0.425)

Familywise Error Rate

0.991 0.606 0.117 0.129 0.143

False Discovery Rate

0.308 0.095 0.015 0.016 0.018

Correct Rejections

n = 500 11.872 11.578 10.658 10.735 10.785

(0.338) (0.558) (0.745) (0.742) (0.738)

Incorrect Rejections

4.893 1.226 0.101 0.121 0.137

(2.381) (1.317) (0.333) (0.362) (0.387)

Familywise Error Rate

0.987 0.636 0.091 0.110 0.123

False Discovery Rate

0.278 0.087 0.009 0.010 0.011

Table 1. Simulation Results

The Table presents the average number of correct or incorrect rejections at a significance level

α = 0.1 and the FWER over all R = 5000 repetitions. For the Benjamini-Hochberg adjustment

γ = 0.1 is chosen. Standard deviation in parentheses.

Page 13: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

12 Simultaneous Inference with HDM

7. A Real-Data Example for Simultaneous Inference in High Dimensions - The Gender

Wage Gap Analysis

The following section demonstrates the methods for valid simultaneous inference implemented in the

package hdm and provides a comparison of the classical correction methods in a replicable real-data

example. The gender wage gap, i.e. the relative difference in wages emerging between male and female

employees, is a central topic in labor economics. Frequently, studies report an average gender wage gap

estimate, i.e. how much less women earn as compared to men in terms of average (i.e. mean or median)

wages. However, it might be helpful for policy makers to gain a more detailed impression on the gender

wage gap and to assess whether and to what extent the gender wage gap differs across individuals. A

simplistic although frequently encountered approach to assess the wage gap heterogeneity is to compare

the relative wage gap across female and male employees in subgroups that are defined in terms of a

particular characteristic, e.g. industry. It is obvious that this approach neglects the role of other

characteristics relevant for the wage income and, hence, the wage gap, e.g. educational background,

experience etc. As an examplary illustration, the gender gap in average (mean) earnings in 12 different

industrial categories is presented in Figure 2, suggesting that the wage gap differs by large across the

subgroups. In contrast to the approach that is simply based on descriptive statistics, an extended

regression equation including interaction terms of the gender indicator with observable characteristics

is able to take the role of other labor market characteristics into account and, hence, allows to give

insights on the determinants of the gender wage gap. As the regression approach leads to a large

number of coefficients that are tested simultaneously, an appropriate multiple testing adjustment is

required. Thus, the heterogeneous gender wage gap example is used to demonstrate the adjustment

methods provided in the R package hdm. The presented example is an illustration of the more extensive

analysis of a heterogeneous gender wage gap in Bach, Chernozhukov, and Spindler (2018).

7.1. Data Preparation. The examplary data is a subsample of the 2016 American Community Sur-

vey.3 The data provide information on civilian full-time working (35+ hours a week, 50+ weeks a year)

White, non-Hispanic employees aged older than 25 and younger than 40 with earnings exceeding a the

federal minimum level of earnings ($12,687.5 of yearly wage income). First, the data set is loaded from

the hdm package and prepared for the analysis.

# Load the hdm package

rm(list=ls())

library(hdm)

# load the ACS data

load("ACS2016_gender.rda")

7.2. Valid Simultaneous Inference on a Heterogeneous Gender Wage Gap. In order to answer

the question whether the gender wage gap differs according to the observable characteristics of female

employees in a valid way, it is necessary to account for regressors that affect women’s job market

environment. In the example, variables on marriage, the presence of own children, geographic variation,

3It can be replicated with the documentation “Appendix: Replicable Data Example” that is available on-

line. Further information and the code is provided at https://www.bwl.uni-hamburg.de/en/statistik/forschung/

software-und-daten.html.

Page 14: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

13

ADMIN

RETAIL

MANUF

PERSON

WHOLESALE

TRANS

AGRI

ENTER

BUISREPSERV

CONSTR

PROFE

FINANCE

0.0 0.1 0.2 0.3

Average Wage Gap (absolute value)

Indu

stry

Figure 2. Average Wage Gap across Industries, ACS 2016.

job characteristics (industry, occupation, hours worked), human capital variables (years of education,

experience (squared)), and field of degree are considered. A wage regression is set up that includes all

two-way interactions of female with the available characteristics in addition to the baseline regressors,

xi.

lnwi = β0 +

K∑k=1

θk (femalei × xk,i) + x′iβ + εi, i = 1, . . . , n,(5)

The analysis begins with constructing a model matrix that implements the regression relationship of

interest.

# Weekly log wages as outcome variable

y = data$lwwage

# Model Matrix containing 2-way interaction of female

# with relevant regressors + covariates

X = model.matrix( ~ 1 + fem + fem:(ind + occ + hw + deg + yos + exp + exp2 +

married + chld19 + region + msa ) +

(married + chld19 + region + msa + ind + occ + hw +

deg + yos + exp + exp2), data = data)

# Exclude the constant variables

X = X[,which(apply(X, 2, var)!=0)]

dim(X)

## [1] 70473 123

Page 15: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

14 Simultaneous Inference with HDM

Accordingly, the regression model considered has p = 123 regressors in total and is estimated on the

basis of n = 70473 observations. The wage Equation 5 is estimated with the lasso with the theory-

based choice of the penalty term (“rlasso”). To answer the question whether the included regressors

have any explanatory power for the outcome variable, the global test of overall significance is run by

calling summary() on the output object of the rlasso() function.

# Estimate rlasso

lasso1 = rlasso(X,y)

# Run test

summary(lasso1, all = FALSE)

# Output shifted to the Appendix.

The hypotheses that all coefficients in the model are zero can be rejected at all common significance

levels. The main objective of the analysis is to estimate the magnitude of the effects associated

with the gender interactions and to assess whether these effects are jointly significantly different from

zero. The so-called “target” variables, in total 62 regressors, are specified in the index option of the

rlassoEffects() function. Hence, it is necessary to indicate the columns of the created model matrix

that correspond to interactions with the female dummy.

# Construct index for gender variable and interactions (target parameters)

index.female = grep("fem", colnames(X))

K = length(index.female)

# Perform inference on target coefficients

# estmation might take some time (10 minutes)

effects = rlassoEffects(x=X,y=y, index = index.female)

summary(effects)

# Output shifted to the Appendix.

The output presented in the appendix shows the K = 62 estimated coefficients together with t-

statistics and unadjusted p-values. The next step is to adjust the p-values for multiple testing. Starting

with Version 0.3.0, the hdm offers the S3 method p adjust() for objects inheriting from classes

rlassoEffects and lm. It hosts the correction methods from the function p.adjust() of the stats

Package, e.g. Bonferroni, Bonferroni-Holm, Benjamini-Hochberg as well as no correction at all. First,

the naive approach without a multiple testing correction given a significance level α is presented. Table

2 shows the number of rejections at significance levels α ∈ {0.01, 0.05, 0.1}.

# Extract (unadjusted) p-values

pvals.unadj = p_adjust(effects, method = "none")

# Coefficients and Pvals

head(pvals.unadj)

## Estimate. pval

## femTRUE -0.0799 0.20800

## femTRUE:indAGRI -0.1307 0.00349

## femTRUE:indCONSTR -0.0521 0.15226

## femTRUE:indMANUF -0.0097 0.70716

## femTRUE:indTRANS -0.0403 0.15392

Page 16: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

15

Method Significance Level

0.01 0.05 0.10

Naive 18 21 25

Benjamini-Hochberg 10 19 21

Bonferroni 9 10 10

Holm 9 10 10

Joint Confidence Region 9 10 11

Romano-Wolf 8 11 11

Table 2. Number of Rejected Hypotheses

## femTRUE:indRETAIL 0.0264 0.30256

Thus, without a correction for multiple testing, 18, 21, and 25 hypotheses could be rejected given

the significance levels of 1%, 5% and 10%. If one returns to the initial example on variation by

industry, one would find significant variation of the wage gap by industry (as compared to the baseline

category “Wholesale”) in 3 categories, namely “Agriculture”, “Finance, Insurance, and Real Estate”

and “Professional and Related Services” at a significance level of 0.1.

Second, classical correction methods like the Bonferroni, Bonferroni-Holm, and the Benjamini-Hochberg

adjustments are used to account for testing the 62 hypotheses at the same time.

# Bonferroni

pvals.bonf = p_adjust(effects, method = "bonferroni")

# Holm

pvals.holm = p_adjust(effects, method = "holm")

head(pvals.bonf)

## Estimate. pval

## femTRUE -0.0799 1.000

## femTRUE:indAGRI -0.1307 0.217

## femTRUE:indCONSTR -0.0521 1.000

## femTRUE:indMANUF -0.0097 1.000

## femTRUE:indTRANS -0.0403 1.000

## femTRUE:indRETAIL 0.0264 1.000

head(pvals.holm)

## Estimate. pval

## femTRUE -0.0799 1.000

## femTRUE:indAGRI -0.1307 0.178

## femTRUE:indCONSTR -0.0521 1.000

## femTRUE:indMANUF -0.0097 1.000

## femTRUE:indTRANS -0.0403 1.000

## femTRUE:indRETAIL 0.0264 1.000

Page 17: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

16 Simultaneous Inference with HDM

As a general improvement, the Holm-corrected p-values are smaller or equal to those obtained from a

Bonferroni adjustment. At significance levels 1%, 5% and 10%, it is possible to reject fewer hypotheses

if p-values are corrected for multiple testing.

According to the Benjamini-Hochberg (BH) correction (Benjamini and Hochberg, 1995) of p-values

that achieves control of the FDR it is possible to reject 10, 19, and 21 null hypotheses at specified

values of the FDR, γ, at 0.01, 0.05 and 0.1.

pvals.BH = p_adjust(effects, method = "BH")

head(pvals.BH)

## Estimate. pval

## femTRUE -0.0799 0.3778

## femTRUE:indAGRI -0.1307 0.0174

## femTRUE:indCONSTR -0.0521 0.3078

## femTRUE:indMANUF -0.0097 0.8412

## femTRUE:indTRANS -0.0403 0.3078

## femTRUE:indRETAIL 0.0264 0.4690

Regarding variation by industry, the Bonferroni and Holm procedure find a significantly different wage

gap (at the 10% significance level) only for industry “Finance, Insurance, and Real Estate” whereas

the Benjamini-Hochberg correction with γ = 0.1 leads to the same conclusion as obtained without

any adjustment. These results can now be compared to results obtained from joint significance test

with and without the Romano-Wolf stepdown procedure. We can start with construction of a joint

0.9-confidence region for the 62 coefficients using the option joint=T in the confint() function for

objects of the class rlassoEffects.

# valid joint 0.95-confidence interval

alpha = 0.1

CI = confint(effects, level = 1-alpha, joint=T)

head(CI)

## 5 % 95 %

## femTRUE -0.2758 0.1160

## femTRUE:indAGRI -0.2923 0.0310

## femTRUE:indCONSTR -0.1632 0.0589

## femTRUE:indMANUF -0.0912 0.0718

## femTRUE:indTRANS -0.1305 0.0499

## femTRUE:indRETAIL -0.0582 0.1110

The results from the confidence intervals are equivalent to a test at significance level α = 0.1 so that 11

hypotheses can be rejected. However, the Romano-Wolf stepdown procedure allows to increase power.

For instance with a significance level of 5%, the stepdown correction allows to reject one hypothesis

more than with the joint confidence interval. The p-values can be adjusted according to the Romano-

Wolf-stepdown algorithm by setting the option method = “RW” (default) of the p adjust() call. The

number of repetitions can be varied by specifying the option B, B = 1000 by default.

# Romano-Wolf stepdown adjustment

pvals.RW = p_adjust(effects, method = "RW", B=1000)

head(pvals.RW)

## Estimate. pval

## femTRUE -0.0799 0.997

## femTRUE:indAGRI -0.1307 0.369

Page 18: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

17

## femTRUE:indCONSTR -0.0521 0.986

## femTRUE:indMANUF -0.0097 1.000

## femTRUE:indTRANS -0.0403 0.992

## femTRUE:indRETAIL 0.0264 0.999

Using the joint confidence interval and the Romano-Wolf stepdown adjustment allows to reject more

hypotheses than with traditional methods at significance levels 5% and 10%. Hence, taking into

account the dependence of test statistics is beneficial in terms of power in the real-data example.

8. Conclusion

The previous sections provide a short overview on important methods for multiple testing adjustment

in a high-dimensional regression setting. Throughout the paper, our intention was to present the

concepts and the necessity of a multiple adjustment in a comprehensive way. Similarly, the tools

for valid simultaneous inference in high-dimensional settings that are available in the R package hdm

are intended to be easy to use in empirical applications. The demonstration of the methods in the

real-data example are intended to motivate applied statisticians to (i) use modern statistical methods

for high-dimensional regression, i.e. the lasso, and (ii) to appropriately adjust if multiple hypotheses

are tested simultaneously. Since the hdm provides user-friendly adjustment methods for objects of the

S3 class lm, we hope that uses will use the correction methods more frequently even in classical least

squares regression.

Page 19: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

18 Simultaneous Inference with HDM

9. Appendix

9.1. Details of Simulation Study. The simulation study was implemented using the statistical

software R (R version 3.3.3 (2017-03-06)) on a x86 64-redhat-linux-gnu (64 bit) platform. For the sake

of replicability, the R code for the simulation study is available as supplemental material on the website

https://www.bwl.uni-hamburg.de/en/statistik/forschung/software-und-daten.html. In the

lasso regression, the theory-based and data-dependent choice of the penalty term λ for homoscedastic

errors is implemented (Chernozhukov, Hansen, and Spindler, 2016).

9.2. Additional Simulation Results. We additionally provide the simulation results if significance

tests are based on a level of α = 0.05 or α = 0.01.

Page 20: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

19

Naive Benjamini-Hochberg Bonferroni Bonferroni-Holm Romano-Wolf

Correct Rejections

n = 200 10.201 8.730 7.163 7.231 7.293

(0.842) (1.008) (0.974) (0.981) (0.983)

Incorrect Rejections

2.676 0.545 0.071 0.078 0.086

(1.758) (0.836) (0.275) (0.290) (0.305)

Familywise Error Rate

0.916 0.381 0.067 0.073 0.080

False Discovery Rate

0.194 0.052 0.009 0.009 0.010

Correct Rejections

n = 500 11.770 11.350 10.364 10.441 10.483

(0.440) (0.648) (0.781) (0.775) (0.771)

Incorrect Rejections

2.518 0.588 0.053 0.062 0.068

(1.721) (0.888) (0.240) (0.260) (0.272)

Familywise Error Rate

0.898 0.394 0.050 0.058 0.062

False Discovery Rate

0.165 0.045 0.005 0.005 0.006

Table 3. Simulation Results

The Table presents the average number of correct or incorrect rejections at a significance level

α = 0.05 and the FWER over all R = 5000 repetitions. For the Benjamini-Hochberg adjustment

γ = 0.05 is chosen. Standard deviation in parantheses.

Page 21: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

20 Simultaneous Inference with HDM

Naive Benjamini-Hochberg Bonferroni Bonferroni-Holm Romano-Wolf

Correct Rejections

n = 200 8.905 7.446 6.294 6.346 6.390

(0.930) (1.041) (0.940) (0.949) (0.961)

Incorrect Rejections

0.614 0.113 0.019 0.022 0.023

(0.841) (0.355) (0.140) (0.152) (0.155)

Familywise Error Rate

0.434 0.101 0.019 0.021 0.022

False Discovery Rate

0.059 0.013 0.003 0.003 0.003

Correct Rejections

n = 500 11.342 10.706 9.642 9.716 9.746

(0.641) (0.763) (0.803) (0.811) (0.828)

Incorrect Rejections

0.537 0.119 0.012 0.014 0.015

(0.792) (0.362) (0.110) (0.119) (0.126)

Familywise Error Rate

0.389 0.107 0.012 0.014 0.015

False Discovery Rate

0.042 0.010 0.001 0.001 0.001

Table 4. Simulation Results

The Table presents the average number of correct or incorrect rejections at a significance level

α = 0.01 and the FWER over all R = 5000 repetitions. For the Benjamini-Hochberg adjustment

γ = 0.01 is chosen. Standard deviation in parantheses.

Page 22: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

21

9.3. Additional Results, Real-Data Example. For the sake of brevity, the full summary outputs

of the global significance test and the joint significance tes for the target coefficients with lasso and

double selection are omitted in the main text. For completeness, the output is presented in the

following.lasso1 = rlasso(X,y)

# Run test

summary(lasso1, all = FALSE)

##

## Call:

## rlasso.default(x = X, y = y)

##

## Post-Lasso Estimation: TRUE

##

## Total number of variables: 123

## Number of selected variables: 58

##

## Residuals:

## Min 1Q Median 3Q Max

## -2.50948 -0.27489 -0.00787 0.25579 2.66745

##

## Estimate

## (Intercept) 4.70

## femTRUE 0.01

## married 0.12

## chld19 0.09

## regionMiddle Atlantic Division 0.07

## regionEast North Central Div. -0.04

## regionWest North Central Div. -0.07

## regionEast South Central Div. -0.12

## regionMountain Division -0.06

## regionPacific Division 0.13

## msa 0.18

## indAGRI -0.22

## indMANUF 0.08

## indTRANS 0.09

## indRETAIL -0.14

## indFINANCE 0.18

## indBUISREPSERV 0.12

## indENTER -0.08

## indPROFE -0.06

## indADMIN -0.05

## occBus Operat Spec -0.04

## occComput/Math 0.02

## occLife/Physical/Soc Sci. -0.18

## occComm/Soc Serv -0.34

## occLegal 0.11

## occEduc/Training/Libr -0.35

## occArts/Design/Entert/Sports/Media -0.17

## occHealthc Pract/Technic 0.10

## occProtect Serv -0.07

## occOffice/Administr Supp -0.32

## occProd -0.32

## hw50to59 0.21

## hw60to69 0.28

## hw70plus 0.25

## degComp/Inform Sci 0.16

## degEngin 0.21

## degEnglish/Lit/Compos -0.06

## degLib Arts/Hum -0.06

## degBio/Life Sci 0.04

## degMath/Stats 0.16

## degPhys Fit/Parks/Recr/Leis -0.08

## degPsych -0.04

## degCrim Just/Fire Prot -0.04

## degPubl Aff/Policy/Soc Wo -0.05

## degSoc Sci 0.08

## degFine Arts -0.08

Page 23: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

22 Simultaneous Inference with HDM

## degBus 0.08

## degHist -0.06

## yos 0.11

## exp 0.03

## femTRUE:married -0.05

## femTRUE:chld19 -0.04

## femTRUE:regionMountain Division -0.01

## femTRUE:indAGRI -0.07

## femTRUE:indFINANCE -0.13

## femTRUE:occArchit/Engin 0.03

## femTRUE:occOffice/Administr Supp -0.02

## femTRUE:degPubl Aff/Policy/Soc Wo -0.01

## femTRUE:exp2 -0.02

##

## Residual standard error: 0.463

## Multiple R-squared: 0.391

## Adjusted R-squared: 0.39

## Joint significance test:

## the sup score statistic for joint significance test is 280 with a p-value of 0

# Summary of significance test

summary(effects)

## [1] "Estimates and significance testing of the effect of target variables"

## Estimate. Std. Error t value

## femTRUE -0.079926 0.063479 -1.26

## femTRUE:indAGRI -0.130650 0.044735 -2.92

## femTRUE:indCONSTR -0.052142 0.036422 -1.43

## femTRUE:indMANUF -0.009699 0.025818 -0.38

## femTRUE:indTRANS -0.040300 0.028265 -1.43

## femTRUE:indRETAIL 0.026415 0.025622 1.03

## femTRUE:indFINANCE -0.135372 0.024672 -5.49

## femTRUE:indBUISREPSERV -0.035614 0.026010 -1.37

## femTRUE:indPERSON -0.047695 0.039551 -1.21

## femTRUE:indENTER -0.050513 0.039630 -1.27

## femTRUE:indPROFE -0.054828 0.024235 -2.26

## femTRUE:indADMIN -0.011253 0.028358 -0.40

## femTRUE:occBus Operat Spec 0.025938 0.016180 1.60

## femTRUE:occFinanc Spec -0.048161 0.016859 -2.86

## femTRUE:occComput/Math -0.006376 0.018829 -0.34

## femTRUE:occArchit/Engin 0.043310 0.027374 1.58

## femTRUE:occLife/Physical/Soc Sci. 0.061113 0.024642 2.48

## femTRUE:occComm/Soc Serv 0.156487 0.024372 6.42

## femTRUE:occLegal 0.009461 0.021722 0.44

## femTRUE:occEduc/Training/Libr 0.115386 0.015943 7.24

## femTRUE:occArts/Design/Entert/Sports/Media 0.046201 0.019747 2.34

## femTRUE:occHealthc Pract/Technic 0.006586 0.017898 0.37

## femTRUE:occProtect Serv -0.003907 0.036230 -0.11

## femTRUE:occSales -0.018405 0.015144 -1.22

## femTRUE:occOffice/Administr Supp -0.000996 0.015553 -0.06

## femTRUE:occProd 0.001608 0.036690 0.04

## femTRUE:hw40to49 -0.053407 0.017716 -3.01

## femTRUE:hw50to59 -0.071987 0.019040 -3.78

## femTRUE:hw60to69 -0.123902 0.022724 -5.45

## femTRUE:hw70plus -0.202611 0.031436 -6.45

## femTRUE:degAgri 0.008232 0.035918 0.23

## femTRUE:degComm 0.039552 0.021445 1.84

## femTRUE:degComp/Inform Sci -0.080594 0.030006 -2.69

## femTRUE:degEngin -0.007807 0.025530 -0.31

## femTRUE:degEnglish/Lit/Compos 0.019353 0.024121 0.80

## femTRUE:degLib Arts/Hum 0.046775 0.037581 1.24

## femTRUE:degBio/Life Sci -0.037057 0.022191 -1.67

## femTRUE:degMath/Stats -0.063706 0.034123 -1.87

## femTRUE:degPhys Fit/Parks/Recr/Leis -0.022403 0.030424 -0.74

## femTRUE:degPhys Sci -0.075693 0.026715 -2.83

## femTRUE:degPsych -0.007874 0.023112 -0.34

## femTRUE:degCrim Just/Fire Prot -0.085437 0.029390 -2.91

## femTRUE:degPubl Aff/Policy/Soc Wo -0.017049 0.041864 -0.41

## femTRUE:degSoc Sci -0.052175 0.019783 -2.64

## femTRUE:degFine Arts -0.017200 0.022201 -0.77

## femTRUE:degMed/Hlth Sci Serv -0.026064 0.025231 -1.03

Page 24: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

23

## femTRUE:degBus -0.020136 0.017776 -1.13

## femTRUE:degHist -0.071948 0.026570 -2.71

## femTRUE:yos 0.005787 0.003359 1.72

## femTRUE:exp -0.001838 0.003762 -0.49

## femTRUE:exp2 -0.008556 0.009151 -0.93

## femTRUE:married -0.050840 0.009097 -5.59

## femTRUE:chld19 -0.051834 0.009268 -5.59

## femTRUE:regionMiddle Atlantic Division -0.024007 0.015626 -1.54

## femTRUE:regionEast North Central Div. -0.009862 0.015658 -0.63

## femTRUE:regionWest North Central Div. -0.011769 0.018628 -0.63

## femTRUE:regionSouth Atlantic Division -0.001011 0.015414 -0.07

## femTRUE:regionEast South Central Div. 0.006562 0.020696 0.32

## femTRUE:regionWest South Central Div. -0.072252 0.017598 -4.11

## femTRUE:regionMountain Division -0.028934 0.019022 -1.52

## femTRUE:regionPacific Division -0.057981 0.016211 -3.58

## femTRUE:msa -0.002253 0.014194 -0.16

## Pr(>|t|)

## femTRUE 0.20800

## femTRUE:indAGRI 0.00349 **

## femTRUE:indCONSTR 0.15226

## femTRUE:indMANUF 0.70716

## femTRUE:indTRANS 0.15392

## femTRUE:indRETAIL 0.30256

## femTRUE:indFINANCE 4.1e-08 ***

## femTRUE:indBUISREPSERV 0.17091

## femTRUE:indPERSON 0.22785

## femTRUE:indENTER 0.20245

## femTRUE:indPROFE 0.02368 *

## femTRUE:indADMIN 0.69150

## femTRUE:occBus Operat Spec 0.10892

## femTRUE:occFinanc Spec 0.00428 **

## femTRUE:occComput/Math 0.73491

## femTRUE:occArchit/Engin 0.11361

## femTRUE:occLife/Physical/Soc Sci. 0.01314 *

## femTRUE:occComm/Soc Serv 1.4e-10 ***

## femTRUE:occLegal 0.66316

## femTRUE:occEduc/Training/Libr 4.6e-13 ***

## femTRUE:occArts/Design/Entert/Sports/Media 0.01930 *

## femTRUE:occHealthc Pract/Technic 0.71291

## femTRUE:occProtect Serv 0.91413

## femTRUE:occSales 0.22424

## femTRUE:occOffice/Administr Supp 0.94896

## femTRUE:occProd 0.96504

## femTRUE:hw40to49 0.00257 **

## femTRUE:hw50to59 0.00016 ***

## femTRUE:hw60to69 5.0e-08 ***

## femTRUE:hw70plus 1.2e-10 ***

## femTRUE:degAgri 0.81872

## femTRUE:degComm 0.06514 .

## femTRUE:degComp/Inform Sci 0.00723 **

## femTRUE:degEngin 0.75976

## femTRUE:degEnglish/Lit/Compos 0.42237

## femTRUE:degLib Arts/Hum 0.21327

## femTRUE:degBio/Life Sci 0.09494 .

## femTRUE:degMath/Stats 0.06191 .

## femTRUE:degPhys Fit/Parks/Recr/Leis 0.46152

## femTRUE:degPhys Sci 0.00461 **

## femTRUE:degPsych 0.73334

## femTRUE:degCrim Just/Fire Prot 0.00365 **

## femTRUE:degPubl Aff/Policy/Soc Wo 0.68383

## femTRUE:degSoc Sci 0.00836 **

## femTRUE:degFine Arts 0.43848

## femTRUE:degMed/Hlth Sci Serv 0.30160

## femTRUE:degBus 0.25730

## femTRUE:degHist 0.00677 **

## femTRUE:yos 0.08494 .

## femTRUE:exp 0.62519

## femTRUE:exp2 0.34980

## femTRUE:married 2.3e-08 ***

## femTRUE:chld19 2.2e-08 ***

Page 25: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

24 Simultaneous Inference with HDM

## femTRUE:regionMiddle Atlantic Division 0.12446

## femTRUE:regionEast North Central Div. 0.52878

## femTRUE:regionWest North Central Div. 0.52751

## femTRUE:regionSouth Atlantic Division 0.94770

## femTRUE:regionEast South Central Div. 0.75119

## femTRUE:regionWest South Central Div. 4.0e-05 ***

## femTRUE:regionMountain Division 0.12824

## femTRUE:regionPacific Division 0.00035 ***

## femTRUE:msa 0.87386

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Page 26: Valid simultaneous inference in high-dimensional settings ... · review of classical and modern methods for simultaneous inference in (high-dimensional) settings and illustrate their

25

References

Bach, P., V. Chernozhukov, and M. Spindler (2018): “An econometric (re-)analysis of the gender wage gap in a

high-dimensional setting,” Work in Progress.

Belloni, A., V. Chernozhukov, D. Chetverikov, and Y. Wei (forthcoming): “Uniformly valid post-regularization

confidence regions for many functional parameters in z-estimation framework,” Annals of Statistics, arXiv preprint

arXiv:1512.07619.

Belloni, A., V. Chernozhukov, and C. Hansen (2014): “Inference on Treatment Effects After Selection Amongst

High-Dimensional Controls,” Review of Economic Studies, 81, 608–650, ArXiv, 2011.

Belloni, A., V. Chernozhukov, and K. Kato (2015): “Uniform post-selection inference for least absolute deviation

regression and other Z-estimation problems,” Biometrika, 102(1), 77–94.

Benjamini, Y., and Y. Hochberg (1995): “Controlling the false discovery rate: a practical and powerful approach to

multiple testing,” Journal of the royal statistical society. Series B (Methodological), pp. 289–300.

Chernozhukov, V., D. Chetverikov, and K. Kato (2013): “Gaussian approximations and multiplier bootstrap for

maxima of sums of high-dimensional random vectors,” Annals of Statistics, 41, 2786–2819.

Chernozhukov, V., C. Hansen, and M. Spindler (2016): “hdm: High-Dimensional Metrics,” The R Journal, 8(2),

185–199.

Holm, S. (1979): “A simple sequentially rejective multiple test procedure,” Scandinavian journal of statistics, pp. 65–70.

List, J. A., A. M. Shaikh, and Y. Xu (2016): “Multiple Hypothesis Testing in Experimental Economics,” Working

Paper 21875, National Bureau of Economic Research.

Romano, J. P., and M. Wolf (2005a): “Exact and Approximate Stepdown Methods for Multiple Hypothesis Testing,”

Journal of the American Statistical Association, 100(469), 94–108.

Romano, J. P., and M. Wolf (2005b): “Stepwise Multiple Testing as Formalized Data Snooping,” Econometrica,

73(4), 1237–1282.

(2016): “Efficient computation of adjusted p-values for resampling-based stepdown multiple testing,” Statistics

& Probability Letters, 113, 38 – 40.