construction of posi statistics - arun-kuchibhotla.github.io · andreas buja and arun kumar...

28
Construction of PoSI Statistics 1 Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI 2018 1 Joint work with "Larry’s Group" at Wharton, including Larry Brown, Edward George, Linda Zhao and Junhui Cai. Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 1 / 28

Upload: others

Post on 20-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Construction of PoSI Statistics1

Andreas Buja and Arun Kumar Kuchibhotla

Department of StatisticsUniversity of Pennsylvania

September 8, 2018

WHOA-PSI 2018

1Joint work with "Larry’s Group" at Wharton, including Larry Brown, EdwardGeorge, Linda Zhao and Junhui Cai.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 1 / 28

Page 2: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

In Memory of

LAWRENCE D. BROWN †1940 – 2018.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 2 / 28

Page 3: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Outline

1 Introduction

2 PoSI in High-dimensions under Misspecification

3 Three PoSI Confidence Regions

4 Numerical Examples

5 Summary

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 3 / 28

Page 4: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Outline

1 Introduction

2 PoSI in High-dimensions under Misspecification

3 Three PoSI Confidence Regions

4 Numerical Examples

5 Summary

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 4 / 28

Page 5: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

A Crisis in the Sciences: Irreproducibility

Indicators of a crisis:Bayer Healthcare reviewed 67 in-house attempts at replicating findings inpublished research: < 1/4 were viewed as replicated

Arrowsmith (2011, Nat. Rev. Drug Discovery 10):Increasing failure rate in Phase II drug trials

Ioannidis (2005, PLOS Medicine):“Why Most Published Research Findings Are False”

Simmons, Nelson, Simonsohn (2011, Psychol.Sci):“False-Positive Psychology: Undisclosed Flexibility in Data Collectionand Analysis Allows Presenting Anything as Significant”

=⇒ Irreproducibility of Empirical Findings

Many potential causes – two major ones:Institutional: Publication bias, “file drawer problem”

Methodological: Statistical biases, “researcher degrees of freedom”

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 5 / 28

Page 6: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Irreproducibility: Methodol. Factor 1 – Selection

A statistical bias is due to lack of accounting forselection of variables, transforms, scales, subsets, weights, ....

Regressor/model selection (our focus) is on several levels:

formal selection: all subset (Cp, AIC, BIC,...), stepwise (F), lasso,...

informal selection: diagnostics for GoF, influence, collinearity,...

post hoc selection: “Effect size is too small, the variable too costly.”

Suspicions and Criticisms:

All three modes of selection are (should be) used.

More thorough data analysis =⇒ More spurious results

Not a solution: Post-selection inference for “adaptive Lasso”, say.Empirical researchers do not write contracts with themselves tocommit a priori to one formal selection method and nothing else.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 6 / 28

Page 7: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

The “PoSI” Solution to Selection: FWER Control

PoSI Procedure — general version:Define a universeM of models M you might ever consider/select:outcomes (Y ), regressors (X ), their transforms (f (X ),g(Y )), ...

Define the universe of all tests you might ever perform in thesemodels, typically for regression coeffs βj,M (j ’th coeff in model M).

Consider the minimum of the p-values for all these tests:Obtain its 0.05 quantile α0.05 for FWER adjustment.

Now freely examine your data and select models M ∈M,reconsider, re-select, re-reconsider, ... but compare all p-valuesagainst α0.05, not 0.05, for 0.05M-FWER control.

Cost-Benefit Analysis:Cost: Huge computation upfront — adjustment for millions of tests

Benefits: Solution to the circularity problem — select model M,don’t like it, select M ′, don’t like it, ... PoSI inference remains valid.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 7 / 28

Page 8: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Irreproducibility: Methodol. Factor 2 – Misspecification

Models are approximations, not generative truths.=⇒ Consequences!

What is the target βj,M of βj,M? Stay tuned.

Model bias interacts with regressor distributions to causemodel-trusting SEs to be off, sometimes too small by a factor of 2.

V [β] = E [V [β|X ]] + V [E [β|X ]]

Do not condition on the regressors; do not treat them as fixed!

Use model-robust standard errors, for example, fromthe x-y pairs or multiplier bootstraps, not the residual bootstrap!

Wanted: PoSI Protection under Misspecification!

Up next: PoSI under Misspecification and PoSI StatisticAndreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 8 / 28

Page 9: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Outline

1 Introduction

2 PoSI in High-dimensions under Misspecification

3 Three PoSI Confidence Regions

4 Numerical Examples

5 Summary

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 9 / 28

Page 10: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

For models M ⊆ {1,2, . . . ,p} and IID random vectors (Xi ,Yi). For anyfunction f , set

Pn[f (X ,Y )] =1n

n∑i=1

f (Xi ,Yi), and P[f (X ,Y )] = E[f (X1,Y1)].

SampleGram matrix:

Σn := Pn

[XX>

].

“Covariance” Vector:

Γn := Pn [XY ] .

Estimator:

βn,M := (Σn(M))−1Γn(M).

PopulationGram matrix:

Σ := P[XX>

].

“Covariance” Vector:

Γ := P [XY ] .

Target:

βM := (Σ(M))−1 Γ(M).

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 10 / 28

Page 11: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Uniform-in-submodel Result for OLS

If Zi := (Xi ,Yi) are sub-Gaussian, then the results of Kuchibhotla et al.(2018b) imply that for any 1 ≤ k ≤ p,

max|M|≤k

∥∥∥βn,M − βM

∥∥∥2

= Op

(√k log(ep/k)

n

),

and

max|M|≤k

∥∥∥∥∥√n(βn,M − βM

)− 1√

n

n∑i=1

ψM(Zi)

∥∥∥∥∥2

= Op

(k log(ep/k)√

n

),

whereψM(Zi) := (Σ(M))−1Xi(M)(Yi − X>i (M)βM).

Recall

Σ(M) = E[X1(M)X>1 (M)] and βM := (Σ(M))−1E[X1(M)Y1].

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 11 / 28

Page 12: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Implications for PoSI

These results imply that if k log(ep/k) = o(√

n), then as n→∞,simultaneously for all |M| ≤ k ,

√n(βn,M − βM

)≈ 1√

n

n∑i=1

ψM(Zi).

This implies one can apply bootstrap to estimate quantiles of the“max-|t|” statistic:

max-|t| := max|M|≤k , j∈M

∣∣∣∣∣√

n(βn,M(j)− βM(j))

σM(j)

∣∣∣∣∣ .The linear representation result holds also for functionallydependent and/or non-identically distributed observations. SeeKuchibhotla et al. (2018b).

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 12 / 28

Page 13: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Outline

1 Introduction

2 PoSI in High-dimensions under Misspecification

3 Three PoSI Confidence Regions

4 Numerical Examples

5 Summary

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 13 / 28

Page 14: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Is max-|t| the right statistic??

The “max-|t |” statistic was used for PoSI in Berk et al. (2013) andBachoc et al. (2016).

max-|t| is definitely not the only choice. So, can we do any better?

In regression analysis, there is a hierarchical structure:model M and then covariate j in model M.

Ignoring this structure leads to certain deficiencies of the “max-|t |”confidence regions.

To follow: Deficiencies of max-|t| andnew PoSI Regions

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 14 / 28

Page 15: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Deficiencies of max-|t| Regions: Part I

Define

TM := maxj∈M

∣∣∣∣∣∣√

n(βn,M(j)− βM(j)

)σM(j)

∣∣∣∣∣∣ and max-|t | := max|M|≤k

|TM |.

Suppose M ⊂ M ′ are two models. Then TM is usually smaller thanTM′ : under certain assumptions,

E [TM ] �√

log |M| and E [TM′ ] �√

log |M ′|.

So, the maximum in the max-|t| is usually attained at the largestmodel implying larger confidence regions for smaller models.

Smaller Models should have Smaller Confidence Regions

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 15 / 28

Page 16: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Deficiencies: Part I contd.

For any two (fixed) models M,M ′, as n→∞,

max

{√n|M|

∥∥∥βn,M − βM

∥∥∥2,

√n|M ′|

∥∥∥βn,M′ − βM′

∥∥∥2

}= Op (1) .

So, without model selection smaller models have smallerconfidence regions.

But the max-|t| confidence regions do NOT maintain this.

This is of importance especially if the total number of covariates islarger than the sample size.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 16 / 28

Page 17: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Deficiencies of max-|t| Regions: Part II

To understand the second major deficiency of max-|t| regions,consider the gram matrix

Σn :=

[Ip−1 c1p−1

c1>p−1 1

],

with c2 < 1/(p − 1) and where 1p−1 = (1,1, . . . ,1)>.

In this setting for most submodels, the covariates are uncorrelatedbut the full model is highly collinear for c2 ≈ 1/(p − 1).

It was shown in Berk et al. (2013) that max-|t| � √p. But if weignore the last covariate, then max-|t| �

√log p.

Collinearity in a model should not affect confidence regionsfor another model.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 17 / 28

Page 18: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

How to Remedy this: A Simplified Example

Suppose Wj ∼ N(µj ,1) for 1 ≤ j ≤ p. We want PoSI for µj for jchosen based on the sequence.If Wj are independent, then max-|t| confidence region for µj isessentially

{θ : |Wj − θ| ≤ (2 log p)1/2}.

Is this the best?? Consider the statistic

S? := max1≤j≤p

|Wj − θj |(2 log(j))1/2 ← index dependent scaling.

It is easy to prove that S? = Op(1) (even if p =∞). This statisticimplies the confidence region for µj :{

θ : |Wj − θ| ≤ C(2 log(j))1/2}

This is much less conservative if the chosen j is not too big.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 18 / 28

Page 19: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Simplified Example Contd.

Once again the confidence regions are

{θ : |Wj − θ| ≤ (2 log p)1/2}, (1)

{θ : |Wj − θ| ≤ C(2 log(j))1/2}. (2)

Confidence region (2) is uniformly better than (1) (rate-wise).Furthermore, both regions (1) and (2) are tight, i.e., there is a jsuch that

|Wj − µj | = max1≤j≤p

|Wj − µj |,

and there is also a j such that

|Wj − µj |

(2 log(j))1/2= max

1≤j≤p

|Wj − µj |(2 log(j))1/2 .

Moral of the story: layer-by-layer standardization helps.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 19 / 28

Page 20: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Three Confidence Regions

Recall the max-|t| for model M and standardized max-|t| as

TM := maxj∈M

∣∣∣∣∣∣√

n(βM(j)− βM(j)

)σM(j)

∣∣∣∣∣∣ , and T ?M :=

TM − E[TM ]√Var(TM)

.

Consider the following three max statistics:

T (1)k := max

|M|≤kTM ,

T (2)k := max

1≤s≤k

(max|M|=s T ?

M − Es

SDs

), where Es := E

[max|M|=s

T ?M

],

T (3)k := max

1≤s≤k

(max|M|≤s T ?

M − E?s

SD?s

), where E?

s := E[

max|M|≤s

T ?M

].

The quantities SDs and SD?s are defined similarly to Es and E?

s .

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 20 / 28

Page 21: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Three Confidence Regions Contd.

Define for θ ∈ R|M|,

TM(θ) := maxj∈M

∣∣∣√n(βM(j)− θ(j))/σn,M(j)∣∣∣ ,

and consider the confidence regions (rectangles) are given by

R(1)n,M :=

{θ : TM(θ) ≤ K (1)

α

},

R(2)n,M :=

{θ : TM(θ) ≤ E[TM ] +

√Var(TM)(Es + SDsK (2)

α )},

R(3)n,M :=

{θ : TM(θ) ≤ E[TM ] +

√Var(TM)(E?

s + SD?sK (3)

α )}.

Here K (j)α denote the quantiles of T (j)

k respectively for j = 1,2,3.

The quantiles K (j)α can be estimated using multiplier bootstrap

(where one replaces E[TM ],Var(TM),Es,SDs by their estimators).

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 21 / 28

Page 22: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Some Comments

The three confidence regions provide asymptotically validpost-selection inference.

The regions R(j)n,M , j = 2,3 provide model dependent scaling and

so give shorter confidence regions for smaller models.

Because of the model-dependent scaling for the last two, they areless conservative than the max-|t| confidence regions.

The three maximum-statistics listed here are not the only optionsand one can get very creative in designing others.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 22 / 28

Page 23: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Outline

1 Introduction

2 PoSI in High-dimensions under Misspecification

3 Three PoSI Confidence Regions

4 Numerical Examples

5 Summary

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 23 / 28

Page 24: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Boston Housing Data

The Boston housing dataset contains data on n = 506 median value ofa house along with 13 predictors.The confidence regions for model M ∈M(k) are given by

|TM(θ)| ≤

K (1)α ,

C(2)M := E[TM ] +

√Var(TM)(Es + SDsK (2)

α ),

C(3)M := E[TM ] +

√Var(TM)(E?

s + SD?sK (3)

α ).

To understand how small/wide the last two confidence regions are, wecompute:

Summary(

C(2)M

K (1)α

: M ∈M(k)

)and Summary

(C(3)

M

K (1)α

: M ∈M(k)

).

This tells for what proportion of models are the second and thirdregions shorter/wider and by how much?

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 24 / 28

Page 25: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Boston Housing Data Contd.

There are 14 predictors including the intercept. k ∈ {1, . . . ,14}represents the maximum model size allowed and j = 2,3 representsthe last two confidence regions.

Here we consider two cases k = 6 and k = 14 (NO file drawer prob.!!).

Table: Comparison of Constants in R(2)n,M and R(3)

n,M to max-|t| constant.

Quantiles→ Min. 5% 25% 50% Mean 75% 95% Max.k = 6 j = 2 0.702 0.978 1.037 1.060 1.052 1.077 1.098 1.140

j = 3 0.692 0.980 1.047 1.072 1.062 1.090 1.112 1.155k = 14 j = 2 0.718 0.996 1.044 1.065 1.060 1.083 1.105 1.148

j = 3 0.678 0.999 1.050 1.070 1.064 1.086 1.108 1.147

About 30% gain with about 15% loss over all models!For ≥ 90% of models, the confidence regions are wider.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 25 / 28

Page 26: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Outline

1 Introduction

2 PoSI in High-dimensions under Misspecification

3 Three PoSI Confidence Regions

4 Numerical Examples

5 Summary

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 26 / 28

Page 27: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

Conclusions

We have provided post-selection inference allowing for increasingnumber of models for linear regression.

Based on the Gaussian approximation results, we haveconstructed and implemented three different PoSI confidenceregions.

All three confidence regions are asymptotically tight. This impliesthat no one can uniformly dominate the other.

A generally interesting question: What kind of maximum statisticshould be to consider?

Efficient algorithms and detailed simulation studies are underdevelopment.

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 27 / 28

Page 28: Construction of PoSI Statistics - arun-kuchibhotla.github.io · Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI

References[1] Bachoc, F., Preinerstorfer, D., and Steinberger, L. (2016).

Uniformly valid confidence intervals post-model-selection.arxiv.org/abs/1611.01043.

[2] Banerjee, D., Kuchibhotla, A. K., and Mukherjee, S. (2018).Cramér-type large deviation and non-uniform central limit theorems in high dimensions.arxiv.org/abs/1806.06153.

[3] Belloni, A., Chernozhukov, V., Chetverikov, D., Hansen, C., Kato, K., et al. (2018).High-dimensional econometrics and regularized gmm.Technical report, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.

[4] Berk, R., Brown, L. D., Buja, A., Zhang, K., Zhao, L. (2013)Valid post-selection inference.Ann. Statist. 41, no. 2, 802–837.

[5] Kuchibhotla, Brown, Buja, Berk, Zhao, George (2018a)Valid Post-selection Inference in Assumption-lean Linear Regression.arxiv.org/abs/1806.04119.

[6] Kuchibhotla, Brown, Buja, Zhao, George (2018b)A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference.arxiv.org/abs/1802.05801.

[7] Kuchibhotla, Brown, Buja, Zhao, George (2018+)A Note on Post-selection Inference for M-estimators.in preparation.

Thank YouQuestions?

Andreas & Arun, Larry’s Group (UPenn) Construction of PoSI Statistics September 8, 2018 28 / 28