arun kumar kuchibhotla · arun kumar kuchibhotla university of pennsylvania larry brown andreas...

Post on 05-Jul-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Arun Kumar KuchibhotlaUniversity of Pennsylvania

https://arun-kuchibhotla.github.io/

Robust Statistics

Single Index Models

Post-selectionInference

Nonparametric Regression

MisspecificationAnalysis

2

Concentration Inequalities and

High-dimensional CLT

Overview of my research interests

Valid Post-selection InferenceWhy and How

Arun Kumar KuchibhotlaUniversity of Pennsylvania

Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George

Roadmap

Data Snooping: Effects and Examples

Formulation of the Problem

Solution for Covariate Selection

Example and Conclusions

4

RoadmapData Snooping: Effects and Examples

● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.

Formulation of the Problem● The Problem & literature review for covariate selection.

Solution for Covariate Selection● Key contributions● Simulations & main components of the theory.

Example and Conclusions

● Real data example, Extensions & Summary

5

RoadmapData Snooping: Effects and Examples

● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.

Formulation of the Problem

Solution for Covariate Selection

Example and Conclusions

6

500 observations

most correlated with 95% CI for slope

Effects of Data Snooping: Stepwise Selection

Effects of Data Snooping: Stepwise Selection

500 observations

most correlated with 95% CI for slope

Effects of Data Snooping: Stepwise Selection

500 observations

most correlated with 95% CI for slope

Some Notes

❖ Unadjusted inference after data snooping can be (very) misleading.

❖ Data snooping contributes to the

➢ Inability to replicate conclusions in future studies.➢ 95% CIs should imply correct conclusions in 95% of

studies.

❖ More concerningly, common practice of data snooping is more informal and imprecise than the example shown.

10

Replicability Crisis

Case Study 1: Covariate Selection

11

BritishMedicalJournal

2005

12

Case Study 1: Covariate SelectionBritishMedicalJournal

2005

13

Case Study 2: Covariate Selection

Case Study 2: Covariate SelectionScientific Reports, 2019

“The MLR models were tested by sequential introduction of predictors and interaction terms.

ultimately, from the three best models with similar adjusted R squared values the simplest one was chosen.”

Scientific Reports, 2019

★H & R (1978) Hedonic housing prices and the demand for clean air.

Case Study 3: Transformations

Harrison and Rubinfeld (1978)★ write

RoadmapData Snooping: Effects and Examples

● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.

Formulation of the Problem● The Problem & literature review for covariate selection.

Solution for Covariate Selection

Example and Conclusions

16

500 observations

most correlated with 95% CI for slope

Effects of Data Snooping: Stepwise Selection

There are covariates and for each

Want a valid CI for the parameter/target :

irrespective of how is chosen based on data.

18

Post-selection Inference: Problem 1

For each

Want a valid CI for a coordinate of :

irrespective of how with size and are chosen based on data.

19

Post-selection Inference: Problem 2

Allows arbitrary exploration in

❌ Cannot revise the model after using

❌ Invalidity when using multiple splits.

❌ Applicable only for independent data.

Variable selection,Transformations etc.

Finally, use once for inference.

20

Solution 0: Sample Splitting

Rinaldo et al. (2019) Bootstrapping and sample splitting for high-dimensional, assumption-lean inference, Annals of Statistics.

For each

Want a valid CI for a coordinate of :

irrespective of how with size and are chosen based on data.

21

Recap: Post-selection Inference

Buehler and Feddersen (1963); Olshen (1973); Sen (1979);

Rencher and Pun (1980); Freedman (1983); Sen and Saleh

(1987); Dijkstra and Veldkamp (1988); Hurvich and Tsai (1990);

Potscher (1991); Pfeiffer, Redd and Carroll (2017).

Cox (1965); Kabaila (1998); Hjort and Claeskens (2003);

Claeskens and Carroll (2007); Berk et al. (2013); Lee, Sun, Sun

and Taylor (2016); Tibshirani, Taylor, Lockhart and Tibshirani

(2016); Bachoc, Preinerstorfer and Steinberger (2019); Rinaldo,

Wasserman, G’Sell and Lei (2019).

Literature Review

22

RoadmapData Snooping: Effects and Examples

● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.

Formulation of the Problem● The Problem & literature review for covariate selection.

Solution for Covariate Selection● Key contributions● Simulations & main components of the theory.

Example and Conclusions

23

Simultaneous Inference FWER Control

Post-selectionInference

➢ Simultaneity implies valid CIs for arbitrary selection

➢ Simultaneity implies infinite revisions of a selection.

➢ Simultaneity also guarantees validity if multiple

models are reported. 24

A Guiding Principle

➢ Choose any model and get a confidence region.➢ The choice can be revised as many times as desired.➢ Simultaneity also guarantees validity if multiple

models are reported.

Theorem: Simultaneous inference is necessary for valid Post-selection inference.

K et al. (2019) Valid Post-selection Inference in Model-free Linear Regression. Annals of Statistics (Forthcoming).

25

Post-selectionInference

A Key Result

Simultaneous Inference FWER Control

The classical interval for ist-score

Solution 1: Uniform Adjustment

For simultaneity, inflate the confidence regions:

quantile of

27

➢ in an assumption-lean setting,➢ for independent and weakly dependent obs.,➢ for and maximal model size ,➢ for fixed or random covariates,

can be estimated using bootstrap.

In the worst case,

We show

Our Contributions

Berk et al. (2013): Homoscedastic Gaussian response, fixed X.Bachoc et al (2019): assumption-lean but fixed X and fixed p.

Simulation Examples: Revisiting Stepwise Selection

500 observations

most correlated with 95% CI for slope

Effects of Data Snooping: Stepwise Selection

Effects of Data Snooping: Stepwise Selection

500 observations

most correlated with 95% CI for slope

Effects of Data Snooping: Stepwise Selection

500 observations

most correlated with 95% CI for slope

Effects of Data Snooping: Stepwise Selection

500 observations

most correlated with 95% CI for slope

Key Steps in the Proof

For independent, sub-Gaussian data

K et al. (2018) A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference. arXiv:1802.05801

➢ Doesn’t require any parametric model assumptions.

➢ A finite sample result. Allows for diverging

➢ Extends beyond independent & sub-Gaussian data.

34

Uniform Linear Representation

implies

Close in Probability by triangle ineq.

35

Three Line Proof

Close inDistribution by HDCLT

K et al. (2018) High-dimensional CLT: Improvements, Non-uniform Extensions and Large Deviations. arXiv:1806.06153

Close in Probability by triangle ineq.

36

Three Line Proof

implies

Close in Probability by triangle ineq.

Close inDistribution by HDCLT

Close inDistributionby triangle ineq.

37

Three Line Proof

Justifies bootstrap!

implies

RoadmapData Snooping: Effects and Examples

● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.

Formulation of the Problem● The Problem & literature review for covariate selection.

Solution for Covariate Selection● Key contributions● Simulations & main components of the theory.

Example and Conclusions● Real Data Example, Extensions & Summary.

38

Telomere Length Analysis

39

Case Study 2: Covariate SelectionScientific Reports, 2019

“The MLR models were tested by sequential introduction of predictors and interaction terms.

ultimately, from the three best models with similar adjusted R squared values the simplest one was chosen.”

40

Scientific Reports, 2019

TL inheritance patterns based on 246 families.

41

Telomere Length Analysis

❖ Dependent Variable: MTL (Mean telomere length)

❖ Child Variables:

➢ Sex ➢ Age

❖ Parental Variables:

➢ mMTL (mother MTL) ➢ fMTL (father MTL)➢ MAC (mother’s age at conception)➢ PAC (father’s age at conception)

(Additionally, 15 interaction variables were considered.)

42

Adjusted Inference: Telomere Length Analysis

Covariate Unadjusted Adjusted

AGE ✔ ❌

mMTL ✔ ✔

fMTL ✔ ✔

MAC ✔ ❌

PAC ❌ ❌

✔ : Significant at 5% level❌ : Insignificant at 5% level

Summary and Conclusions❖ Data snooping contributes to replicability crisis.❖ Inference is possible after data snooping.

❖ The framework allows for➢ Misspecified models; Random covariates;➢ Dependent data; High-dimensional features;➢ Variable Transformations;➢ M-estimators: logistic/Poisson/Quantile/Cox.

43

Classical Framework New FrameworkFix the test & model Fix a universe

Collect the data Collect the data

Thank you for your attention44

Kuchibhotla A., Brown L., Buja A., Cai J., George E., Zhao L. (2019) Valid Post-selection Inference in Model-free Linear Regression. Annals of Statistics (Forthcoming).

Kuchibhotla A., Brown L., Buja A., George E., Zhao L. (2018) A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference. arXiv:1802.05801

Kuchibhotla A. (2018) Deterministic Inequalities for Smooth M-estimators. arXiv:1809.05172

Kuchibhotla A., Mukherjee S., and Banerjee D. (2018) High-dimensional CLT: Improvements, Non-uniform Extensions and Large Deviations. arXiv:1806.06153

Kuchibhotla A., Brown L., Buja A., Cai J. (2019) All of Linear Regression. arXiv:1910.06386

References

For each

Want a valid CI for a coordinate of

irrespective of how is chosen based on data.

45

Post-selection for Transformations

Inflate the classical intervals,

with being the quantile of

Bootstrap applies with validity for “nice” function classes , for example, Box-Cox family.

46

Solution for Transformations

Transformations: Boston Housing DataThis dataset has 506 census tracts with 13 features.

❖ Dependent Variable: MV, Median value of house.

❖ Covariate of Interest: NOX, Nitrogen Oxide Conc.

❖ Structural Variables: RM (No. of rms), AGE (% of homes bfr 1940).

❖ Neighborhood Variables: CRIM (Crime rate), ZN (% of res. land

zoned for lots > 25K ft2), INDUS (% non-retail business acres per twn),

RIVER (Charles river dummy), TAX (Property tax rate),

PTRATIO (Pupil-teacher ratio), B (Racial diversity),

LSTAT (% of lwr socio-econ. status of population).

❖ Accessibility Variables: DIS (Distance to Employment Ctr.),

RAD (Distance to Radial Highway). 47

Harrison and Rubinfeld (1978)★ write

★H & R (1978) Hedonic housing prices and the demand for clean air. 48

Transformations: Boston Housing Data

49

Covariate Unadjusted Adjusted Covariate Unadjusted Adjusted

NOX2 ✔ ✔ TAX ✔ ✔

RM ✔ ✔ PTRATIO ✔ ✔

AGE ❌ ❌ B ✔ ❌

CRIM ✔ ✔ LSTAT ✔ ✔

ZN ❌ ❌ DIS ✔ ✔

INDUS ❌ ❌ RAD ✔ ✔

RIVER ✔ ❌

Adjusted Inference: Boston Housing

✔ : Significant at 5% level❌ : Insignificant at 5% level

50

Implications of PoSI for Applications

Similar to Boston housing data and TL data,

How do conclusions in applied data analysis change when exploration is

accounted for?

Joint work with Junhui Cai and Linda Zhao.

51

High-dimensional CLT

Anderson, Hall and Titterington (1998, JSPI):

Joint work with Somabha Mukherjee.

Let be the class of all rectangles in

52

Maximal Inequalities

➢ High-dimensional rates are sensitive to tails.

➢ What is the order of

subject to

Joint work with Somabha Mukherjee and Sagnik Nandy.

53

Multiple Testing under Dependence

➢ Multiple testing often requires independence.

➢ FWER is possible under arbitrary dependence.

➢ What about FDR control?

➢ How does BH procedure behave?

Arun KumarKuchibhotla

University of Pennsylvania

Webpage: https://arun-kuchibhotla.github.io/Email: arunku@upenn.edu

55

Case Study 3: Model Building

2008

➢ 76 Oregon homes and 12 features.

➢ Try a linear model with 12 predictors “as is”.

➢ Residuals imply non-linearity.

➢ Bath and Bed also have high p-values, so add an

interaction to the model.

➢ Price is skewed suggesting a log-transformation.

Journal of Statistics Education

Case Study 3: Transformations

★Stine and Foster, Statistics for Business: Decision Making and Analysis.

In the context of curve fitting to bivariate data, Stine and Foster (2014)★ on page 515, write

56

“Picking a transformation requires practice, and you may need to try several to find one that is interpretable and captures the pattern in the data.”

➢ The solution applies to most other estimation problems.

➢ Examples include logistic/Poisson regression, quantile regression and Cox regression.

➢ Solution also applies to transformation of variables.

K (2018) Deterministic Inequalities for Smooth M-estimators, arXiv:1809.0517257

Extensions

top related