arun kumar kuchibhotla · arun kumar kuchibhotla university of pennsylvania larry brown andreas...

57
Arun Kumar Kuchibhotla University of Pennsylvania https://arun-kuchibhotla.github.io/

Upload: others

Post on 05-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Arun Kumar KuchibhotlaUniversity of Pennsylvania

https://arun-kuchibhotla.github.io/

Page 2: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Robust Statistics

Single Index Models

Post-selectionInference

Nonparametric Regression

MisspecificationAnalysis

2

Concentration Inequalities and

High-dimensional CLT

Overview of my research interests

Page 3: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Valid Post-selection InferenceWhy and How

Arun Kumar KuchibhotlaUniversity of Pennsylvania

Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George

Page 4: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Roadmap

Data Snooping: Effects and Examples

Formulation of the Problem

Solution for Covariate Selection

Example and Conclusions

4

Page 5: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

RoadmapData Snooping: Effects and Examples

● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.

Formulation of the Problem● The Problem & literature review for covariate selection.

Solution for Covariate Selection● Key contributions● Simulations & main components of the theory.

Example and Conclusions

● Real data example, Extensions & Summary

5

Page 6: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

RoadmapData Snooping: Effects and Examples

● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.

Formulation of the Problem

Solution for Covariate Selection

Example and Conclusions

6

Page 7: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

500 observations

most correlated with 95% CI for slope

Effects of Data Snooping: Stepwise Selection

Page 8: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Effects of Data Snooping: Stepwise Selection

500 observations

most correlated with 95% CI for slope

Page 9: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Effects of Data Snooping: Stepwise Selection

500 observations

most correlated with 95% CI for slope

Page 10: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Some Notes

❖ Unadjusted inference after data snooping can be (very) misleading.

❖ Data snooping contributes to the

➢ Inability to replicate conclusions in future studies.➢ 95% CIs should imply correct conclusions in 95% of

studies.

❖ More concerningly, common practice of data snooping is more informal and imprecise than the example shown.

10

Replicability Crisis

Page 11: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Case Study 1: Covariate Selection

11

BritishMedicalJournal

2005

Page 12: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

12

Case Study 1: Covariate SelectionBritishMedicalJournal

2005

Page 13: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

13

Case Study 2: Covariate Selection

Page 14: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Case Study 2: Covariate SelectionScientific Reports, 2019

“The MLR models were tested by sequential introduction of predictors and interaction terms.

ultimately, from the three best models with similar adjusted R squared values the simplest one was chosen.”

Scientific Reports, 2019

Page 15: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

★H & R (1978) Hedonic housing prices and the demand for clean air.

Case Study 3: Transformations

Harrison and Rubinfeld (1978)★ write

Page 16: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

RoadmapData Snooping: Effects and Examples

● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.

Formulation of the Problem● The Problem & literature review for covariate selection.

Solution for Covariate Selection

Example and Conclusions

16

Page 17: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

500 observations

most correlated with 95% CI for slope

Effects of Data Snooping: Stepwise Selection

Page 18: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

There are covariates and for each

Want a valid CI for the parameter/target :

irrespective of how is chosen based on data.

18

Post-selection Inference: Problem 1

Page 19: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

For each

Want a valid CI for a coordinate of :

irrespective of how with size and are chosen based on data.

19

Post-selection Inference: Problem 2

Page 20: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Allows arbitrary exploration in

❌ Cannot revise the model after using

❌ Invalidity when using multiple splits.

❌ Applicable only for independent data.

Variable selection,Transformations etc.

Finally, use once for inference.

20

Solution 0: Sample Splitting

Rinaldo et al. (2019) Bootstrapping and sample splitting for high-dimensional, assumption-lean inference, Annals of Statistics.

Page 21: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

For each

Want a valid CI for a coordinate of :

irrespective of how with size and are chosen based on data.

21

Recap: Post-selection Inference

Page 22: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Buehler and Feddersen (1963); Olshen (1973); Sen (1979);

Rencher and Pun (1980); Freedman (1983); Sen and Saleh

(1987); Dijkstra and Veldkamp (1988); Hurvich and Tsai (1990);

Potscher (1991); Pfeiffer, Redd and Carroll (2017).

Cox (1965); Kabaila (1998); Hjort and Claeskens (2003);

Claeskens and Carroll (2007); Berk et al. (2013); Lee, Sun, Sun

and Taylor (2016); Tibshirani, Taylor, Lockhart and Tibshirani

(2016); Bachoc, Preinerstorfer and Steinberger (2019); Rinaldo,

Wasserman, G’Sell and Lei (2019).

Literature Review

22

Page 23: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

RoadmapData Snooping: Effects and Examples

● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.

Formulation of the Problem● The Problem & literature review for covariate selection.

Solution for Covariate Selection● Key contributions● Simulations & main components of the theory.

Example and Conclusions

23

Page 24: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Simultaneous Inference FWER Control

Post-selectionInference

➢ Simultaneity implies valid CIs for arbitrary selection

➢ Simultaneity implies infinite revisions of a selection.

➢ Simultaneity also guarantees validity if multiple

models are reported. 24

A Guiding Principle

Page 25: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

➢ Choose any model and get a confidence region.➢ The choice can be revised as many times as desired.➢ Simultaneity also guarantees validity if multiple

models are reported.

Theorem: Simultaneous inference is necessary for valid Post-selection inference.

K et al. (2019) Valid Post-selection Inference in Model-free Linear Regression. Annals of Statistics (Forthcoming).

25

Post-selectionInference

A Key Result

Simultaneous Inference FWER Control

Page 26: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

The classical interval for ist-score

Solution 1: Uniform Adjustment

For simultaneity, inflate the confidence regions:

quantile of

Page 27: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

27

➢ in an assumption-lean setting,➢ for independent and weakly dependent obs.,➢ for and maximal model size ,➢ for fixed or random covariates,

can be estimated using bootstrap.

In the worst case,

We show

Our Contributions

Berk et al. (2013): Homoscedastic Gaussian response, fixed X.Bachoc et al (2019): assumption-lean but fixed X and fixed p.

Page 28: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Simulation Examples: Revisiting Stepwise Selection

Page 29: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

500 observations

most correlated with 95% CI for slope

Effects of Data Snooping: Stepwise Selection

Page 30: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Effects of Data Snooping: Stepwise Selection

500 observations

most correlated with 95% CI for slope

Page 31: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Effects of Data Snooping: Stepwise Selection

500 observations

most correlated with 95% CI for slope

Page 32: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Effects of Data Snooping: Stepwise Selection

500 observations

most correlated with 95% CI for slope

Page 33: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Key Steps in the Proof

Page 34: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

For independent, sub-Gaussian data

K et al. (2018) A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference. arXiv:1802.05801

➢ Doesn’t require any parametric model assumptions.

➢ A finite sample result. Allows for diverging

➢ Extends beyond independent & sub-Gaussian data.

34

Uniform Linear Representation

Page 35: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

implies

Close in Probability by triangle ineq.

35

Three Line Proof

Page 36: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Close inDistribution by HDCLT

K et al. (2018) High-dimensional CLT: Improvements, Non-uniform Extensions and Large Deviations. arXiv:1806.06153

Close in Probability by triangle ineq.

36

Three Line Proof

implies

Page 37: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Close in Probability by triangle ineq.

Close inDistribution by HDCLT

Close inDistributionby triangle ineq.

37

Three Line Proof

Justifies bootstrap!

implies

Page 38: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

RoadmapData Snooping: Effects and Examples

● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.

Formulation of the Problem● The Problem & literature review for covariate selection.

Solution for Covariate Selection● Key contributions● Simulations & main components of the theory.

Example and Conclusions● Real Data Example, Extensions & Summary.

38

Page 39: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Telomere Length Analysis

39

Page 40: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Case Study 2: Covariate SelectionScientific Reports, 2019

“The MLR models were tested by sequential introduction of predictors and interaction terms.

ultimately, from the three best models with similar adjusted R squared values the simplest one was chosen.”

40

Scientific Reports, 2019

Page 41: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

TL inheritance patterns based on 246 families.

41

Telomere Length Analysis

❖ Dependent Variable: MTL (Mean telomere length)

❖ Child Variables:

➢ Sex ➢ Age

❖ Parental Variables:

➢ mMTL (mother MTL) ➢ fMTL (father MTL)➢ MAC (mother’s age at conception)➢ PAC (father’s age at conception)

(Additionally, 15 interaction variables were considered.)

Page 42: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

42

Adjusted Inference: Telomere Length Analysis

Covariate Unadjusted Adjusted

AGE ✔ ❌

mMTL ✔ ✔

fMTL ✔ ✔

MAC ✔ ❌

PAC ❌ ❌

✔ : Significant at 5% level❌ : Insignificant at 5% level

Page 43: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Summary and Conclusions❖ Data snooping contributes to replicability crisis.❖ Inference is possible after data snooping.

❖ The framework allows for➢ Misspecified models; Random covariates;➢ Dependent data; High-dimensional features;➢ Variable Transformations;➢ M-estimators: logistic/Poisson/Quantile/Cox.

43

Classical Framework New FrameworkFix the test & model Fix a universe

Collect the data Collect the data

Page 44: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Thank you for your attention44

Kuchibhotla A., Brown L., Buja A., Cai J., George E., Zhao L. (2019) Valid Post-selection Inference in Model-free Linear Regression. Annals of Statistics (Forthcoming).

Kuchibhotla A., Brown L., Buja A., George E., Zhao L. (2018) A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference. arXiv:1802.05801

Kuchibhotla A. (2018) Deterministic Inequalities for Smooth M-estimators. arXiv:1809.05172

Kuchibhotla A., Mukherjee S., and Banerjee D. (2018) High-dimensional CLT: Improvements, Non-uniform Extensions and Large Deviations. arXiv:1806.06153

Kuchibhotla A., Brown L., Buja A., Cai J. (2019) All of Linear Regression. arXiv:1910.06386

References

Page 45: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

For each

Want a valid CI for a coordinate of

irrespective of how is chosen based on data.

45

Post-selection for Transformations

Page 46: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Inflate the classical intervals,

with being the quantile of

Bootstrap applies with validity for “nice” function classes , for example, Box-Cox family.

46

Solution for Transformations

Page 47: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Transformations: Boston Housing DataThis dataset has 506 census tracts with 13 features.

❖ Dependent Variable: MV, Median value of house.

❖ Covariate of Interest: NOX, Nitrogen Oxide Conc.

❖ Structural Variables: RM (No. of rms), AGE (% of homes bfr 1940).

❖ Neighborhood Variables: CRIM (Crime rate), ZN (% of res. land

zoned for lots > 25K ft2), INDUS (% non-retail business acres per twn),

RIVER (Charles river dummy), TAX (Property tax rate),

PTRATIO (Pupil-teacher ratio), B (Racial diversity),

LSTAT (% of lwr socio-econ. status of population).

❖ Accessibility Variables: DIS (Distance to Employment Ctr.),

RAD (Distance to Radial Highway). 47

Page 48: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Harrison and Rubinfeld (1978)★ write

★H & R (1978) Hedonic housing prices and the demand for clean air. 48

Transformations: Boston Housing Data

Page 49: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

49

Covariate Unadjusted Adjusted Covariate Unadjusted Adjusted

NOX2 ✔ ✔ TAX ✔ ✔

RM ✔ ✔ PTRATIO ✔ ✔

AGE ❌ ❌ B ✔ ❌

CRIM ✔ ✔ LSTAT ✔ ✔

ZN ❌ ❌ DIS ✔ ✔

INDUS ❌ ❌ RAD ✔ ✔

RIVER ✔ ❌

Adjusted Inference: Boston Housing

✔ : Significant at 5% level❌ : Insignificant at 5% level

Page 50: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

50

Implications of PoSI for Applications

Similar to Boston housing data and TL data,

How do conclusions in applied data analysis change when exploration is

accounted for?

Joint work with Junhui Cai and Linda Zhao.

Page 51: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

51

High-dimensional CLT

Anderson, Hall and Titterington (1998, JSPI):

Joint work with Somabha Mukherjee.

Let be the class of all rectangles in

Page 52: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

52

Maximal Inequalities

➢ High-dimensional rates are sensitive to tails.

➢ What is the order of

subject to

Joint work with Somabha Mukherjee and Sagnik Nandy.

Page 53: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

53

Multiple Testing under Dependence

➢ Multiple testing often requires independence.

➢ FWER is possible under arbitrary dependence.

➢ What about FDR control?

➢ How does BH procedure behave?

Page 54: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Arun KumarKuchibhotla

University of Pennsylvania

Webpage: https://arun-kuchibhotla.github.io/Email: [email protected]

Page 55: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

55

Case Study 3: Model Building

2008

➢ 76 Oregon homes and 12 features.

➢ Try a linear model with 12 predictors “as is”.

➢ Residuals imply non-linearity.

➢ Bath and Bed also have high p-values, so add an

interaction to the model.

➢ Price is skewed suggesting a log-transformation.

Journal of Statistics Education

Page 56: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

Case Study 3: Transformations

★Stine and Foster, Statistics for Business: Decision Making and Analysis.

In the context of curve fitting to bivariate data, Stine and Foster (2014)★ on page 515, write

56

“Picking a transformation requires practice, and you may need to try several to find one that is interpretable and captures the pattern in the data.”

Page 57: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation

➢ The solution applies to most other estimation problems.

➢ Examples include logistic/Poisson regression, quantile regression and Cox regression.

➢ Solution also applies to transformation of variables.

K (2018) Deterministic Inequalities for Smooth M-estimators, arXiv:1809.0517257

Extensions