what do you do with missing data? - umn ctssingle imputed public use datasets (2006-2013). 15...

17
25 th Annual CTS Transportation Conference May 22, 2014 / St. Paul, MN Yiwen Zhang Jon Roesler, MS / Anna Gaichas, MS / Mark Kinde, MPH Minnesota Department of Health What Do You Do With Missing Data? E Pluribus Unum (one out of many) A Comparison of Single Imputation Methods 1

Upload: others

Post on 06-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

25th Annual CTS Transportation Conference

May 22, 2014 / St. Paul, MN

Yiwen Zhang Jon Roesler, MS / Anna Gaichas, MS / Mark Kinde, MPH

Minnesota Department of Health

What Do You Do With Missing Data?

E Pluribus Unum (one out of many)

A Comparison of Single Imputation Methods

1

Page 2: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

Background Methods Results

Discussion Conclusions

2

Page 3: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

Crash Outcome Data Evaluation System National Transportation Safety Board June 2013

3

Page 4: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

CODES

(23,569)

Crash (n=183,689)

~15,000 taken to hospital

Hospital (n=131,959)

~35,000 MV Traffic

Probabilistic Linkage

(Strategicmatching.com)

4

Page 5: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

Dataset Creation CODES software (LinkSolv: StrategicMatching.com) 2009 MN CODES linked dataset (Anna Gaichas) Ways to deal with the enigma of missing data*

3 Primary Strategies: Complete Case Analysis Multiple Imputation **Single Imputation** Making up the numbers

*It is a riddle, wrapped in a mystery, inside an enigma. – Winston Churchill

5

Page 6: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

Markov Chain Monte Carlo Propensity Score Regular Regression Maximum Likelihood Predictive Mean Method Stochastic Regression

6

Page 7: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

Multiple Imputation (IVEware 0.2)

Multivariate sequential regression: works by fitting a sequence of regression models. For example, given a variable type, a regression model is chosen: continuous variable regular linear regression binary variable logistic regression count variable poisson regression categorical variable polytomous regression (i.e., with >2 levels)

7

Page 8: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

Table 1: Standard errors of the 6 single imputation methods plus the multiple imputation (larger is better!)

Variables Markov Chain Monte Carlo

Propensity Score

Regular Regression

Maximum likelihood

Predictive Mean

Stochastic Regression

Multiple Imputation

speed(log) 27.4 27.4 26.5 27.6 27.7 35.4 47.5 weather 0.8 0.9 1.0 0.8 1.3 2.1 2.4 light 1.0 1.0 0.6 0.9 1.9 3.5 3.8 diagram 0.3 0.3 0.3 0.3 0.3 0.3 0.4 event1 0.5 0.5 0.4 0.5 0.5 0.5 1.2 event2 0.3 0.3 0.3 0.3 0.3 0.6 1.1 eject 0.3 0.3 0.3 0.3 0.4 0.6 0.8 injsev 6.8 6.8 6.8 6.8 6.8 7.7 9.8 age 12.9 12.9 12.9 12.9 12.9 24.0 28.3 Note: multiplying the values by 10−3 gives the standard errors.

8

Page 9: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

0

0.01

0.02

0.03

0.04

0.05

MCMC Propensity Score(PS)

RegularRegression (RR)

Maximumlikelihood (ML)

Predictive MeanMethod(PMM)

Stochasticregression(SR)

MultipleImputation(MI)

speed

9

Page 10: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

light condition weather condition logspeed eject injured seveity

Single Imputation Stochastic Regression yields almost as much variance as the "GOLD STANDARD"

MCMC

Propensity Score (PS)

Regular Regression (RR)

Maximum likelihood (ML)

Predictive Mean Method(PMM)

Stochastic regression(SR)

Multiple Imputation(MI)

10

Page 11: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

Percent difference between multiple Imputation and stochastic regression

Variables Multiple Imputation vs. Stochastic

speed 25%

weather 14%

light 8%

diagram 10%

event1 56%

event2 46%

eject 26%

injsev 22%

age 15% Average 25%

11

Page 12: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

Stochastic regression single imputation is good for: hypothesis generation applications such as online query systems less sophisticated users introducing users to CODES

12

Page 13: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

There are limitations Generalization (only for CODES data?) Compare with the multiple imputations with 5 imputed datasets.

However, the results are compelling…

13

Page 14: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

Stochastic regression is the best for single imputation

Single imputation can be “good enough”…

14

Page 15: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

More research on multiple imputation by changing to 10 imputed datasets

Proc MI (SAS) vs IVEware Paper to be published Online query: MIDAS - MN Injury Data Access System

Single imputed public use datasets (2006-2013).

15

Page 16: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

2006-2013 will be available October 2014. Please contact Jon Roesler anytime by:

[email protected]

16

Page 17: What Do You Do With Missing Data? - UMN CTSSingle imputed public use datasets (2006-2013). 15 2006-2013 will be available October 2014. Please contact Jon Roesler anytime by: jon.roesler@state.mn.us

Yiwen Zhang [email protected] 612-242-4290

17