what do you do with missing data? - umn ctssingle imputed public use datasets (2006-2013). 15...

25th Annual CTS Transportation Conference

May 22, 2014 / St. Paul, MN

Yiwen Zhang Jon Roesler, MS / Anna Gaichas, MS / Mark Kinde, MPH

Minnesota Department of Health

What Do You Do With Missing Data?

E Pluribus Unum (one out of many)

A Comparison of Single Imputation Methods

1

Background Methods Results

Discussion Conclusions

2

Crash Outcome Data Evaluation System National Transportation Safety Board June 2013

3

CODES

(23,569)

Crash (n=183,689)

~15,000 taken to hospital

Hospital (n=131,959)

~35,000 MV Traffic

Probabilistic Linkage

(Strategicmatching.com)

4

Dataset Creation CODES software (LinkSolv: StrategicMatching.com) 2009 MN CODES linked dataset (Anna Gaichas) Ways to deal with the enigma of missing data*

3 Primary Strategies: Complete Case Analysis Multiple Imputation **Single Imputation** Making up the numbers

*It is a riddle, wrapped in a mystery, inside an enigma. – Winston Churchill

5

Markov Chain Monte Carlo Propensity Score Regular Regression Maximum Likelihood Predictive Mean Method Stochastic Regression

6

Multiple Imputation (IVEware 0.2)

Multivariate sequential regression: works by fitting a sequence of regression models. For example, given a variable type, a regression model is chosen: continuous variable regular linear regression binary variable logistic regression count variable poisson regression categorical variable polytomous regression (i.e., with >2 levels)

7

Table 1: Standard errors of the 6 single imputation methods plus the multiple imputation (larger is better!)

Variables Markov Chain Monte Carlo

Propensity Score

Regular Regression

Maximum likelihood

Predictive Mean

Stochastic Regression

Multiple Imputation

speed(log) 27.4 27.4 26.5 27.6 27.7 35.4 47.5 weather 0.8 0.9 1.0 0.8 1.3 2.1 2.4 light 1.0 1.0 0.6 0.9 1.9 3.5 3.8 diagram 0.3 0.3 0.3 0.3 0.3 0.3 0.4 event1 0.5 0.5 0.4 0.5 0.5 0.5 1.2 event2 0.3 0.3 0.3 0.3 0.3 0.6 1.1 eject 0.3 0.3 0.3 0.3 0.4 0.6 0.8 injsev 6.8 6.8 6.8 6.8 6.8 7.7 9.8 age 12.9 12.9 12.9 12.9 12.9 24.0 28.3 Note: multiplying the values by 10−3 gives the standard errors.

8

0

0.01

0.02

0.03

0.04

0.05

MCMC Propensity Score(PS)

RegularRegression (RR)

Maximumlikelihood (ML)

Predictive MeanMethod(PMM)

Stochasticregression(SR)

MultipleImputation(MI)

speed

9

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

light condition weather condition logspeed eject injured seveity

Single Imputation Stochastic Regression yields almost as much variance as the "GOLD STANDARD"

MCMC

Propensity Score (PS)

Regular Regression (RR)

Maximum likelihood (ML)

Predictive Mean Method(PMM)

Stochastic regression(SR)

Multiple Imputation(MI)

10

Percent difference between multiple Imputation and stochastic regression

Variables Multiple Imputation vs. Stochastic

speed 25%

weather 14%

light 8%

diagram 10%

event1 56%

event2 46%

eject 26%

injsev 22%

age 15% Average 25%

11

Stochastic regression single imputation is good for: hypothesis generation applications such as online query systems less sophisticated users introducing users to CODES

12

There are limitations Generalization (only for CODES data?) Compare with the multiple imputations with 5 imputed datasets.

However, the results are compelling…

13

Stochastic regression is the best for single imputation

Single imputation can be “good enough”…

14

More research on multiple imputation by changing to 10 imputed datasets

Proc MI (SAS) vs IVEware Paper to be published Online query: MIDAS - MN Injury Data Access System

Single imputed public use datasets (2006-2013).

15

2006-2013 will be available October 2014. Please contact Jon Roesler anytime by:

[email protected]

16

mailto:[email protected]

Yiwen Zhang [email protected] 612-242-4290

17

what do you do with missing data? - umn ctssingle imputed public use datasets (2006-2013). 15...

Documents