what do you do with missing data? - umn ctssingle imputed public use datasets (2006-2013). 15...
TRANSCRIPT
25th Annual CTS Transportation Conference
May 22, 2014 / St. Paul, MN
Yiwen Zhang Jon Roesler, MS / Anna Gaichas, MS / Mark Kinde, MPH
Minnesota Department of Health
What Do You Do With Missing Data?
E Pluribus Unum (one out of many)
A Comparison of Single Imputation Methods
1
Background Methods Results
Discussion Conclusions
2
Crash Outcome Data Evaluation System National Transportation Safety Board June 2013
3
CODES
(23,569)
Crash (n=183,689)
~15,000 taken to hospital
Hospital (n=131,959)
~35,000 MV Traffic
Probabilistic Linkage
(Strategicmatching.com)
4
Dataset Creation CODES software (LinkSolv: StrategicMatching.com) 2009 MN CODES linked dataset (Anna Gaichas) Ways to deal with the enigma of missing data*
3 Primary Strategies: Complete Case Analysis Multiple Imputation **Single Imputation** Making up the numbers
*It is a riddle, wrapped in a mystery, inside an enigma. – Winston Churchill
5
Markov Chain Monte Carlo Propensity Score Regular Regression Maximum Likelihood Predictive Mean Method Stochastic Regression
6
Multiple Imputation (IVEware 0.2)
Multivariate sequential regression: works by fitting a sequence of regression models. For example, given a variable type, a regression model is chosen: continuous variable regular linear regression binary variable logistic regression count variable poisson regression categorical variable polytomous regression (i.e., with >2 levels)
7
Table 1: Standard errors of the 6 single imputation methods plus the multiple imputation (larger is better!)
Variables Markov Chain Monte Carlo
Propensity Score
Regular Regression
Maximum likelihood
Predictive Mean
Stochastic Regression
Multiple Imputation
speed(log) 27.4 27.4 26.5 27.6 27.7 35.4 47.5 weather 0.8 0.9 1.0 0.8 1.3 2.1 2.4 light 1.0 1.0 0.6 0.9 1.9 3.5 3.8 diagram 0.3 0.3 0.3 0.3 0.3 0.3 0.4 event1 0.5 0.5 0.4 0.5 0.5 0.5 1.2 event2 0.3 0.3 0.3 0.3 0.3 0.6 1.1 eject 0.3 0.3 0.3 0.3 0.4 0.6 0.8 injsev 6.8 6.8 6.8 6.8 6.8 7.7 9.8 age 12.9 12.9 12.9 12.9 12.9 24.0 28.3 Note: multiplying the values by 10−3 gives the standard errors.
8
0
0.01
0.02
0.03
0.04
0.05
MCMC Propensity Score(PS)
RegularRegression (RR)
Maximumlikelihood (ML)
Predictive MeanMethod(PMM)
Stochasticregression(SR)
MultipleImputation(MI)
speed
9
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
light condition weather condition logspeed eject injured seveity
Single Imputation Stochastic Regression yields almost as much variance as the "GOLD STANDARD"
MCMC
Propensity Score (PS)
Regular Regression (RR)
Maximum likelihood (ML)
Predictive Mean Method(PMM)
Stochastic regression(SR)
Multiple Imputation(MI)
10
Percent difference between multiple Imputation and stochastic regression
Variables Multiple Imputation vs. Stochastic
speed 25%
weather 14%
light 8%
diagram 10%
event1 56%
event2 46%
eject 26%
injsev 22%
age 15% Average 25%
11
Stochastic regression single imputation is good for: hypothesis generation applications such as online query systems less sophisticated users introducing users to CODES
12
There are limitations Generalization (only for CODES data?) Compare with the multiple imputations with 5 imputed datasets.
However, the results are compelling…
13
Stochastic regression is the best for single imputation
Single imputation can be “good enough”…
14
More research on multiple imputation by changing to 10 imputed datasets
Proc MI (SAS) vs IVEware Paper to be published Online query: MIDAS - MN Injury Data Access System
Single imputed public use datasets (2006-2013).
15
2006-2013 will be available October 2014. Please contact Jon Roesler anytime by:
16
Yiwen Zhang [email protected] 612-242-4290
17