recent advances in missing data methods: …...recent advances in missing data methods: imputation...
TRANSCRIPT
Recent Advances in Missing Data Methods:Imputation and Weighting
Elizabeth A. Stuart
Johns Hopkins Bloomberg School of Public HealthDepartment of Mental HealthDepartment of Biostatistics
[email protected]/∼estuart
MCHB/HRSASeptember 27, 2012
E. Stuart (JHSPH) MCHB September 27, 2012 1 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 3 / 103
Course description
Missing data is a problem in almost every research study, and standard ways of dealing
with missing values, such as complete case analysis, are generally inappropriate. This
session will discuss the drawbacks of traditional methods for dealing with missing data
and describe why newer methods, such as multiple imputation, are preferable.
Discussion will focus in particular on Multiple Imputation by Chained Equations, which
is particularly useful for large datasets with complex data structures. Emphasis will be
on providing practical tips and guidance for implementing multiple imputation and
analyzing and interpreting multiply imputed data.
E. Stuart (JHSPH) MCHB September 27, 2012 4 / 103
Missing data
Missing data common, especially with administrative data (e.g.,medical records) or sensitive surveys
Advanced methods have been developed to handle missing data
But how do we actually implement those methods?
What are the implications for analyses?
E. Stuart (JHSPH) MCHB September 27, 2012 5 / 103
Why should you pay attention?
Ignoring or inappropriately handling missing data may lead to...
Biased estimates
Incorrect standard errors
Incorrect inferences/results
E. Stuart (JHSPH) MCHB September 27, 2012 6 / 103
Types of missing data
Will discuss two main types of missing data:
“Unit nonresponse”: when data for an entire “unit” (e.g., individual)is missing
e.g., did not respond at all to follow-up surveyAlso called “attrition”Usually handled using nonresponse weighting adjustments
“Item nonresponse”: when individual items are missing for anindividual
e.g., someone answered most of the survey questions, but left a fewblankUsually handled using imputation approaches
E. Stuart (JHSPH) MCHB September 27, 2012 7 / 103
Lots of reasons for missingness...
Non-response/attrition
Data entry errors
Administrative data with missing values
Lost survey forms
Individuals not wanting to disclose (or not knowing) particularinformation
Note: sometimes entire variables are missing in that they are “latent”;we will generally not be talking about those types of variables
E. Stuart (JHSPH) MCHB September 27, 2012 8 / 103
More formally...“Missing data mechanisms”
Need to understand what led to missing values
Missing Completely at Random (MCAR): Missingness is totallyrandom; does not depend on anything
P(R|Y ,X ) = P(R|Y ,X obs ,Xmis) = P(R|ψ)Cases with missing values a random sample of the original sampleNo systematic differences between those with missing and observedvaluesAnalyses using only complete cases will not be biased, but may havelow powerGenerally unrealistic, although may be reasonable for things like dataentry errors
E. Stuart (JHSPH) MCHB September 27, 2012 9 / 103
Missing At Random (MAR): Missingness depends on observed data
P(R|Y ,X ) = P(R|Y ,X obs , ψ)e.g., women more likely to respond than menSo there are differences between those with observed and missingvalues, but we observe the ways in which they differCan use weighting or imputation approaches to deal with themissingnessThis is probably the assumption made most frequentlySatisfied for data missing by designIncluding a lot of predictors in the imputation model can make thismore plausible
E. Stuart (JHSPH) MCHB September 27, 2012 10 / 103
Not Missing At Random (NMAR): Missingness depends onunobserved values
P(R|Y ,X ) cannot be simplifiede.g., probability of someone reporting their income depends on whattheir income ise.g., probability of reporting psychiatric treatment depends on whetheror not they have received iti.e., even among people with the same values of the observedcovariates, those with missing values on Y have a different distributionof Y than do those with observed YSo we can’t just use the observed cases to help impute the missingcasesUnfortunately no easy ways of dealing with this...have to posit somemodel of the missing data process
E. Stuart (JHSPH) MCHB September 27, 2012 11 / 103
Of course those are assumptions...
Never know which of them is correct
Can do diagnostics/tests for whether missingness is MCAR vs. (MARor NMAR) (Enders 2010, p. 18)
Does the probability of missingness depend on other variables?e.g., are the mean ages of people with missing and non-missing valuesof drug use behavior different?e.g., In a logistic regression predicting missingness on some variable,are there other variables that are significant predictors?Little (1998; JASA): test for MCAR (implemented in Stata’s mipackage, possibly others)
E. Stuart (JHSPH) MCHB September 27, 2012 12 / 103
But never know for sure if missingness is MAR or NMAR...
Have to use substantive understanding of what might have led tomissing valuese.g., Are those who had been arrested more likely to not respond to aquestion asking about previous arrests? (They may not want to lie, butalso may not want to tell the truth...)Helps to have a good understanding of the data collection processIf believe missingness is NMAR, have to posit some model for themissingness (e.g., that those with previous arrests are 10% more likelyto not respond to that question)Tailored for each research questionSiddique and Belin (2008): example of missing depression levels;simulations show value in using a variety of assumptions and modelsOther references: Hedeker and Gibbons (1997), Resseguier et al.(2011; sensMICE package for R), Enders (2011),
E. Stuart (JHSPH) MCHB September 27, 2012 13 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 14 / 103
Inappropriate ways of handling missing data
Ignoring it
Complete case
Single imputation
Missing indicator approach
Last observation carried forward
E. Stuart (JHSPH) MCHB September 27, 2012 15 / 103
Ignoring it...
Common approach is to “ignore” it; just run models without doinganything about missingness
Then what is done will depend on the defaults of the software
Usually will be the same as complete-case analyses, discussed next
E. Stuart (JHSPH) MCHB September 27, 2012 16 / 103
Complete case analysis
Restrict analyses to individuals with observed data
Generally bad!
Assumes missingness is MCAROften results in lots of cases dropped...decreased power and loss ofrepresentativeness (Little and Rubin, 2002; page 42)Generally leads to biased results
Is also model-dependent...will mean that different analyses may usedifferent subsets of the data (unless do big restriction at thebeginning)
(Also called listwise deletion)
E. Stuart (JHSPH) MCHB September 27, 2012 17 / 103
Single imputation: Fill in (“impute”) each missing value
Ways of doing that imputation:
Mean
Regression prediction (“conditional mean imputation”)
e.g., impute mean within categories of observed covariates (gender,race, etc.)e.g., fit regression model among observed cases, use to predictresponse for individuals with missing values
Yi = α+ βXi
Regression prediction plus error (“stochastic regression imputation”)
Like regression prediction, but also add random error
Yi = α+ βXi + ei , ei ∼ N(0, σ2)
E. Stuart (JHSPH) MCHB September 27, 2012 18 / 103
“Hot-deck”
For an individual with missing data, find individuals with the sameobserved values on other variables, randomly pick one of their values asthe one to use for imputation
Predictive mean matching
Like a combination of regression prediction and hot-deckTake observed value from someone with similar predicted valueWorks well when values non-normally distributed or have bounds(ensures that imputed values in same range as observed values)
E. Stuart (JHSPH) MCHB September 27, 2012 19 / 103
Other (inappropriate) strategies
Missing data indicator
Do simple imputation and include indicator of missingness as anadditional predictor in regression modelsDoesn’t work very well and can lead to bias (Vach and Blettner 1991,Donders et al. 2006, Greenland and Finkle 1995)
Last observation carried forward
For longitudinal studiesIf someone drops out of study, the last value observed for them is“carried forward” (copied) to later time pointsBut generally biased (Carpenter et al. 2004; Cook, Zeng, and Yi, 2004;Jansen et al. 2006)
E. Stuart (JHSPH) MCHB September 27, 2012 20 / 103
Summary of single imputation approaches
Best are regression prediction plus error or hot-deck (based oncategorical versions of all of the variables observed)
Can be reasonable, especially if not a lot of missing data, e.g., < 5%(Graham 2008)
BUT...results in overly precise estimates
Analyses following single imputation do not know that some of thevalues have been imputedSimply treats all of the values as observed valuesSo does not take into account the uncertainty in the imputations
Anti-conservative...results will have more significance, narrowerconfidence intervals, than they should (Donders et al. 2006)
Higher Type I error rates
So what to do instead?
E. Stuart (JHSPH) MCHB September 27, 2012 21 / 103
Appropriate ways of handling missingness
Maximum likelihood
Alternative sources of information
Weighting
Multiple imputation
Remember: Goal is not to get correct predictions of missing values;goal is to obtain accurate parameter estimates for relationships ofinterest
E. Stuart (JHSPH) MCHB September 27, 2012 22 / 103
Maximum likelihood
In some cases, maximum likelihood approaches exist
Directly maximize the likelihood function, f (X ,Y )
Use observed values, take missingness into account
e.g., longitudinal analyses that use the observations available for eachperson and correctly account for the missing observations
When ML methods exist, can work very well (e.g., Mplus, LISREL)
But they don’t always exist so not always a feasible option
Another drawback is that you cannot use auxiliary information toimprove the predictions; uses only the variables in the actual analysis(and assumes MAR given those)
Graham (2008), Siddique et al. (2008)
E. Stuart (JHSPH) MCHB September 27, 2012 23 / 103
Alternative sources of information
In some cases, can utilize a secondary data source to get neededinformation
Or, for example, calibrate numbers to known totals
e.g., issue of missing information about offender and incident in theSupplemental Homicide Reports (SHR)
Compare victim counts in SHR to similar data from NCHS, adjust asnecessary (Fox and Zawitz 2004)Wadsworth and Roberts (2008) evaluates four common techniques fordealing with this missingness that utilize supplemental info from policerecords
E. Stuart (JHSPH) MCHB September 27, 2012 24 / 103
Nonresponse weighting
Often used to deal with attrition
Generate model predicting non-response given observed covariates
Weight respondents by their inverse probability of response
Weights the respondents up to represent the full sampleSame idea as survey sampling weights
Use analysis methods that allow for weights (e.g., survey packages)
Works well for simple missing data patterns (e.g., attrition)
Oh and Scheuren (1983), Kalton and Flores-Cervantes (2003), Hortonand Lipsitz (1999), Carpenter et al. (2006)
E. Stuart (JHSPH) MCHB September 27, 2012 25 / 103
A simple example...
Imagine 100 males and 100 females in sample
But only 80 males and 75 females respond
Male respondents will get weight of 100/80 = (1/(80/100)) = 1.25
Female respondents will get weight of 100/75 = (1/(75/100)) =1.333
So, e.g., a male respondent represents 1.25 males in the originalsample
These weights will make the 80 male and 75 female respondentsrepresent the full sample of 200
E. Stuart (JHSPH) MCHB September 27, 2012 26 / 103
To implement weighting adjustments:
Fit model predicting response as a function of fully observedcharacteristicsAssign respondents a weight of 1/(p(response))Use those weights in regression models and summary statistics
Will weight the respondents to look like the full original sample
Like survey sampling weights, except estimated instead of known
Can be used for attrition as well as for original survey response
E. Stuart (JHSPH) MCHB September 27, 2012 27 / 103
Use model with many characteristics, generally measured at baseline
Treat the weights like you would survey sampling weights (e.g., usingsurvey packages), run weighted models (e.g., pweight in Stata)
Some concern about extreme weights
Check distribution of weights, trim outliersSome do a “weighting class adjustment” where actually just form 5subclasses based on the probabilities and everyone in each subclass getsthe same weight
Relatively simple (and widely accepted) way of handling attrition/unitnon-response
E. Stuart (JHSPH) MCHB September 27, 2012 28 / 103
Now on to item non-response . . .
E. Stuart (JHSPH) MCHB September 27, 2012 29 / 103
Multiple imputation
Same idea as single imputation, but fills in each missing valuemultiple times
Like repeating the stochastic mean imputation multiple times
Creates multiple (e.g., 10) “complete” data sets
Analyses then run separately on each dataset and results combinedacross datasets
Standard “combining rules” (Rubin 1987)(Software will do this for you)
Total variance a function of within-imputation variance andbetween-imputation variance
Takes into account the uncertainty in the imputations
Also nice because very general: same set of imputations can be usedfor many analyses
“Imputer” may be different from “analyst”
E. Stuart (JHSPH) MCHB September 27, 2012 30 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 31 / 103
Traditional approach to creating multiple imputations
Joint model of all variables
e.g., multivariate normal distribution
Fit using the observed cases
Used to predict (multiple times) the missing values
Sometimes multivariate normal model used even with categoricalvariables, but this can be severely biased (Horton, Lipsitz, andParzen, 2003; Allison 2005)
Can’t easily handle complexities such as skip patterns, bounds,restrictions, complex designs
Software: norm, mix, SAS proc mi, Stata mi
E. Stuart (JHSPH) MCHB September 27, 2012 32 / 103
Newer approach: Multiple imputation by chained equations(MICE)
Fit model of each variable, conditional on all others
Iterate fitting model and imputing each variable
Models used depend on type of variable(categorical/continuous/binary)
Raghunathan et al. (2001), van Buuren et al. (2006)
Also called “fully conditional specification” or “sequential regressionmultiple imputation”
E. Stuart (JHSPH) MCHB September 27, 2012 33 / 103
Example of MICE
3 variables: X1 (binary), X2 (continuous), X3 (ordinal)Steps in MICE:
1 Do simple imputations to fill in missing values for X1, X2, X3
2 Using cases with observed X1, fit logistic regression model ofX1 ∼ X2 + X3; predict missing values of X1
3 Using cases with observed X2, fit normal regression model ofX2 ∼ X1 + X3; predict missing values of X2
4 Using cases with observed X3, fit proportional odds regression modelof X3 ∼ X1 + X2; predict missing values of X3
5 Iterate Steps 2-4
6 Repeat Step 5 to get multiple imputations
E. Stuart (JHSPH) MCHB September 27, 2012 34 / 103
Pros and cons of MICE
Benefits
Can more easily work in large datasetsModels can more accurately reflect distribution of each variableAllows bounds (e.g., age started smoking)Incorporates restriction to subpopulations (e.g., age started smoking)
Drawbacks
Potentially less principled than joint modeDoesn’t necessarily imply a proper joint distribution(Although this doesn’t seem to be a big problem in practice)
E. Stuart (JHSPH) MCHB September 27, 2012 35 / 103
Software to implement MICE
SAS and stand-alone: IVEWare
Stata: ice
R: mice, mi
SPSS: Missing Values add-on module (“fully conditionalspecification” option)
E. Stuart (JHSPH) MCHB September 27, 2012 36 / 103
Steps to implementing MI methods
1 Examine rates and patterns of missingness, and any predictors ofmissingness
2 Generate imputations
3 Diagnose and assess imputations
4 Analysis
E. Stuart (JHSPH) MCHB September 27, 2012 37 / 103
Motivating example: PRAMS data
Hypothetical analysis using NYC PRAMS data
Interested in predicting postpartum depression (“depress”) as afunction of other predictors
Predictors include demographics, information on the delivery, healthmeasures, etc.
About 40 possible predictors
But lots of missingness
Will go through some code and diagnostics here
Also see Stuart et al. (2009), Azur et al. (2011) for another example,including code
E. Stuart (JHSPH) MCHB September 27, 2012 38 / 103
Step 1: Rates of missingness
High rates of missingness for some variables
Variable % MissingBMI 0.2%Maternal age 1.6%Diabetes status 2%Freq of drinking 3.8%Sad 8.2%Nhope 11.2%Slow 11.1%Income 21%
E. Stuart (JHSPH) MCHB September 27, 2012 39 / 103
Missingness depends on observed characteristics
Not MCAR: means of variables differ across those with missing vs.observed values on other items
Don’t have reason to think missingness is NMAR so comfortable withMAR
E. Stuart (JHSPH) MCHB September 27, 2012 40 / 103
Step 2: Generate imputations
Need to specify model for each variable, conditional on all othervariablesCheck to see if transformations make sense (e.g., to look morenormally distributed; see White et al., 2011 for examples)
If non-normal, can also use predictive mean matching (White et al.,2011)
May make sense to include some variables in imputation model, evenif not going to be used in analyses (“auxiliary variables”; Collins et al.2001)
i.e., include more rather than fewer variables in imputation procedureThere may be some that are very predictive of missing values, even ifthey aren’t of primary interest in analyses
With so many variables, can’t possibly do careful model selection foreach one
Some software packages allow stepwise selection to select imputationmodel for each variable (IVEWare, add-on packages for Stata’s ice:pred eq, check eq)
E. Stuart (JHSPH) MCHB September 27, 2012 41 / 103
What variables should be included?
Any variables that will be used in subsequent analyses
Otherwise its associations with other variables will be attenuated inanalyses
Any higher-order effects that are of interest in the analysis phase
Any other special features of the data (e.g., survey weights)
Can be fairly loose and include a lot, especially if using stepwiseprocedures
In PRAMS, N ∼ 1400, had about 40 variables in imputationprocedure
Note: Don’t need to specify which are dependent vs. independent
E. Stuart (JHSPH) MCHB September 27, 2012 42 / 103
General guidance/issues
Imputation and analysis compatibility
Imputation model should be more general than analysis model that willbe used: otherwise risk finding null effects simply because dataimputed assuming no relationship between variablesMay want to force some variables into the models even if do stepwise
How many imputations to generate?
Conventional advice has been 5-10, but more (e.g., 40) may yieldincreased power (Graham, Olchowski, & Gilreath, 2007)White et al. (2011) recommend m = 100 ∗ FMI (FMI=fraction ofmissing information)
Since FMI hard to estimate, but Bodner’s approximation says FMI < %missing cases, approximate m = 100∗(% missing cases)e.g., 20% missing cases would imply m = 20
White et al. (2011) also argue that for reproducibility may needm > 100Stata command to check if you’ve done enough imputations: mimmcerror
E. Stuart (JHSPH) MCHB September 27, 2012 43 / 103
Importance of auxiliary variables
Can be very beneficial to include “auxiliary variables:” not of interest inthe analysis in and of themselves, but might help with the imputationsCollins et al. (2003) show that not much cost to including these extravariables and they can help a lotIncluding a lot of variables can also make MAR assumption morereasonable(No easy way to incorporate this extra information in maximumlikelihood approaches; see Enders (2010, Chp. 5))
E. Stuart (JHSPH) MCHB September 27, 2012 44 / 103
Variables that are functions of others
In this case, dependent variable (depress) is a function of 3 othervariablesBest to impute the 3 variables separately, then re-create the depresssummary variable after the imputations are created
Keeps the imputation process more general (e.g., in case you want touse the individual variables for something)Retains observed values on the 3 individual variables (e.g., someonewho is missing one would be missing for depress but we don’t want tothrow away their 2 observed values on the other two variables)
Once imputations can created, can manipulate those datasets basicallyin the same ways you normally would(A bit more on this later)
E. Stuart (JHSPH) MCHB September 27, 2012 45 / 103
Implementing MICE
In R using the mi package: imp< −mice(data, pred=pred,maxit=10,m=10,seed=92385)
In Stata using ice:
ice modeprt2 batch2 wtanal i.stratumc mmhbp vagdel m.race2 bornusplural2 married2 o.matdeg o.pnc /// lga2 sga o.kotelchuck2 o.prelbo.gestwk bmi wrknow o.income2 medicaid2 wicpreg /// o.pp drinkm.diabetes2 preexer infert tx m.feelpg2 o.smoke o.strs inficu o.los2o.bfeed2 back sleep2 cosleep2 /// ppvchk i.inf age2 o.sad o.nhopeo.slow matage totcnt, saving(impute, replace) m(10) boot(mmhbpvagdel race2 /// bornus matdeg pnc lga2 sga kotelchuck2 prelb gestwkwrknow income2 medicaid2 wicpreg pp drink diabetes2 /// preexerinfert tx feelpg2 smoke strs inficu los2 bfeed2 back sleep2 cosleep2ppvchk sad nhope slow matage) /// seed(1285964)
E. Stuart (JHSPH) MCHB September 27, 2012 46 / 103
Step 3: Diagnosing and assessing imputations
Try to identify potentially problematic variables
Two types of comparisons:
Before and after imputationAcross two imputation sets with slightly different settings (e.g.,different criteria in the stepwise model)
Can also do diagnostics that assess the imputation models themselves(Su et al., 2009)
Most packages have very limited diagnostics
R’s mi and mice packages have the most
Note: Differences don’t mean something is wrong! Could be becauseof differences in the types of people with observed vs. missing data
E. Stuart (JHSPH) MCHB September 27, 2012 47 / 103
Graphical summaries
Bivariate scatterplots of observed and imputed values
Residual plots, for observed and imputed values
Density plots of observed and imputed values
Next slide: Example from R mi package
E. Stuart (JHSPH) MCHB September 27, 2012 48 / 103
Den
sity
0.00
.20.
40.6
0.81
.0
−5 0 5 10
matdeg
01
23
0 1 2
pnc
0.00
.20.
40.6
0.81
.0
0 2 4
kotelchuck2
0.0
1.0
2.0
0 1 2 3 4
gestwk
0.0
0.1
0.2
0.3
0.4
10 20 30 40 50 60
bmi
0.00
0.10
0.20
0.30
−2 0 2 4 6 8
income2
0.0
0.5
1.0
1.5
0 2 4
pp_drink2
01
23
0 1 2
smoke2
0.0
0.4
0.8
−1 0 1 2 3
strs
0.0
0.4
0.8
−1 0 1 2 3
los2
0.0
0.5
1.0
−1 0 1 2 3
bfeed2
0.0
0.2
0.4
0.6
0 2 4
sad
0.0
0.5
1.0
1.5
0 1 2 3 4
nhope
0.0
0.2
0.4
0.6
0 2 4
slow
0.0
0.4
0.8
1.2
0 2 4 6
matage
E. Stuart (JHSPH) MCHB September 27, 2012 49 / 103
Numerical diagnostics
Some packages automatically print out some diagnostics
Model fit information to assess imputation models
Comparisons of original and imputed values (Abayomi et al., 2008)
Should check that imputations look reasonable (e.g., compare means,correlation coefficients)Make sure values being imputed are in the correct ranges
Potentially flag problematic variables to warn future data analysts
E. Stuart (JHSPH) MCHB September 27, 2012 50 / 103
At a basic level . . .
Can compare means of variables pre and post imputation, make surereasonable
Note that changes are not necessarily bad! (May in fact be fixing theproblem!)
Variable Mean before Mean afterBMI 24.97 24.97Freq of drinking 0.66 0.66Sad 1.18 1.17Nhope 0.56 0.57Slow 1.23 1.21Income 2.99 2.75
E. Stuart (JHSPH) MCHB September 27, 2012 51 / 103
Step 4: Analyses
Combining rules allow the combination of results across the multiplyimputed data sets (Rubin 1987)
Account for both within- and between-imputation variance
Run analysis separately within each “complete” dataset, thencombine across datasets
Software packages have automated version of this for many models
Stata: mim, mi packageSAS: proc mianalyzeHLM: multiple imputation optionsMplus: multiple imputation command
For other models, may need to do it “by hand”
Final results just need to report the combined estimates; interpret asyou would a standard regression
E. Stuart (JHSPH) MCHB September 27, 2012 52 / 103
The math behind the combining
Qj = estimate of scalar quantity of interest (e.g., regressioncoefficient) from complete dataset j
Uj = standard error of Qj
Overall estimate just the average of the estimates from each completedataset
Q =1
m
m∑j=1
Qj
E. Stuart (JHSPH) MCHB September 27, 2012 53 / 103
For the overall variance, first calculate the average within-imputationvariance (U) and the between-imputation variance (B)
U =1
m
m∑j=1
Uj
B =1
m − 1
m∑j=1
(Qj − Q)2
The total variance of Q is then
T = U + (1 +1
m)B
Degrees of freedom for t distribution can also be calculated (Enders,2010, p. 231+)
See Schafer (1997) or Little and Rubin (2002) for details
E. Stuart (JHSPH) MCHB September 27, 2012 54 / 103
PRAMS data analysis
Ran models predicting small for gestational age (sga; binary variable)as a function of other predictors
Note: Analyses not meant to have substantive meaning or to presentactual findings on these associations
Purely illustrative examples to demonstrate the methods
E. Stuart (JHSPH) MCHB September 27, 2012 55 / 103
PRAMS data analysis: Complete case (sga)
. svy: logit sga strs i.feelpg2 vagdel i.race2 income2 medicaid2 wicpreg matdeg bmi
bmisq smoke2 matage matagesq infert_tx (running logit on estimation sample)
Survey: Logistic regression
Number of strata = 2 Number of obs = 897
Number of PSUs = 897 Population size = 76003.841
Design df = 895
F( 19, 877) = 1.98
Prob > F = 0.0077
------------------------------------------------------------------------------
| Linearized
sga | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
strs | .0583217 .1563791 0.37 0.709 -.2485908 .3652342
feelpg2 |
3 | .083498 .4021682 0.21 0.836 -.7058046 .8728006
4 | .1576447 .3839254 0.41 0.681 -.5958543 .9111436
5 | .2775728 .358963 0.77 0.440 -.4269345 .9820801
vagdel | -.5620046 .2460526 -2.28 0.023 -1.044912 -.0790974
race2 |
3 | -.0955608 .3815155 -0.25 0.802 -.84433 .6532084
4 | -.6372159 .34151 -1.87 0.062 -1.30747 .0330378
E. Stuart (JHSPH) MCHB September 27, 2012 57 / 103
PRAMS data analysis: Single imputation (sga)
. svy: logit sga strs i.feelpg3 vagdel i.race3 income2 medicaid2 wicpreg matdeg
bmi bmisq smoke2 matage matagesq infert_tx (running logit on estimation sample)
Survey: Logistic regression
Number of strata = 2 Number of obs = 1436
Number of PSUs = 1436 Population size = 115091.01
Design df = 1434
F( 19, 1416) = 2.29
Prob > F = 0.0013
------------------------------------------------------------------------------
| Linearized
sga | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
strs | .0906505 .1258239 0.72 0.471 -.1561681 .337469
feelpg3 |
2 | .1569055 .3307433 0.47 0.635 -.4918871 .8056981
3 | .322266 .3297164 0.98 0.329 -.3245122 .9690442
4 | .2078305 .296976 0.70 0.484 -.3747235 .7903844
vagdel | -.2975287 .1860587 -1.60 0.110 -.6625051 .0674478
race3 |
2 | -.1242725 .2946427 -0.42 0.673 -.7022494 .4537043
3 | -.5491899 .2595075 -2.12 0.034 -1.058245 -.0401349
E. Stuart (JHSPH) MCHB September 27, 2012 59 / 103
PRAMS data analysis: Multiple imputation (sga)
. mi estimate: svy: logit sga strs i.feelpg2 vagdel i.race2 income2 medicaid2
wicpreg matdeg bmi bmisq smoke2 matage matage sq infert_tx
Multiple-imputation estimates Imputations = 10
Survey: Logistic regression Number of obs = 1436
Number of strata = 2 Population size = 115091
Number of PSUs = 1436
Average RVI = 0.0714
Complete DF = 1434
DF adjustment: Small sample DF: min = 100.19
avg = 971.24
max = 1391.55
Model F test: Equal FMI F( 19, 1374.4) = 2.18
Within VCE type: Linearized Prob > F = 0.0023
------------------------------------------------------------------------------
sga | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
strs | .114504 .135696 0.84 0.399 -.1521104 .3811184
feelpg2 |
3 | .1470457 .3411933 0.43 0.667 -.5223776 .816469
4 | .3355484 .337919 0.99 0.321 -.3274445 .9985413
5 | .2565546 .3094421 0.83 0.407 -.35068 .8637892
vagdel | -.304427 .1895014 -1.61 0.108 -.6761812 .0673273
race2 |
3 | -.1324997 .2969264 -0.45 0.656 -.7150331 .4500336
4 | -.5676001 .2659223 -2.13 0.033 -1.089379 -.0458213
E. Stuart (JHSPH) MCHB September 27, 2012 61 / 103
PRAMS data analysis: Comparison and Lessons
Complete case analysis drops a lot of subjects (almost 30%)
Output from MI models looks about the same; easy to read it just asyou would standard regression output!
Some point estimates vary across the 3 approaches, some also vary insignificance
It is the case that sometimes results won’t change after doing MI
But sometimes they will, and you don’t know if they will or won’tuntil you do it (and should trust the MI results more)
May think single imputation results “best” because they might matchour expectations about sga and stresses during pregnancy (strs), butbe careful of that interpretation
Single imputation approach has false precision; standard errors smallerthan they should be
E. Stuart (JHSPH) MCHB September 27, 2012 62 / 103
Example: mim command in Stata (other data)
mim: logit dsmmood sex age
Multiple-imputation estimates (logit) Imputations = 10
Logistic regression Minimum obs = 9185
Minimum dof = 4.8
--------------------------------------------------------------------------
dsmmood| Coef. Std. Err. t P>|t| [95% Conf. Int.] MI.df
-------------+------------------------------------------------------------
sex| .470223 .060136 7.82 0.000 0.346444 0.594003 25.3
age| .099599 .019677 5.06 0.004 0.04845 0.150748 4.8
cons|-2.05873 .257295 -8.00 0.001 -2.72791 -1.38954 4.8
--------------------------------------------------------------------------
E. Stuart (JHSPH) MCHB September 27, 2012 64 / 103
Calculating the fraction (%) of missing information
Captures how much information there is in the data about aparticular parameter
Compares the within-imputation and between-imputation variance
To calculate:
γ =(r + 2)/(df + 3)
r + 1
r =(1 + 1/m) ∗ B
U
r is the relative increase in variance due to the nonresponse/missingdata
Alternatively (Enders, 2010, p. 225): FMI = B+B/m+2/(ν+3)T
ν is a degrees of freedom value; goes to infinity as m goes to infinity
E. Stuart (JHSPH) MCHB September 27, 2012 65 / 103
FMI will typically be lower than % of missing values because ofcorrelations in the data
Quantifies influence of missing data on the standard errors
Also can be used as diagnostic: should pay more attention tovariables that have high FMI when doing imputation diagnostics
E. Stuart (JHSPH) MCHB September 27, 2012 66 / 103
Some quantities cannot be easily combined using Rubin’srules
Likelihood ratio tests hard to combine; better to use Wald tests withmultiply imputed data (White et al., 2011)
Stata’s mi package can combine subsets of coefficients or linear ornonlinear hypotheses (mi test, mi testtransform, mim: testparm)
If model you are running not part of standard combining software, canjust send point estimates and variances to a few functions
e.g., R: mitools, Stata: mi estimate (option cmdok)
Combining R2 values:http://www.ats.ucla.edu/stat/stata/faq/mi r squared.htm
Combining rules assume normality so some parameters work betterwhen transformed (Enders, 2010, p. 220+); e.g., correlationcoefficient
E. Stuart (JHSPH) MCHB September 27, 2012 67 / 103
White et al. (2011), Table VIII
E. Stuart (JHSPH) MCHB September 27, 2012 68 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 69 / 103
FAQ’s
Isn’t imputation “making up” data?
No! It is creating our best guesses at the missing valuesIn fact non-imputation methods (e.g., complete case analysis) generallyrely on much stronger assumptionsAlso important to note that we aren’t assuming that we are imputingthe correct values...generating the imputations only as an intermediatestep to estimating the model parameters of real interest
What if the imputation model is wrong?
Usually it’s fine; most results indicate that MI still works well even ifthe imputation models are not correct (Schafer 1997)Can help the situation by, for example, taking logs to make data morenormally distributed when using linear regression
E. Stuart (JHSPH) MCHB September 27, 2012 70 / 103
Are there guidelines for how much missingness is “too much”?
Unfortunately, noAnd remember that if there is not good information in the data to doimputations (i.e., not much that is predictive of the missing values), MIwill take that into account by making the imputations very variableGood results have been found with over 40% missingnessKey quantity is the fraction of missing information (Schafer 1997),which combines the % missing with how correlated the missing variableis with observed values
What if you are really worried about NMAR?
Sensitivity analyses (e.g., Siddique and Belin 2008)Pattern mixture models (fitting model separately for each missing datapattern; e.g., Hedeker and Gibbons, 1997)
E. Stuart (JHSPH) MCHB September 27, 2012 71 / 103
Should I impute a scale or the individual items?
Impute the scale if: (1) over half of the individual items observed if anyare observed, (2) items have high α’s, and (3) the item-totalcorrelations are similar across items (Graham, 2008)Otherwise (and if have the code to recreate the scales), impute theitems
Should I impute raw or standardized scores?
Assuming you have the ability to recreate the standardized scores...Impute whichever one looks more normally distributed
E. Stuart (JHSPH) MCHB September 27, 2012 72 / 103
What about data with a multilevel structure?
Not a lot of guidance on thisIf analysis will have only random intercepts, can include clusterindicators as possible predictors (this done in CMHI data)If analysis will have random intercepts and slopes (i.e., if going to lookat relationships between variables separately for different clusters),impute separately within each cluster or include cluster*variableinteractions in imputation model (Graham, 2008)MLWwiN macro for imputing multi-level data: www.missingdata.org.ukYucel (2008), Reiter et al. (2006)
E. Stuart (JHSPH) MCHB September 27, 2012 73 / 103
What if imputing data within a study estimating causal effects?
Impute covariates and outcomes together, include lots of interactionsbetween treatment status, covariates, and outcomes in imputationmodel (Want to make sure not to impose a treatment effect on theimputations)Although some people may balk at including outcome in imputationprocess, better to impute them than to leave it out, which wouldassume no treatment effect (Moons et al., 2006; Sterne et al., 2009)That said . . . some recommendation to include only those withobserved treatment and outcome in the outcome analyses (but still usethe outcome and treatment when creating the imputations)
Using imputed treatment status and imputed outcomes in analyses mayjust add noise/random error (White et al., 2011)
E. Stuart (JHSPH) MCHB September 27, 2012 74 / 103
Should I include variables that are predictive of the missingness orpredictive of the missing values?
Ideally would be inclusive and include any variables that may be relatedto the missingness AND/OR the values themselvesIf can’t do that (e.g., small samples), better to include variablespredictive of the missing values
What should I do if some analysis I want to do isn’t covered by any ofthe existing packages that analyze multiply imputed data?
If just exploratory (e.g., regression diagnostics, graphics), run it on 2-3of the imputed datasets separately and see how consistent the resultsare. If results consistent, just go with them. If not consistent, rethinkimputations: why are they so variable?If want to actually estimate models, will need to write code to do thecombining across datasets yourselfThe mitools() package for R gives some examples of this, makes it easyif you can send it coefficient estimates and their associated variances
E. Stuart (JHSPH) MCHB September 27, 2012 75 / 103
Overall lessons
Missing data can have serious implications for analyses
Requires making assumptions about the missingness and missingvalues
Best approach: Minimize the amount of missing data up front
Invest substantial resources in following up individualsDesign surveys to encourage full responseExplore alternative data sources (e.g., administrative records) asnecessary
Important to have a good understanding of the missing data process
Why were some cases missing?How plausible is MAR? Are we worried about NMAR?Can we collect additional data that will inform about the missingness?e.g., for attrition, can ask in earlier waves about individual’s likelihoodof answering subsequent surveysIs it possible to follow-up a subsample of those who initially did notrespond?
E. Stuart (JHSPH) MCHB September 27, 2012 76 / 103
Benefits of weighting
Fairly easy to implement
Allows inferences to be made for original full sample
An established way of handling attrition/non-response
E. Stuart (JHSPH) MCHB September 27, 2012 77 / 103
Benefits of multiple imputation
Yields accurate standard errors and inferences
Allows the use of auxiliary variables to improve imputations
Can be used for very general settings
Can impute a dataset and then use for lots of different analyses (Stuartet al. 2009)Imputer and analyst can be different peopleAnalyses run on “complete” data sets and so any type of analysis canbe run
E. Stuart (JHSPH) MCHB September 27, 2012 78 / 103
Lessons
If just have a simple attrition problem, use non-response weights. Butif things more complex . . .
If rates of missingness low (e.g., 1-2%), consider doing singleimputation (e.g., regression prediction with noise)
Otherwise use multiple imputation
Very general, flexible
With either approach, it may (or may not) change the results, butwon’t know until you do it, and using multiply imputed data shouldbe more accurate and correct
Either approach also allows you to make statements about the wholesample, not just about the respondents/those with complete data(and uses that same sample for all analyses; sample not dependent onthe variables you happen to use in a given analysis)
E. Stuart (JHSPH) MCHB September 27, 2012 79 / 103
When using multiple imputation . . .
MICE approaches can be useful, especially for large datasets with lotsof categorical variables
Make imputation models very general: lots of terms and interactions(little cost to including lots of potential predictors)
Compare distributions of data pre- and post-imputation
Determine ways to summarize the results across variables
If others will be using the imputed data, make clear documentation
Specify models used, interactions includedHighlight potentially problematic variables
E. Stuart (JHSPH) MCHB September 27, 2012 80 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 81 / 103
General resources
http://web.inter.nl.net/users/S.van.Buuren/mi/hmtl/software.htm
Code for many packages in Horton and Kleinman (2007)
http://www.math.smith.edu/muchado-appendix.pdf
E. Stuart (JHSPH) MCHB September 27, 2012 82 / 103
Software for MICE: R
mi package (Su et al., 2011)
http://cran.r-project.org/web/packages/mi/index.htmlFor creating and analyzing multiply imputed dataMultiple imputation by chained equations incorporated with predictivemean matchingLots of good diagnosticsAutomatically determines the correct model (e.g., linear vs.multinomial logit)Goal is for the software to handle many complexities automatically (likecollinearity, perfect prediction)Eventually want to include procedures to explicitly handle time-seriesdata and multilevel/clustered dataNot a lot of functionality currently, but check back soon...should begood!
E. Stuart (JHSPH) MCHB September 27, 2012 83 / 103
mice package
http://web.inter.nl.net/users/S.van.Buuren/mi/hmtl/mice.htmFor creating and analyzing multiply imputed dataMultiple imputation by chained equationsSome good diagnostics, added functionality recentlyCan incorporate bounds, restrictions
E. Stuart (JHSPH) MCHB September 27, 2012 84 / 103
Analyzing MI data in R
mice and mi packages have built in commands
mitools package
Run imputationList() command to combine the imputed datasets(could be from mice or from another package)Use the with() command to run analysis on each complete dataset inthe imputationList objectUse micombine() command on the results from the with() command toget results pooled across the complete datasetsVery general: Can run as long as you have the estimates and variancesfrom each complete datasethttp://cran.r-project.org/web/packages/mitools/mitools.pdf
Zelig package
Can run almost any modelJust say data=mi(dataset1, dataset2, ...) in the command
E. Stuart (JHSPH) MCHB September 27, 2012 85 / 103
Software for MICE: SAS
IVEWare (stand-alone as well)
http://www.isr.umich.edu/src/smp/ive/For creating multiple imputationsMultiple imputation by chained equationsMore details below
proc mianalyze
http://www.sas.com/rnd/app/papers/mianalyzev802.pdfFor analyzing multiply imputed dataCan be run on data imputed using proc mi or imputed using anotherpackageHorton and Kleinman (2007) appendix shows code for reading multiplyimputed data into SAS and running mianalyze
E. Stuart (JHSPH) MCHB September 27, 2012 86 / 103
Software for MICE: Stata
ice
http://ideas.repec.org/c/boc/bocode/s446602.htmlhttp://www.ats.ucla.edu/stat/Stata/library/ice.htmFor creating multiple imputationsMultiple imputation by chained equations enditemizeFor analyzing mi data: micombine, mim, mi estimateNote: New in Stata 12, the ”mi” package can do the ice procedureFor pre-Stata 12 users: can easily go between ice and mi proceduresusing “mi import ice” and “mi export ice” commands so can use mi’sprocedures, like for analyzing MI data
E. Stuart (JHSPH) MCHB September 27, 2012 87 / 103
Using ice in Stata
To install ice: ssc install ice, replace
To install mim: ssc install mim, replace
Main command:
ice cohort sex age income totchild totadu nrace3 nrace5 nrace7totrole bersraw ctotcomr ctotraw cintraw cextraw ytotraw yintrawyextraw i.siteid, clear;
Default is to let each variable be regressed on all other variables
Often run into convergence/collinearity issues
http://www.ats.ucla.edu/stat/Stata/library/ice.htm
http://www.stata-press.com/manuals/mi.html
http://www.ats.ucla.edu/stat/stata/seminars/missing data/mi in stata pt1.htm
E. Stuart (JHSPH) MCHB September 27, 2012 88 / 103
Options in ice
Passive imputation: for variables that are a direct function of others(e.g., interactions)
Need to make sure the imputations are consistent with each other“passive” optionpassive(sexxrace1: sex*nrace1 \ sexxrace3: sex*nrace3)
Specify regression model to be used
e.g., default for categorical is multinomial logit (unordered), but whatif want to use ordered logit?“cmd” optioncmd(income:ologit)
E. Stuart (JHSPH) MCHB September 27, 2012 89 / 103
Specify predictors in particular regression models
ice doesn’t do stepwise, so what if want to use simpler model (notinclude all variables as predictors)?“eq” optioneq(income: sex cintraw, cextraw: nrace1 nrace2)(Note: of course the models in previous line make no sense; no reasonto do that, but this could be useful to, e.g., exclude certain predictorsfrom particular models)Can also use user-written pred eq and check eq functions to facilitatestepwise models within context of ice
Impute categorical variables as categories, but when predictors useseries of dummy variables
“sub” optionpassive(\ inc1:(income==1) \ inc2:(income==2)) sub(income: inc1inc2)(Assuming just 2 levels of income variable)
E. Stuart (JHSPH) MCHB September 27, 2012 90 / 103
Software for SPSS
http://support.spss.com/ProductsExt/SPSS/ESD/17/Download/User%20Manuals/English/SPSS%20Missing%20Values%2017.0.pdf
Missing values add-on package: Creates imputations and analyzes MIdata
Monotone or MICE approaches
Option to include all two-way interactions in prediction models
E. Stuart (JHSPH) MCHB September 27, 2012 91 / 103
Outline
1 Introduction and terminologyUnderstanding types of missingness
2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...
3 Implementing multiple imputation: MICE
4 Conclusions
5 Software
6 References
E. Stuart (JHSPH) MCHB September 27, 2012 92 / 103
References: General references on missing data and MICE
http://missingdata.org.uk/
http://www.stat.psu.edu/∼jls/mifaq.html
Allison, PD. (2002) Missing Data in Quantitative Applications in the SocialSciences. Thousand Oaks, CA. Sage.
Brame, R. and Paternoster, R. (2003). Missing data problems in criminologicalresearch: Two case studies. Journal of Quantitative Criminology 19(1): 55-78.
Carpenter, J. (2006). Missing Data Example Analysis. Available athttp://www.lshtm.ac.uk/msu/missingdata/example.html
van Buuren S, Brand JPL, Groothuis-Oudshoorn K, Rubin DB (2006). Fullyconditional specification in multivariate imputation. Journal of StatisticalComputation and Simulation, 76(12), 1049-1064
van der Heidjen, G.J.M.G., Donders, A.R.T., Stijnen, T., and Moons, K.G.M.(2006). Imputation of missing values is superior to complete case analysis and themissing-indicator method in multivariable diagnostic research: A clinical example.Journal of Clinical Epidemiology 59: 1102-1109.
E. Stuart (JHSPH) MCHB September 27, 2012 93 / 103
References: Books
Brame, R., Turner, M.G., and Paternoster, R. (2010). Missing data problems incriminologic research. Chapter 14 in Handbook of Quantitative Criminology (eds:A.R. Piquero and D. Weisburd). Springer Science and Business Media.
Enders, C.K. (2010). Applied missing data analysis. Guilford Press.
Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall,London.
Little, R.J.A. and Rubin, D.B. (2002) Statistical Analysis with Missing Data, 2ndEdition. J. Wiley & Sons, New York.
E. Stuart (JHSPH) MCHB September 27, 2012 94 / 103
References: Guidance for dealing with missing data
Carpenter, J. (2006). Annotated Bibliography on Missing Data [accessed July 30,2006]. Available online at http://www.lshtm.ac.uk/msu/missingdata/biblio.html
Carpenter, J.R. and Kenward, M.G. (2007). Missing data in randomised controlledtrials: A practical guide. Available athttp://www.pcpoh.bham.ac.uk/publichealth/methodology
/docs/invitations/Final Report RM04 JH17 mk.pdf.
Graham, J.W. (2008). Missing data analysis: Making it work in the real world.Annual Review of Psychology 60(6): 1-28.
He, Y., Zaslavsky, A.M., Landrum, M.B., Harrington, D.P., and Catalano, P.(2009). Multiple imputation in a large-scale complex survey: A practical guide.Statistical Methods in Medical Research 1-18.
Lavori, P. et al. (2008). Missing data in longitudinal clinical trials Part A: Designand Conceptual Issues. Psychiatric Annals 38(12): 784-792.
Schafer, J.L., & Graham, J.W. (2002). Missing data: Our view of the state of theart. Psychological Methods 7(2): 147-177.
Siddique J. et al. (2008). Missing data in longitudinal trials Part B AnalyticIssues. Psychiatric Annals 38(12): 793-801.
Wadsworth, T., and Roberts, J.M. (2008). When missing data are not missing: Anew approach to evaluating supplemental homicide report imputation strategies.Criminology 46(4): 841-870.E. Stuart (JHSPH) MCHB September 27, 2012 95 / 103
References: Statistical basis for MI
Kenward, M.G. and Carpenter, J. (2007). Multiple imputation:Current perspectives. Statistical Methods in Medical Research 16:199-218.
Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., &Solenberger, P. (2001). A multivariate technique for multiplyimputing missing values using a sequence of regression models.Survey Methodology 27: 85-95.
Rubin, D.B. (1976) Inference and missing data. Biometrika, 63,581-592.
Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys.J. Wiley & Sons, New York.
Rubin, D.B. (1996) Multiple imputation after 18+ years (withdiscussion). Journal of the American Statistical Association, 91,473-489.
E. Stuart (JHSPH) MCHB September 27, 2012 96 / 103
References: Tutorials and software for implementing MI
www.multiple-imputation.com
MI FAQ’s: http://www.stat.psu.edu/∼jls/mifaq.html
http://www.ats.ucla.edu/stat/stata/seminars/missing data/mi in stata pt1.htm
Azur, M., Stuart, E.A., Frangakis, C.M., and Leaf, P.J. (2011). Multipleimputation by chained equations: What is it and how does it work? InternationalJournal of Methods in Psychiatric Research 20(1): 40-49.
Donders ART, van der Heijden GJMG, Stijnen T, Moons KGM.(2006). Review: Agentle introduction to imputation of missing values. Journal of ClinicalEpidemiology 1087-1091.
Drechsler, J. (2011). Multiple imputation in practice: A case study using aGerman establishment survey. Advances in Statistical Analysis 96: 1-26.
Horton, N. & Kleinman, K.P. (2007). Much ado about nothing: A comparison ofmissing data methods and software to fit incomplete data regression models. TheAmerican Statistician 61(1): 79-90. Software appendix:http://www.math.smith.edu/muchado-appendix.pdf
E. Stuart (JHSPH) MCHB September 27, 2012 97 / 103
Lunt, M. (2008). A guide to imputing missing data with Stata.http://personalpages.manchester.ac.uk/staff/mark.lunt/mi.html
Raghunathan, T.E., Solenberger, P.W., & Van Hoewyk, J.V. (2002). IVEWare:Imputation and Variance Estimation Software User’s Guide. Ann Arbor, MI:Institute for Social Research, University of Michigan.www.isr.umich.edurs/c/smp/ive/
Schafer, J.L. (1999). Multiple imputation: A primer. Statistical methods inmedical research 8(1): 3-15.
Sterne, J.A.C., et al. (2009). Multiple imputation for missing data inepidemiological and clinical research: potential and pitfalls. British MedicalJournal 338:b2393.
Stuart, E.A., Azur, M., Frangakis, C.E., and Leaf, P. (2009). Multiple imputationwith large datasets: A case study of the Children’s Mental Health Initiative.American Journal of Epidemiology 169(9): 1133-1139.
White, I.R., Royston, P., and Wood, A.M. (2011). Multiple imputation usingchained equations: Issues and guidance for practice. Statistics in Medicine 30:377-399. (Includes Stata code snippets).
E. Stuart (JHSPH) MCHB September 27, 2012 98 / 103
References: Survey weighting for nonresponse
Kalton, G., and Flores-Cervantes, I. (2003). Weighting Methods.Journal of Official Statistics, 19, pp. 81-97.
Kalton, G. and Kasprzyk, D. (1986). The Treatment of MissingSurvey Data. Survey Methodology, Vol. 12, No. 1: 1-16.
Oh, H. and Scheuren, F. (1983). Weighting Adjustment for UnitNonresponse. Chap. 13 in vol. 2, part 4 of Incomplete Data inSample Surveys. New York: Academic Press
E. Stuart (JHSPH) MCHB September 27, 2012 99 / 103
References: Not missing at random models
Enders, C.K. (2011). Missing not at random models for latent growthcurve analysis. Psychological Methods 16(1): 1-16.
Jackson, D., White, I.R., and Leese, M. (2010). How much can welearn about missing data? An exploration of a clinical trial inpsychiatry. Journal of the Royal Statistical Society Series A 173(3):593-612.
Resseguier, N., Giorgi, R. & Paoletti, X. (2011). Sensitivity AnalysisWhen Data Are Missing Not-at-random. Epidemiology 22(2): 282.
Siddique, J. and Belin, T.R. (2008). Using an approximate Bayesianbootstrap to multiply impute nonignorable missing data.Computational Statistics and Data Analysis 53: 405-415.
E. Stuart (JHSPH) MCHB September 27, 2012 100 / 103
References: Diagnostics
Abayomi, K., Gelman, A., & Levy, M. (2008). Diagnostics for multivariateimputations. Applied Statistics 57(3), 273-291.
Gelman A, King G, & Liu C (1998). Not Asked and Not Answered: MultipleImputation for Multiple Surveys. Journal of the American Statistical Association93(443), 846:857.
Su, Y., Gelman, A., Hill, J., and Yajima, M. (2011) Multiple imputation withdiagnostics (mi) in R: Opening windows into the black box. Forthcoming inJournal of Statistical Software.
E. Stuart (JHSPH) MCHB September 27, 2012 101 / 103
References: Other
Allison, P.D. (2005). Imputation of Categorical Variables with PROC MI [accessedJuly 30, 2006]. Available online at http:// www2.sas.com/proceedings/sugi30/113-30.pdf
Carpenter, J., Kenward, M., Evans, S., and White, I. (2004). Last ObservationCarry-Forward and Last Observation Analysis. Statistics in Medicine 23: 32413244.
Collins LM, Schafer JL, Kam CK. (2001). A comparison of inclusive and restrictivestrategies in modern missing-data procedures. Psychological Methods 6(4):330351.
Cook, R. J., Zeng, L., and Yi, G. Y. (2004). Marginal Analysis of IncompleteLongitudinal Binary Data: A Cautionary Note on LOCF Imputation. Biometrics60: 820828.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum LikelihoodFrom Incomplete Data via theEMAlgorithm. Journal of the Royal StatisticalSociety, Series B 39: 122.
E. Stuart (JHSPH) MCHB September 27, 2012 102 / 103
Greenland S and Finkle WD. (1995). A Critical Look at Methods for HandlingMissing Covariates in Epidemiologic Regression Analyses. American Journal ofEpidemiology 142:1255-64.
Moons, K.G.M., Donders, R.A.R.T., Stijnen, T., and Harrell, F.E., Jr. (2006).Using the outcome for imputation of missing predictor values was preferred.Journal of Clinical Epidemiology 59: 1092-1101.
Reiter, J. P., Raghunathan, T. E., and Kinney, S. (2006), The importance of thesampling design in multiple imputation for missing data, Survey Methodology,32.2, 143 - 150.
Siddique, J. and Belin, T.R. (2008). Using an approximate Bayesian bootstrap tomultiply impute nonignorable missing data. Computational Statistics and DataAnalysis 53: 405-415.
Spratt, M., Carpenter, J., Sterne, J.A.C., Carlin, J.B., Heron, J., Henderson, J.,and Tilling, K. (2010). Strategies for multiple imputation in longitudinal studies.American Journal of Epidemiology 172(4): 478-487.
Vach W and Blettner M. (1991). Biased Estimation of the Odds Ratio inCase-Control Studies due to the Use of Ad Hoc Methods of Correcting for MissingValues for Confounding Variables. American Journal of Epidemiology 134:895-907.
E. Stuart (JHSPH) MCHB September 27, 2012 103 / 103