recent advances in missing data methods: …...recent advances in missing data methods: imputation...

104
Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health Department of Mental Health Department of Biostatistics [email protected] www.biostat.jhsph.edu/estuart MCHB/HRSA September 27, 2012 E. Stuart (JHSPH) MCHB September 27, 2012 1 / 103

Upload: others

Post on 20-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Recent Advances in Missing Data Methods:Imputation and Weighting

Elizabeth A. Stuart

Johns Hopkins Bloomberg School of Public HealthDepartment of Mental HealthDepartment of Biostatistics

[email protected]/∼estuart

MCHB/HRSASeptember 27, 2012

E. Stuart (JHSPH) MCHB September 27, 2012 1 / 103

Page 2: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103

Page 3: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103

Page 4: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103

Page 5: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103

Page 6: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103

Page 7: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 2 / 103

Page 8: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 3 / 103

Page 9: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Course description

Missing data is a problem in almost every research study, and standard ways of dealing

with missing values, such as complete case analysis, are generally inappropriate. This

session will discuss the drawbacks of traditional methods for dealing with missing data

and describe why newer methods, such as multiple imputation, are preferable.

Discussion will focus in particular on Multiple Imputation by Chained Equations, which

is particularly useful for large datasets with complex data structures. Emphasis will be

on providing practical tips and guidance for implementing multiple imputation and

analyzing and interpreting multiply imputed data.

E. Stuart (JHSPH) MCHB September 27, 2012 4 / 103

Page 10: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Missing data

Missing data common, especially with administrative data (e.g.,medical records) or sensitive surveys

Advanced methods have been developed to handle missing data

But how do we actually implement those methods?

What are the implications for analyses?

E. Stuart (JHSPH) MCHB September 27, 2012 5 / 103

Page 11: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Why should you pay attention?

Ignoring or inappropriately handling missing data may lead to...

Biased estimates

Incorrect standard errors

Incorrect inferences/results

E. Stuart (JHSPH) MCHB September 27, 2012 6 / 103

Page 12: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Types of missing data

Will discuss two main types of missing data:

“Unit nonresponse”: when data for an entire “unit” (e.g., individual)is missing

e.g., did not respond at all to follow-up surveyAlso called “attrition”Usually handled using nonresponse weighting adjustments

“Item nonresponse”: when individual items are missing for anindividual

e.g., someone answered most of the survey questions, but left a fewblankUsually handled using imputation approaches

E. Stuart (JHSPH) MCHB September 27, 2012 7 / 103

Page 13: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Lots of reasons for missingness...

Non-response/attrition

Data entry errors

Administrative data with missing values

Lost survey forms

Individuals not wanting to disclose (or not knowing) particularinformation

Note: sometimes entire variables are missing in that they are “latent”;we will generally not be talking about those types of variables

E. Stuart (JHSPH) MCHB September 27, 2012 8 / 103

Page 14: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

More formally...“Missing data mechanisms”

Need to understand what led to missing values

Missing Completely at Random (MCAR): Missingness is totallyrandom; does not depend on anything

P(R|Y ,X ) = P(R|Y ,X obs ,Xmis) = P(R|ψ)Cases with missing values a random sample of the original sampleNo systematic differences between those with missing and observedvaluesAnalyses using only complete cases will not be biased, but may havelow powerGenerally unrealistic, although may be reasonable for things like dataentry errors

E. Stuart (JHSPH) MCHB September 27, 2012 9 / 103

Page 15: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Missing At Random (MAR): Missingness depends on observed data

P(R|Y ,X ) = P(R|Y ,X obs , ψ)e.g., women more likely to respond than menSo there are differences between those with observed and missingvalues, but we observe the ways in which they differCan use weighting or imputation approaches to deal with themissingnessThis is probably the assumption made most frequentlySatisfied for data missing by designIncluding a lot of predictors in the imputation model can make thismore plausible

E. Stuart (JHSPH) MCHB September 27, 2012 10 / 103

Page 16: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Not Missing At Random (NMAR): Missingness depends onunobserved values

P(R|Y ,X ) cannot be simplifiede.g., probability of someone reporting their income depends on whattheir income ise.g., probability of reporting psychiatric treatment depends on whetheror not they have received iti.e., even among people with the same values of the observedcovariates, those with missing values on Y have a different distributionof Y than do those with observed YSo we can’t just use the observed cases to help impute the missingcasesUnfortunately no easy ways of dealing with this...have to posit somemodel of the missing data process

E. Stuart (JHSPH) MCHB September 27, 2012 11 / 103

Page 17: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Of course those are assumptions...

Never know which of them is correct

Can do diagnostics/tests for whether missingness is MCAR vs. (MARor NMAR) (Enders 2010, p. 18)

Does the probability of missingness depend on other variables?e.g., are the mean ages of people with missing and non-missing valuesof drug use behavior different?e.g., In a logistic regression predicting missingness on some variable,are there other variables that are significant predictors?Little (1998; JASA): test for MCAR (implemented in Stata’s mipackage, possibly others)

E. Stuart (JHSPH) MCHB September 27, 2012 12 / 103

Page 18: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

But never know for sure if missingness is MAR or NMAR...

Have to use substantive understanding of what might have led tomissing valuese.g., Are those who had been arrested more likely to not respond to aquestion asking about previous arrests? (They may not want to lie, butalso may not want to tell the truth...)Helps to have a good understanding of the data collection processIf believe missingness is NMAR, have to posit some model for themissingness (e.g., that those with previous arrests are 10% more likelyto not respond to that question)Tailored for each research questionSiddique and Belin (2008): example of missing depression levels;simulations show value in using a variety of assumptions and modelsOther references: Hedeker and Gibbons (1997), Resseguier et al.(2011; sensMICE package for R), Enders (2011),

E. Stuart (JHSPH) MCHB September 27, 2012 13 / 103

Page 19: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 14 / 103

Page 20: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Inappropriate ways of handling missing data

Ignoring it

Complete case

Single imputation

Missing indicator approach

Last observation carried forward

E. Stuart (JHSPH) MCHB September 27, 2012 15 / 103

Page 21: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Ignoring it...

Common approach is to “ignore” it; just run models without doinganything about missingness

Then what is done will depend on the defaults of the software

Usually will be the same as complete-case analyses, discussed next

E. Stuart (JHSPH) MCHB September 27, 2012 16 / 103

Page 22: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Complete case analysis

Restrict analyses to individuals with observed data

Generally bad!

Assumes missingness is MCAROften results in lots of cases dropped...decreased power and loss ofrepresentativeness (Little and Rubin, 2002; page 42)Generally leads to biased results

Is also model-dependent...will mean that different analyses may usedifferent subsets of the data (unless do big restriction at thebeginning)

(Also called listwise deletion)

E. Stuart (JHSPH) MCHB September 27, 2012 17 / 103

Page 23: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Single imputation: Fill in (“impute”) each missing value

Ways of doing that imputation:

Mean

Regression prediction (“conditional mean imputation”)

e.g., impute mean within categories of observed covariates (gender,race, etc.)e.g., fit regression model among observed cases, use to predictresponse for individuals with missing values

Yi = α+ βXi

Regression prediction plus error (“stochastic regression imputation”)

Like regression prediction, but also add random error

Yi = α+ βXi + ei , ei ∼ N(0, σ2)

E. Stuart (JHSPH) MCHB September 27, 2012 18 / 103

Page 24: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

“Hot-deck”

For an individual with missing data, find individuals with the sameobserved values on other variables, randomly pick one of their values asthe one to use for imputation

Predictive mean matching

Like a combination of regression prediction and hot-deckTake observed value from someone with similar predicted valueWorks well when values non-normally distributed or have bounds(ensures that imputed values in same range as observed values)

E. Stuart (JHSPH) MCHB September 27, 2012 19 / 103

Page 25: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Other (inappropriate) strategies

Missing data indicator

Do simple imputation and include indicator of missingness as anadditional predictor in regression modelsDoesn’t work very well and can lead to bias (Vach and Blettner 1991,Donders et al. 2006, Greenland and Finkle 1995)

Last observation carried forward

For longitudinal studiesIf someone drops out of study, the last value observed for them is“carried forward” (copied) to later time pointsBut generally biased (Carpenter et al. 2004; Cook, Zeng, and Yi, 2004;Jansen et al. 2006)

E. Stuart (JHSPH) MCHB September 27, 2012 20 / 103

Page 26: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Summary of single imputation approaches

Best are regression prediction plus error or hot-deck (based oncategorical versions of all of the variables observed)

Can be reasonable, especially if not a lot of missing data, e.g., < 5%(Graham 2008)

BUT...results in overly precise estimates

Analyses following single imputation do not know that some of thevalues have been imputedSimply treats all of the values as observed valuesSo does not take into account the uncertainty in the imputations

Anti-conservative...results will have more significance, narrowerconfidence intervals, than they should (Donders et al. 2006)

Higher Type I error rates

So what to do instead?

E. Stuart (JHSPH) MCHB September 27, 2012 21 / 103

Page 27: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Appropriate ways of handling missingness

Maximum likelihood

Alternative sources of information

Weighting

Multiple imputation

Remember: Goal is not to get correct predictions of missing values;goal is to obtain accurate parameter estimates for relationships ofinterest

E. Stuart (JHSPH) MCHB September 27, 2012 22 / 103

Page 28: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Maximum likelihood

In some cases, maximum likelihood approaches exist

Directly maximize the likelihood function, f (X ,Y )

Use observed values, take missingness into account

e.g., longitudinal analyses that use the observations available for eachperson and correctly account for the missing observations

When ML methods exist, can work very well (e.g., Mplus, LISREL)

But they don’t always exist so not always a feasible option

Another drawback is that you cannot use auxiliary information toimprove the predictions; uses only the variables in the actual analysis(and assumes MAR given those)

Graham (2008), Siddique et al. (2008)

E. Stuart (JHSPH) MCHB September 27, 2012 23 / 103

Page 29: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Alternative sources of information

In some cases, can utilize a secondary data source to get neededinformation

Or, for example, calibrate numbers to known totals

e.g., issue of missing information about offender and incident in theSupplemental Homicide Reports (SHR)

Compare victim counts in SHR to similar data from NCHS, adjust asnecessary (Fox and Zawitz 2004)Wadsworth and Roberts (2008) evaluates four common techniques fordealing with this missingness that utilize supplemental info from policerecords

E. Stuart (JHSPH) MCHB September 27, 2012 24 / 103

Page 30: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Nonresponse weighting

Often used to deal with attrition

Generate model predicting non-response given observed covariates

Weight respondents by their inverse probability of response

Weights the respondents up to represent the full sampleSame idea as survey sampling weights

Use analysis methods that allow for weights (e.g., survey packages)

Works well for simple missing data patterns (e.g., attrition)

Oh and Scheuren (1983), Kalton and Flores-Cervantes (2003), Hortonand Lipsitz (1999), Carpenter et al. (2006)

E. Stuart (JHSPH) MCHB September 27, 2012 25 / 103

Page 31: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

A simple example...

Imagine 100 males and 100 females in sample

But only 80 males and 75 females respond

Male respondents will get weight of 100/80 = (1/(80/100)) = 1.25

Female respondents will get weight of 100/75 = (1/(75/100)) =1.333

So, e.g., a male respondent represents 1.25 males in the originalsample

These weights will make the 80 male and 75 female respondentsrepresent the full sample of 200

E. Stuart (JHSPH) MCHB September 27, 2012 26 / 103

Page 32: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

To implement weighting adjustments:

Fit model predicting response as a function of fully observedcharacteristicsAssign respondents a weight of 1/(p(response))Use those weights in regression models and summary statistics

Will weight the respondents to look like the full original sample

Like survey sampling weights, except estimated instead of known

Can be used for attrition as well as for original survey response

E. Stuart (JHSPH) MCHB September 27, 2012 27 / 103

Page 33: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Use model with many characteristics, generally measured at baseline

Treat the weights like you would survey sampling weights (e.g., usingsurvey packages), run weighted models (e.g., pweight in Stata)

Some concern about extreme weights

Check distribution of weights, trim outliersSome do a “weighting class adjustment” where actually just form 5subclasses based on the probabilities and everyone in each subclass getsthe same weight

Relatively simple (and widely accepted) way of handling attrition/unitnon-response

E. Stuart (JHSPH) MCHB September 27, 2012 28 / 103

Page 34: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Now on to item non-response . . .

E. Stuart (JHSPH) MCHB September 27, 2012 29 / 103

Page 35: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Multiple imputation

Same idea as single imputation, but fills in each missing valuemultiple times

Like repeating the stochastic mean imputation multiple times

Creates multiple (e.g., 10) “complete” data sets

Analyses then run separately on each dataset and results combinedacross datasets

Standard “combining rules” (Rubin 1987)(Software will do this for you)

Total variance a function of within-imputation variance andbetween-imputation variance

Takes into account the uncertainty in the imputations

Also nice because very general: same set of imputations can be usedfor many analyses

“Imputer” may be different from “analyst”

E. Stuart (JHSPH) MCHB September 27, 2012 30 / 103

Page 36: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 31 / 103

Page 37: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Traditional approach to creating multiple imputations

Joint model of all variables

e.g., multivariate normal distribution

Fit using the observed cases

Used to predict (multiple times) the missing values

Sometimes multivariate normal model used even with categoricalvariables, but this can be severely biased (Horton, Lipsitz, andParzen, 2003; Allison 2005)

Can’t easily handle complexities such as skip patterns, bounds,restrictions, complex designs

Software: norm, mix, SAS proc mi, Stata mi

E. Stuart (JHSPH) MCHB September 27, 2012 32 / 103

Page 38: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Newer approach: Multiple imputation by chained equations(MICE)

Fit model of each variable, conditional on all others

Iterate fitting model and imputing each variable

Models used depend on type of variable(categorical/continuous/binary)

Raghunathan et al. (2001), van Buuren et al. (2006)

Also called “fully conditional specification” or “sequential regressionmultiple imputation”

E. Stuart (JHSPH) MCHB September 27, 2012 33 / 103

Page 39: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Example of MICE

3 variables: X1 (binary), X2 (continuous), X3 (ordinal)Steps in MICE:

1 Do simple imputations to fill in missing values for X1, X2, X3

2 Using cases with observed X1, fit logistic regression model ofX1 ∼ X2 + X3; predict missing values of X1

3 Using cases with observed X2, fit normal regression model ofX2 ∼ X1 + X3; predict missing values of X2

4 Using cases with observed X3, fit proportional odds regression modelof X3 ∼ X1 + X2; predict missing values of X3

5 Iterate Steps 2-4

6 Repeat Step 5 to get multiple imputations

E. Stuart (JHSPH) MCHB September 27, 2012 34 / 103

Page 40: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Pros and cons of MICE

Benefits

Can more easily work in large datasetsModels can more accurately reflect distribution of each variableAllows bounds (e.g., age started smoking)Incorporates restriction to subpopulations (e.g., age started smoking)

Drawbacks

Potentially less principled than joint modeDoesn’t necessarily imply a proper joint distribution(Although this doesn’t seem to be a big problem in practice)

E. Stuart (JHSPH) MCHB September 27, 2012 35 / 103

Page 41: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Software to implement MICE

SAS and stand-alone: IVEWare

Stata: ice

R: mice, mi

SPSS: Missing Values add-on module (“fully conditionalspecification” option)

E. Stuart (JHSPH) MCHB September 27, 2012 36 / 103

Page 42: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Steps to implementing MI methods

1 Examine rates and patterns of missingness, and any predictors ofmissingness

2 Generate imputations

3 Diagnose and assess imputations

4 Analysis

E. Stuart (JHSPH) MCHB September 27, 2012 37 / 103

Page 43: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Motivating example: PRAMS data

Hypothetical analysis using NYC PRAMS data

Interested in predicting postpartum depression (“depress”) as afunction of other predictors

Predictors include demographics, information on the delivery, healthmeasures, etc.

About 40 possible predictors

But lots of missingness

Will go through some code and diagnostics here

Also see Stuart et al. (2009), Azur et al. (2011) for another example,including code

E. Stuart (JHSPH) MCHB September 27, 2012 38 / 103

Page 44: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Step 1: Rates of missingness

High rates of missingness for some variables

Variable % MissingBMI 0.2%Maternal age 1.6%Diabetes status 2%Freq of drinking 3.8%Sad 8.2%Nhope 11.2%Slow 11.1%Income 21%

E. Stuart (JHSPH) MCHB September 27, 2012 39 / 103

Page 45: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Missingness depends on observed characteristics

Not MCAR: means of variables differ across those with missing vs.observed values on other items

Don’t have reason to think missingness is NMAR so comfortable withMAR

E. Stuart (JHSPH) MCHB September 27, 2012 40 / 103

Page 46: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Step 2: Generate imputations

Need to specify model for each variable, conditional on all othervariablesCheck to see if transformations make sense (e.g., to look morenormally distributed; see White et al., 2011 for examples)

If non-normal, can also use predictive mean matching (White et al.,2011)

May make sense to include some variables in imputation model, evenif not going to be used in analyses (“auxiliary variables”; Collins et al.2001)

i.e., include more rather than fewer variables in imputation procedureThere may be some that are very predictive of missing values, even ifthey aren’t of primary interest in analyses

With so many variables, can’t possibly do careful model selection foreach one

Some software packages allow stepwise selection to select imputationmodel for each variable (IVEWare, add-on packages for Stata’s ice:pred eq, check eq)

E. Stuart (JHSPH) MCHB September 27, 2012 41 / 103

Page 47: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

What variables should be included?

Any variables that will be used in subsequent analyses

Otherwise its associations with other variables will be attenuated inanalyses

Any higher-order effects that are of interest in the analysis phase

Any other special features of the data (e.g., survey weights)

Can be fairly loose and include a lot, especially if using stepwiseprocedures

In PRAMS, N ∼ 1400, had about 40 variables in imputationprocedure

Note: Don’t need to specify which are dependent vs. independent

E. Stuart (JHSPH) MCHB September 27, 2012 42 / 103

Page 48: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

General guidance/issues

Imputation and analysis compatibility

Imputation model should be more general than analysis model that willbe used: otherwise risk finding null effects simply because dataimputed assuming no relationship between variablesMay want to force some variables into the models even if do stepwise

How many imputations to generate?

Conventional advice has been 5-10, but more (e.g., 40) may yieldincreased power (Graham, Olchowski, & Gilreath, 2007)White et al. (2011) recommend m = 100 ∗ FMI (FMI=fraction ofmissing information)

Since FMI hard to estimate, but Bodner’s approximation says FMI < %missing cases, approximate m = 100∗(% missing cases)e.g., 20% missing cases would imply m = 20

White et al. (2011) also argue that for reproducibility may needm > 100Stata command to check if you’ve done enough imputations: mimmcerror

E. Stuart (JHSPH) MCHB September 27, 2012 43 / 103

Page 49: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Importance of auxiliary variables

Can be very beneficial to include “auxiliary variables:” not of interest inthe analysis in and of themselves, but might help with the imputationsCollins et al. (2003) show that not much cost to including these extravariables and they can help a lotIncluding a lot of variables can also make MAR assumption morereasonable(No easy way to incorporate this extra information in maximumlikelihood approaches; see Enders (2010, Chp. 5))

E. Stuart (JHSPH) MCHB September 27, 2012 44 / 103

Page 50: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Variables that are functions of others

In this case, dependent variable (depress) is a function of 3 othervariablesBest to impute the 3 variables separately, then re-create the depresssummary variable after the imputations are created

Keeps the imputation process more general (e.g., in case you want touse the individual variables for something)Retains observed values on the 3 individual variables (e.g., someonewho is missing one would be missing for depress but we don’t want tothrow away their 2 observed values on the other two variables)

Once imputations can created, can manipulate those datasets basicallyin the same ways you normally would(A bit more on this later)

E. Stuart (JHSPH) MCHB September 27, 2012 45 / 103

Page 51: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Implementing MICE

In R using the mi package: imp< −mice(data, pred=pred,maxit=10,m=10,seed=92385)

In Stata using ice:

ice modeprt2 batch2 wtanal i.stratumc mmhbp vagdel m.race2 bornusplural2 married2 o.matdeg o.pnc /// lga2 sga o.kotelchuck2 o.prelbo.gestwk bmi wrknow o.income2 medicaid2 wicpreg /// o.pp drinkm.diabetes2 preexer infert tx m.feelpg2 o.smoke o.strs inficu o.los2o.bfeed2 back sleep2 cosleep2 /// ppvchk i.inf age2 o.sad o.nhopeo.slow matage totcnt, saving(impute, replace) m(10) boot(mmhbpvagdel race2 /// bornus matdeg pnc lga2 sga kotelchuck2 prelb gestwkwrknow income2 medicaid2 wicpreg pp drink diabetes2 /// preexerinfert tx feelpg2 smoke strs inficu los2 bfeed2 back sleep2 cosleep2ppvchk sad nhope slow matage) /// seed(1285964)

E. Stuart (JHSPH) MCHB September 27, 2012 46 / 103

Page 52: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Step 3: Diagnosing and assessing imputations

Try to identify potentially problematic variables

Two types of comparisons:

Before and after imputationAcross two imputation sets with slightly different settings (e.g.,different criteria in the stepwise model)

Can also do diagnostics that assess the imputation models themselves(Su et al., 2009)

Most packages have very limited diagnostics

R’s mi and mice packages have the most

Note: Differences don’t mean something is wrong! Could be becauseof differences in the types of people with observed vs. missing data

E. Stuart (JHSPH) MCHB September 27, 2012 47 / 103

Page 53: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Graphical summaries

Bivariate scatterplots of observed and imputed values

Residual plots, for observed and imputed values

Density plots of observed and imputed values

Next slide: Example from R mi package

E. Stuart (JHSPH) MCHB September 27, 2012 48 / 103

Page 54: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Den

sity

0.00

.20.

40.6

0.81

.0

−5 0 5 10

matdeg

01

23

0 1 2

pnc

0.00

.20.

40.6

0.81

.0

0 2 4

kotelchuck2

0.0

1.0

2.0

0 1 2 3 4

gestwk

0.0

0.1

0.2

0.3

0.4

10 20 30 40 50 60

bmi

0.00

0.10

0.20

0.30

−2 0 2 4 6 8

income2

0.0

0.5

1.0

1.5

0 2 4

pp_drink2

01

23

0 1 2

smoke2

0.0

0.4

0.8

−1 0 1 2 3

strs

0.0

0.4

0.8

−1 0 1 2 3

los2

0.0

0.5

1.0

−1 0 1 2 3

bfeed2

0.0

0.2

0.4

0.6

0 2 4

sad

0.0

0.5

1.0

1.5

0 1 2 3 4

nhope

0.0

0.2

0.4

0.6

0 2 4

slow

0.0

0.4

0.8

1.2

0 2 4 6

matage

E. Stuart (JHSPH) MCHB September 27, 2012 49 / 103

Page 55: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Numerical diagnostics

Some packages automatically print out some diagnostics

Model fit information to assess imputation models

Comparisons of original and imputed values (Abayomi et al., 2008)

Should check that imputations look reasonable (e.g., compare means,correlation coefficients)Make sure values being imputed are in the correct ranges

Potentially flag problematic variables to warn future data analysts

E. Stuart (JHSPH) MCHB September 27, 2012 50 / 103

Page 56: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

At a basic level . . .

Can compare means of variables pre and post imputation, make surereasonable

Note that changes are not necessarily bad! (May in fact be fixing theproblem!)

Variable Mean before Mean afterBMI 24.97 24.97Freq of drinking 0.66 0.66Sad 1.18 1.17Nhope 0.56 0.57Slow 1.23 1.21Income 2.99 2.75

E. Stuart (JHSPH) MCHB September 27, 2012 51 / 103

Page 57: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Step 4: Analyses

Combining rules allow the combination of results across the multiplyimputed data sets (Rubin 1987)

Account for both within- and between-imputation variance

Run analysis separately within each “complete” dataset, thencombine across datasets

Software packages have automated version of this for many models

Stata: mim, mi packageSAS: proc mianalyzeHLM: multiple imputation optionsMplus: multiple imputation command

For other models, may need to do it “by hand”

Final results just need to report the combined estimates; interpret asyou would a standard regression

E. Stuart (JHSPH) MCHB September 27, 2012 52 / 103

Page 58: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

The math behind the combining

Qj = estimate of scalar quantity of interest (e.g., regressioncoefficient) from complete dataset j

Uj = standard error of Qj

Overall estimate just the average of the estimates from each completedataset

Q =1

m

m∑j=1

Qj

E. Stuart (JHSPH) MCHB September 27, 2012 53 / 103

Page 59: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

For the overall variance, first calculate the average within-imputationvariance (U) and the between-imputation variance (B)

U =1

m

m∑j=1

Uj

B =1

m − 1

m∑j=1

(Qj − Q)2

The total variance of Q is then

T = U + (1 +1

m)B

Degrees of freedom for t distribution can also be calculated (Enders,2010, p. 231+)

See Schafer (1997) or Little and Rubin (2002) for details

E. Stuart (JHSPH) MCHB September 27, 2012 54 / 103

Page 60: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

PRAMS data analysis

Ran models predicting small for gestational age (sga; binary variable)as a function of other predictors

Note: Analyses not meant to have substantive meaning or to presentactual findings on these associations

Purely illustrative examples to demonstrate the methods

E. Stuart (JHSPH) MCHB September 27, 2012 55 / 103

Page 61: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

PRAMS data analysis: Complete case (sga)

. svy: logit sga strs i.feelpg2 vagdel i.race2 income2 medicaid2 wicpreg matdeg bmi

bmisq smoke2 matage matagesq infert_tx (running logit on estimation sample)

Survey: Logistic regression

Number of strata = 2 Number of obs = 897

Number of PSUs = 897 Population size = 76003.841

Design df = 895

F( 19, 877) = 1.98

Prob > F = 0.0077

------------------------------------------------------------------------------

| Linearized

sga | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

strs | .0583217 .1563791 0.37 0.709 -.2485908 .3652342

feelpg2 |

3 | .083498 .4021682 0.21 0.836 -.7058046 .8728006

4 | .1576447 .3839254 0.41 0.681 -.5958543 .9111436

5 | .2775728 .358963 0.77 0.440 -.4269345 .9820801

vagdel | -.5620046 .2460526 -2.28 0.023 -1.044912 -.0790974

race2 |

3 | -.0955608 .3815155 -0.25 0.802 -.84433 .6532084

4 | -.6372159 .34151 -1.87 0.062 -1.30747 .0330378

E. Stuart (JHSPH) MCHB September 27, 2012 57 / 103

Page 62: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

PRAMS data analysis: Single imputation (sga)

. svy: logit sga strs i.feelpg3 vagdel i.race3 income2 medicaid2 wicpreg matdeg

bmi bmisq smoke2 matage matagesq infert_tx (running logit on estimation sample)

Survey: Logistic regression

Number of strata = 2 Number of obs = 1436

Number of PSUs = 1436 Population size = 115091.01

Design df = 1434

F( 19, 1416) = 2.29

Prob > F = 0.0013

------------------------------------------------------------------------------

| Linearized

sga | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

strs | .0906505 .1258239 0.72 0.471 -.1561681 .337469

feelpg3 |

2 | .1569055 .3307433 0.47 0.635 -.4918871 .8056981

3 | .322266 .3297164 0.98 0.329 -.3245122 .9690442

4 | .2078305 .296976 0.70 0.484 -.3747235 .7903844

vagdel | -.2975287 .1860587 -1.60 0.110 -.6625051 .0674478

race3 |

2 | -.1242725 .2946427 -0.42 0.673 -.7022494 .4537043

3 | -.5491899 .2595075 -2.12 0.034 -1.058245 -.0401349

E. Stuart (JHSPH) MCHB September 27, 2012 59 / 103

Page 63: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

PRAMS data analysis: Multiple imputation (sga)

. mi estimate: svy: logit sga strs i.feelpg2 vagdel i.race2 income2 medicaid2

wicpreg matdeg bmi bmisq smoke2 matage matage sq infert_tx

Multiple-imputation estimates Imputations = 10

Survey: Logistic regression Number of obs = 1436

Number of strata = 2 Population size = 115091

Number of PSUs = 1436

Average RVI = 0.0714

Complete DF = 1434

DF adjustment: Small sample DF: min = 100.19

avg = 971.24

max = 1391.55

Model F test: Equal FMI F( 19, 1374.4) = 2.18

Within VCE type: Linearized Prob > F = 0.0023

------------------------------------------------------------------------------

sga | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

strs | .114504 .135696 0.84 0.399 -.1521104 .3811184

feelpg2 |

3 | .1470457 .3411933 0.43 0.667 -.5223776 .816469

4 | .3355484 .337919 0.99 0.321 -.3274445 .9985413

5 | .2565546 .3094421 0.83 0.407 -.35068 .8637892

vagdel | -.304427 .1895014 -1.61 0.108 -.6761812 .0673273

race2 |

3 | -.1324997 .2969264 -0.45 0.656 -.7150331 .4500336

4 | -.5676001 .2659223 -2.13 0.033 -1.089379 -.0458213

E. Stuart (JHSPH) MCHB September 27, 2012 61 / 103

Page 64: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

PRAMS data analysis: Comparison and Lessons

Complete case analysis drops a lot of subjects (almost 30%)

Output from MI models looks about the same; easy to read it just asyou would standard regression output!

Some point estimates vary across the 3 approaches, some also vary insignificance

It is the case that sometimes results won’t change after doing MI

But sometimes they will, and you don’t know if they will or won’tuntil you do it (and should trust the MI results more)

May think single imputation results “best” because they might matchour expectations about sga and stresses during pregnancy (strs), butbe careful of that interpretation

Single imputation approach has false precision; standard errors smallerthan they should be

E. Stuart (JHSPH) MCHB September 27, 2012 62 / 103

Page 65: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Example: mim command in Stata (other data)

mim: logit dsmmood sex age

Multiple-imputation estimates (logit) Imputations = 10

Logistic regression Minimum obs = 9185

Minimum dof = 4.8

--------------------------------------------------------------------------

dsmmood| Coef. Std. Err. t P>|t| [95% Conf. Int.] MI.df

-------------+------------------------------------------------------------

sex| .470223 .060136 7.82 0.000 0.346444 0.594003 25.3

age| .099599 .019677 5.06 0.004 0.04845 0.150748 4.8

cons|-2.05873 .257295 -8.00 0.001 -2.72791 -1.38954 4.8

--------------------------------------------------------------------------

E. Stuart (JHSPH) MCHB September 27, 2012 64 / 103

Page 66: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Calculating the fraction (%) of missing information

Captures how much information there is in the data about aparticular parameter

Compares the within-imputation and between-imputation variance

To calculate:

γ =(r + 2)/(df + 3)

r + 1

r =(1 + 1/m) ∗ B

U

r is the relative increase in variance due to the nonresponse/missingdata

Alternatively (Enders, 2010, p. 225): FMI = B+B/m+2/(ν+3)T

ν is a degrees of freedom value; goes to infinity as m goes to infinity

E. Stuart (JHSPH) MCHB September 27, 2012 65 / 103

Page 67: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

FMI will typically be lower than % of missing values because ofcorrelations in the data

Quantifies influence of missing data on the standard errors

Also can be used as diagnostic: should pay more attention tovariables that have high FMI when doing imputation diagnostics

E. Stuart (JHSPH) MCHB September 27, 2012 66 / 103

Page 68: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Some quantities cannot be easily combined using Rubin’srules

Likelihood ratio tests hard to combine; better to use Wald tests withmultiply imputed data (White et al., 2011)

Stata’s mi package can combine subsets of coefficients or linear ornonlinear hypotheses (mi test, mi testtransform, mim: testparm)

If model you are running not part of standard combining software, canjust send point estimates and variances to a few functions

e.g., R: mitools, Stata: mi estimate (option cmdok)

Combining R2 values:http://www.ats.ucla.edu/stat/stata/faq/mi r squared.htm

Combining rules assume normality so some parameters work betterwhen transformed (Enders, 2010, p. 220+); e.g., correlationcoefficient

E. Stuart (JHSPH) MCHB September 27, 2012 67 / 103

Page 69: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

White et al. (2011), Table VIII

E. Stuart (JHSPH) MCHB September 27, 2012 68 / 103

Page 70: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 69 / 103

Page 71: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

FAQ’s

Isn’t imputation “making up” data?

No! It is creating our best guesses at the missing valuesIn fact non-imputation methods (e.g., complete case analysis) generallyrely on much stronger assumptionsAlso important to note that we aren’t assuming that we are imputingthe correct values...generating the imputations only as an intermediatestep to estimating the model parameters of real interest

What if the imputation model is wrong?

Usually it’s fine; most results indicate that MI still works well even ifthe imputation models are not correct (Schafer 1997)Can help the situation by, for example, taking logs to make data morenormally distributed when using linear regression

E. Stuart (JHSPH) MCHB September 27, 2012 70 / 103

Page 72: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Are there guidelines for how much missingness is “too much”?

Unfortunately, noAnd remember that if there is not good information in the data to doimputations (i.e., not much that is predictive of the missing values), MIwill take that into account by making the imputations very variableGood results have been found with over 40% missingnessKey quantity is the fraction of missing information (Schafer 1997),which combines the % missing with how correlated the missing variableis with observed values

What if you are really worried about NMAR?

Sensitivity analyses (e.g., Siddique and Belin 2008)Pattern mixture models (fitting model separately for each missing datapattern; e.g., Hedeker and Gibbons, 1997)

E. Stuart (JHSPH) MCHB September 27, 2012 71 / 103

Page 73: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Should I impute a scale or the individual items?

Impute the scale if: (1) over half of the individual items observed if anyare observed, (2) items have high α’s, and (3) the item-totalcorrelations are similar across items (Graham, 2008)Otherwise (and if have the code to recreate the scales), impute theitems

Should I impute raw or standardized scores?

Assuming you have the ability to recreate the standardized scores...Impute whichever one looks more normally distributed

E. Stuart (JHSPH) MCHB September 27, 2012 72 / 103

Page 74: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

What about data with a multilevel structure?

Not a lot of guidance on thisIf analysis will have only random intercepts, can include clusterindicators as possible predictors (this done in CMHI data)If analysis will have random intercepts and slopes (i.e., if going to lookat relationships between variables separately for different clusters),impute separately within each cluster or include cluster*variableinteractions in imputation model (Graham, 2008)MLWwiN macro for imputing multi-level data: www.missingdata.org.ukYucel (2008), Reiter et al. (2006)

E. Stuart (JHSPH) MCHB September 27, 2012 73 / 103

Page 75: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

What if imputing data within a study estimating causal effects?

Impute covariates and outcomes together, include lots of interactionsbetween treatment status, covariates, and outcomes in imputationmodel (Want to make sure not to impose a treatment effect on theimputations)Although some people may balk at including outcome in imputationprocess, better to impute them than to leave it out, which wouldassume no treatment effect (Moons et al., 2006; Sterne et al., 2009)That said . . . some recommendation to include only those withobserved treatment and outcome in the outcome analyses (but still usethe outcome and treatment when creating the imputations)

Using imputed treatment status and imputed outcomes in analyses mayjust add noise/random error (White et al., 2011)

E. Stuart (JHSPH) MCHB September 27, 2012 74 / 103

Page 76: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Should I include variables that are predictive of the missingness orpredictive of the missing values?

Ideally would be inclusive and include any variables that may be relatedto the missingness AND/OR the values themselvesIf can’t do that (e.g., small samples), better to include variablespredictive of the missing values

What should I do if some analysis I want to do isn’t covered by any ofthe existing packages that analyze multiply imputed data?

If just exploratory (e.g., regression diagnostics, graphics), run it on 2-3of the imputed datasets separately and see how consistent the resultsare. If results consistent, just go with them. If not consistent, rethinkimputations: why are they so variable?If want to actually estimate models, will need to write code to do thecombining across datasets yourselfThe mitools() package for R gives some examples of this, makes it easyif you can send it coefficient estimates and their associated variances

E. Stuart (JHSPH) MCHB September 27, 2012 75 / 103

Page 77: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Overall lessons

Missing data can have serious implications for analyses

Requires making assumptions about the missingness and missingvalues

Best approach: Minimize the amount of missing data up front

Invest substantial resources in following up individualsDesign surveys to encourage full responseExplore alternative data sources (e.g., administrative records) asnecessary

Important to have a good understanding of the missing data process

Why were some cases missing?How plausible is MAR? Are we worried about NMAR?Can we collect additional data that will inform about the missingness?e.g., for attrition, can ask in earlier waves about individual’s likelihoodof answering subsequent surveysIs it possible to follow-up a subsample of those who initially did notrespond?

E. Stuart (JHSPH) MCHB September 27, 2012 76 / 103

Page 78: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Benefits of weighting

Fairly easy to implement

Allows inferences to be made for original full sample

An established way of handling attrition/non-response

E. Stuart (JHSPH) MCHB September 27, 2012 77 / 103

Page 79: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Benefits of multiple imputation

Yields accurate standard errors and inferences

Allows the use of auxiliary variables to improve imputations

Can be used for very general settings

Can impute a dataset and then use for lots of different analyses (Stuartet al. 2009)Imputer and analyst can be different peopleAnalyses run on “complete” data sets and so any type of analysis canbe run

E. Stuart (JHSPH) MCHB September 27, 2012 78 / 103

Page 80: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Lessons

If just have a simple attrition problem, use non-response weights. Butif things more complex . . .

If rates of missingness low (e.g., 1-2%), consider doing singleimputation (e.g., regression prediction with noise)

Otherwise use multiple imputation

Very general, flexible

With either approach, it may (or may not) change the results, butwon’t know until you do it, and using multiply imputed data shouldbe more accurate and correct

Either approach also allows you to make statements about the wholesample, not just about the respondents/those with complete data(and uses that same sample for all analyses; sample not dependent onthe variables you happen to use in a given analysis)

E. Stuart (JHSPH) MCHB September 27, 2012 79 / 103

Page 81: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

When using multiple imputation . . .

MICE approaches can be useful, especially for large datasets with lotsof categorical variables

Make imputation models very general: lots of terms and interactions(little cost to including lots of potential predictors)

Compare distributions of data pre- and post-imputation

Determine ways to summarize the results across variables

If others will be using the imputed data, make clear documentation

Specify models used, interactions includedHighlight potentially problematic variables

E. Stuart (JHSPH) MCHB September 27, 2012 80 / 103

Page 82: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 81 / 103

Page 83: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

General resources

http://web.inter.nl.net/users/S.van.Buuren/mi/hmtl/software.htm

Code for many packages in Horton and Kleinman (2007)

http://www.math.smith.edu/muchado-appendix.pdf

E. Stuart (JHSPH) MCHB September 27, 2012 82 / 103

Page 84: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Software for MICE: R

mi package (Su et al., 2011)

http://cran.r-project.org/web/packages/mi/index.htmlFor creating and analyzing multiply imputed dataMultiple imputation by chained equations incorporated with predictivemean matchingLots of good diagnosticsAutomatically determines the correct model (e.g., linear vs.multinomial logit)Goal is for the software to handle many complexities automatically (likecollinearity, perfect prediction)Eventually want to include procedures to explicitly handle time-seriesdata and multilevel/clustered dataNot a lot of functionality currently, but check back soon...should begood!

E. Stuart (JHSPH) MCHB September 27, 2012 83 / 103

Page 85: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

mice package

http://web.inter.nl.net/users/S.van.Buuren/mi/hmtl/mice.htmFor creating and analyzing multiply imputed dataMultiple imputation by chained equationsSome good diagnostics, added functionality recentlyCan incorporate bounds, restrictions

E. Stuart (JHSPH) MCHB September 27, 2012 84 / 103

Page 86: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Analyzing MI data in R

mice and mi packages have built in commands

mitools package

Run imputationList() command to combine the imputed datasets(could be from mice or from another package)Use the with() command to run analysis on each complete dataset inthe imputationList objectUse micombine() command on the results from the with() command toget results pooled across the complete datasetsVery general: Can run as long as you have the estimates and variancesfrom each complete datasethttp://cran.r-project.org/web/packages/mitools/mitools.pdf

Zelig package

Can run almost any modelJust say data=mi(dataset1, dataset2, ...) in the command

E. Stuart (JHSPH) MCHB September 27, 2012 85 / 103

Page 87: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Software for MICE: SAS

IVEWare (stand-alone as well)

http://www.isr.umich.edu/src/smp/ive/For creating multiple imputationsMultiple imputation by chained equationsMore details below

proc mianalyze

http://www.sas.com/rnd/app/papers/mianalyzev802.pdfFor analyzing multiply imputed dataCan be run on data imputed using proc mi or imputed using anotherpackageHorton and Kleinman (2007) appendix shows code for reading multiplyimputed data into SAS and running mianalyze

E. Stuart (JHSPH) MCHB September 27, 2012 86 / 103

Page 88: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Software for MICE: Stata

ice

http://ideas.repec.org/c/boc/bocode/s446602.htmlhttp://www.ats.ucla.edu/stat/Stata/library/ice.htmFor creating multiple imputationsMultiple imputation by chained equations enditemizeFor analyzing mi data: micombine, mim, mi estimateNote: New in Stata 12, the ”mi” package can do the ice procedureFor pre-Stata 12 users: can easily go between ice and mi proceduresusing “mi import ice” and “mi export ice” commands so can use mi’sprocedures, like for analyzing MI data

E. Stuart (JHSPH) MCHB September 27, 2012 87 / 103

Page 89: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Using ice in Stata

To install ice: ssc install ice, replace

To install mim: ssc install mim, replace

Main command:

ice cohort sex age income totchild totadu nrace3 nrace5 nrace7totrole bersraw ctotcomr ctotraw cintraw cextraw ytotraw yintrawyextraw i.siteid, clear;

Default is to let each variable be regressed on all other variables

Often run into convergence/collinearity issues

http://www.ats.ucla.edu/stat/Stata/library/ice.htm

http://www.stata-press.com/manuals/mi.html

http://www.ats.ucla.edu/stat/stata/seminars/missing data/mi in stata pt1.htm

E. Stuart (JHSPH) MCHB September 27, 2012 88 / 103

Page 90: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Options in ice

Passive imputation: for variables that are a direct function of others(e.g., interactions)

Need to make sure the imputations are consistent with each other“passive” optionpassive(sexxrace1: sex*nrace1 \ sexxrace3: sex*nrace3)

Specify regression model to be used

e.g., default for categorical is multinomial logit (unordered), but whatif want to use ordered logit?“cmd” optioncmd(income:ologit)

E. Stuart (JHSPH) MCHB September 27, 2012 89 / 103

Page 91: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Specify predictors in particular regression models

ice doesn’t do stepwise, so what if want to use simpler model (notinclude all variables as predictors)?“eq” optioneq(income: sex cintraw, cextraw: nrace1 nrace2)(Note: of course the models in previous line make no sense; no reasonto do that, but this could be useful to, e.g., exclude certain predictorsfrom particular models)Can also use user-written pred eq and check eq functions to facilitatestepwise models within context of ice

Impute categorical variables as categories, but when predictors useseries of dummy variables

“sub” optionpassive(\ inc1:(income==1) \ inc2:(income==2)) sub(income: inc1inc2)(Assuming just 2 levels of income variable)

E. Stuart (JHSPH) MCHB September 27, 2012 90 / 103

Page 92: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Software for SPSS

http://support.spss.com/ProductsExt/SPSS/ESD/17/Download/User%20Manuals/English/SPSS%20Missing%20Values%2017.0.pdf

Missing values add-on package: Creates imputations and analyzes MIdata

Monotone or MICE approaches

Option to include all two-way interactions in prediction models

E. Stuart (JHSPH) MCHB September 27, 2012 91 / 103

Page 93: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Outline

1 Introduction and terminologyUnderstanding types of missingness

2 Ways of handling missing data(Generally) improper ways of handling missing data...Better ways of dealing with missing data...

3 Implementing multiple imputation: MICE

4 Conclusions

5 Software

6 References

E. Stuart (JHSPH) MCHB September 27, 2012 92 / 103

Page 94: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

References: General references on missing data and MICE

http://missingdata.org.uk/

http://www.stat.psu.edu/∼jls/mifaq.html

Allison, PD. (2002) Missing Data in Quantitative Applications in the SocialSciences. Thousand Oaks, CA. Sage.

Brame, R. and Paternoster, R. (2003). Missing data problems in criminologicalresearch: Two case studies. Journal of Quantitative Criminology 19(1): 55-78.

Carpenter, J. (2006). Missing Data Example Analysis. Available athttp://www.lshtm.ac.uk/msu/missingdata/example.html

van Buuren S, Brand JPL, Groothuis-Oudshoorn K, Rubin DB (2006). Fullyconditional specification in multivariate imputation. Journal of StatisticalComputation and Simulation, 76(12), 1049-1064

van der Heidjen, G.J.M.G., Donders, A.R.T., Stijnen, T., and Moons, K.G.M.(2006). Imputation of missing values is superior to complete case analysis and themissing-indicator method in multivariable diagnostic research: A clinical example.Journal of Clinical Epidemiology 59: 1102-1109.

E. Stuart (JHSPH) MCHB September 27, 2012 93 / 103

Page 95: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

References: Books

Brame, R., Turner, M.G., and Paternoster, R. (2010). Missing data problems incriminologic research. Chapter 14 in Handbook of Quantitative Criminology (eds:A.R. Piquero and D. Weisburd). Springer Science and Business Media.

Enders, C.K. (2010). Applied missing data analysis. Guilford Press.

Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall,London.

Little, R.J.A. and Rubin, D.B. (2002) Statistical Analysis with Missing Data, 2ndEdition. J. Wiley & Sons, New York.

E. Stuart (JHSPH) MCHB September 27, 2012 94 / 103

Page 96: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

References: Guidance for dealing with missing data

Carpenter, J. (2006). Annotated Bibliography on Missing Data [accessed July 30,2006]. Available online at http://www.lshtm.ac.uk/msu/missingdata/biblio.html

Carpenter, J.R. and Kenward, M.G. (2007). Missing data in randomised controlledtrials: A practical guide. Available athttp://www.pcpoh.bham.ac.uk/publichealth/methodology

/docs/invitations/Final Report RM04 JH17 mk.pdf.

Graham, J.W. (2008). Missing data analysis: Making it work in the real world.Annual Review of Psychology 60(6): 1-28.

He, Y., Zaslavsky, A.M., Landrum, M.B., Harrington, D.P., and Catalano, P.(2009). Multiple imputation in a large-scale complex survey: A practical guide.Statistical Methods in Medical Research 1-18.

Lavori, P. et al. (2008). Missing data in longitudinal clinical trials Part A: Designand Conceptual Issues. Psychiatric Annals 38(12): 784-792.

Schafer, J.L., & Graham, J.W. (2002). Missing data: Our view of the state of theart. Psychological Methods 7(2): 147-177.

Siddique J. et al. (2008). Missing data in longitudinal trials Part B AnalyticIssues. Psychiatric Annals 38(12): 793-801.

Wadsworth, T., and Roberts, J.M. (2008). When missing data are not missing: Anew approach to evaluating supplemental homicide report imputation strategies.Criminology 46(4): 841-870.E. Stuart (JHSPH) MCHB September 27, 2012 95 / 103

Page 97: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

References: Statistical basis for MI

Kenward, M.G. and Carpenter, J. (2007). Multiple imputation:Current perspectives. Statistical Methods in Medical Research 16:199-218.

Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J., &Solenberger, P. (2001). A multivariate technique for multiplyimputing missing values using a sequence of regression models.Survey Methodology 27: 85-95.

Rubin, D.B. (1976) Inference and missing data. Biometrika, 63,581-592.

Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys.J. Wiley & Sons, New York.

Rubin, D.B. (1996) Multiple imputation after 18+ years (withdiscussion). Journal of the American Statistical Association, 91,473-489.

E. Stuart (JHSPH) MCHB September 27, 2012 96 / 103

Page 98: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

References: Tutorials and software for implementing MI

www.multiple-imputation.com

MI FAQ’s: http://www.stat.psu.edu/∼jls/mifaq.html

http://www.ats.ucla.edu/stat/stata/seminars/missing data/mi in stata pt1.htm

Azur, M., Stuart, E.A., Frangakis, C.M., and Leaf, P.J. (2011). Multipleimputation by chained equations: What is it and how does it work? InternationalJournal of Methods in Psychiatric Research 20(1): 40-49.

Donders ART, van der Heijden GJMG, Stijnen T, Moons KGM.(2006). Review: Agentle introduction to imputation of missing values. Journal of ClinicalEpidemiology 1087-1091.

Drechsler, J. (2011). Multiple imputation in practice: A case study using aGerman establishment survey. Advances in Statistical Analysis 96: 1-26.

Horton, N. & Kleinman, K.P. (2007). Much ado about nothing: A comparison ofmissing data methods and software to fit incomplete data regression models. TheAmerican Statistician 61(1): 79-90. Software appendix:http://www.math.smith.edu/muchado-appendix.pdf

E. Stuart (JHSPH) MCHB September 27, 2012 97 / 103

Page 99: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Lunt, M. (2008). A guide to imputing missing data with Stata.http://personalpages.manchester.ac.uk/staff/mark.lunt/mi.html

Raghunathan, T.E., Solenberger, P.W., & Van Hoewyk, J.V. (2002). IVEWare:Imputation and Variance Estimation Software User’s Guide. Ann Arbor, MI:Institute for Social Research, University of Michigan.www.isr.umich.edurs/c/smp/ive/

Schafer, J.L. (1999). Multiple imputation: A primer. Statistical methods inmedical research 8(1): 3-15.

Sterne, J.A.C., et al. (2009). Multiple imputation for missing data inepidemiological and clinical research: potential and pitfalls. British MedicalJournal 338:b2393.

Stuart, E.A., Azur, M., Frangakis, C.E., and Leaf, P. (2009). Multiple imputationwith large datasets: A case study of the Children’s Mental Health Initiative.American Journal of Epidemiology 169(9): 1133-1139.

White, I.R., Royston, P., and Wood, A.M. (2011). Multiple imputation usingchained equations: Issues and guidance for practice. Statistics in Medicine 30:377-399. (Includes Stata code snippets).

E. Stuart (JHSPH) MCHB September 27, 2012 98 / 103

Page 100: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

References: Survey weighting for nonresponse

Kalton, G., and Flores-Cervantes, I. (2003). Weighting Methods.Journal of Official Statistics, 19, pp. 81-97.

Kalton, G. and Kasprzyk, D. (1986). The Treatment of MissingSurvey Data. Survey Methodology, Vol. 12, No. 1: 1-16.

Oh, H. and Scheuren, F. (1983). Weighting Adjustment for UnitNonresponse. Chap. 13 in vol. 2, part 4 of Incomplete Data inSample Surveys. New York: Academic Press

E. Stuart (JHSPH) MCHB September 27, 2012 99 / 103

Page 101: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

References: Not missing at random models

Enders, C.K. (2011). Missing not at random models for latent growthcurve analysis. Psychological Methods 16(1): 1-16.

Jackson, D., White, I.R., and Leese, M. (2010). How much can welearn about missing data? An exploration of a clinical trial inpsychiatry. Journal of the Royal Statistical Society Series A 173(3):593-612.

Resseguier, N., Giorgi, R. & Paoletti, X. (2011). Sensitivity AnalysisWhen Data Are Missing Not-at-random. Epidemiology 22(2): 282.

Siddique, J. and Belin, T.R. (2008). Using an approximate Bayesianbootstrap to multiply impute nonignorable missing data.Computational Statistics and Data Analysis 53: 405-415.

E. Stuart (JHSPH) MCHB September 27, 2012 100 / 103

Page 102: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

References: Diagnostics

Abayomi, K., Gelman, A., & Levy, M. (2008). Diagnostics for multivariateimputations. Applied Statistics 57(3), 273-291.

Gelman A, King G, & Liu C (1998). Not Asked and Not Answered: MultipleImputation for Multiple Surveys. Journal of the American Statistical Association93(443), 846:857.

Su, Y., Gelman, A., Hill, J., and Yajima, M. (2011) Multiple imputation withdiagnostics (mi) in R: Opening windows into the black box. Forthcoming inJournal of Statistical Software.

E. Stuart (JHSPH) MCHB September 27, 2012 101 / 103

Page 103: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

References: Other

Allison, P.D. (2005). Imputation of Categorical Variables with PROC MI [accessedJuly 30, 2006]. Available online at http:// www2.sas.com/proceedings/sugi30/113-30.pdf

Carpenter, J., Kenward, M., Evans, S., and White, I. (2004). Last ObservationCarry-Forward and Last Observation Analysis. Statistics in Medicine 23: 32413244.

Collins LM, Schafer JL, Kam CK. (2001). A comparison of inclusive and restrictivestrategies in modern missing-data procedures. Psychological Methods 6(4):330351.

Cook, R. J., Zeng, L., and Yi, G. Y. (2004). Marginal Analysis of IncompleteLongitudinal Binary Data: A Cautionary Note on LOCF Imputation. Biometrics60: 820828.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum LikelihoodFrom Incomplete Data via theEMAlgorithm. Journal of the Royal StatisticalSociety, Series B 39: 122.

E. Stuart (JHSPH) MCHB September 27, 2012 102 / 103

Page 104: Recent Advances in Missing Data Methods: …...Recent Advances in Missing Data Methods: Imputation and Weighting Elizabeth A. Stuart Johns Hopkins Bloomberg School of Public Health

Greenland S and Finkle WD. (1995). A Critical Look at Methods for HandlingMissing Covariates in Epidemiologic Regression Analyses. American Journal ofEpidemiology 142:1255-64.

Moons, K.G.M., Donders, R.A.R.T., Stijnen, T., and Harrell, F.E., Jr. (2006).Using the outcome for imputation of missing predictor values was preferred.Journal of Clinical Epidemiology 59: 1092-1101.

Reiter, J. P., Raghunathan, T. E., and Kinney, S. (2006), The importance of thesampling design in multiple imputation for missing data, Survey Methodology,32.2, 143 - 150.

Siddique, J. and Belin, T.R. (2008). Using an approximate Bayesian bootstrap tomultiply impute nonignorable missing data. Computational Statistics and DataAnalysis 53: 405-415.

Spratt, M., Carpenter, J., Sterne, J.A.C., Carlin, J.B., Heron, J., Henderson, J.,and Tilling, K. (2010). Strategies for multiple imputation in longitudinal studies.American Journal of Epidemiology 172(4): 478-487.

Vach W and Blettner M. (1991). Biased Estimation of the Odds Ratio inCase-Control Studies due to the Use of Ad Hoc Methods of Correcting for MissingValues for Confounding Variables. American Journal of Epidemiology 134:895-907.

E. Stuart (JHSPH) MCHB September 27, 2012 103 / 103