chained equations and more in multiple imputation in stata 12 · chained equations and more in...

45
Chained equations and more in multiple imputation in Stata 12 Chained equations and more in multiple imputation in Stata 12 Yulia Marchenko Associate Director, Biostatistics StataCorp LP 2011 Italian Stata Users Group Meeting Yulia Marchenko (StataCorp) November 17, 2011 1 / 45

Upload: trinhxuyen

Post on 06-Mar-2019

231 views

Category:

Documents


0 download

TRANSCRIPT

Chained equations and more in multiple imputation in Stata 12

Chained equations and more in multiple

imputation in Stata 12

Yulia Marchenko

Associate Director, BiostatisticsStataCorp LP

2011 Italian Stata Users Group Meeting

Yulia Marchenko (StataCorp) November 17, 2011 1 / 45

Chained equations and more in multiple imputation in Stata 12

Outline

Outline

Brief overview of MI

Brief history of MI in Stata

New official MI features in Stata 12

Multiple imputation using chained equations (MICE)

OverviewExamplesConvergenceAdvantages/DisadvantagesIncompatibility of conditionalsMICE versus MVN

Concluding remarks

References

Yulia Marchenko (StataCorp) November 17, 2011 2 / 45

Chained equations and more in multiple imputation in Stata 12

Brief overview of MI

Multiple imputation (MI) is a principled, simulation-basedapproach for analyzing incomplete data

MI procedure 1) replaces missing values with multiple sets ofsimulated values to complete the data, 2) applies standardanalyses to each completed dataset, and 3) adjusts theobtained parameter estimates for missing-data uncertainty

The objective of MI is not to predict missing values as closeas possible to the true ones but to handle missing data in away resulting in valid statistical inference (Rubin 1996)

MI is statistically valid if an imputation model is proper andthe primary, completed-data analysis is statistically valid inthe absence of missing data (Rubin 1987)

Yulia Marchenko (StataCorp) November 17, 2011 3 / 45

Chained equations and more in multiple imputation in Stata 12

Brief history of MI in Stata

User-written tools

Stata 7

2003 (Carlin et al. 2003): tools for analyzing multiplyimputed data (mifit, miset, mido, mici, mitestparm,miappend, etc.)

Stata 8

2004 (Royston 2004): univariate imputation (uvis) andmultivariate imputation using chained equations (mvis),analysis of multiply imputed data (micombine similar toCarlin’s mifit)

2005 (Royston 2005a, 2005b): ice replaces and extends mvisfor imputation using chained equations

2007 (Royston 2007): updates for ice with an emphasis oninterval censoring

2008: mira by Rodrigo Alfaro for analyzing MI data stored inseparate files

Yulia Marchenko (StataCorp) November 17, 2011 4 / 45

Chained equations and more in multiple imputation in Stata 12

Brief history of MI in Stata

User-written tools

Stata 9

2008 (Carlin et al. 2008): new framework for managing andanalyzing MI data (the mim: prefix replaces micombine,mifit, and other earlier tools for analyzing and manipulatingMI data)

2009 (Royston 2009, Royston et al. 2009): updates to ice

and mim

inorm by John Galati and John Carlin for performingimputation using MVN

Yulia Marchenko (StataCorp) November 17, 2011 5 / 45

Chained equations and more in multiple imputation in Stata 12

Brief history of MI in Stata

Official tools

Stata 11

2009: an official suite of commands for creating (mi impute),manipulating (mi merge, mi reshape, etc.), and analyzing(mi estimate) MI data

mi provides 4 different styles of storing MI data, MI dataverification, and extensive data-management supportmi impute provides a number of univariate imputationmethods and multivariate imputation using MVNthe mi estimate: prefix, similar to mim:, analyzes MI data

Stata 12

2011: various additions to mi, including multivariateimputation using chained equations (mi impute chained)

See http://www.stata.com/support/faqs/stat/mi ice.html forcomparison of mi with user-written commands ice and mim

Yulia Marchenko (StataCorp) November 17, 2011 6 / 45

Chained equations and more in multiple imputation in Stata 12

Some of the new official MI features in Stata 12

Imputation

Multivariate imputation using chained equations (mi impute

chained)

Four new univariate imputation methods of mi impute:truncreg, intreg, poisson, and nbreg

Conditional imputation within mi impute chained and mi

impute monotone

Handling of perfect prediction via the new augment optionduring imputation of categorical data

Separate imputation for different groups of the data via thenew by() option of mi impute

Yulia Marchenko (StataCorp) November 17, 2011 7 / 45

Chained equations and more in multiple imputation in Stata 12

Some of the new official MI features in Stata 12

Estimation

mi estimate, mcerror estimates the amount of simulationerror associated with MI results

New commands mi predict and mi predictnl to computelinear and nonlinear MI predictions

misstable summarize, generate() creates missing-valueindicators for variables containing missing values

Yulia Marchenko (StataCorp) November 17, 2011 8 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Overview

MICE (van Buuren et al. 1999) is an iterative imputationmethod that imputes multiple variables by using chainedequations, a sequence of univariate imputation methods withfully conditional specification (FCS) of prediction equations

That is, to get one set of imputed values, iterate overt = 0, 1, . . . ,T and impute:

X(t+1)1 using X

(t)2 ,X

(t)3 , . . . ,X

(t)q

X(t+1)2 using X

(t+1)1 ,X

(t)3 , . . . ,X

(t)q

· · ·X

(t+1)q using X

(t+1)1 ,X

(t+1)2 , . . . ,X

(t+1)q−1

Yulia Marchenko (StataCorp) November 17, 2011 9 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Overview

MICE is also known as FCS and SRMI, sequential regressionmultivariate imputation (Raghunathan et al. 2001)

MICE can handle variables of different types

MICE can handle arbitrary missing-data patterns

MICE can accommodate certain important characteristics(data ranges, restrictions within a subset) of the observationaldata

Being an iterative method, MICE requires checking ofconvergence

MICE requires careful modeling of conditional specifications

See White et al. (2011) for practical guidelines about usingMICE

Yulia Marchenko (StataCorp) November 17, 2011 10 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Examples: Data

Consider fictional data recording heart attacks

. use mheart8(Fictional heart attack data; bmi and age missing; arbitrary pattern)

. describe

Contains data from mheart8.dta

obs: 154 Fictional heart attack data;bmi and age missing; arbitrarypattern

vars: 6 1 Sep 2011 10:11size: 1,848

storage display value

variable name type format label variable label

attack byte %9.0g Outcome (heart attack)

smokes byte %9.0g Current smokerage float %9.0g Age, in years

bmi float %9.0g Body Mass Index, kg/m^2female byte %9.0g Genderhsgrad byte %9.0g High school graduate

Sorted by:

Yulia Marchenko (StataCorp) November 17, 2011 11 / 45

Let’s summarize missing values

. misstable summarize, generate(Mis_)

Obs<.

UniqueVariable Obs=. Obs>. Obs<. values Min Max

age 12 142 142 20.73613 83.78423bmi 28 126 126 17.22643 38.24214

and explore missing-data patterns

. misstable patterns

Missing-value patterns

(1 means complete)

PatternPercent 1 2

77% 1 1

16 1 05 0 1

3 0 0

100%

Variables are (1) age (2) bmi

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Examples: Prepare data for imputation

Declare the storage style

. mi set wide

Register variables

. mi register imputed age bmi

. mi register regular attack smokes female hsgrad

Yulia Marchenko (StataCorp) November 17, 2011 13 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 1: Default prediction equations

Impute age and bmi using regression imputation

. mi impute chained (regress) age bmi = attack smokes female hsgrad, add(5) rseed(27654)

Conditional models:age: regress age bmi attack smokes female hsgrad

bmi: regress bmi age attack smokes female hsgrad

Performing chained iterations ...

Multivariate imputation Imputations = 5Chained equations added = 5Imputed: m=1 through m=5 updated = 0

Initialization: monotone Iterations = 50

burn-in = 10

age: linear regressionbmi: linear regression

Observations per m

Variable Complete Incomplete Imputed Total

age 142 12 12 154

bmi 126 28 28 154

(complete + incomplete = total; imputed is the minimum across mof the number of filled-in observations.)

Yulia Marchenko (StataCorp) November 17, 2011 14 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 1: MI diagnostics

Compare distributions of the imputed, completed, andobserved data for age (midiagplots is a forthcominguser-written command; see Marchenko and Eddings (2011) forhow to create MI diagnostic plots manually)

. midiagplots age, m(1/5) combine

(M = 5 imputations)(imputed: age bmi)

(Continued on next page)

Yulia Marchenko (StataCorp) November 17, 2011 15 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 1: MI diagnostics

0.2

.4.6

.81

Cum

ulat

ive

dist

ribut

ion

20 40 60 80 100Age, in years

Imputation 1

0.2

.4.6

.81

Cum

ulat

ive

dist

ribut

ion

20 40 60 80Age, in years

Imputation 2

0.2

.4.6

.81

Cum

ulat

ive

dist

ribut

ion

20 40 60 80Age, in years

Imputation 30

.2.4

.6.8

1C

umul

ativ

e di

strib

utio

n

20 40 60 80Age, in years

Imputation 40

.2.4

.6.8

1C

umul

ativ

e di

strib

utio

n

20 40 60 80Age, in years

Imputation 5

Observed Imputed Completed

Yulia Marchenko (StataCorp) November 17, 2011 16 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 1: MI diagnostics

Compare distributions of the imputed, completed, andobserved data for bmi

. midiagplots bmi, m(1/5) combine(M = 5 imputations)

(imputed: age bmi)

(Continued on next page)

Yulia Marchenko (StataCorp) November 17, 2011 17 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 1: MI diagnostics

0.2

.4.6

.81

Cum

ulat

ive

dist

ribut

ion

15 20 25 30 35 40Body Mass Index, kg/m^2

Imputation 1

0.2

.4.6

.81

Cum

ulat

ive

dist

ribut

ion

15 20 25 30 35 40Body Mass Index, kg/m^2

Imputation 2

0.2

.4.6

.81

Cum

ulat

ive

dist

ribut

ion

15 20 25 30 35 40Body Mass Index, kg/m^2

Imputation 30

.2.4

.6.8

1C

umul

ativ

e di

strib

utio

n

15 20 25 30 35 40Body Mass Index, kg/m^2

Imputation 40

.2.4

.6.8

1C

umul

ativ

e di

strib

utio

n

10 20 30 40Body Mass Index, kg/m^2

Imputation 5

Observed Imputed Completed

Yulia Marchenko (StataCorp) November 17, 2011 18 / 45

. mi estimate, mcerror cformat(%8.4f): logit attack smokes age bmi female hsgrad

Multiple-imputation estimates Imputations = 5

Logistic regression Number of obs = 154Average RVI = 0.0338Largest FMI = 0.0866

DF adjustment: Large sample DF: min = 574.54avg = 1370395.93

max = 7973220.18Model F test: Equal FMI F( 5, 9595.8) = 3.53Within VCE type: OIM Prob > F = 0.0035

attack Coef. Std. Err. t P>|t| [95% Conf. Interval]

smokes 1.1326 0.3561 3.18 0.001 0.4347 1.83060.0145 0.0009 0.04 0.000 0.0137 0.0155

age 0.0372 0.0162 2.30 0.022 0.0054 0.06910.0019 0.0003 0.12 0.007 0.0019 0.0021

bmi 0.0935 0.0457 2.05 0.041 0.0039 0.1831

0.0044 0.0011 0.11 0.011 0.0050 0.0048

female -0.1331 0.4171 -0.32 0.750 -0.9507 0.68440.0195 0.0020 0.05 0.035 0.0209 0.0189

hsgrad 0.1324 0.4019 0.33 0.742 -0.6553 0.92010.0112 0.0007 0.03 0.021 0.0099 0.0126

_cons -5.2048 1.5652 -3.33 0.001 -8.2726 -2.13710.0170 0.0163 0.03 0.000 0.0413 0.0304

Note: values displayed beneath estimates are Monte Carlo error estimates.

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 2: Different imputation methods

Impute bmi using predictive mean matching instead

. mi impute chained (regress) age (pmm) bmi = attack smokes female hsgrad, replace

Conditional models:age: regress age bmi attack smokes female hsgrad

bmi: pmm bmi age attack smokes female hsgrad

Performing chained iterations ...

Multivariate imputation Imputations = 5Chained equations added = 0

Imputed: m=1 through m=5 updated = 5

Initialization: monotone Iterations = 50burn-in = 10

age: linear regressionbmi: predictive mean matching

Observations per m

Variable Complete Incomplete Imputed Total

age 142 12 12 154bmi 126 28 28 154

(complete + incomplete = total; imputed is the minimum across mof the number of filled-in observations.)

Yulia Marchenko (StataCorp) November 17, 2011 20 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 3.1: Custom prediction equations (different sets of predictors)

Omit hsgrad from the prediction equation for bmi

. mi impute chained (regress) age ///> (pmm, omit(hsgrad)) bmi ///

> = attack smokes female hsgrad, replace

Conditional models:age: regress age bmi attack smokes female hsgrad

bmi: pmm bmi age attack smokes female

Performing chained iterations ...

Multivariate imputation Imputations = 5Chained equations added = 0Imputed: m=1 through m=5 updated = 5

Initialization: monotone Iterations = 50

burn-in = 10

age: linear regressionbmi: predictive mean matching

Observations per m

Variable Complete Incomplete Imputed Total

age 142 12 12 154

bmi 126 28 28 154

(complete + incomplete = total; imputed is the minimum across mof the number of filled-in observations.)

Yulia Marchenko (StataCorp) November 17, 2011 21 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 3.1: Custom prediction equations (different sets of predictors)

Or, include hsgrad in the prediction equation for age

. mi impute chained (regress, include(hsgrad)) age ///> (pmm) bmi ///

> = attack smokes female, replace

Conditional models:age: regress age bmi hsgrad attack smokes female

bmi: pmm bmi age attack smokes female

Performing chained iterations ...

Multivariate imputation Imputations = 5Chained equations added = 0Imputed: m=1 through m=5 updated = 5

Initialization: monotone Iterations = 50

burn-in = 10

age: linear regressionbmi: predictive mean matching

Observations per m

Variable Complete Incomplete Imputed Total

age 142 12 12 154

bmi 126 28 28 154

(complete + incomplete = total; imputed is the minimum across mof the number of filled-in observations.)

Yulia Marchenko (StataCorp) November 17, 2011 22 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 3.2: Custom prediction equations (functions of imputed variables)

What if relationship between age and bmi is curvilinear?

. mi impute chained (regress, include(hsgrad (bmi^2))) age ///> (pmm) bmi ///

> = attack smokes female, replace

Conditional models:age: regress age bmi hsgrad (bmi^2) attack smokes female

bmi: pmm bmi age attack smokes female

Performing chained iterations ...

Multivariate imputation Imputations = 5Chained equations added = 0Imputed: m=1 through m=5 updated = 5

Initialization: monotone Iterations = 50

burn-in = 10

age: linear regressionbmi: predictive mean matching

Observations per m

Variable Complete Incomplete Imputed Total

age 142 12 12 154

bmi 126 28 28 154

(complete + incomplete = total; imputed is the minimum across mof the number of filled-in observations.)

Yulia Marchenko (StataCorp) November 17, 2011 23 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 4: Variables with a restricted range

What if unobserved values of age are known to lie in [20, 84]?

. generate age_l = cond(age==., 20, age)

. generate age_u = cond(age==., 84, age)

. mi impute chained (intreg, ll(age_l) ul(age_u) include(hsgrad)) age ///> (pmm) bmi ///

> = attack smokes female, replace

Conditional models:age: intreg age bmi hsgrad attack smokes female , ll(age_l) ul(age_u)

bmi: pmm bmi age attack smokes female

Performing chained iterations ...

Multivariate imputation Imputations = 5Chained equations added = 0Imputed: m=1 through m=5 updated = 5

Initialization: monotone Iterations = 50burn-in = 10

age: interval regressionbmi: predictive mean matching

Observations per m

Variable Complete Incomplete Imputed Total

age 142 12 12 154

bmi 126 28 28 154

(complete + incomplete = total; imputed is the minimum across mYulia Marchenko (StataCorp) November 17, 2011 24 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 5: Imputing on subsamples

Impute age and bmi separately for males and females

. mi impute chained (regress) age (pmm) bmi = attack smokes hsgrad,> replace by(female, noreport)

Multivariate imputation Imputations = 5

Chained equations added = 0Imputed: m=1 through m=5 updated = 5

Initialization: monotone Iterations = 50

burn-in = 10

age: linear regressionbmi: predictive mean matching

Observations per m

by()Variable Complete Incomplete Imputed Total

female = 0

age 106 10 10 116bmi 95 21 21 116

female = 1

age 36 2 2 38bmi 31 7 7 38

Overallage 142 12 12 154

bmi 126 28 28 154

(complete + incomplete = total; imputed is the minimum across mYulia Marchenko (StataCorp) November 17, 2011 25 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 6: Conditional imputation

Consider heart attack data containing hightar, an indicatorfor smoking high-tar cigarettes

. webuse mheart10s0(Fict. heart attack data; bmi, age, hightar, & smokes missing; arbitrary pattern)

. mi describe

Style: mlonglast mi update 25mar2011 11:00:38, 66 days ago

Obs.: complete 92

incomplete 62 (M = 0 imputations)

total 154

Vars.: imputed: 4; bmi(24) age(30) hightar(19) smokes(14)

passive: 0

regular: 3; attack female hsgrad

system: 3; _mi_m _mi_id _mi_miss

(there are no unregistered variables)

Yulia Marchenko (StataCorp) November 17, 2011 26 / 45

Explore missing-data patterns

. mi misstable patterns

Missing-value patterns(1 means complete)

PatternPercent 1 2 3 4

60% 1 1 1 1

14 1 1 1 0

10 1 1 0 17 0 0 1 13 1 1 0 0

2 1 0 1 11 0 0 0 1

<1 0 0 1 0<1 1 0 0 0<1 1 0 1 0

100%

Variables are (1) smokes (2) hightar (3) bmi (4) age

.

. mi misstable nested

1. smokes(14) -> hightar(19)2. bmi(24)

3. age(30)

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Example 6: Conditional imputation

Impute hightar conditionally on smokes; check predictionequations prior to imputation (option dryrun)

. mi impute chained ///> (regress) age ///

> (pmm) bmi ///> (logit) smokes ///> (logit, conditional(if smokes==1) omit(i.smokes)) hightar ///

> = attack hsgrad female, dryrun

Conditional models:smokes: logit smokes bmi age attack hsgrad female

hightar: logit hightar bmi age attack hsgrad female ,conditional(if smokes==1)

bmi: pmm bmi i.smokes i.hightar age attack hsgrad femaleage: regress age i.smokes i.hightar bmi attack hsgrad female

Yulia Marchenko (StataCorp) November 17, 2011 28 / 45

Prediction equations are as intended; proceed to imputation

. mi impute chained ///

> (regress) age ///> (pmm) bmi ///

> (logit) smokes ///> (logit, conditional(if smokes==1) omit(i.smokes)) hightar ///> = attack hsgrad female, add(5)

Performing chained iterations ...

Multivariate imputation Imputations = 5

Chained equations added = 5Imputed: m=1 through m=5 updated = 0

Initialization: monotone Iterations = 50

burn-in = 10

Conditional imputation:hightar: incomplete out-of-sample obs. replaced with value 0

age: linear regression

bmi: predictive mean matchingsmokes: logistic regression

hightar: logistic regression

Observations per m

Variable Complete Incomplete Imputed Total

age 124 30 30 154bmi 130 24 24 154

smokes 140 14 14 154

hightar 135 19 19 154

(complete + incomplete = total; imputed is the minimum across mof the number of filled-in observations.)

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Convergence

MICE is an iterative method—its convergence needs to beevaluated

Recall imputation model for age and bmi from example 2(here we use 3 nearest neighbors with PMM)

Let’s explore the convergence of MICE

. webuse mheart8s0(Fictional heart attack data; bmi and age missing; arbitrary pattern)

. set seed 38762

. mi impute chained (regress) age (pmm, knn(3)) bmi = attack smokes female hsgrad,

> chainonly burnin(50) savetrace(impstats)

Conditional models:age: regress age bmi attack smokes female hsgradbmi: pmm bmi age attack smokes female hsgrad , knn(3)

Performing chained iterations ...

Note: no imputation performed.

Yulia Marchenko (StataCorp) November 17, 2011 30 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Convergence

Trace plots of means and standard deviations of imputedvalues

. use impstats

(Summaries of imputed values from -mi impute chained-)

. tsset itertime variable: iter, 0 to 50

delta: 1 unit

. tsline bmi_mean, name(gr1) nodraw yline(25)

. tsline bmi_sd, name(gr2) nodraw yline(4)

. tsline age_mean, name(gr3) nodraw yline(56)

. tsline age_sd, name(gr4) nodraw yline(11.6)

. graph combine gr1 gr2 gr3 gr4, title(Trace plots of summaries of imputed values)> rows(2)

(Continued on next page)

Yulia Marchenko (StataCorp) November 17, 2011 31 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Convergence

2324

2526

27M

ean

of b

mi

0 10 20 30 40 50Iteration numbers

23

45

6S

td. D

ev. o

f bm

i

0 10 20 30 40 50Iteration numbers

5055

6065

Mea

n of

age

0 10 20 30 40 50Iteration numbers

510

1520

Std

. Dev

. of a

ge

0 10 20 30 40 50Iteration numbers

Trace plots of summaries of imputed values

Yulia Marchenko (StataCorp) November 17, 2011 32 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Convergence

MICE uses separate independent chains to obtain imputations

Use add() instead of chainonly in combination withsavetrace() to save summaries of imputed values frommultiple chains

. webuse mheart8s0, clear

(Fictional heart attack data; bmi and age missing; arbitrary pattern)

. qui mi impute chain (regress) age (pmm, knn(3)) bmi = attack smokes female hsgrad,> add(5) burnin(20) savetrace(impstats, replace)

Yulia Marchenko (StataCorp) November 17, 2011 33 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Convergence

Trace plots of means and standard deviations of imputedvalues from multiple chains

. use impstats, clear

(Summaries of imputed values from -mi impute chained-)

. reshape wide *mean *sd, i(iter) j(m)(note: j = 1 2 3 4 5)

Data long -> wide

Number of obs. 105 -> 21

Number of variables 6 -> 21j variable (5 values) m -> (dropped)xij variables:

age_mean -> age_mean1 age_mean2 ... age_mean5bmi_mean -> bmi_mean1 bmi_mean2 ... bmi_mean5

age_sd -> age_sd1 age_sd2 ... age_sd5bmi_sd -> bmi_sd1 bmi_sd2 ... bmi_sd5

--more--

Yulia Marchenko (StataCorp) November 17, 2011 34 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Convergence

. tsset itertime variable: iter, 0 to 20

delta: 1 unit

. tsline bmi_mean*, name(gr1) nodraw legend(off) ytitle(Mean of bmi) yline(25)

. tsline bmi_sd*, name(gr2) nodraw legend(off) ytitle(Std. Dev. of bmi) yline(4)

. tsline age_mean*, name(gr3) nodraw legend(off) ytitle(Mean of age) yline(56)

. tsline age_sd*, name(gr4) nodraw legend(off) ytitle(Std. Dev. of age) yline(11.6)

. graph combine gr1 gr2 gr3 gr4, title(Trace plots of summaries of imputed values> from 5 chains) rows(2)

(Continued on next page)

Yulia Marchenko (StataCorp) November 17, 2011 35 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Convergence

2425

2627

Mea

n of

bm

i

0 5 10 15 20Iteration numbers

34

56

Std

. Dev

. of b

mi

0 5 10 15 20Iteration numbers

4550

5560

65M

ean

of a

ge

0 5 10 15 20Iteration numbers

510

1520

Std

. Dev

. of a

ge

0 5 10 15 20Iteration numbers

Trace plots of summaries of imputed values from 5 chains

Yulia Marchenko (StataCorp) November 17, 2011 36 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Advantages

The variable-by-variable specification of MICE makes it easyto build complicated imputation models for multiple variables

Unlike sequential monotone imputation, MICE does notrequire monotone missing-data patterns

MICE accommodates variables of different types by using animputation method appropriate for each variable

MICE allows different sets of predictors when imputingdifferent variables

MICE allows to impute missing values within the observed (orpre-specified) ranges of the data

MICE can handle imputation of variables defined only on asubset of the data—conditional imputation

MICE can incorporate functional relationships among variables

Yulia Marchenko (StataCorp) November 17, 2011 37 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Disadvantages

MICE lacks formal theoretical justification

In particular, its theoretical weakness is possibleincompatibility of fully conditional specifications for which noproper joint multivariate distribution exists

The variable-by-variable specification of MICE also makes iteasy to build models with incompatible conditionals

Yulia Marchenko (StataCorp) November 17, 2011 38 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

Incompatibility of conditionals

MICE is similar in spirit to a Gibbs sampler but is not a trueGibbs sampler except in rare cases

A set of fully conditional specifications may be incompatible,that is, it may not correspond to any proper joint multivariatedistribution (e.g., Arnold et al. 2001)

For example, X1|X2 ∼ N(α1 + β1X2, σ21) and

X2|X1 ∼ N(α2 + β2 lnX1, σ22) are incompatible

See, for example, van Buuren (2006, 2007) for the impact ofincompatible conditionals on final MI results—only minorimpact was found in the examples considered

Yulia Marchenko (StataCorp) November 17, 2011 39 / 45

Chained equations and more in multiple imputation in Stata 12

Multiple imputation using chained equations

MICE versus MVN

MICE uses a sequential (variable-by-variable) approach forimputation; MVN (Schafer 1997) uses a joint modelingapproach based on a multivariate normal distributionMICE has no theoretical justification (except in someparticular cases); MVN doesMICE can handle variables of different types; MVN is intendedfor continuous variables and requires normality (Schafer [1997]and Allison [2001] note that MVN can be robust to departuresfrom normality and can sometimes be used to model binaryand ordinal variables)MICE can incorporate important data characteristics such asranges and restrictions within a subset of the data; in general,MVN cannotIn practice, the quality of imputations from either of themethods should be examinedSee, for example, Lee and Carlin (2010) for a recentcomparison of MVN and MICE

Yulia Marchenko (StataCorp) November 17, 2011 40 / 45

Chained equations and more in multiple imputation in Stata 12

Concluding remarks

Stata 12’s mi provides multivariate imputation using chainedequations, mi impute chained, among other new features

MICE is a very powerful and flexible imputation tool. Itsflexibility, however, must be used with caution.

MICE has no formal theoretical justification but provides waysof capturing important data characteristics

MICE is an iterative imputation method so its convergenceneeds to be evaluated

As with any imputation method, the quality of imputationsneeds to be evaluated after MICE

Careful modeling is required with MICE to avoid incompatibleconditionals, although a few simulation studies suggest theimpact of incompatible conditionals on final MI inference isminor

Yulia Marchenko (StataCorp) November 17, 2011 41 / 45

Chained equations and more in multiple imputation in Stata 12

References

Allison, P. D. 2001. Missing Data. Thousand Oaks, CA: Sage.

Arnold, B. C., E. Castillo, and J. M. Sarabia. 2001. Conditionallyspecified distributions: An introduction. Statistical Science 16:249—274.

Carlin, J. B., J. C. Galati, and P. Royston. 2008. A new frameworkfor managing and analyzing multiply imputed data in Stata. Stata

Journal 8: 49—67.

Carlin, J. B., N. Li, P. Greenwood, and C. Coffey. 2003. Tools foranalyzing multiple imputed datasets. Stata Journal 3: 226—244.

Lee, K. J., and J. B. Carlin. 2010. Multiple imputation for missingdata: Fully conditional specification versus multivariate normalimputation. American Journal of Epidemiology 171: 624—632.

Marchenko, Y. V., and W. D. Eddings. 2011. A note on how toperform multiple-imputation diagnostics in Stata.http://www.stata.com/users/ymarchenko/midiagnote.pdf.

Yulia Marchenko (StataCorp) November 17, 2011 42 / 45

Chained equations and more in multiple imputation in Stata 12

References

Raghunathan, T. E., J. M. Lepkowski, J. Van Hoewyk, and P.Solenberger. 2001. A multivariate technique for multiply imputingmissing values using a sequence of regression models. Survey

Methodology 27: 85—95.

Royston, P. 2004. Multiple imputation of missing values. Stata

Journal 4: 227—241.

Royston, P. 2005a. Multiple imputation of missing values: Update.Stata Journal 5: 188—201.

Royston, P. 2005b. Multiple imputation of missing values: Updateof ice. Stata Journal 5: 527—536.

Royston, P. 2007. Multiple imputation of missing values: Furtherupdate of ice, with an emphasis on interval censoring. Stata

Journal 7: 445—464.

Yulia Marchenko (StataCorp) November 17, 2011 43 / 45

Chained equations and more in multiple imputation in Stata 12

References

Royston, P. 2009. Multiple imputation of missing values: Furtherupdate of ice, with an emphasis on categorical variables. Stata

Journal 9: 466—477.

Royston, P., J. B. Carlin, and I. R. White. 2009. Multipleimputation of missing values: New features for mim. Stata Journal

9: 252—264.

Rubin, D. B. 1987. Multiple Imputation for Nonresponse in

Surveys. New York: Wiley.

Rubin, D. B. 1996. Multiple imputation after 18+ years. Journal

of the American Statistical Association 91: 473—489.

Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data.Boca Raton, FL: Chapman & Hall/CRC.

Yulia Marchenko (StataCorp) November 17, 2011 44 / 45

Chained equations and more in multiple imputation in Stata 12

References

van Buuren, S. 2007. Multiple imputation of discrete andcontinuous data by fully conditional specification. Statistical

Methods in Medical Research 16: 219—242.

van Buuren, S., H. C. Boshuizen, and D. L. Knook. 1999. Multipleimputation of missing blood pressure covariates in survival analysis.Statistics in Medicine 18: 681—694.

van Buuren, S., J. P. L. Brand, C. G. M. Groothuis-Oudshoorn,and D. B. Rubin. 2006. Fully conditional specification inmultivariate imputation. Journal of Statistical Computation and

Simulation 76: 1049—1064.

White, I. R., P. Royston, and A. M. Wood. 2011. Multipleimputation using chained equations: Issues and guidance forpractice. Statistics in Medicine 30: 377—399.

Yulia Marchenko (StataCorp) November 17, 2011 45 / 45