understanding and addressing missing dataterm:name]/[node:create:custom:ym... · maximum likelihood...

42
UNDERSTANDING AND ADDRESSING MISSING DATA DANIELLE ROBBINS, PHD CINDY SANGALANG, PHD AUGUST 28, 2013

Upload: others

Post on 04-Nov-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

UNDERSTANDING AND ADDRESSING

MISSING DATA

DANIELLE ROBBINS, PHD

CINDY SANGALANG, PHD

AUGUST 28, 2013

Page 2: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

OVERVIEW

Goals for workshop

• Understand what missing data is and why it matters

• Describe ways to address missing data

• Gain exposure to advanced methods for addressing missing data

Topics covered

• Patterns of missing data

• Conventional approaches for handling missing data

• Advanced topics:

• Maximum likelihood (ML)

• Multiple imputation (MI)

Page 3: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

WHAT IS MISSING DATA?

Observations/cases with any missing values

?

?

?

?

?

?

Variables

Cases ?

? = missing

Page 4: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

WHY CAN MISSING DATA

BE A PROBLEM?

Reduced sample size

Potential bias

Page 5: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

HOW MIGHT DATA BE MISSING?

• Study participants

• Item nonresponse

• Study design

• Attrition (panel studies)

• Error in data recording

Page 6: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

WHEN IS MISSING DATA

A PROBLEM?

• Amount of missing data

• Adequate statistical power to detect effects

• Patterns of missingness

Page 7: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MISSING DATA PATTERNS:

UNDERLYING ASSUMPTIONS

Classification system developed by Rubin and colleagues

(Rubin, 1976; Little & Rubin, 1987, 2002)

In other words, is the reason for missing data random or not

random? 3 categories:

• Missing Completely at Random (MCAR)

• Missing at Random (MAR)

• Not Missing at Random (NMAR or MNAR)

Page 8: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MISSING COMPLETELY AT

RANDOM (MCAR)

• Data are considered MCAR when there are no patterns in

missingness

• Missing values are not related (or correlated) to any variables

in the study

• If data are MCAR, the complete data is considered a random

subsample from the original target sample

• MCAR is the ideal situation

Page 9: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MISSING AT RANDOM (MAR)

• Data are MAR if missingness on a variable is related to other

variables in your analysis

• Missingness on a variable can be predicted by other variables

• Considered a weaker assumption than MCAR

Page 10: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

NOT MISSING AT RANDOM

(NMAR)

• Missingness is related to the variable itself

• Whether data is NMAR is a theoretical and conceptual

consideration

• Must create a separate model that accounts for missing data

Page 11: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

CONVENTION APPROACHES FOR

HANDLING MISSING DATA

Listwise deletion

Pairwise deletion

Mean imputation/substitution

Page 12: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

LISTWISE DELETION

Also known as complete case analysis

Strengths

• Unbiased with data are MCAR

• Works for any kind of statistical analysis

Weaknesses

• If large proportion of data missing, could result loss of

statistical bias

• May introduce bias if data are MAR

Page 13: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

PAIRWISE DELETION

Only cases relating to a pair of variables are used in analysis

(e.g. correlations)

AKA “unwise deletion”

Strengths

• Uses all available information

• Approximately unbiased in MCAR

Weaknesses

• Estimates based on different sets of data (sample sizes and

standard errors)

Page 14: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MEAN IMPUTATION

(SUBSTITUTION)

Missing values replaced with mean values

Strengths

• Easy to implement

Weaknesses

• Introduces bias - sample size increased but standard error

underestimated

Page 15: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types
Page 16: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

ADVANCED APPROACHES:

Maximum Likelihood

Multiple Imputation

Page 17: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MAXIMUM LIKELIHOOD

A method that uses a likelihood function to determine

estimates

Likelihood functions:

• Functions that relate the parameters of a statistical model like

regression :

• 𝑌 = 𝛽1𝑋1 + 𝛽2𝑋2 +⋯

• Function that characterizes the probability of the data

behavior based on the data and unknown parameter functions

• 𝐿 𝜃 𝑌 = 𝑃 𝑌 𝜃 , so which parameter estimates set, θ,

maximizes the likelihood function

Page 18: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

EXAMPLES OF JOINT

PROBABILITY DENSITY

FUNCTIONS

Page 19: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

COMPONENTS OF

MAXIMUM LIKELIHOOD*

Estimates are obtained when, θ maximizes L(θ|Y).

Estimates are not biased in large samples due to the central limit theorem

Estimates have low standard error due to appropriate uncertainty quantification

Estimates appear normal due to the central limit theorem

* Given N > 200, hard to use for N < 100

Page 20: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MAXIMUM LIKELIHOOD

PROCEDURE

To get estimates one method used is the EM algorithm is

used iteratively in a two step process

1. Find the Expected Value of the log likelihood function

given an estimate, θ0

2. Maximize the function and get new estimates, θ1

Repeat steps 1 and 2 until convergence in estimate occurs

Page 21: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

IDEAL CONDITIONS FOR

MAXIMUM LIKELIHOOD

Data should be continuous

Data should be normally distributed

Constant variance , i.e. variation in variables should not be dependent upon the data

Overall Multivariate Normal Data is ideal but Maximum Likelihood is still implemented on categorical data

Missing data assumed to be MAR

Page 22: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

IMPLEMENTATION OF

MAXIMUM LIKELIHOOD

Can use EM algorithm *

Direct Maximization

Factoring

* Used in both ML and MI

Page 23: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

DATA EXAMPLES

Can you pick which data sets might not be MAR?

ID Alcohol

Amount Alcohol

Frequency

1 1 1

2 2 1

3 3 1

4 4 1

5 4 x

6 2 x

7 2 x

ID Alcohol Amount

Alcohol Frequency

1 1 1

2 2 x

3 3 1

4 4 2

5 4 3

6 2 2

7 2 x

ID Alcohol Amount

Alcohol Frequency

1 1 x

2 2 x

3 3 x

4 4 x

5 4 x

6 2 x

7 2 x

Page 24: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

DATA EXAMPLE FOR

ANALYSIS

National Longitudinal Survey of Youth (1990)

Variables:

• ANTI: antisocial behavior, (0-6)

• SELF: self-esteem, (6-24)

• POV: poverty status , dichotomous, 1= in poverty

• BLACK: 1 if child is Black, dichotomous

• HISPANIC: 1 if child is Hispanic, dichotomous

• CHILDAGE: age in 1990

• DIVORCE: 1 if mother divorced in 1990, dichotomous

• GENDER: 1 if female, dichotomous

• MOMAGE: mother’s age at birth of child

• MOMWORK: 1 if mother is employed, dichotomous

Missing data on predictor variables

N=581

Page 25: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

ANALYSIS ON

MISSING DATA SET

proc reg data = nlsyem;

model anti=self pov black hispanic childage divorce gender

momage momwork;

run;

Page 26: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MAXIMUM

LIKELIHOOD: SAS

proc calis data=nlsyem method=fiml;

path anit <- self pov black hispanic childage divorce gender

momage momwork;

run;

Page 27: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MAXIMUM LIKELIHOOD: STATA

sem anti <- self pov black hispanic childage divorce gender

momage momwork, method(mlmv)

Page 28: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MAXIMUM LIKELIHOOD:

REPEATED MEASURES

Proc mixed and Proc glimmix (SAS)

xtreg, xtmixed, etc. (STATA) automatically handle missing data

by ML only if there are no missing on the predicted

Page 29: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

LIMITATIONS OF ML

Must use a joint distribution of all variables to make estimate

(i.e. L(θ|Y))

May not be robust

Auxiliary variables may be difficult to use, an auxiliary

variable is a variable that is freely correlated with all other

variables,

Page 30: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MULTIPLE IMPUTATION

Similar properties to ML, both estimates are started with the

EM algorithm

Used with any type of data

Multiple software options

Stochastic results

Complex implementation

Imputation model must at least be the same model as the

analysis model

Page 31: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MULTIPLE IMPUTATION

EQUATIONS

Given dependent variable Y, and independent variable X,

imputations are generated for non-missing cases using:

𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖

Cases with missing data, imputations are generated using:

𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖 + 𝑠𝑥𝑦𝑟𝑖

Page 32: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MULTIPLE IMPUTATION

ASSUMPTIONS

Data is assumed to be missing at random (MAR)

Data is multivariate normal

Page 33: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

IMPLEMENTATION OF MI: SAS

(N=10)

proc mi data=nlsyem out=impnlsyem noprint nimpute=100;

var anti self pov black hispanic childage divorce gender momage momwork;

run;

proc reg data=impnlsyem outest=a covout;

model anti=self pov black hispanic childage

divorce gender momage momwork;

by _imputation_;

run;

proc mianalyze data=a;

modeleffects intercept self pov black hispanic childage divorce gender momage

momwork;

run;

Page 34: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

IMPLEMENTATION OF

MI: SAS (N=10)

Page 35: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

IMPLEMENTATION OF MI: SAS

(N=100)

Page 36: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

IMPLEMENTATION OF MI: STATA

mi set flong

mi register imputed anti self pov black hispanic childage divorce

gender momage momwork

mi impute mvn anti self pov black hispanic childage divorce

gender momage momwork, add(10)

mi estimate: regress anti self pov black hispanic childage divorce

gender momage momwork

Page 37: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

IMPLEMENTATION OF MI: STATA

Page 38: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

MI VS ML

Maximum likelihood (ML) is simpler to implement

Multiple imputation (MI) can handle various types of data but the imputation model must be synonymous with the analysis model

ML offers one result given a set of parameters

MI gives stochastic results

ML offers no conflict between imputation and analysis model

Page 39: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

CHOOSING ML VS MI

Choose ML when data is mostly continuous

Choose MI when data is mostly categorical

Page 40: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

PRACTICAL STEPS

Assess amount of missing and assess on which variables

Select appropriate technique for handling missing data

Conduct sensitivity analysis (test out different approaches and

see how results differ, if you see consistent) – that way you can

empirically compare different approaches and compare estimates

Page 41: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

REPORTING MISSING DATA

FOR PUBLICATION

• Report amount of missing data as a percentage of the

complete data

• Consider assumptions and patterns of missing data

• Report appropriate method of handling missing data

Page 42: UNDERSTANDING AND ADDRESSING MISSING DATAterm:name]/[node:create:custom:Ym... · Maximum likelihood (ML) is simpler to implement Multiple imputation (MI) can handle various types

REFERENCES

Allison, P.D. (2012). Handling Missing Data by Maximum Likelihood. Statistics and Data Analysis: SAS Global Forum.

Allison, P.D. (2002). Missing Data. Iowa City: Sage.

Baraldi, A.N. & Enders, C.K. (2010). An introduction to modern missing data analyses. Journal of School Psychology, 48, 5-37.

Schlomer, G.L., Bauman, S. & Card. N.A. (2010). Best practices for missing data management in counseling psychology. Journal of Counseling Psychology, 57(1), 1-10.