imputing (item-level) missing data - facultyfaculty.nps.edu/rdfricke/oa4109/lecture 8-2... ·...

24
1 Imputing (Item-level) Missing Data Professor Ron Fricker Naval Postgraduate School Monterey, California Reading Assignment: Groves et al., Chapter 10 8/18/12

Upload: others

Post on 27-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

1

Imputing (Item-level) Missing Data!

Professor Ron Fricker!Naval Postgraduate School!

Monterey, California!

Reading Assignment:!Groves et al.,!Chapter 10!8/18/12

Page 2: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

2

Goals for this Lecture

•  Define what imputation is and why it is important!–  Explicit vs. implicit imputation!–  Advantages and disadvantages!

•  Describe some of the most common methods!•  A bit about survey data documentation

and metadata!

8/18/12

Page 3: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Terminology

•  Unit nonresponse: Respondent refuses to take the survey at all!–  Sampled person: “I don’t take surveys. Please do

not contact me again.”!•  Item nonresponse: Respondent refuses to

answer one or more survey questions!–  Interviewer: “What was your total family income

last year?” !–  Response: “That’s personal. I’m not going to tell

you.” !

3 8/18/12 3

Page 4: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Addressing Unit and Item Non-response

•  Unit nonresponse handled via appropriate weighting!

•  What about respondents that failed/refused to answer one or more items (i.e., item nonresponse)?!

8/18/12 4

Source: Survey Methodology, 1st ed., Groves, et al, 2004.

Page 5: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Imputation

•  Imputation is the substitution of some value for a missing data point !–  Used for item nonresponse!–  It’s within-sample inference!

•  Naïve survey analysts sometimes ignore all records that are not complete (aka casewise deletion)!–  Called complete case analysis!

•  Can be a terrible waste of data/information!•  Why ignore a whole record because one item is missing?!

–  Likely paid a lot to collect the data, so should not lightly ignore/delete data!

•  If a respondent fails to answer one or more items, still would like to use rest of their data!

8/18/12 5

Page 6: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Casewise Deletion is Also Imputation (But Usually Poor Imputation) •  Remember, goal is to infer from the sample to

the population!•  Deleting an observation is the same as taking

the respondent out of the sample!–  But they are still part of the population!–  So it’s equivalent to imputing (inferring) all of their

data from the remaining sample!–  Casewise deletion is implicit (vs. explicit)

imputation!–  Also, if missings not random, can introduce bias!

•  But casewise deletion sometimes appropriate!8/18/12 6

Page 7: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Can Be a Major Issue When Modeling

•  For multivariate regression with many covariates, a small number missing for each variable can result in dropping too many observations with casewise deletion!–  If only doing simple univariate (question by

question) analyses, imputation not necessary!•  A real example:!

8/18/12 7

Country A Country B Country C Country D Country E Country F Number of Respondents 3,703 1,678 1,407 1,796 1,571 1,383

Percentage Missing 18.8 60.0 46.6 48.9 32.3 39.8

Percentage Unknown 9.5 20.8 17.3 24.7 15.0 15.8

Page 8: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Question-level Detail

8/18/12 8

Country A Country B Country C Country D Country E Country F Q41 0.6% 1.9% 1.7% 1.9% 1.3% 0.7% Q5 1.1% 1.0% 0.6% 0.6% 0.6% 1.2% Q7EDU 0.8% 0.5% 0.2% 0.7% 0.8% 0.2% Q7HEA 0.6% 0.8% 0.1% 1.9% 1.4% 0.6% Q7WAT 0.7% 0.7% 1.1% 2.0% 1.6% 0.5% Q7ROA 0.8% 1.1% 0.8% 2.7% 1.3% 0.8% Q7ELE 0.7% 9.1% 14.8% 7.1% 5.0% 2.4% Q8EDU 0.1% 1.0% 0.6% 1.0% 0.9% 1.4% Q8HEA 0.2% 1.0% 0.8% 0.9% 1.7% 1.4% Q8WAT 0.3% 1.5% 1.5% 1.1% 1.8% 1.8% Q8ROA 0.4% 1.2% 1.3% 1.9% 2.0% 1.2% Q8ELE 0.3% 9.6% 12.2% 6.4% 5.2% 3.1% D0 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% D2 0.0% 6.0% 1.7% 1.7% 1.6% 0.1% D3 0.0% 1.1% 1.2% 1.3% 2.4% 0.0% D7 0.2% 3.4% 5.2% 1.1% 1.0% 1.1% D8 10.8% 26.1% 22.8% 33.7% 12.9% 26.9% D9 0.5% 16.3% 2.3% 3.3% 3.9% 6.9% D15 1.2% 2.3% 0.6% 1.2% 2.0% 1.2% D16 0.6% 2.0% 0.9% 1.5% 2.2% 0.7% D17 0.5% 19.4% 1.4% 1.3% 1.7% 0.8% D18 3.6% 3.0% 0.6% 2.3% 1.6% 1.2% D20 1.0% 0.0% 1.1% 1.2% 2.1% 0.9% D22 0.9% 3.0% 3.9% 4.5% 2.2% 1.2% D23 0.6% 2.3% 1.4% 1.1% 2.2% 1.7% D33 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% D34 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

Country A Country B Country C Country D Country E Country F Q41 1.1% 4.5% 2.9% 4.4% 1.7% 4.6% Q5 1.0% 1.6% 0.8% 1.3% 1.7% 1.3% Q7EDU 1.1% 2.6% 0.9% 4.4% 1.9% 1.8% Q7HEA 0.7% 1.7% 0.3% 2.7% 1.0% 0.7% Q7WAT 0.7% 1.0% 0.4% 1.7% 0.4% 0.2% Q7ROA 0.6% 1.4% 0.4% 2.6% 0.8% 0.4% Q7ELE 0.9% 3.5% 4.1% 5.5% 4.1% 2.4% Q8EDU 1.3% 3.2% 1.1% 3.8% 2.4% 1.7% Q8HEA 0.6% 2.1% 0.9% 2.5% 1.5% 1.0% Q8WAT 1.0% 1.3% 1.0% 1.7% 0.9% 0.3% Q8ROA 0.8% 1.8% 0.9% 2.4% 1.0% 0.4% Q8ELE 1.1% 3.2% 5.1% 5.1% 2.7% 2.5% D0 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% D2 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% D3 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% D7 0.1% 0.1% 2.4% 0.4% 0.2% 0.8% D8 0.4% 0.2% 3.9% 0.9% 0.4% 0.4% D9 0.1% 2.4% 0.4% 1.0% 0.6% 4.1% D15 1.5% 0.3% 0.4% 2.4% 0.8% 0.2% D16 1.4% 2.0% 0.6% 2.4% 0.7% 0.3% D17 0.9% 1.1% 0.9% 3.1% 0.4% 0.2% D18 2.2% 6.1% 0.3% 3.3% 0.8% 0.3% D20 0.1% 0.0% 0.0% 0.1% 0.4% 0.1% D22 0.8% 0.2% 0.8% 1.3% 1.3% 1.0% D23 0.7% 0.4% 0.1% 0.4% 0.4% 0.9% D33 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% D34 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

Percent Missing by Question! Percent Unknown by Question!

Page 9: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Goal of (Explicit) Imputation

•  “Fill in” values for the missing data in such a way that !–  Information from actual survey can be used!–  Imputed data does not bias/affect results!

•  Last bullet is key!–  This is not an exercise in manipulating data to

make it say what you want!•  Done in a careful manner, gains from use of

all observations outweighs potential issues from the imputation !

9 8/18/12

Page 10: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Advantages and Disadvantages of Explicit Imputation •  Advantages!

–  Maximizes the use of all survey data!–  Univariate analysis of each variable will have

same number of observations!•  Disadvantages!

–  Some think of imputed data as “made up” data!–  Most statistical software often not designed to

distinguish between real and imputed data!•  If not handled correctly, results in improper

standard error estimates, too narrow confidence intervals, etc.!

10 8/18/12

Page 11: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Some Common Imputation Methods

•  There are many types of imputation methods!•  We’ll discuss four of the more common

methods:!–  Mean value imputation!–  Regression imputation!–  Hot deck imputation!–  Multiple imputation!

8/18/12 11

Page 12: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Mean Value Imputation

•  Basic idea: replace missing values with the sample average for that item ( )!

•  Advantages!–  Easy!–  Does not affect estimates of the mean!

•  Disadvantages!–  Distorts the distribution (spike at the mean value)!–  Can result in underestimation of standard errors!–  Using overall mean value may not be appropriate

for all observations!

( )ry

12 8/18/12

Page 13: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Examples

8/18/12 13

Simulated data, Y~U(0,1):!E(Y)=0.5 and = 0.289!σ Y Not a Good Idea!Probably Okay!

Page 14: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Stochastic Mean Value Imputation

•  One solution is to add “noise” to the imputed value:!–  where and is the sample variance

calculated on the nonmissing items for question j!•  Example: !

8/18/12 14

( ) ( )ijr r

j ijy y ε= +2~ (0, )ij jN sε 2

js

Page 15: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Mean Value Imputation by Subgroups

•  Mean value imputation by subgroup can sometimes provide more accurate estimate!–  E.g., If data is an individual’s weight, better to use

mean value after grouping by gender!•  Can also do stochastic mean value

imputation by subgroup!•  Can generalize to subgroups based on

multiple variables!–  E.g., impute weight using subgroups based on

gender and height categories!–  Can quickly get out of hand computationally!

8/18/12 15

Page 16: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Regression Imputation (1)

•  Can use regression to predict mean value!–  Better than simple mean value imputation when

there are multiple variables!•  First, use information from those with data to

estimate regression model!

•  Then use model to predict those missing data!

!where here is added “noise” !

8/18/12 16

Yi

(nr ) = β0 + β1 X1,i(nr ) + β2 X2,i

(nr ) ++ βk Xk ,i(nr ) + ε

Yi

(r ) = β0 + β1 X1,i(r ) + β2 X2,i

(r ) ++ βk Xk ,i(r ) + ε

ε

Page 17: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Regression Imputation (2)

•  For multiple regression, must have data available on all covariates in the model!–  Sometimes too hard and/or results in too few

observations available with all data!•  To get around, sometimes must do a series of

imputations!–  I.e., Impute values with one model that become

covariates in the next regression imputation model!

17 8/18/12

Page 18: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Hot Deck Imputation

•  Hot deck imputation often used in large-scale imputation processes!

•  The name dates back to the use of computer punch cards!–  Sometimes called last value carried forward!

•  Basic idea:!–  Sort data by important variables!–  Start at the top and replace any missing data with

value of the immediately preceding observation!–  If first one is missing, replace with appropriate

mean value!

8/18/12 18

Page 19: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Hot Deck Imputation Example

19 8/18/12

Source: Survey Methodology, 1st ed., Groves, et al, 2004.

Page 20: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

More Sophisticated Hot Deck Methods

•  Instead of using the last “hot” observation, use nearest neighbor methodology!–  Select a set of observations that matches the

record to be imputed on one or more variables (the “donor class”) !

–  Then pick the observation that is closest (according to some measure) to the record to be imputed according to some other variables!

•  If multiple closest, randomly choose one out of the group!

–  Use this record to impute missing items!•  In R, see the StatMatch package!8/18/12 20

Page 21: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Illustration

8/18/12 21

Entire Survey Dataset!

Record “i” missing data for variable “j”!

Select subset of observations

matching record “i” variables “w”

and “x”!

Find observation(s)

closest to record “i” on variables

“y” and “z”!

Choose one record, take its variable “j” value and fill in record “i” missing value!

Donor records! Matching records!

Recipient record!

Page 22: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Multiple Imputation

•  Even more sophisticated (and complicated) imputation method!

•  Creates multiple imputed data sets (hence the name)!

•  Variation across the multiple data sets allows estimation of overall variation!–  Including both sampling and imputation variance!

•  Requires specification of “imputation model” and use of specialized software / methods!–  E.g., see http://www.multiple-imputation.com/!

8/18/12 22

Page 23: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

Survey Data Documentation and Metadata

•  Data from large surveys often used by many researchers for many years!–  Critical to carefully and fully document data,

including weights and imputed data !•  Metadata: data about data!

–  Sometimes called data dictionary or codebook!•  Types of metadata!

–  Definitional!–  Procedural!–  Operational!–  Systems!

8/18/12 23

Page 24: Imputing (Item-level) Missing Data - Facultyfaculty.nps.edu/rdfricke/OA4109/Lecture 8-2... · missing data point ! – Used for item nonresponse! – It’s within-sample inference!

24

What We Have Covered

•  Defined what imputation is and discussed why it is important!–  Explicit vs. implicit imputation!–  Advantages and disadvantages!

•  Described some of the most common methods!

•  Talked a bit about survey data documentation and metadata!

8/18/12