postcollection processing of survey data

Postcollection Processing of Survey Data

Professor Ron FrickerNaval Postgraduate School

Monterey, California

1

Goals for this Lecture

• Discuss coding, how to code survey data, and various types of editing

• Describe the various types of survey weights and how to calculate and use them

• Define what imputation is– Describe some of the most common methods– Discuss techniques for variance estimation

• A bit about survey data documentation and metadata

2

Post-survey Processing

• There are a number of steps between collection of survey data and analysis– Varies by mode of data collection– Easier or harder depending

on a number of factors

3

Terminology

• Coding: Process of turning text-based answers into numerically-coded categories

• Data entry: Process of entering numeric data into data files

• Editing: Examination of data file to detect errors and inconsistencies (to possibly correct or delete)

• Imputation: Repair of item-level missing data by estimating and inserting an answer in the data field

• Weighting: Adjustment of survey statistic computations to account for sampling design, nonresponse, and noncoverage

• Variance estimation: Computation of the variance of the sampling distribution of a statistic

4

Coding for Closed-ended Items

• Example: “Indicate the extent to which you agree or disagree with the following statement: OA4109 is the best class I’ve taken at NPS.”

Strongly agreeAgreeNeutralDisagreeStrongly disagree

• When creating codes, be sure direction of scale from positive to negative always the same – Makes it easier to interpret and report results

Code12345

5

Coding for Open-ended Items

• Coding is both an act of translation and an act of summarization

“Well, in our house we have a dog, a cat, and two parakeets. Oh yeah, and my stepdaughter visits every Tuesday and brings her ferret…”

“code = 5”

6

Example of Field Coding

7

Example of Standard Codes

8

Editing

• Most basic form of editing is accomplished via different kinds of data checks– Range checks– Ratio checks– Balance checks– Outlier checks– Consistency checks– Logic checks– Comparisons to historical data

• Basic point: Sanity check the raw data file

9

Editing Suggestions

• Graphical plots help easily identify odd data– Bar charts, histograms and scatterplots– Don’t forget to look at amount of missingness

• Can identify problem items or skip patterns• On longer / complicated instruments, logic

checks important– If service = Navy and rank in {1LT, 2LT, CPT,

MAJ, LTC, COL} then problem_flag_1 = 1– If family_sep_allowance > 0 and deploy_ind = 0

then problem_flag_2 = 1

10

Weighting

• Four basic types:– Weighting as a first-stage ratio adjustment– Weighting for differential selection probabilities– Weighting to adjust for unit nonresponse– Poststratification weighting

• Variance reduction• Undercoverage• Unit nonresponse

11

What is a Weight?

• Think of it like the number of units in the population that the respondent represents

• SRS is simplest: imagine a population of size N=2,000,000 from which we take a sample of size n=1,000– Then the probability of selecting any one unit is

p=n/N or p=0.0005– Each unit in the sample gets a weight of

1/p=N/n=2,000• That is, each unit in the sample represents

2,000 units in the population

Weighting for First-stage Ratio Adjustment

• In stratified multi-stage designs, primary sampling units (PSUs) sampled – Usually sampling probability is proportional to

some size measure• Weight calculated as

• Called “first stage ratio adjustment weight”

1Stratum Pop'n Total from Frame

Pop'n Total for Selected PSU / Pr(Selecting PSU)iW =

13

Weighting for Differential Selection Probabilities

• Differential selection probability means some units in the population have a higher or lower probability of being selected into the sample– For example, stratification usually results in

differential selection probabilities• Weight is just the inverse of the selection

probability• Weighted mean is

14

21

21

n

i ii

w n

ii

w yy

w

=

=

=∑

∑

Examples

• Consider SRS (just to show it makes sense):

• Now, assume two strata with N1+N2=N, samples of size n1<N1 and n2<N2 and n1+n2=n:

( ) ( )21 1 1

12

1 1

/ /1

/

n n n

i i i i ni i i

w in ni

ii i

w y N n y N n yy y

N nw N n

= = =

=

= =

= = = =∑ ∑ ∑

∑∑ ∑

( ) ( )1

2

1 2

1 1 2 2211 1 1 2 2

2 1 1 2 2

/ /

/ /

n nn

i ii ii i ni

w n n n

i

N n y N n yw yN

1 1 1i i i

y N yyNw N n N n

= ==

++

= = =+

∑ ∑∑

∑ ∑ ∑= = =

Weighting toAdjust for Unit Nonresponse

• Can only adjust for nonresponse using variables known on everyone in sample– Has to be based on external (not survey) data– Often these are demographic types of variables

• Adjustment assumes data is missing at random (MAR) within observed groups

• Weight is the inverse of response rate for each categorical group– Same idea as weight for differential selection prob.

16

Nonresponse Weight Example

2 (Latino) 24,937,500 / 62,500 399w = =

2 (Non-Latino) 174,562,500 / 62,500 2,793w = =

So, Latino’s were oversampled at a rate of 7 times that of Non-Latinos

17

Poststratification Weighting

• Most frequently used to ensure survey totals match known population totals

• Example: Population is known to be 52% female and 48% male– However, survey results – perhaps using first-

stage and/or nonresponse weights – differ – Survey results show 50% female and 50% male– Then adjust female weight by 0.52/0.50 = 1.04

and male weight by 0.48/0.50 = 0.96

18

Poststratification Weighting, Part II

• When there are multiple poststratificationvariables can get pretty complicated

• Options– Raking (aka sample balancing)

• Method for sequentially adjusting weights for each variable until converges

• See www.abtassociates.com/presentations/raking_survey_data_2_JOS.pdf

– Logistic regression• Fit logistic regression model and use predicted

probabilities as weights

Weighting: Putting It All Together

20

Remember : Two Types of Nonresponse – Unit and Item

• Nonrespondentshandled via appropriate weighting

• What about respondents that failed/refused to answer one or more items?

21

Imputation

• Imputation is the substitution of some value for a missing data point – Used for item nonresponse– It’s within-sample inference

• Naïve survey analysts sometimes ignore all records that are not complete (aka casewise deletion)– Can be a terrible waste of data/information

• Why ignore a whole record because one item is missing?– Likely paid a lot to collect the data, so should not lightly

ignore/delete data• If a respondent fails to answer one or more items, still

would like to use rest of their data

22

Casewise Deletion is Also Imputation (But Usually Poor Imputation)

• Remember, goal is to infer from the sample to the population

• Deleting an observation is the same as taking the respondent out of the sample– But they are still part of the population– So it’s equivalent to imputing (inferring) all of their

data from the remaining sample– Casewise deletion is implicit (vs. explicit) imputation– Also, if missings not random, can introduce bias

• However, sometimes casewise deleting is appropriate / necessary 23

Goal of (Explicit) Imputation

• Example: Analysis requires a multivariate regression on using many covariates drawn from the survey data– Without imputation, all records with one or more

missing covariate data cannot be used• Goal: Impute values for the missing data in

such a way that – Information from actual survey can be used– Imputed data does not bias/affect results

24

Advantages and Disadvantages of Explicit Imputation

• Advantages– Maximizes the use of all survey data– Univariate analysis of each variable will have

same number of observations• Disadvantages

– Some think of imputed data as “made up” data– Statistical software often not designed to

distinguish between real and imputed data• Results in improper standard error estimates,

too narrow confidence intervals, etc.

25

Some Common Imputation Methods

• There are many types of imputation methods• We’ll discuss four of the more common

methods:– Mean value imputation– Regression imputation– Hot deck imputation– Multiple imputation

26

Mean Value Imputation

• Basic idea: replace missing values with the sample average for that item ( )

• Advantages– Easy– Does not affect estimates of the mean

• Disadvantages– Distorts the distribution (spike at the mean value)– Can result in underestimation of standard errors– Using overall mean value may not be appropriate

for all observations

ry

27

Stochastic Mean Value Imputation

• One solution for distribution distortion is to add “noise” to the imputed value:– where and is the sample variance

calculated on the nonmissing items for question I• Improves standard error underestimation

problem• Example:

riy ε+2~ (0, )iN sε 2

is

28

Mean Value Imputation by Subgroups

• Mean value imputation by subgroup can sometimes provide more accurate estimate– E.g., If data is an individuals weight, better to use

mean value after grouping by gender• Can also do stochastic mean value

imputation by subgroup• Can generalize to subgroups based on

multiple variables– E.g., impute weight using subgroups based on

gender and height categories– Can quickly get out of hand computationally 29

Regression Imputation (1)

• Can use regression to predict mean value– Better than simple mean value imputation when

there are multiple variables• First, use information from those with data to

estimate regression model

• Then use model to predict those missing data

where here the epsilon is added “noise”

( ) 0 1 1 ( ) ( )i r i r i ry xβ β ε= + +

' ' '( ) 0 1 1 ( ) ( )i r i r i ry xβ β ε= + +

30

Regression Imputation (2)

• Of course, can also do multiple regression

• However, must have data available on all covariates– Sometimes too hard and/or results in too few

observations available with all data• To get around, sometimes must do a series of

imputations– I.e., Impute values with one model that become

covariates in the next regression imputation model

( ) 0 1 1 ( ) 2 2 ( ) ( ) ( )i r i r i r p pi r i ry x x xβ β β β ε= + + + + +L

31

Hot Deck Imputation

• Hot deck imputation often used in large-scale imputation processes

• The name dates back to the use of computer punch cards

• Basic idea:– Sort data by important variables– Start at the top and replace any missing data with

value of the immediately preceding observation– If first one is missing, replace with appropriate

mean value32

Hot Deck Imputation Example

33

Multiple Imputation

• More sophisticated (and complicated) imputation method

• Creates multiple imputed data sets (hence the name)

• Variation across the multiple data sets allows estimation of overall variation– Including both sampling and imputation variance

• Requires specification of “imputation model”and use of specialized software / methods– E.g., see http://www.multiple-imputation.com/

34

Variance Estimation

• Complex sampling methods require nonstandard methods to estimate variances– I.e., Can’t just plug the data into statistical

software and use their standard errors– (Very rare) exception: SRS with large population

and low nonresponse• Software for (some) complex survey designs:

– Free: CENVAR, VPLX, CPLX, EpiInfo– Commercial: SAS, Stata, SUDAAN, WesVar

• Two estimation methods: Taylor series expansion and Jackknife 35

Variance Estimation (Taylor Series)

• Taylor series approximation: converts ratios into sums

• Example: Variance for weighted mean

assuming a SRS can be expressed as1 1

n n

w i i ii i

y w y w= =

= ∑ ∑

( ) ( ) ( ) ( )( )

2

2

2 ,i i w i w i i iw

i

Var w y y Var w y Cov w y wVar y

w

+ −= ∑ ∑ ∑ ∑

∑

36

Variance Estimation (Jackknife and Balanced Repeated Replication)

• Jackknife and balanced repeated replication methods rely on empirical methods– Basically, resample from data c times– Calculate overall mean as

and then estimate variance as1

1 c

y yc γ

γ =

= ∑

( ) ( )2

1

1( 1)

c

v y y yc c γ

γ =

= −− ∑

37

Survey Data Documentation and Metadata

• Data from large surveys often used by many researchers for many years– Critical to carefully and fully document data,

including weights and imputed data • Metadata: data about data

– Sometimes called data dictionary or codebook• Types of metadata

– Definitional– Procedural– Operational– Systems 38

What We Have Covered

• Discussed coding, how to code survey data, and various types of editing

• Described the various types of survey weights and how to calculate and use them

• Defined what imputation is– Describe some of the most common methods– Discuss techniques for variance estimation

• Talked a bit about survey data documentation and metadata

39

postcollection processing of survey data

Documents