postcollection processing of survey data

39
Postcollection Processing of Survey Data Professor Ron Fricker Naval Postgraduate School Monterey, California 1

Upload: others

Post on 16-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Postcollection Processing of Survey Data

Professor Ron FrickerNaval Postgraduate School

Monterey, California

1

Goals for this Lecture

ā€¢ Discuss coding, how to code survey data, and various types of editing

ā€¢ Describe the various types of survey weights and how to calculate and use them

ā€¢ Define what imputation isā€“ Describe some of the most common methodsā€“ Discuss techniques for variance estimation

ā€¢ A bit about survey data documentation and metadata

2

Post-survey Processing

ā€¢ There are a number of steps between collection of survey data and analysisā€“ Varies by mode of data collectionā€“ Easier or harder depending

on a number of factors

3

Terminology

ā€¢ Coding: Process of turning text-based answers into numerically-coded categories

ā€¢ Data entry: Process of entering numeric data into data files

ā€¢ Editing: Examination of data file to detect errors and inconsistencies (to possibly correct or delete)

ā€¢ Imputation: Repair of item-level missing data by estimating and inserting an answer in the data field

ā€¢ Weighting: Adjustment of survey statistic computations to account for sampling design, nonresponse, and noncoverage

ā€¢ Variance estimation: Computation of the variance of the sampling distribution of a statistic

4

Coding for Closed-ended Items

ā€¢ Example: ā€œIndicate the extent to which you agree or disagree with the following statement: OA4109 is the best class Iā€™ve taken at NPS.ā€

Strongly agreeAgreeNeutralDisagreeStrongly disagree

ā€¢ When creating codes, be sure direction of scale from positive to negative always the same ā€“ Makes it easier to interpret and report results

Code12345

5

Coding for Open-ended Items

ā€¢ Coding is both an act of translation and an act of summarization

ā€œWell, in our house we have a dog, a cat, and two parakeets. Oh yeah, and my stepdaughter visits every Tuesday and brings her ferretā€¦ā€

ā€œcode = 5ā€

6

Example of Field Coding

7

Example of Standard Codes

8

Editing

ā€¢ Most basic form of editing is accomplished via different kinds of data checksā€“ Range checksā€“ Ratio checksā€“ Balance checksā€“ Outlier checksā€“ Consistency checksā€“ Logic checksā€“ Comparisons to historical data

ā€¢ Basic point: Sanity check the raw data file

9

Editing Suggestions

ā€¢ Graphical plots help easily identify odd dataā€“ Bar charts, histograms and scatterplotsā€“ Donā€™t forget to look at amount of missingness

ā€¢ Can identify problem items or skip patternsā€¢ On longer / complicated instruments, logic

checks importantā€“ If service = Navy and rank in {1LT, 2LT, CPT,

MAJ, LTC, COL} then problem_flag_1 = 1ā€“ If family_sep_allowance > 0 and deploy_ind = 0

then problem_flag_2 = 1

10

Weighting

ā€¢ Four basic types:ā€“ Weighting as a first-stage ratio adjustmentā€“ Weighting for differential selection probabilitiesā€“ Weighting to adjust for unit nonresponseā€“ Poststratification weighting

ā€¢ Variance reductionā€¢ Undercoverageā€¢ Unit nonresponse

11

What is a Weight?

ā€¢ Think of it like the number of units in the population that the respondent represents

ā€¢ SRS is simplest: imagine a population of size N=2,000,000 from which we take a sample of size n=1,000ā€“ Then the probability of selecting any one unit is

p=n/N or p=0.0005ā€“ Each unit in the sample gets a weight of

1/p=N/n=2,000ā€¢ That is, each unit in the sample represents

2,000 units in the population

Weighting for First-stage Ratio Adjustment

ā€¢ In stratified multi-stage designs, primary sampling units (PSUs) sampled ā€“ Usually sampling probability is proportional to

some size measureā€¢ Weight calculated as

ā€¢ Called ā€œfirst stage ratio adjustment weightā€

1Stratum Pop'n Total from Frame

Pop'n Total for Selected PSU / Pr(Selecting PSU)iW =

13

Weighting for Differential Selection Probabilities

ā€¢ Differential selection probability means some units in the population have a higher or lower probability of being selected into the sampleā€“ For example, stratification usually results in

differential selection probabilitiesā€¢ Weight is just the inverse of the selection

probabilityā€¢ Weighted mean is

14

21

21

n

i ii

w n

ii

w yy

w

=

=

=āˆ‘

āˆ‘

Examples

ā€¢ Consider SRS (just to show it makes sense):

ā€¢ Now, assume two strata with N1+N2=N, samples of size n1<N1 and n2<N2 and n1+n2=n:

( ) ( )21 1 1

12

1 1

/ /1

/

n n n

i i i i ni i i

w in ni

ii i

w y N n y N n yy y

N nw N n

= = =

=

= =

= = = =āˆ‘ āˆ‘ āˆ‘

āˆ‘āˆ‘ āˆ‘

( ) ( )1

2

1 2

1 1 2 2211 1 1 2 2

2 1 1 2 2

/ /

/ /

n nn

i ii ii i ni

w n n n

i

N n y N n yw yN

1 1 1i i i

y N yyNw N n N n

= ==

++

= = =+

āˆ‘ āˆ‘āˆ‘

āˆ‘ āˆ‘ āˆ‘= = =

Weighting toAdjust for Unit Nonresponse

ā€¢ Can only adjust for nonresponse using variables known on everyone in sampleā€“ Has to be based on external (not survey) dataā€“ Often these are demographic types of variables

ā€¢ Adjustment assumes data is missing at random (MAR) within observed groups

ā€¢ Weight is the inverse of response rate for each categorical groupā€“ Same idea as weight for differential selection prob.

16

Nonresponse Weight Example

2 (Latino) 24,937,500 / 62,500 399w = =

2 (Non-Latino) 174,562,500 / 62,500 2,793w = =

So, Latinoā€™s were oversampled at a rate of 7 times that of Non-Latinos

17

Poststratification Weighting

ā€¢ Most frequently used to ensure survey totals match known population totals

ā€¢ Example: Population is known to be 52% female and 48% maleā€“ However, survey results ā€“ perhaps using first-

stage and/or nonresponse weights ā€“ differ ā€“ Survey results show 50% female and 50% maleā€“ Then adjust female weight by 0.52/0.50 = 1.04

and male weight by 0.48/0.50 = 0.96

18

Poststratification Weighting, Part II

ā€¢ When there are multiple poststratificationvariables can get pretty complicated

ā€¢ Optionsā€“ Raking (aka sample balancing)

ā€¢ Method for sequentially adjusting weights for each variable until converges

ā€¢ See www.abtassociates.com/presentations/raking_survey_data_2_JOS.pdf

ā€“ Logistic regressionā€¢ Fit logistic regression model and use predicted

probabilities as weights

Weighting: Putting It All Together

20

Remember : Two Types of Nonresponse ā€“ Unit and Item

ā€¢ Nonrespondentshandled via appropriate weighting

ā€¢ What about respondents that failed/refused to answer one or more items?

21

Imputation

ā€¢ Imputation is the substitution of some value for a missing data point ā€“ Used for item nonresponseā€“ Itā€™s within-sample inference

ā€¢ NaĆÆve survey analysts sometimes ignore all records that are not complete (aka casewise deletion)ā€“ Can be a terrible waste of data/information

ā€¢ Why ignore a whole record because one item is missing?ā€“ Likely paid a lot to collect the data, so should not lightly

ignore/delete dataā€¢ If a respondent fails to answer one or more items, still

would like to use rest of their data

22

Casewise Deletion is Also Imputation (But Usually Poor Imputation)

ā€¢ Remember, goal is to infer from the sample to the population

ā€¢ Deleting an observation is the same as taking the respondent out of the sampleā€“ But they are still part of the populationā€“ So itā€™s equivalent to imputing (inferring) all of their

data from the remaining sampleā€“ Casewise deletion is implicit (vs. explicit) imputationā€“ Also, if missings not random, can introduce bias

ā€¢ However, sometimes casewise deleting is appropriate / necessary 23

Goal of (Explicit) Imputation

ā€¢ Example: Analysis requires a multivariate regression on using many covariates drawn from the survey dataā€“ Without imputation, all records with one or more

missing covariate data cannot be usedā€¢ Goal: Impute values for the missing data in

such a way that ā€“ Information from actual survey can be usedā€“ Imputed data does not bias/affect results

24

Advantages and Disadvantages of Explicit Imputation

ā€¢ Advantagesā€“ Maximizes the use of all survey dataā€“ Univariate analysis of each variable will have

same number of observationsā€¢ Disadvantages

ā€“ Some think of imputed data as ā€œmade upā€ dataā€“ Statistical software often not designed to

distinguish between real and imputed dataā€¢ Results in improper standard error estimates,

too narrow confidence intervals, etc.

25

Some Common Imputation Methods

ā€¢ There are many types of imputation methodsā€¢ Weā€™ll discuss four of the more common

methods:ā€“ Mean value imputationā€“ Regression imputationā€“ Hot deck imputationā€“ Multiple imputation

26

Mean Value Imputation

ā€¢ Basic idea: replace missing values with the sample average for that item ( )

ā€¢ Advantagesā€“ Easyā€“ Does not affect estimates of the mean

ā€¢ Disadvantagesā€“ Distorts the distribution (spike at the mean value)ā€“ Can result in underestimation of standard errorsā€“ Using overall mean value may not be appropriate

for all observations

ry

27

Stochastic Mean Value Imputation

ā€¢ One solution for distribution distortion is to add ā€œnoiseā€ to the imputed value:ā€“ where and is the sample variance

calculated on the nonmissing items for question Iā€¢ Improves standard error underestimation

problemā€¢ Example:

riy Īµ+2~ (0, )iN sĪµ 2

is

28

Mean Value Imputation by Subgroups

ā€¢ Mean value imputation by subgroup can sometimes provide more accurate estimateā€“ E.g., If data is an individuals weight, better to use

mean value after grouping by genderā€¢ Can also do stochastic mean value

imputation by subgroupā€¢ Can generalize to subgroups based on

multiple variablesā€“ E.g., impute weight using subgroups based on

gender and height categoriesā€“ Can quickly get out of hand computationally 29

Regression Imputation (1)

ā€¢ Can use regression to predict mean valueā€“ Better than simple mean value imputation when

there are multiple variablesā€¢ First, use information from those with data to

estimate regression model

ā€¢ Then use model to predict those missing data

where here the epsilon is added ā€œnoiseā€

( ) 0 1 1 ( ) ( )i r i r i ry xĪ² Ī² Īµ= + +

' ' '( ) 0 1 1 ( ) ( )i r i r i ry xĪ² Ī² Īµ= + +

30

Regression Imputation (2)

ā€¢ Of course, can also do multiple regression

ā€¢ However, must have data available on all covariatesā€“ Sometimes too hard and/or results in too few

observations available with all dataā€¢ To get around, sometimes must do a series of

imputationsā€“ I.e., Impute values with one model that become

covariates in the next regression imputation model

( ) 0 1 1 ( ) 2 2 ( ) ( ) ( )i r i r i r p pi r i ry x x xĪ² Ī² Ī² Ī² Īµ= + + + + +L

31

Hot Deck Imputation

ā€¢ Hot deck imputation often used in large-scale imputation processes

ā€¢ The name dates back to the use of computer punch cards

ā€¢ Basic idea:ā€“ Sort data by important variablesā€“ Start at the top and replace any missing data with

value of the immediately preceding observationā€“ If first one is missing, replace with appropriate

mean value32

Hot Deck Imputation Example

33

Multiple Imputation

ā€¢ More sophisticated (and complicated) imputation method

ā€¢ Creates multiple imputed data sets (hence the name)

ā€¢ Variation across the multiple data sets allows estimation of overall variationā€“ Including both sampling and imputation variance

ā€¢ Requires specification of ā€œimputation modelā€and use of specialized software / methodsā€“ E.g., see http://www.multiple-imputation.com/

34

Variance Estimation

ā€¢ Complex sampling methods require nonstandard methods to estimate variancesā€“ I.e., Canā€™t just plug the data into statistical

software and use their standard errorsā€“ (Very rare) exception: SRS with large population

and low nonresponseā€¢ Software for (some) complex survey designs:

ā€“ Free: CENVAR, VPLX, CPLX, EpiInfoā€“ Commercial: SAS, Stata, SUDAAN, WesVar

ā€¢ Two estimation methods: Taylor series expansion and Jackknife 35

Variance Estimation (Taylor Series)

ā€¢ Taylor series approximation: converts ratios into sums

ā€¢ Example: Variance for weighted mean

assuming a SRS can be expressed as1 1

n n

w i i ii i

y w y w= =

= āˆ‘ āˆ‘

( ) ( ) ( ) ( )( )

2

2

2 ,i i w i w i i iw

i

Var w y y Var w y Cov w y wVar y

w

+ āˆ’= āˆ‘ āˆ‘ āˆ‘ āˆ‘

āˆ‘

36

Variance Estimation (Jackknife and Balanced Repeated Replication)

ā€¢ Jackknife and balanced repeated replication methods rely on empirical methodsā€“ Basically, resample from data c timesā€“ Calculate overall mean as

and then estimate variance as1

1 c

y yc Ī³

Ī³ =

= āˆ‘

( ) ( )2

1

1( 1)

c

v y y yc c Ī³

Ī³ =

= āˆ’āˆ’ āˆ‘

37

Survey Data Documentation and Metadata

ā€¢ Data from large surveys often used by many researchers for many yearsā€“ Critical to carefully and fully document data,

including weights and imputed data ā€¢ Metadata: data about data

ā€“ Sometimes called data dictionary or codebookā€¢ Types of metadata

ā€“ Definitionalā€“ Proceduralā€“ Operationalā€“ Systems 38

What We Have Covered

ā€¢ Discussed coding, how to code survey data, and various types of editing

ā€¢ Described the various types of survey weights and how to calculate and use them

ā€¢ Defined what imputation isā€“ Describe some of the most common methodsā€“ Discuss techniques for variance estimation

ā€¢ Talked a bit about survey data documentation and metadata

39