postcollection processing of survey data
TRANSCRIPT
Postcollection Processing of Survey Data
Professor Ron FrickerNaval Postgraduate School
Monterey, California
1
Goals for this Lecture
ā¢ Discuss coding, how to code survey data, and various types of editing
ā¢ Describe the various types of survey weights and how to calculate and use them
ā¢ Define what imputation isā Describe some of the most common methodsā Discuss techniques for variance estimation
ā¢ A bit about survey data documentation and metadata
2
Post-survey Processing
ā¢ There are a number of steps between collection of survey data and analysisā Varies by mode of data collectionā Easier or harder depending
on a number of factors
3
Terminology
ā¢ Coding: Process of turning text-based answers into numerically-coded categories
ā¢ Data entry: Process of entering numeric data into data files
ā¢ Editing: Examination of data file to detect errors and inconsistencies (to possibly correct or delete)
ā¢ Imputation: Repair of item-level missing data by estimating and inserting an answer in the data field
ā¢ Weighting: Adjustment of survey statistic computations to account for sampling design, nonresponse, and noncoverage
ā¢ Variance estimation: Computation of the variance of the sampling distribution of a statistic
4
Coding for Closed-ended Items
ā¢ Example: āIndicate the extent to which you agree or disagree with the following statement: OA4109 is the best class Iāve taken at NPS.ā
Strongly agreeAgreeNeutralDisagreeStrongly disagree
ā¢ When creating codes, be sure direction of scale from positive to negative always the same ā Makes it easier to interpret and report results
Code12345
5
Coding for Open-ended Items
ā¢ Coding is both an act of translation and an act of summarization
āWell, in our house we have a dog, a cat, and two parakeets. Oh yeah, and my stepdaughter visits every Tuesday and brings her ferretā¦ā
ācode = 5ā
6
Editing
ā¢ Most basic form of editing is accomplished via different kinds of data checksā Range checksā Ratio checksā Balance checksā Outlier checksā Consistency checksā Logic checksā Comparisons to historical data
ā¢ Basic point: Sanity check the raw data file
9
Editing Suggestions
ā¢ Graphical plots help easily identify odd dataā Bar charts, histograms and scatterplotsā Donāt forget to look at amount of missingness
ā¢ Can identify problem items or skip patternsā¢ On longer / complicated instruments, logic
checks importantā If service = Navy and rank in {1LT, 2LT, CPT,
MAJ, LTC, COL} then problem_flag_1 = 1ā If family_sep_allowance > 0 and deploy_ind = 0
then problem_flag_2 = 1
10
Weighting
ā¢ Four basic types:ā Weighting as a first-stage ratio adjustmentā Weighting for differential selection probabilitiesā Weighting to adjust for unit nonresponseā Poststratification weighting
ā¢ Variance reductionā¢ Undercoverageā¢ Unit nonresponse
11
What is a Weight?
ā¢ Think of it like the number of units in the population that the respondent represents
ā¢ SRS is simplest: imagine a population of size N=2,000,000 from which we take a sample of size n=1,000ā Then the probability of selecting any one unit is
p=n/N or p=0.0005ā Each unit in the sample gets a weight of
1/p=N/n=2,000ā¢ That is, each unit in the sample represents
2,000 units in the population
Weighting for First-stage Ratio Adjustment
ā¢ In stratified multi-stage designs, primary sampling units (PSUs) sampled ā Usually sampling probability is proportional to
some size measureā¢ Weight calculated as
ā¢ Called āfirst stage ratio adjustment weightā
1Stratum Pop'n Total from Frame
Pop'n Total for Selected PSU / Pr(Selecting PSU)iW =
13
Weighting for Differential Selection Probabilities
ā¢ Differential selection probability means some units in the population have a higher or lower probability of being selected into the sampleā For example, stratification usually results in
differential selection probabilitiesā¢ Weight is just the inverse of the selection
probabilityā¢ Weighted mean is
14
21
21
n
i ii
w n
ii
w yy
w
=
=
=ā
ā
Examples
ā¢ Consider SRS (just to show it makes sense):
ā¢ Now, assume two strata with N1+N2=N, samples of size n1<N1 and n2<N2 and n1+n2=n:
( ) ( )21 1 1
12
1 1
/ /1
/
n n n
i i i i ni i i
w in ni
ii i
w y N n y N n yy y
N nw N n
= = =
=
= =
= = = =ā ā ā
āā ā
( ) ( )1
2
1 2
1 1 2 2211 1 1 2 2
2 1 1 2 2
/ /
/ /
n nn
i ii ii i ni
w n n n
i
N n y N n yw yN
1 1 1i i i
y N yyNw N n N n
= ==
++
= = =+
ā āā
ā ā ā= = =
Weighting toAdjust for Unit Nonresponse
ā¢ Can only adjust for nonresponse using variables known on everyone in sampleā Has to be based on external (not survey) dataā Often these are demographic types of variables
ā¢ Adjustment assumes data is missing at random (MAR) within observed groups
ā¢ Weight is the inverse of response rate for each categorical groupā Same idea as weight for differential selection prob.
16
Nonresponse Weight Example
2 (Latino) 24,937,500 / 62,500 399w = =
2 (Non-Latino) 174,562,500 / 62,500 2,793w = =
So, Latinoās were oversampled at a rate of 7 times that of Non-Latinos
17
Poststratification Weighting
ā¢ Most frequently used to ensure survey totals match known population totals
ā¢ Example: Population is known to be 52% female and 48% maleā However, survey results ā perhaps using first-
stage and/or nonresponse weights ā differ ā Survey results show 50% female and 50% maleā Then adjust female weight by 0.52/0.50 = 1.04
and male weight by 0.48/0.50 = 0.96
18
Poststratification Weighting, Part II
ā¢ When there are multiple poststratificationvariables can get pretty complicated
ā¢ Optionsā Raking (aka sample balancing)
ā¢ Method for sequentially adjusting weights for each variable until converges
ā¢ See www.abtassociates.com/presentations/raking_survey_data_2_JOS.pdf
ā Logistic regressionā¢ Fit logistic regression model and use predicted
probabilities as weights
Remember : Two Types of Nonresponse ā Unit and Item
ā¢ Nonrespondentshandled via appropriate weighting
ā¢ What about respondents that failed/refused to answer one or more items?
21
Imputation
ā¢ Imputation is the substitution of some value for a missing data point ā Used for item nonresponseā Itās within-sample inference
ā¢ NaĆÆve survey analysts sometimes ignore all records that are not complete (aka casewise deletion)ā Can be a terrible waste of data/information
ā¢ Why ignore a whole record because one item is missing?ā Likely paid a lot to collect the data, so should not lightly
ignore/delete dataā¢ If a respondent fails to answer one or more items, still
would like to use rest of their data
22
Casewise Deletion is Also Imputation (But Usually Poor Imputation)
ā¢ Remember, goal is to infer from the sample to the population
ā¢ Deleting an observation is the same as taking the respondent out of the sampleā But they are still part of the populationā So itās equivalent to imputing (inferring) all of their
data from the remaining sampleā Casewise deletion is implicit (vs. explicit) imputationā Also, if missings not random, can introduce bias
ā¢ However, sometimes casewise deleting is appropriate / necessary 23
Goal of (Explicit) Imputation
ā¢ Example: Analysis requires a multivariate regression on using many covariates drawn from the survey dataā Without imputation, all records with one or more
missing covariate data cannot be usedā¢ Goal: Impute values for the missing data in
such a way that ā Information from actual survey can be usedā Imputed data does not bias/affect results
24
Advantages and Disadvantages of Explicit Imputation
ā¢ Advantagesā Maximizes the use of all survey dataā Univariate analysis of each variable will have
same number of observationsā¢ Disadvantages
ā Some think of imputed data as āmade upā dataā Statistical software often not designed to
distinguish between real and imputed dataā¢ Results in improper standard error estimates,
too narrow confidence intervals, etc.
25
Some Common Imputation Methods
ā¢ There are many types of imputation methodsā¢ Weāll discuss four of the more common
methods:ā Mean value imputationā Regression imputationā Hot deck imputationā Multiple imputation
26
Mean Value Imputation
ā¢ Basic idea: replace missing values with the sample average for that item ( )
ā¢ Advantagesā Easyā Does not affect estimates of the mean
ā¢ Disadvantagesā Distorts the distribution (spike at the mean value)ā Can result in underestimation of standard errorsā Using overall mean value may not be appropriate
for all observations
ry
27
Stochastic Mean Value Imputation
ā¢ One solution for distribution distortion is to add ānoiseā to the imputed value:ā where and is the sample variance
calculated on the nonmissing items for question Iā¢ Improves standard error underestimation
problemā¢ Example:
riy Īµ+2~ (0, )iN sĪµ 2
is
28
Mean Value Imputation by Subgroups
ā¢ Mean value imputation by subgroup can sometimes provide more accurate estimateā E.g., If data is an individuals weight, better to use
mean value after grouping by genderā¢ Can also do stochastic mean value
imputation by subgroupā¢ Can generalize to subgroups based on
multiple variablesā E.g., impute weight using subgroups based on
gender and height categoriesā Can quickly get out of hand computationally 29
Regression Imputation (1)
ā¢ Can use regression to predict mean valueā Better than simple mean value imputation when
there are multiple variablesā¢ First, use information from those with data to
estimate regression model
ā¢ Then use model to predict those missing data
where here the epsilon is added ānoiseā
( ) 0 1 1 ( ) ( )i r i r i ry xĪ² Ī² Īµ= + +
' ' '( ) 0 1 1 ( ) ( )i r i r i ry xĪ² Ī² Īµ= + +
30
Regression Imputation (2)
ā¢ Of course, can also do multiple regression
ā¢ However, must have data available on all covariatesā Sometimes too hard and/or results in too few
observations available with all dataā¢ To get around, sometimes must do a series of
imputationsā I.e., Impute values with one model that become
covariates in the next regression imputation model
( ) 0 1 1 ( ) 2 2 ( ) ( ) ( )i r i r i r p pi r i ry x x xĪ² Ī² Ī² Ī² Īµ= + + + + +L
31
Hot Deck Imputation
ā¢ Hot deck imputation often used in large-scale imputation processes
ā¢ The name dates back to the use of computer punch cards
ā¢ Basic idea:ā Sort data by important variablesā Start at the top and replace any missing data with
value of the immediately preceding observationā If first one is missing, replace with appropriate
mean value32
Multiple Imputation
ā¢ More sophisticated (and complicated) imputation method
ā¢ Creates multiple imputed data sets (hence the name)
ā¢ Variation across the multiple data sets allows estimation of overall variationā Including both sampling and imputation variance
ā¢ Requires specification of āimputation modelāand use of specialized software / methodsā E.g., see http://www.multiple-imputation.com/
34
Variance Estimation
ā¢ Complex sampling methods require nonstandard methods to estimate variancesā I.e., Canāt just plug the data into statistical
software and use their standard errorsā (Very rare) exception: SRS with large population
and low nonresponseā¢ Software for (some) complex survey designs:
ā Free: CENVAR, VPLX, CPLX, EpiInfoā Commercial: SAS, Stata, SUDAAN, WesVar
ā¢ Two estimation methods: Taylor series expansion and Jackknife 35
Variance Estimation (Taylor Series)
ā¢ Taylor series approximation: converts ratios into sums
ā¢ Example: Variance for weighted mean
assuming a SRS can be expressed as1 1
n n
w i i ii i
y w y w= =
= ā ā
( ) ( ) ( ) ( )( )
2
2
2 ,i i w i w i i iw
i
Var w y y Var w y Cov w y wVar y
w
+ ā= ā ā ā ā
ā
36
Variance Estimation (Jackknife and Balanced Repeated Replication)
ā¢ Jackknife and balanced repeated replication methods rely on empirical methodsā Basically, resample from data c timesā Calculate overall mean as
and then estimate variance as1
1 c
y yc Ī³
Ī³ =
= ā
( ) ( )2
1
1( 1)
c
v y y yc c Ī³
Ī³ =
= āā ā
37
Survey Data Documentation and Metadata
ā¢ Data from large surveys often used by many researchers for many yearsā Critical to carefully and fully document data,
including weights and imputed data ā¢ Metadata: data about data
ā Sometimes called data dictionary or codebookā¢ Types of metadata
ā Definitionalā Proceduralā Operationalā Systems 38
What We Have Covered
ā¢ Discussed coding, how to code survey data, and various types of editing
ā¢ Described the various types of survey weights and how to calculate and use them
ā¢ Defined what imputation isā Describe some of the most common methodsā Discuss techniques for variance estimation
ā¢ Talked a bit about survey data documentation and metadata
39