data cleaning process
DESCRIPTION
Data Cleaning Process. Patrick Bartels MEA Frankfurt, December 6 th. A short reminder. „Respondents don´t lie!“ only change values if you´re really sure gather information about your country_specific database by references of survey agencies by information of remarks - PowerPoint PPT PresentationTRANSCRIPT
Mannheim Research Institute for the Economics of Aging www.mea.uni-mannheim.de
Data Cleaning Process
Patrick Bartels
MEA
Frankfurt, December 6th
A short reminder
„Respondents don´t lie!“ only change values if you´re really sure
gather information about your country_specific database by references of survey agencies by information of remarks by own investigation
write syntax or do-file, don´t change the data directely save original variable, when recoding values
e.g. varname_original indicate by flag_variable
e.g. varname_flag save corrected data files with new name
e.g. filename_corrected
Division of work
What we do
consistency checks between cv_r & modules between wave_1 &
wave_2 for demography for children
fixing of interchanged IDs by automatic exchanges
Automatic corrections (respid)
gender_w1 gender_w2month / year of birth_w1
month / year of birth_w2sampid respid
100123 01 female male Okt. 1945 Apr. 1942
100123 02 male female Apr. 1942 Okt. 1945
Automatic corrections (respid)
gender_w1 gender_w2month / year of birth_w1
month / year of birth_w2sampid respid
100123 female male Okt. 1945 Apr. 1942
100123 male Apr. 1942
wave1 wave2
female Okt. 1945
01 02
02 01
01
02
compute respid_original = respidcompute respid_flag = 1
Overview of merge between wave_1 and wave_2
male female missing total
male 7.8658.566
757121
6.7176.633
15.33915.320
female
755121
10.18410.949
8.3228.212
19.26119.282
refusal
--
21
11
31
missing
5.2865.219
6.4846.358
2323
11.79311.600
total 13.90613.906
17.42717.429
15.06314.869
46.39646.204
wave_1 - gender
wav
e_2
- g
end
er
afterauto-corrections
afterauto-corrections
afterauto-corrections
afterauto-corrections
afterauto-corrections
Division of workWhat we do
consistency checks between cv_r & modules between wave_1 &
wave_2 for demography for children
fixing of interchanged IDs by automatic exchanges
correction of wave_1 by further information in
wave_2
What we want you to do
ID-corrections initiated by survey
agencies
check booklets, tests, HH-composition (> Omar)
check financial modules (> Mario)
check remarks (> Laura)
check country specific deviations (> Stephanie)
encoding open questions priority: education, ep005
you´re much better in doing this we can fix a lot of cases
Division of workWhat we do
consistency checks between cv_r & modules between wave_1 &
wave_2 for demography for children
fixing of interchanged IDs by automatic exchanges
correction of wave_1 by further information in
wave_2
response for not fixable cases to country-teams
What we want you to do
ID-corrections initiated by survey
agencies
check booklets, tests, HH-composition (> Omar)
check financial modules (> Mario)
check remarks (> Laura)
check country specific deviations (> Stephanie)
encoding open questions priority: education, ep005
check data again, inquire survey agencies if necessary
you´re much better in doing this we can fix a lot of cases
Do-File or Syntax
name of author, date of program
short description of ‘what is made‘
which database and which modules
version of data, date of publishing
conditions / order of do-files
for STATA-users: define global path
Example of STATA-do_file (1)
/******************************************************************************
This program provides changes in cvid and respid variables in wave2 datasets of the longitudinal sample, in order to get exact matching between wave1 and wave2 respondents. A variable called "mix_hh_flag" is added to the final dataset : it is equal to 1 in each household when the value of the respid variable was changed in one or two interviews of that household.
data-version: 2007/Oct/26 Omar Paccagnella, 30 October 2007 VERY IMPORTANT! IN ORDER TO GET EXACT MATCHING OF
RESPONDENTS WITHIN AND BETWEEN WAVES, THIS PROGRAM MUST BE RUN ONLY AFTER THE PROGRAMS: "IT_DN_changes_w2.do", "IT_CV_changes_w2.do" and "IT_XT_changes_w2.do" !
**********************************************/
author´s name & date of program
short description which dataset
order of do-files
data-version
Example of STATA-do_file (2)
global drive “S:/Share/wave2“
/*************************************************************
THIS PROGRAM HAS TO BE RUN FOR ALL SECTIONS FROM DN TO IV
**************************************************************/
foreach module in ac as br cf ch co cs dn ep ex hc hh ho iv mh pf ph sp ws {
use $drive/sharew2_`module'
gen mix_hh_flag=0
gen sampid_original = sampid
gen respid_original = respid
replace respid=1 if sampid=="1604200015300" & cvid==2 & respid==2
replace mix_hh_flag=1 if sampid=="1604200015300"
[...]
save $drive/sharew2_`module'_corrected
}
global drive
save original variables
flag-variable
for which modules?
new version of data
Example of SPSS-syntax (1)
COMMENT This program provides changes in cvid and respid variables in wave2 datasets of the
longitudinal sample, in order to get exact matching between wave1 and wave2
respondents. A variable called "mix_hh_w2" is added to the final dataset (called
sharew2_`var'_checked): it is equal to 1 in each household when the value of the
respid variable was changed in one or two interviews of that household.
* date of data: 2007/Oct/26
* Omar Paccagnella, October 2007
* VERY IMPORTANT! IN ORDER TO GET EXACT MATCHING OF RESPONDENTS WITHIN AND BETWEEN WAVES,
* THIS PROGRAM MUST BE RUN ONLY AFTER THE PROGRAMS: "IT_DN_changes_w2.do",
* "IT_CV_changes_w2.do" and "IT_XT_changes_w2.do" !
****************************************************************************
*THIS PROGRAM HAS TO BE RUN FOR ALL SECTIONS FROM DN TO IV
short description
author´s name
which dataset
order of syntax
data-version
for which modules?
Example of SPSS-syntax (2)
GET FILE='S:\SHARE\wave2\dn_module.sav'.
EXE.
compute mix_hh_flag=0.
compute cvid_original = cvid.
compute respid_original = respid.
compute sampid_original = sampid.
if (sampid = 1604200015300 & cvid = 2) cvid = 1.
if (sampid = 1604200015300 & cvid = 2) respid = 2.
if sampid = (1604200015300) mix_hh_flag=1.EXE.
[...]
SAVE OUTFILE='S:\SHARE\wave2\dn_module_corrected.sav'.
EXE.
flag-variable
save original variables
Any problems with programming do-files
or syntax?
Please give us a call