data cleaning process

Mannheim Research Institute for the Economics of Aging www.mea.uni-mannheim.de

Data Cleaning Process

Patrick Bartels

MEA

Frankfurt, December 6th

A short reminder

„Respondents don´t lie!“ only change values if you´re really sure

gather information about your country_specific database by references of survey agencies by information of remarks by own investigation

write syntax or do-file, don´t change the data directely save original variable, when recoding values

e.g. varname_original indicate by flag_variable

e.g. varname_flag save corrected data files with new name

e.g. filename_corrected

Division of work

What we do

consistency checks between cv_r & modules between wave_1 &

wave_2 for demography for children

fixing of interchanged IDs by automatic exchanges

Automatic corrections (respid)

gender_w1 gender_w2month / year of birth_w1

month / year of birth_w2sampid respid

100123 01 female male Okt. 1945 Apr. 1942

100123 02 male female Apr. 1942 Okt. 1945

Automatic corrections (respid)

gender_w1 gender_w2month / year of birth_w1

month / year of birth_w2sampid respid

100123 female male Okt. 1945 Apr. 1942

100123 male Apr. 1942

wave1 wave2

female Okt. 1945

01 02

02 01

01

02

compute respid_original = respidcompute respid_flag = 1

Overview of merge between wave_1 and wave_2

male female missing total

male 7.8658.566

757121

6.7176.633

15.33915.320

female

755121

10.18410.949

8.3228.212

19.26119.282

refusal

--

21

11

31

missing

5.2865.219

6.4846.358

2323

11.79311.600

total 13.90613.906

17.42717.429

15.06314.869

46.39646.204

wave_1 - gender

wav

e_2

- g

end

er

afterauto-corrections





Division of workWhat we do




correction of wave_1 by further information in

wave_2

What we want you to do

ID-corrections initiated by survey

agencies

check booklets, tests, HH-composition (> Omar)

check financial modules (> Mario)

check remarks (> Laura)

check country specific deviations (> Stephanie)

encoding open questions priority: education, ep005

you´re much better in doing this we can fix a lot of cases

Division of workWhat we do




correction of wave_1 by further information in

wave_2

response for not fixable cases to country-teams

What we want you to do

ID-corrections initiated by survey

agencies

check booklets, tests, HH-composition (> Omar)

check financial modules (> Mario)

check remarks (> Laura)

check country specific deviations (> Stephanie)

encoding open questions priority: education, ep005

check data again, inquire survey agencies if necessary

you´re much better in doing this we can fix a lot of cases

Do-File or Syntax

name of author, date of program

short description of ‘what is made‘

which database and which modules

version of data, date of publishing

conditions / order of do-files

for STATA-users: define global path

Example of STATA-do_file (1)

/******************************************************************************

This program provides changes in cvid and respid variables in wave2 datasets of the longitudinal sample, in order to get exact matching between wave1 and wave2 respondents. A variable called "mix_hh_flag" is added to the final dataset : it is equal to 1 in each household when the value of the respid variable was changed in one or two interviews of that household.

data-version: 2007/Oct/26 Omar Paccagnella, 30 October 2007 VERY IMPORTANT! IN ORDER TO GET EXACT MATCHING OF

RESPONDENTS WITHIN AND BETWEEN WAVES, THIS PROGRAM MUST BE RUN ONLY AFTER THE PROGRAMS: "IT_DN_changes_w2.do", "IT_CV_changes_w2.do" and "IT_XT_changes_w2.do" !

**********************************************/

author´s name & date of program

short description which dataset

order of do-files

data-version

Example of STATA-do_file (2)

global drive “S:/Share/wave2“

/*************************************************************

THIS PROGRAM HAS TO BE RUN FOR ALL SECTIONS FROM DN TO IV

**************************************************************/

foreach module in ac as br cf ch co cs dn ep ex hc hh ho iv mh pf ph sp ws {

use $drive/sharew2_`module'

gen mix_hh_flag=0

gen sampid_original = sampid

gen respid_original = respid

replace respid=1 if sampid=="1604200015300" & cvid==2 & respid==2

replace mix_hh_flag=1 if sampid=="1604200015300"

[...]

save $drive/sharew2_`module'_corrected

}

global drive

save original variables

flag-variable

for which modules?

new version of data

Example of SPSS-syntax (1)

COMMENT This program provides changes in cvid and respid variables in wave2 datasets of the

longitudinal sample, in order to get exact matching between wave1 and wave2

respondents. A variable called "mix_hh_w2" is added to the final dataset (called

sharew2_`var'_checked): it is equal to 1 in each household when the value of the

respid variable was changed in one or two interviews of that household.

* date of data: 2007/Oct/26

* Omar Paccagnella, October 2007

* VERY IMPORTANT! IN ORDER TO GET EXACT MATCHING OF RESPONDENTS WITHIN AND BETWEEN WAVES,

* THIS PROGRAM MUST BE RUN ONLY AFTER THE PROGRAMS: "IT_DN_changes_w2.do",

* "IT_CV_changes_w2.do" and "IT_XT_changes_w2.do" !

****************************************************************************

*THIS PROGRAM HAS TO BE RUN FOR ALL SECTIONS FROM DN TO IV

short description

author´s name

which dataset

order of syntax

data-version

for which modules?

Example of SPSS-syntax (2)

GET FILE='S:\SHARE\wave2\dn_module.sav'.

EXE.

compute mix_hh_flag=0.

compute cvid_original = cvid.

compute respid_original = respid.

compute sampid_original = sampid.

if (sampid = 1604200015300 & cvid = 2) cvid = 1.

if (sampid = 1604200015300 & cvid = 2) respid = 2.

if sampid = (1604200015300) mix_hh_flag=1.EXE.

[...]

SAVE OUTFILE='S:\SHARE\wave2\dn_module_corrected.sav'.

EXE.

flag-variable

save original variables

Any problems with programming do-files

or syntax?

Please give us a call

data cleaning process

Documents

r modulesbetween wave

respid compute respid

w2month year of birth

w1month year of birth

doconsistency checksbetween

modulesversion of data

respid variables

ep005 check data