ipums-international integration process

35
IPUMS-International Integration Process Matt Sobek Minnesota Population Center [email protected]

Upload: chase-williamson

Post on 14-Mar-2016

35 views

Category:

Documents


0 download

DESCRIPTION

IPUMS-International Integration Process. Matt Sobek Minnesota Population Center [email protected]. 1. 2. 3. 4. Input material. Pre-processing. Standardization. Integration. Data files. Batch samples Reformat data Donation Draw sample Confidentiality A. Code clean-up Verify data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: IPUMS-International Integration Process

IPUMS-InternationalIntegration Process

Matt SobekMinnesota Population Center

[email protected]

Page 2: IPUMS-International Integration Process

DATA

METADATA

Data files

Data dictionary

Enumeration forms

Enum. instructions

Sample information

Batch samples

Reformat data

Donation

Draw sample

Confidentiality A

Translate to English

Images to editable files

Ipums data dictionary

Code clean-up

Verify data

Confidentiality B

Tag enumeration text

Document unharmonized variables

Harmonize codes

Variable programming

Constructed variables

Variable descriptions

Sample design

Input material1

Pre-processing2

Standardization3

Integration4

Page 3: IPUMS-International Integration Process

End

Matt SobekMinnesota Population Center

[email protected]

Page 4: IPUMS-International Integration Process

Batch Samples

In spring we identify the samples to integrate the following year.

Samples are processed as a group – one per year. The entire batch of samples is processed through each stage before we proceed to the next step.

There is little flexibility in the work process. If a sample is not available for processing during the earliest stages of integration, it cannot be included in the data release for that year.

Page 5: IPUMS-International Integration Process

Original Input Data

Some examples of differing file formats:

• SPSS and SAS system files

• Redatam-format

• IMPS format

• Records that combine household and person characteristics

• Separate files for persons, households (and dwellings, buildings)

• Different types of records (mortality or migration)

• Separate files for different administrative units

Page 6: IPUMS-International Integration Process

Reformatting: Original Data File

Page 7: IPUMS-International Integration Process

Reformatting: Data File after Reformatting

Page 8: IPUMS-International Integration Process

geography housing

person (head)person (child)person (child)

geography housing person (head)geography housing person (child)geography housing person (child)geography housing person (head)geography housing person (spouse)geography housing person (child)geography housing person (child)

geography housingperson (head)person (spouse)person (child)person (child)

(Brazil 1980)

(Person records only; household data duplicated on person records)

Reformatting: Rectangular Sample

Page 9: IPUMS-International Integration Process

dwellinghouseholdperson (head)person (spouse)person (child)

householdperson (head)person (child)

person (head)person (spouse)

dwellinghousehold

dwelling householdperson (head)person (spouse)person (child)dwelling householdperson (head)person (child)dwelling householdperson (head)person (spouse)

(Chile 1992)

(Separate dwelling and household records)

Reformatting: Dwelling-Household-Person Sample

Page 10: IPUMS-International Integration Process

serial 001 headserial 001 spouseserial 002 headserial 002 childserial 003 head

serial 001 geog & housingserial 002 geog & housingserial 003 geog & housing

serial 001 householdserial 001 headserial 001 spouse

serial 003 household

serial 002 householdserial 002 headserial 002 child

serial 003 head

Household File

Person File

(Brazil 2000)

Reformatting: Merge Household and Person Files

Page 11: IPUMS-International Integration Process

geog person housing geog persongeog person housing geog persongeog person housing geog persongeog person housing geog persongeog person housing geog person

personhousehold

householdperson

person

person

person

household

household

household

(Mexico 1960)

geog person housing geog persongeog person housing geog persongeog person housing geog persongeog person housing geog persongeog person housing geog person

(Individuals only; not organized in households)

Reformatting: Persons not Organized in Households

Page 12: IPUMS-International Integration Process

Donation and Error Correction

Data are tested for errors that affect structural integrity, such as merged households, unmatched person and household records, corrupted records, etc. Such errors often do not affect tabulations, but create inconsistencies across records within households that affect sophisticated analyses.

• Some problems can be resolved with custom programming.• Other problems are resolved by donating (substituting) a donor household for the corrupted one.

Households are divided into strata based on predictor variables. Donors are drawn from the same strata as the corrupted household, ensuring they share key characteristics.

If a sample is drawn from the full census, a substitute donor record is used; if we are already starting with a sample, the donor record is duplicated. A flag indicates that a record was duplicated.

Page 13: IPUMS-International Integration Process

Drawing a Sample

About one-third of IPUMS samples are drawn from full-count data.

After reformatting, we draw a systematic sample of every Nth dwelling to yield the desired sample density – typically 10%.

If the input data are not full-count (for example, they include only the long-form records), the sample design might have to account for differing sample densities between areas.

Very large dwelling units (over 30 persons) are sampled at the individual level – not as intact units – in order to reduce sampling error. Every Nth individual is taken.

Page 14: IPUMS-International Integration Process

Confidentiality Measures: A

Swap a small percentage of cases between geographic areas.

Reorder households within geographic areas.

Suppress low-level geographic variables.

Suppress any variable deemed too sensitive by the National Statistical Office.

Encrypt all versions of the data prior to the imposition of these confidentiality measures.

Page 15: IPUMS-International Integration Process

Code Clean-Up: Recoding Unharmonized Variables

CR840018 label cos1984Marital status P

75

<tt>0 NIU B=Under age 101 Consensual union 1=Consensual union2 Married 2=Married3 Separated 3=Separated4 Divorced 4=Divorced5 Widowed 5=Widowed6 Single 6=Single9 Missing 0=[undocumented]9 " 8=[undocumented]

</tt>

• Recode the input variables to conform to some basic standards for treatment of missing values, etc.• Recode stray values into a consolidated missing category as appropriate.• Convert non-numeric characters to numeric.

Most recoding is performed using a data translation matrix like the one below for Marital Status in 1984 Costa Rica. If the recoding requires more complex logic, use custom programming.

Page 16: IPUMS-International Integration Process

Verify Data: Unharmonized Variables

Examine the marginal frequencies of every input variable.

Analyze the data universe for each variable – the population at risk of having a response. Determine the theoretical universe from enumeration materials or other documentation, then empirically determine any discrepancies from that universe.

Document the universe for each variable and any other observations.

Page 17: IPUMS-International Integration Process

Confidentiality Measures: B

Recode geographic units to ensure small localities cannot be identified (typically those with fewer than 20,000 persons).

For recent censuses:

Identify cells that represent very small numbers of persons in the population. Code them to a residual category or combine them.

Top- or bottom-code continuous variables that have a long tail that could identify small subpopulations.

Suppress specific categories of variables as requested by the National Statistical Office.

Page 18: IPUMS-International Integration Process

MARST Marital Status

code label CN82A403 CO73A411 KN89A413 MX70A402 US90A425

100 SINGLE/NEVER MARRIED 1=never married 4=single 1=single 9=single 6=never married200 MARRIED/IN UNION210 Married (not specified) 2=married 2=married 3=monogamous 1=married211 Civil 3=only civil212 Religious 4=only religious213 Civil and religious 2=civil and religious214 Polygamous 3=polygamous220 Consensual union 1=free union 5=free union300 SEPARATED/DIVORCED 3=sep. or divorced310 Separated 6=separated 8=separated 3=separated321 Legally separated322 De facto separated330 Divorced 4=divorced 5=divorced 7=divorced 4=divorced400 WIDOWED 3=widowed 5=widowed 4=widowed 6=widowed 5=widowed999 UNKNOWN/MISSING 0=missing 6=unknown B=blank 1=unknown

ChinaChina19821982

ColombiaColombia19731973

KenyaKenya19891989

MexicoMexico19701970

U.S.A.U.S.A.19901990

Harmonize Codes: Translation Matrix for Marital Status

Page 19: IPUMS-International Integration Process

Variable Programming

Some variable manipulations are too complex to be handled using the translation matrix tables. Typically these involve continuous variables or recoding logic that refers to multiple variables. This programming is written in C++.

Page 20: IPUMS-International Integration Process

Pernum Relate Age Sex Marst Chborn1 head 46 male married n/a2 spouse 44 female married 33 aunt 77 female widow 74 child 15 female single 05 child 13 female single n/a6 child 11 male single n/a

Pernum Relate Age Sex Marst Chborn1 head 46 male married n/a2 spouse 44 female married 33 aunt 77 female widow 74 child 15 female single 05 child 13 female single n/a6 child 11 male single n/a

Spouse’s

Mother’s Father’s

Location      

21

0

000

Location      

Location      

000 0

00

2 111

22

(Colombia 1985)

(Simple household)

Constructed “Pointer” Variables

Page 21: IPUMS-International Integration Process

Pernum Relationship Age Sex Marst Chborn1 head 53 female separated 62 child 28 male single n/a3 child 22 male single n/a4 child 21 male single n/a5 child 25 female married 26 child-in-law 28 male married n/a7 grandchild 3 male single n/a8 grandchild 1 male single n/a9 non-relative 32 female separated 2

10 non-relative 10 male single n/a11 non-relative 5 female single n/a

Location           

Location           

Location           

00

0

006500000

011110550

99

00066

00000

Spouse’s Father’sMother’s

(Complex household)

(Colombia 1985)

Constructed “Pointer” Variables

Page 22: IPUMS-International Integration Process

. C006-EA-TYPE N 13 1 RURAL 1 URBAN 2 . C007-HHOLDNUM N 14-16 3 HHOLD-CODE 001:999 . (record type) A 17 1 . .age 2 Data Dictionary: REAL1 IMPS Version 3.1 . Created: 31/10/95 11:57:21 . Record Name: POP-RECORD Record Type: 2 .------------------------------------------------------------------------------- .tem (occurs) Data Item . Subitem (occurs) Type Position Len. Dec. Value Name Values .------------------------------------------------------------------------------- POP1 A 18-67 50 . P00-LINENUMBER N 18-19 2 0 LINE-NUMBER 01:49 . P10-RELATIONSHI N 20 1 0 HEAD 1 SPOUSE 2 SON-OFHEAD 3 DAU-OFHEAD 4 FATHER 5 MOTHER 6 OTHERRELATIVE 7 NOTRELATED 8 NR 9 . P11-SEX N 21 1 0 MALE 1 FEMALE 2 NR 9 . P12-AGE N 22-23 2 0 UNDERONE 00 YEARGIVEN 01:96 OVR97 97 NR 99

Original Data Dictionary – Kenya 1989

Page 23: IPUMS-International Integration Process

Line No.

Item Data type and

Item Len.

Signification and values

1. MAPA N 6 010001- 47XXXX number of the file, where : - 01- 47 is the code of the county

- 0001-XXXX is the code of the census sector within the county

2. CLAD N 3 The order number of the building in the file 3. LOC N 3 The order number of the dwelling within the building 4. RT N 1 Record type value: 4 5. P00 N 1 The order number of the household in the dwelling 6. PNR N 2 The order number of the person in the household 7. P01 N 2 Relationship with the household head:

. household head 1 . husband / wife 2 . son / daughter 3 . son in law / daughter in law 4 . grandson / granddaughter 5 . father / mother 6 . grandfather / grandmother 7 . brother / sister 8 . brother in law / sister in law 9 . father in law / mother in law 10 . other relative 11 . non-related person 20

8. P05 N 1 Situation at the census moment: . present 1 . temporally absent from the household: - left in other place of the country 2 - left abroad 3 . absent for a long time: - for working 4 - for studies 5 - other reason 6

Original Data Dictionary – Romania 1992

Page 24: IPUMS-International Integration Process

======================================= year: 1982, sample: 1%, record: individual, variable: age Length: 3 Start: 7 Age in years 0..99 ======================================= year: 1982, sample: 1%, record: individual, variable: race Length: 2 Start: 10 Ethnicity 01: Han 21: Va 41: Tajik 02: Mongol 22: She 42: Nu 03: Hui 23: Gaoshan 43: Uzbek 04: Tibetan 24: Lahu 44: Russian 05: Uygur 25: Sui 45: Ewenkei 06: Miao 26: Dongxiang 46: Benglong 07: Yi 27: Naxi 47: Baoan 08: Zhuang 28: Jingpo 48: Yugur 09: Bouyi 29: Kirgiz 49: Gin 10: Korean 30: Tu 50: Tatar 11: Man 31: Daur 51: Derung 12: Dong 32: Mulam 52: Orogen 13: Yao 33: Qiang 53: Hezhen 14: Bai 34: Bulang 54: monba 15: Tujia 35: Salar 55: Lhoba 16: Hani 36: Maonan 56: Jino 17: Kazak 37: Gelao 97: Other Unidentified 18: Dai 38: Xibe 98: Naturalized Foreigners 19: Li 39: Achang 20: Lisu 40: Pumi ======================================= year: 1982, sample: 1%, record: individual, variable: regstats Length: 1 Start: 12 Registration Status 1: Residing and registered here 2: Residing here over 1 year, but registered elsewhere. 3: Residing here less than 1 year, absent from the registration place 1 year or more. 4: Living here with registration unsettled 5: Used to reside here; is now abroad with no local registration =======================================

Original Data Dictionary – China 1982

Page 25: IPUMS-International Integration Process

25 CLAVE DE PARENTESCO CATALOGO DE PARENTESCO (CATPAREN.TXT) PRIMER DIGITO IGUAL A: 1 JEFE(A) 2 ESPOSA(O) O COMPAÑERA(O) 3 HIJO(A) 4 SIRVIENTE 5 SIN PARENTESCO 6 OTRO PARENTESCO 7 PERSONA SOLA 9 PARENTESCO NO ESPECIFICADO 26 SEXO 1 HOMBRE 2 MUJER 27 EDAD AÑOS CUMPLIDOS 999 EDAD NO ESPECIFICADA 28 LUGAR DE NACIMIENTO CATALOGO DE PAISES (CATPAISE.TXT) 001..032 ENTIDADES DEL PAIS 033..099 ENTIDAD INSUFICIENTEMENTE ESPECIFICADO 100..998 OTRO PAIS 999 NO ESPECIFICO LUGAR DE NACIMIENTO 29 LUGAR DE RESIDENCIA ANTERIOR CATALOGO DE PAISES (CATPAISE.TXT) 001..032 ENTIDADES DEL PAIS 033..099 ENTIDAD INSUFICIENTEMENTE ESPECIFICADO 100..998 OTRO PAIS 999 NO ESPECIFICO LUGAR DE RESIDENCIA ANTERIOR

Original Data Dictionary – Mexico 1990

Page 26: IPUMS-International Integration Process

Enumeration Form: Original File

Page 27: IPUMS-International Integration Process

Enumeration Instructions: Original File (Mexico 1990)

Page 28: IPUMS-International Integration Process

Sample Information – from Statistical Office

Sample information is difficult for the IPUMS project to collect. Often only limited information can be gleaned from available documentation. It is extremely helpful when countries collate the information themselves, as was done below by the Netherlands:

Page 29: IPUMS-International Integration Process

Translate Documents to English

Many countries provide their census documentation in English. For those that do not, the IPUMS project hires translators from around the world. Often these are persons currently or formerly associated with National Statistical Offices. Some common languages are translated by staff in Minnesota.

Page 30: IPUMS-International Integration Process

Editable Enumeration Form – In English

5. Number of Rooms

How many rooms are used for sleeping without counting hallways? _____ Write the number

Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count the kitchen

_____Write the number

6. Access to water Read all of the options until you get an affirmative answer. Circle only one answer

1 Running water inside the dwelling 2 Running water outside the dwelling but on the land 3 Running water from a public faucet or hydrant 4 Running water that is carried from another dwelling 5 Tanked in by truck 6 Water from a well, river, lake, stream or other

Answers 3, 4, 5, 6 continue with number 8

7. Water supply How many days of the week is water available? Circle only one answer

1 Daily 2 Every third day 3 Twice a week 4 Once a week 5 Occasionally

Page 31: IPUMS-International Integration Process

IPUMS Data Dictionary

Rec Var Col Wid Value Value_Label Value_Label_Original Freq Svar P relate 36 2 Relationship to household head P01-Parentesco con el jefe(a) CR00A400 1 Head (male or female) Jefe o jefa 960,098 2 Spouse or partner Esposo(a)/compañera 680,217 3 Child or stepchild Hijo(a)/hijastro 1,763,230 4 Son-in-law or daughter-in-law Yerno o nuera 23,644 5 Grandchild Nieto(a) 140,300 6 Parent or parent in-law Padres o suegros 44,393 7 Other relative Otro familiar 117,223 8 Domestic servant or relative Serv.Domestico o su familiar 11,884 9 Other non-relative Otro no familiar 69,190 P sex 38 1 Sex P02-Sexo CR00A401 1 Male Masculino 1,902,614 2 Female Femenino 1,907,565 P bpl 39 1 Place of birth P04-Lugar de Nacimiento CR00A403 1 In this same canton Mismo canton 2,303,784 2 In another canton Otro canton 1,209,934 3 In another country Otro pais 296,461 P ethnic 40 2 Ethnic group P06-Etnia CR00A408 1 Indigenous Indigena 63,876 2 Black or Afrocostarican Negra o Afrocostarricense 72,784 3 Asian China 7,873 4 None of the above Ninguna anterior 3,568,471 9 Unknown Ignorado 97,175 P indigsp 42 2 Speaks Indigenous language P06b-Habla lengua indigena CR00A410 1 Yes, speaks Indigenous lang Si habla lengua indígena 15,806 2 No, does not speak Indigenous lang No habla lengua indígena 13,768 9 Unknown Ignorado 3,554 10 [no label] 3,777,051

Page 32: IPUMS-International Integration Process

XML-Tagged Enumeration Form

5. Number of Rooms <svar v="MX00A016" a="all"> How many rooms are used for sleeping without counting hallways?

<i1> _____ Write the number </i1>

</svar> <svar v="MX00A017" a="all"> Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count the kitchen

<i1> _____Write the number </i1>

</svar> <svar v="MX00A018" a="all"> 6. Access to water Read all of the options until you get an affirmative answer. Circle only one answer

<i1> 1 Running water inside the dwelling 2 Running water outside the dwelling but on the land 3 Running water from a public faucet or hydrant 4 Running water that is carried from another dwelling 5 Tanked in by truck 6 Water from a well, river, lake, stream or other </i1>

Answers 3, 4, 5, 6 continue with number 8 </svar>

Page 33: IPUMS-International Integration Process

Document Unharmonized Variables

The enumeration form and instruction text provides most of the documentation for the unharmonized input variables.

Other documentation is written as needed to clarify the interpretation of the variable for users.

We also empirically determine the universe of persons or households with valid values for each variable.

Page 34: IPUMS-International Integration Process

Variable Description (Literacy)<vardesc> <var> LIT </var> <desc> LIT indicates whether or not the respondent could read and write in any language. A person is typically considered literate if they can both read and write. All other persons are illiterate, including those who can either read or write but cannot do both. </desc> <comp> Some samples provided more specific criteria than others with respect to the level of ability that should constitute literacy. Typically, the instructions appear to be aimed at distinguishing persons who have memorized how to write their signature or recognize certain words from those that can truly write and comprehend text they read. In 1999 Vietnam, all persons with 5 or more years of schooling are automatically considered literate. </comp> <comp.bra> All Brazilian censuses consistently stipulated that to be considered literate a person must be able to read and write a simple note in any language. Persons are not literate if they can only write their name or if they once learned to read and write but have since forgotten. </comp.bra> <comp.chn> The Chinese census instructions supplied explicit criteria for defining literate and semi-literate persons, who are combined in the data as "illiterate." The instructions stated that illiterate and semi-literate persons were those who knew fewer than 1500 words and could not read "simple language books and newspapers or write a simple message." </comp.chn>

Page 35: IPUMS-International Integration Process

Sample Design