data projects at the minnesota population center resources for comparative population and health...
TRANSCRIPT
Data Projects at the Minnesota Population Center
Resources for Comparative Population and Health Research
Seattle, WashingtonMay 22, 2014
Elizabeth Boyle, Miriam King, Matthew Sobek Minnesota Population Center, University of Minnesota
Integrated Public Use Microdata Series
U.S. Labor Force Participation: 1850-2012
Men
Women
Steve Ruggles
1995: “King of Quant”
President Population Association of America
New U.S. Data From Ancestry.com
We build data infrastructure for research community. Specialize in data harmonization.
World’s largest collection of individual population and health data, across 9 projects.
50,000 registered users from over 100 countries.
Free
Minnesota Population Center
MPC Data Dissemination, 1993-2012
Gigabytes per week
MPC Data Projects
The Problem
1. Combining data from multiple sources is time consuming
Discovery Data management
2. It’s error prone Recoding data Overlook documentation
3. Hard to replicate results
4. Discourages comparative research
Outline
Harmonization methods
Dissemination system
International projects Integrated DHS Terra Populus IPUMS-International
Terminology
Harmonization:
Combining datasets collected at different times or places into a single, consistent data series.
“Integration”
Metadata:
Data about data. Documentation in broadest sense.
Relation to head
Marital status Education Occupation
Microdata
Summary Data
Harmonization Methods
Metadata
Data
Dissemination
Systematize Metadata(record layout file, pdf)
MPC Data DictionaryVariable Start Width Value Var ValueLabel Frequency Universe
SMOKE100 57 1 Ever smoked 100 cigarettes All persons
1 Yes 54,189
2 No 59,501
7 Don't know/Not sure 205
9 Refused 39
SMOKENOW 58 1 Smoke cigarettes now Persons who ever smoked
1 Yes 25,644
2 No 28,535
7 Don't know/Not sure 0
9 Refused 10
Blank [no label] 59,745
SMOKE30 59 2 Number of days smoked in the last 30 Persons who currently smoke
1 to 30 Number of days 25,290
77 Don't know/Not sure 293
88 None 49
99 Refused 12
Blank [no label] 88,290
SMOKENUM 61 2 Number of cigarettes smoked per day Persons who currently smoke
0 to 76 Number of cigarettes 22,292
77 Don't know/Not sure 248
99 Refused 43
Blank [no label] 91,351
WaterAccess
Convert Questionnaires to Metadata(Mexico 2000)
5. Number of Rooms
How many rooms are used for sleeping without counting hallways? _____ Write the number
Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count the kitchen
_____Write the number
6. Access to water
Read all of the options until you get an affirmative answer. Circle only one answer
1 Running water inside the dwelling 2 Running water outside the dwelling but on the land 3 Running water from a public faucet or hydrant 4 Running water that is carried from another dwelling 5 Tanked in by truck 6 Water from a well, river, lake, stream or other
Answers 3, 4, 5, 6 continue with number 8
7. Water supply
How many days of the week is water available? Circle only one answer
1 Daily 2 Every third day 3 Twice a week 4 Once a week 5 Occasionally
Metadata: Questionnaire Text
Water access
Bedrooms
Rooms
XML-Tagged Questionnaire Text
Data: Variable Harmonization
Marital Status: IPUMS-International
Bangladesh 2011
1 = Unmarried
2 = Married
3 = Widowed
4 = Divorced/separated
Mexico 1970
1 = Married, civil & relig
2 = Married, civil
3 = Married, religious
4 = Consensual union
5 = Widowed
6 = Divorced
7 = Separated
8 = Single
Kenya 1999
1 = Never married
2 = Monogamous
3 = Polygamous
4 = Widowed
5 = Divorced
6 = Separated
Translation TableInput
Bangladesh
2011
4 = Divrc or separated
1 = Unmarried
2 = Married
3 = Widowed
Mexico1970
1 = Married, civil & relig2 = Married, civil
3 = Married, religious
4 = Consensual union
5 = Widowed
6 = Divorced
7 = Separated
8 = Single
Kenya1999
1 = Never married
2 = Monogamous
3 = Polygamous
4 = Widowed
5 = Divorced
6 = Separated
LabelCode
Translation TableHarmonized
1 = Never married1 = Married, civil & relig
4 = Divrc or separated
1 = Unmarried
2 = Married
3 = Widowed
2 = Married, civil
3 = Married, religious
4 = Consensual union
5 = Widowed
6 = Divorced
7 = Separated
8 = Single
Single
Married or in union
Married, formally
Civil
Religious
Civil and religious
Monogamous
Polygamous
Consensual union
Separated
Divorced
2 = Monogamous
3 = Polygamous
4 = Widowed
5 = Divorced
6 = Separated
1 0 0
2 0 0
2 1 0
2 1 1
2 1 2
2 1 3
2 1 4
2 1 5
2 2 0
0 0
3 1 0
3 2 0
0 0
Mexico1970
Input
Bangladesh
2011Kenya1999
Divorced or separated3
Widowed4
LabelCode
Translation TableHarmonized
1 = Never married
1 = Married, civil & relig
4 = Divrc or separated
1 = Unmarried
2 = Married
3 = Widowed
2 = Married, civil
3 = Married, religious
4 = Consensual union
5 = Widowed
6 = Divorced
7 = Separated
8 = SingleSingle
Married or in union
Married, formally
Civil
Religious
Civil and religious
Monogamous
Polygamous
Consensual union
Separated
Divorced
2 = Monogamous
3 = Polygamous
4 = Widowed
5 = Divorced
6 = Separated
1 0 0
2 0 0
2 1 0
2 1 1
2 1 2
2 1 3
2 1 4
2 1 5
2 2 0
0 0
3 1 0
3 2 0
0 0
Mexico1970
Input
Bangladesh
2011Kenya1999
Divorced or separated3
Widowed4
Data Dissemination System
Data Dissemination System
Variables Page
Variables Page
238 censuses
Sample Filtering
Variables Page – Filtered
Variable Page: Marital Status
Variable Codes(Marital status)
Variable Codes(Marital status)
Variable Codes(Marital status)
Variable Page: Marital Status
Variable Comparability Discussion(Marital status)
Variable Page: Documentation
Questionnaire Text
Questionnaire Text(Marital status, Cambodia)
Variables Page
Extract Summary
Case Selection
Age of spouse
Employment status of father
Occupation of father
Attached Characteristics
Extract Summary
Download or Revise Extract
On-line Analysis
The International Projects
Integrated DHS
Foremost source of health information for the developing world
Funded by USAID
Since 1980s, over 300 surveys, 90 countries
Topics: fertility, nutrition, HIV, malaria, maternal and child health, etc
Demographic and Health Surveys
5-year NIH grant (end of year 2)
Focus on Africa, with India
Partnership with ICF-International and USAID
IDHS Project
Motivation: DHS is incredibly valuable, but it’s hard to capitalize on its full potential.
Problem:
Data discovery
Dispersed documentation
Data management
Variable changes over time
Not unique to DHS: endemic to any survey that’s persisted over decades.
Why an Integrated DHS?
DHS Research Process Example: Find data on female genital cutting
Survey Search Tool
Recode notes
Data dictionary
Just the woman file – for one survey. 61 to go.
Still need Report (377 page pdf)
• Contains questionnaire and sample design information
• Errata file
DHS “Recode Variables” make it more harmonized than most surveys Consistent variable names Each DHS phase has a shared model questionnaire
But:
6 phases over 25+ years
Country control over final wording of surveys
Country-specific variables
The recode variables can be a two-edged sword
At least the DHS variables are alreadyharmonized, right?
100 Muslim/Islam 4 = Muslim 7 = Moslem 1 = Muslim 2 = Muslim200 Christian 2 = Christian 3 = Christian201 Catholic 2 = Catholic 1 = Catholic202 Protestant 1 = Protestant203 Anglican 2 = Anglican204 Methodist 3 = Methodist205 Presbyterian 4 = Presbyterian206 Pentacostal 5 = Pentecostal208 Other Christian 3 = Other Christian 6 = Other Christian300 Other301 Hindu 0 = Hindu 1 = Hindu302 Sikh 3 = Sikh 4 = Sikh303 Buddhist 5 = Buddhist302 Jain 6 = Jain305 Jewish 7 = Jewish306 Parsi/Zoroastrian 8 = Parsi/Zoroastrian307 Doni-Polo 10 = Donyi polo400 Traditional/spiritual 8 = Trad/spiritualist401 Traditional 5 = Traditional402 Spiritual403 Animist500 No religion 0 = No religion 9 = No religion 9 = No religion600 Other 96 = Other 4 = Other 96 = Other
Ghana 1993V130
Ghana 2008V130
India 1992V130
India 2005V130
Harmonization: Religion
Egypt 1995 S802 Ever circumcisedEgypt 2005 S801 Respondent circumcisedEgypt 2008 G102 Respondent circumcisedEthiopia 2000 FG103 CircumcisedEthiopia 2005 FG103 CircumcisedGhana 2003 S821 CircumcisedKenya 1998 S1002 Respondent circumcisedKenya 2003 S821 CircumcisedKenya 2008 G102 Respondent circumcisedMali 1995 S551 CircumcisedMali 2001 FG103 Circumcised?Mali 2006 G102 Respondent circumcisedNigeria 1999 S521 Type of circumcisionNigeria 2003 FG103 CircumcisedNigeria 2008 G102 Respondent circumcised
Harmonization: Female Circumcision
Ever Circumcised
Timeline: 2014 (current)
9 countries, 39 samples
Much of woman files Women of child
bearing age as unit of analysis
Timeline: 2015
15 countries, 69 samples
Complete the woman files
Children & birth files
Timeline: 2017
21 countries, 94 samples
Men and couples files
Timeline: Next grant
41 African countries, 130+ samples
11 Asian countries, 32+ samples
Beta
Lower barriers to conducting research on population and the environment.
Motivation:
The data from different domains have incompatible formats, and few researchers have the skills to combine them
Terra Populus Goal
5 year grant NSF
At mid-point: year 3
TerraPop
6 countries: Argentina
Brazil
Malawi
Spain
United States
Vietnam
Population Microdata
Tabulations of census data for administrative units
Area-level Data
Land cover from satellite images (Global Land Cover 2000)
Agricultural usefrom satellites and government records (Global Landscapes Initiative)
Climate from weather stations (WorldClim)
Environmental DataRasters (Grid Cells)
Microdata
Area-level dataRasters
Mix and match variables originating in
any of the data structures
Obtain output in the data structure most
useful to you
Location-Based Integration
Individuals and households with their environmental
and social context
Microdata
Area-level dataRasters
Location-Based Integration
Summarized environmental and population
Microdata
Area-level dataRasters
County IDG17003100001G17003100002G17003100003G17003100004G17003100005G17003100006G17003100007
County IDMean Ann. Temp.
Max. Ann. Precip.
G17003100001 21.2 768G17003100002 23.4 589G17003100003 24.3 867G17003100004 21.5 943G17003100005 24.1 867G17003100006 24.4 697G17003100007 25.6 701
County IDMean Ann. Temp.
Max. Ann. Precip.
Rent, Rural
Rent, Urban
Own, Rural
Own, Urban
G17003100001 21.2 768 3129 1063 637 365G17003100002 23.4 589 2949 1075 1469 717G17003100003 24.3 867 3418 1589 1108 617G17003100004 21.5 943 1882 425 202 142G17003100005 24.1 867 2416 572 426 197G17003100006 24.4 697 2560 934 950 563G17003100007 25.6 701 2126 653 321 215
characteristics for administrative
districts
Location-Based Integration
Rasters of population and environment data
Microdata
Area-level dataRasters
Location-Based Integration
Rasterization of Area-Level Data
Area-Level Summary of Raster Data
Linkages across data formats rely on administrative unit boundaries
Particular needs Lower level
boundaries Historical
boundaries
Boundaries are Key
Geographic Harmonization
Geographic Harmonization
Geographic Harmonization
Web interface will change significantly in fall 2014
Fast microdata tabulator needed
Beta Version
IPUMS-International
IPUMS-International
Census microdata from around world
Funded by NSF and NIH
Motivation:
Provide data access
Preservation
Khartoum, CBS-Sudan
Dhaka, Bangladesh Bureau of Statistics
IPUMS-International
ParticipatingDisseminating
IPUMS Censuses Per Country
IPUMS Censuses Per Country
Variables Included in Extracts
Top Institutional UsersCountry Institution Country Institution
1 USA University of Minnesota 16 Brazil Universidade Federal de Minas Gerais
2 USA Harvard University 17 Mexico El Colegio de México
3 USA University of Michigan Ann Arbor 18 USA Yale University
4 USA Columbia University 19 China University of Hong Kong
5 Spain Autonomous University Barcelona 20 USA University of Washington
6 USA Arizona State University 21 UK London School Economics
7 Singapore National University of Singapore 22 UK University of Stirling
8 IADB Inter American Development Bank 23 France Université de Bordeaux 4
9 WB World Bank Group 24 Austria University of Vienna
10 USA University of California Berkeley 25 Malaysia National University of Malaysia
11 USA Vanderbilt University 26 Austria Vienna Institute of Demography
12 USA University of Chicago 27 USA Pew Research Center
13 Australia University of Queensland Australia 28 Colombia Universidad del Valle
14 USA University of California Los Angeles 29 USA University of Delaware
15 USA Dartmouth College 30 USA Brown University
Millennium Development Goals
Ratio of literate women to men, 15-24 years old
Source: Cuesta and Lovatón (2014) 1990 Census round
Millennium Development Goals
Source: Cuesta and Lovatón (2014) Data Source: IPUMS-International, Minnesota Population Center
Census 1993 Census 2005
Colombia: Adolescent Birth Rate
Data acquisition
Outreach: developing countries
Virtual data enclave
IPUMSI Future
Thank you!