geo-referencing and integrating survey and census data · 4/23/2013 · 6th seminar “jean...
TRANSCRIPT
Geo-Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Motivation and Aim of the paper
There is an increasing need for data in small area, however the lack of or missing data is very frequently, especially in developing countries. It is possible to have access to Census data for housing and household in most of the countries, however those have a limited number of variables. In the last two decades there are a growing number of household surveys available for developing countries, especially promoted by international organization, so they are similar in the questionnaires and the sampling methodology. The household survey has a larger number of question and variables than the census, however the number of observation is an small portion of the population
The objective of the paper is proposed a methodology to extrapolate the information on the household survey into the census
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
ELL Methodology
Elbers, Lanjouw & Lanjouw (2003) methodology (ELL), has been very influential in producing data for small areas, specially to generate poverty maps in more than 40 developing countries
There is a large set of papers showing the results of their application to generate data for small areas which are in Census data but do not contain the data available in household survey of the same area.
Elbers, C., J.O. Lanjouw, P. Lanjouw (2003). Micro‐Level Estimation of Poverty and Inequality. Econometrica, Vol. 71, No. 1, pp. 355‐364
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
ELL Methodology
Let W be an indicator of poverty or inequality based on the distribution of a household‐level variable of interest, Yh.
Using a smaller and richer data sample, the joint distribution of Yh (which is not in the larger data set) is estimated on a vector of covariates, Xh. By restricting the set of explanatory variables to those that can also be linked to households in the larger sample or census.
The estimated distribution can be used to generate the distribution of Yh for any subpopulation in the larger sample, conditional on the subpopulation's observed characteristics.
This allows the generation of the conditional distribution of W, in particular, its point estimate and prediction error
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
ELL Methodology
h h h chY X u
The second error term allow for taking into account the “cluster correlation”, which we would called spatial correlation within the cluster.
However, the authors assume that this would affect only the variance estimation (efficiency problem but not the estimator bias) and they also implicitly assume that the clusters are independently distributed across the territory.
This assumption will work on the best of the cases when spatial autocorrelation affect the efficiency of the estimator only, however will not work if this spatial autocorrelation bias the estimator.
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
BUT
Some questions:
How does the ELL method work when there are geographic areas that are not covered by the survey?
How can the ELL method allow for the location of observations on the survey and interactions between them?
There is no information available in regard to the first question, which we will address in this article. Recent applications of small area estimates question conditions of homogeneity and treatments of space that the ELL model implicitly assumes.
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Questioning the ELL model
Tarozzi and Deaton (2009) pointed out that the hypothesis of random effects within homoscedastic, independent and identical distributed clusters assumed by the ELL methods is a very strong one, because it ignores the integration of the areas with the neighbors.
Minot and Baulch (2005) showed significant differences in the spatial distribution of poverty in small areas, depending upon they closeness to large urban centers.
Sowunmi et al (2012) and Benson et al (2005) identified spatial pattern of poverty in small areas strongly autocorrelated with their neighbors.
Mohamed and Mohamed (2009) question the ELL method developing a study on household expenditure in small areas, using two spatial and one non‐spatial models, showing that the same set of variables have significant difference in explaining a variable in different areas due to spatial heterogeneity and interdependence.
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Matching‐Kriging Methodology
The methodology proposed in this paper has two stage:
Stage 1: Using Coarsened Exact Matching (CEM) to find survey observations in the census. This allows using the geographical information of the census, at lower level than it is not available in the survey.
Stage 2: Interpolation by Kriging to find missing data in small areas
Also, there will be some extension in order to use not only geographical information, but also related information with the variables of interest (co‐kriging)
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Stage 1: Spatial Matching for a County and its Districts
The matching estimator technique (Coarsened Exact Matching) consists of pairing sample units that belong to a “treated” (survey) group with a “control” (census) group and are similar in terms of their observable characteristics.
After homogenizing certain conditions between the two data collection methods, we can integrating survey and census data. Then, using a data warehouse and OLAP tools (online analytical processing), the user can access a joint and integrated vision of the two instruments. The integrated database is used to generate models for each subject of the survey.
This pairing transfers the geographic location of census clones to their clones in the survey so that they can be associated with a specific place within the sampled area.
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Stage 1: Spatial Matching for a County and its Districts
Homogenizing conditions between the two data collection. For example, code.
what is your relationship with the householder Census code change code Survey code Relationship with the householderChief or head of household 1 1 Chief or head of household
Husband-wife/spouse 2 2 2 Husband-wife/consortLive together/consort 3 2 3 Son/daughter- step son/daughter
Son/daughter 4 3 4 Father/MotherStep son/daughter 5 3 5 Father/Mother in law
Son/daughter in law 6 6 Yerno o nueraGrandchild 7 7 Grandchild
Brother/sister 8 8 Brother/sisterBrother/sister in law 9 9 Brother/sister in law
Father/Mother 10 4 10 Other sibFather/Mother in law 11 5 11 Roommate (friends)
Other sib 12 10 12 Maid servant houseRoommate (friends) 13 11Maid servant house 14 12
Roommate (no friends) 15
Census Survey
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
This search is conducted using exact pairing in order to identify an observation in thecensus that has the same characteristics as each observation from the survey withcharacteristics xi (set of shared co‐variables).
Stage 1: Spatial Matching for a County and its Districts
Identification
Area code Province codeCounty codeZone (rural-urban)
Relationship with the head of householdGender Native group decendantAge Educational levelEmployment status
Housing Variables Walls qualityFloor qualityRoof qualityType of housingHomeownershipPower SupplierWater SupplierWater connection Sewer System
Home Appliance VariablesRefrigeratorTelephone Microwave
ComputerCalefont Cell phoneTV Cable
We call this event spatial matching, and it is generated by finding nregressions of variables for n spatial clones. As such, the “pairs orclones” of the survey provide income information while their pair orclone in the census provides the spatial location within themunicipality.
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Grey color represents the census data, while the dark dots are the non geo‐referenced survey data. We use 29 similar questions in both data sets to matchindividuals.
Visual Representation of the Spatial Matching Estimator Between Census and Survey Data
Stage 1: Spatial Matching for a County and its Districts
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Stage 2: Interpolation by Kriging
According to a comprehensive evaluation for spatial interpolation models (Liu et al, 2011): Kriging is one of the most reliable methods using data with significant spatial autocorrelation
Then the interpolation of the variable or a function of the variable for an unknown value (xo)in the space will be:
or in general terms
nn xZxZxZxZ ....ˆ22110
in
ii xZxZ
1
0ˆ
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
• Let be x a geo‐referenced variable of interest and h a distance measure on the space, then define:
• Then, a matrix is define for all the x used for the interpolation as:
(Semi‐variogram)
2,21)( hxZxZE ii h
01111
11
)()()(
)()()()()()(
21
212
121
0
00
xxxx
xxxxxxxx
nn
n
n
Stage 2: Interpolation by Kriging
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Stage 2: Interpolation by Kriging• Then, a weight function is define as:
• Where
• Where xo is the point to be interpolated and 1 is in the vector to measure the interpolation error.
In order to evaluate the matching estimator and its interpolation, the estimated valuesof per capita household income are expected to replicate the pattern of spatial self‐correlation measured by Moran I in the original income values (Liu et al, 2011)
01
1)(
)()(
0
02
01
0
xx
xxxx
n
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
The spatial units associated to CASEN survey 2003 are:Region (13 regions)Province (52 provinces)County (345 counties)There is data only for 301 counties out of the 345.
CASEN 2003: Household Survey for the country with 250 thousand records and more than 60 thousand families.
The Census 2002 has one additional division that is called:District (2749 districts).
Census 2002 with more than 16 million records in three databases, Houses database, Households on the house database and Individuals on the household database
Chilean Data
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Chilean Data
• Borders: Regions, Provinces and Counties
• Differences between ELL & this proposal:
• Homogeneity
• Missing data• Spatial autocorrelation and the
borders
• Use of the geographical information of the census
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
0 5 10 15 202,5Miles
Legend
comunascomunasYCASEN
81.588 - 132.284132.284 - 176.319176.319 - 226.518226.518 - 408.130408.130 - 613.445613.445 - 1.124.075distritos²
Median income household home. RM, Casen 2003 IngresocomunasYCASEN
81.588 - 132.284
132.284 - 176.319
176.319 - 226.518
226.518 - 408.130
408.130 - 613.445
613.445 - 1.124.075
Small areas RM County RM
ymed=$665.262
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
ymed=$127.831
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Comparison of income estimated using spatial matching‐Kriging compared to ELL method estimate
RSME BIAS MAE NMAE
ELL 184.365 -87.229 87.318 0,2135
Matching-Kriging 121.669 40.203 53.597 0,1528
Quality testing- Benchmarking with ELL- Benchmarking in the Missing data case
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
Ñuñoa CountySurvey Income 408.130Matching Income 450.838ELL income 258.146
Ñuñoa DistrictPlaza Los Guindos 555.666Pucará 551.604Plaza Ñuñoa 530.574Crescente Errázuriz 521.729Simón Bolivar 517.740Hospital de Carabineros 482.430Chacra Valparaíso 440.553Plaza Zañartu 421.905Piscina Mundt 385.860Santa Julia 350.857Estadio Nacional 200.302
Macul CountySurvey Income 277.529Matching Income 269.304ELL income 164.602
Macul DistrictVilla Santa Carolina 327.918Macul 296.058Pedreros 289.908Ignacio Carrera Pinto 263.735Camino Agrícola 231.885Lo Plaza 206.317
Example with two counties
Geo‐Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid
The findings presented above allow us to move closer to empirically validatingthe questioning of the ELL model.
They also show that the integration of the census and survey databasesproposed in our model through Matching and Kriging allow researchers to obtainmodels that capture local differences and spatial interactions.
The differences between ELL and the proposed methodology are larger when theneighbors are more different from the predicted spatial unit, due to that ELLignore this fact while Kriging take into consideration of it.
Conclusions
Thanks!!!
Geo-Referencing and Integrating Survey and Census Data
6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid