geo-referencing and integrating survey and census data · 4/23/2013 · 6th seminar “jean...

Geo-Referencing and Integrating Survey and Census Data

6th Seminar “Jean Paelinck” in Spatial Econometrics Autonomous University of Madrid

Geo‐Referencing and Integrating Survey and Census Data


Motivation and Aim of the paper

There is an increasing need for data in small area, however the lack of or missing data is very frequently, especially in developing countries. It is possible to have access to Census data for housing and household in most of the countries, however those have a limited number of variables. In the last two decades there are a growing number of household surveys available for developing countries, especially promoted by international organization, so they are similar in the questionnaires and the sampling methodology. The household survey has a larger number of question and variables than the census, however the number of observation is an small portion of the population

The objective of the paper is proposed a methodology to extrapolate the information on the household survey into the census



ELL Methodology

Elbers, Lanjouw & Lanjouw (2003) methodology (ELL), has been very influential in producing data for small areas, specially to generate poverty maps in more than 40 developing countries

There is a large set of papers showing the results of their application to generate data for small areas which are in Census data but do not contain the data available in household survey of the same area.

Elbers, C., J.O. Lanjouw, P. Lanjouw (2003). Micro‐Level Estimation of Poverty and Inequality. Econometrica, Vol. 71, No. 1, pp. 355‐364



ELL Methodology

Let W be an indicator of poverty or inequality based on the distribution of a household‐level variable of interest, Yh.

Using a smaller and richer data sample, the joint distribution of Yh (which is not in the larger data set) is estimated on a vector of covariates, Xh. By restricting the set of explanatory variables to those that can also be linked to households in the larger sample or census.

The estimated distribution can be used to generate the distribution of Yh for any subpopulation in the larger sample, conditional on the subpopulation's observed characteristics.

This allows the generation of the conditional distribution of W, in particular, its point estimate and prediction error



ELL Methodology

h h h chY X u

The second error term allow for taking into account the “cluster correlation”, which we would called spatial correlation within the cluster.

However, the authors assume that this would affect only the variance estimation (efficiency problem but not the estimator bias) and they also implicitly assume that the clusters are independently distributed across the territory.

This assumption will work on the best of the cases when spatial autocorrelation affect the efficiency of the estimator only, however will not work if this spatial autocorrelation bias the estimator.



BUT

Some questions:

How does the ELL method work when there are geographic areas that are not covered by the survey?

How can the ELL method allow for the location of observations on the survey and interactions between them?

There is no information available in regard to the first question, which we will address in this article. Recent applications of small area estimates question conditions of homogeneity and treatments of space that the ELL model implicitly assumes.



Questioning the ELL model

Tarozzi and Deaton (2009) pointed out that the hypothesis of random effects within homoscedastic, independent and identical distributed clusters assumed by the ELL methods is a very strong one, because it ignores the integration of the areas with the neighbors.

Minot and Baulch (2005) showed significant differences in the spatial distribution of poverty in small areas, depending upon they closeness to large urban centers.

Sowunmi et al (2012) and Benson et al (2005) identified spatial pattern of poverty in small areas strongly autocorrelated with their neighbors.

Mohamed and Mohamed (2009) question the ELL method developing a study on household expenditure in small areas, using two spatial and one non‐spatial models, showing that the same set of variables have significant difference in explaining a variable in different areas due to spatial heterogeneity and interdependence.



Matching‐Kriging Methodology

The methodology proposed in this paper has two stage:

Stage 1: Using Coarsened Exact Matching (CEM) to find survey observations in the census. This allows using the geographical information of the census, at lower level than it is not available in the survey.

Stage 2: Interpolation by Kriging to find missing data in small areas

Also, there will be some extension in order to use not only geographical information, but also related information with the variables of interest (co‐kriging)



Stage 1: Spatial Matching for a County and its Districts

The matching estimator technique (Coarsened Exact Matching) consists of pairing sample units that belong to a “treated” (survey) group with a “control” (census) group and are similar in terms of their observable characteristics.

After homogenizing certain conditions between the two data collection methods, we can integrating survey and census data. Then, using a data warehouse and OLAP tools (online analytical processing), the user can access a joint and integrated vision of the two instruments. The integrated database is used to generate models for each subject of the survey.

This pairing transfers the geographic location of census clones to their clones in the survey so that they can be associated with a specific place within the sampled area.




Homogenizing conditions between the two data collection. For example, code.

what is your relationship with the householder Census code change code Survey code Relationship with the householderChief or head of household 1 1 Chief or head of household

Husband-wife/spouse 2 2 2 Husband-wife/consortLive together/consort 3 2 3 Son/daughter- step son/daughter

Son/daughter 4 3 4 Father/MotherStep son/daughter 5 3 5 Father/Mother in law

Son/daughter in law 6 6 Yerno o nueraGrandchild 7 7 Grandchild

Brother/sister 8 8 Brother/sisterBrother/sister in law 9 9 Brother/sister in law

Father/Mother 10 4 10 Other sibFather/Mother in law 11 5 11 Roommate (friends)

Other sib 12 10 12 Maid servant houseRoommate (friends) 13 11Maid servant house 14 12

Roommate (no friends) 15

Census Survey



This search is conducted using exact pairing in order to identify an observation in thecensus that has the same characteristics as each observation from the survey withcharacteristics xi (set of shared co‐variables).


Identification

Area code Province codeCounty codeZone (rural-urban)

Relationship with the head of householdGender Native group decendantAge Educational levelEmployment status

Housing Variables Walls qualityFloor qualityRoof qualityType of housingHomeownershipPower SupplierWater SupplierWater connection Sewer System

Home Appliance VariablesRefrigeratorTelephone Microwave

ComputerCalefont Cell phoneTV Cable

We call this event spatial matching, and it is generated by finding nregressions of variables for n spatial clones. As such, the “pairs orclones” of the survey provide income information while their pair orclone in the census provides the spatial location within themunicipality.



Grey color represents the census data, while the dark dots are the non geo‐referenced survey data. We use 29 similar questions in both data sets to matchindividuals.

Visual Representation of the Spatial Matching Estimator Between Census and Survey Data




Stage 2: Interpolation by Kriging

According to a comprehensive evaluation for spatial interpolation models (Liu et al, 2011): Kriging is one of the most reliable methods using data with significant spatial autocorrelation

Then the interpolation of the variable or a function of the variable for an unknown value (xo)in the space will be:

or in general terms

nn xZxZxZxZ ....ˆ22110

in

ii xZxZ

1

0ˆ



• Let be x a geo‐referenced variable of interest and h a distance measure on the space, then define:

• Then, a matrix is define for all the x used for the interpolation as:

(Semi‐variogram)

2,21)( hxZxZE ii h

01111

11

)()()(

)()()()()()(

21

212

121

0

00

xxxx

xxxxxxxx

nn

n

n

Stage 2: Interpolation by Kriging



Stage 2: Interpolation by Kriging• Then, a weight function is define as:

• Where

• Where xo is the point to be interpolated and 1 is in the vector to measure the interpolation error.

In order to evaluate the matching estimator and its interpolation, the estimated valuesof per capita household income are expected to replicate the pattern of spatial self‐correlation measured by Moran I in the original income values (Liu et al, 2011)

01

1)(

)()(

0

02

01

0

xx

xxxx

n



The spatial units associated to CASEN survey 2003 are:Region (13 regions)Province (52 provinces)County (345 counties)There is data only for 301 counties out of the 345.

CASEN 2003: Household Survey for the country with 250 thousand records and more than 60 thousand families.

The Census 2002 has one additional division that is called:District (2749 districts).

Census 2002 with more than 16 million records in three databases, Houses database, Households on the house database and Individuals on the household database

Chilean Data



Chilean Data

• Borders: Regions, Provinces and Counties

• Differences between ELL & this proposal:

• Homogeneity

• Missing data• Spatial autocorrelation and the

borders

• Use of the geographical information of the census



0 5 10 15 202,5Miles

Legend

comunascomunasYCASEN

81.588 - 132.284132.284 - 176.319176.319 - 226.518226.518 - 408.130408.130 - 613.445613.445 - 1.124.075distritos²

Median income household home. RM, Casen 2003 IngresocomunasYCASEN

81.588 - 132.284

132.284 - 176.319

176.319 - 226.518

226.518 - 408.130

408.130 - 613.445

613.445 - 1.124.075

Small areas RM County RM

ymed=$665.262



ymed=$127.831



Comparison of income estimated using spatial matching‐Kriging compared to ELL method estimate

RSME BIAS MAE NMAE

ELL 184.365 -87.229 87.318 0,2135

Matching-Kriging 121.669 40.203 53.597 0,1528

Quality testing- Benchmarking with ELL- Benchmarking in the Missing data case



Ñuñoa CountySurvey Income 408.130Matching Income 450.838ELL income 258.146

Ñuñoa DistrictPlaza Los Guindos 555.666Pucará 551.604Plaza Ñuñoa 530.574Crescente Errázuriz 521.729Simón Bolivar 517.740Hospital de Carabineros 482.430Chacra Valparaíso 440.553Plaza Zañartu 421.905Piscina Mundt 385.860Santa Julia 350.857Estadio Nacional 200.302

Macul CountySurvey Income 277.529Matching Income 269.304ELL income 164.602

Macul DistrictVilla Santa Carolina 327.918Macul 296.058Pedreros 289.908Ignacio Carrera Pinto 263.735Camino Agrícola 231.885Lo Plaza 206.317

Example with two counties



The findings presented above allow us to move closer to empirically validatingthe questioning of the ELL model.

They also show that the integration of the census and survey databasesproposed in our model through Matching and Kriging allow researchers to obtainmodels that capture local differences and spatial interactions.

The differences between ELL and the proposed methodology are larger when theneighbors are more different from the predicted spatial unit, due to that ELLignore this fact while Kriging take into consideration of it.

Conclusions

Thanks!!!

Geo-Referencing and Integrating Survey and Census Data


geo-referencing and integrating survey and census data · 4/23/2013 · 6th seminar “jean...

Documents