missingness in action: selectivity bias in … size measured with gps and farmers ' estimate...
TRANSCRIPT
Missing ness in Action: Selectivity Bias in GPS‐Based Land Area Measurements
Talip Kilic †Alberto Zezza †
Calogero Carletto †Sara Savastano ‡
† The World Bank, Development Research Group ‡ University of Rome, Tor Vergata
AIEAA Conference 2013"Between Crisis and Development: Which Role for the Bio‐Economy“
Parma, 6‐7 June 2013
Why GPS land area measurement?
0.5
11.
52
2.5
% (A
rea
Self-
Rep
orte
d)
0.2
.4.6
.8%
(Are
a G
PS
)
0 1 2 3 4Acres
GPS Farmers' Estimate
Plot Size Measured with GPS and Farmers ' Estimate
Background & Motivation
• Land areas: Fundamental component of agricultural statistics
• Traversing assumed to be the gold‐standard in land area measurement, but neither time‐ nor cost‐effective
• Increasing use of GPS technology in measuring land areas
However...
• Collecting GPS‐based land areas not always feasible – field work protocols, lack of physical access, refusals
• Substantial presence of missing values up to 30 percent or more : Empirical implications unclear
Background & Motivation Cont’d
• Literature documents the value of GPS‐based land measurements but without discussion on missing data issues
• Udry & Goldstein 1999• De Groote & Traore 2005• Keita & Carfagna 2009• Dorward & Chirwa 2010• Carletto et al. Forthcoming
Objectives
• Document correlates of missing data in GPS‐Based land area measurements in Uganda National Panel Survey 2009/10 & Tanzania National Panel Survey 2010/11
• Implement Multiple Imputation to predict missing GPS‐based land area measurements in each survey
• Investigate the empirical implications of incomplete vs. multiply‐imputed complete data in the context of inverse scale‐land productivity relationship
Data• Uganda National Panel Survey 2009/10
– Implemented by Uganda Bureau of Statistics– 1st Wave of a panel household survey program– Target: 3,123 HHs for the UNHS 2005/06
• Tanzania National Panel Survey 2010/11– Implemented by National Bureau of Statistics– 2nd Wave of a panel household survey program– 1st Wave: 2008/09 interviewed 3,200 HHs
• Samples representative at the national, urban/rural, regional levels
• Integrated questionnaire design, with a strong focus on agriculture– Multi‐Topic Household Questionnaire– Agriculture Questionnaire Farming Households
• Allows for plot‐level analyses of productivity • Geo‐referenced plot locations & GPS‐Based measurement of plot areas
Field Insights on GPS‐Based Measurement of Plot Areas
• Both surveys rely on mobile survey teams that spend on average 3‐4 days in a given enumeration area, net of their tracking assignments
• Teams allowed to exclude “distant” plots from GPS‐based measurement
– Radius for GPS‐based measurement: EA in Uganda, 1hr travel in Tanzania
– Reasons for the protocol: Transportation costs, field workload panel surveys , interview durations & frequency
• Missing GPS‐based plot area information driven by refusals or physical inaccessibility represent a very limited share less than 5%
Entire Sample W/ GPS W/o GPS
Observations 4,333 2,81465%
1,51935%
GPS‐Based Plot Area Acres 2.13 2.13 ‐‐
Farmer‐Reported Plot Area Acres 2.05 2.00 2.12
Less Than 15 Mins Away from HH † 0.62 0.80 0.31 ***
15‐30 Mins Away from HH † 0.17 0.14 0.21 ***
30 Mins Away from HH † 0.22 0.06 0.48 ***
Rented/Other † 0.26 0.14 0.46 ***
Hilly, Steep or Valley † 0.20 0.17 0.25 ***
# of Plots in Holding 3.31 3.17 3.54 ***
Mover Original HH † 0.04 0.01 0.09 ***
Split‐Off HH † 0.13 0.06 0.25 ***
Wealth Index 2005/06 ‐0.66 ‐0.77 ‐0.47 ***
Note: Results from tests of mean differences reported. *** p 0.01, ** p 0.05, * p 0.1. Statistics weighted through the use of household sampling weights. † denotes a dummy variable.
Selected Plot‐Level Descriptives – UNPS 2009/10
Entire Sample W/ GPS W/o GPS
Observations 4,142 3,38382%
75918%
GPS‐Based Plot Area Acres 2.59 2.59 ‐‐
Farmer‐Reported Plot Area Acres 2.31 2.30 2.35
Distance to Home KM 3.74 1.95 13.92 ***
Distance to Road KM 2.18 1.62 5.39 ***
Rented/Other † 0.12 0.09 0.25 ***
# of Plots in Holding 3.09 3.08 3.15 ***
Mover Original HH † 0.06 0.05 0.09 ***Split‐Off HH † 0.09 0.08 0.15 ***
Wealth Index 2008/09 ‐1.06 ‐1.09 ‐0.88 ***
Note: Results from tests of mean differences reported. *** p 0.01, ** p 0.05, * p 0.1. Statistics weighted through the use of household sampling weights. † denotes a dummy variable.
Selected Plot‐Level Descriptives – TZNPS 2010/11
Multiple Imputation MI : Background• MI originally proposed to handle missing data in public use files from
censuses, sample household surveys Rubin, 1977
• Using distribution of observed data to estimate plausible values for missing data, incorporating random, imputation‐related components to reflect uncertainty Rubin, 1987
• Superior over casewise deletion & conditional mean imputation, known to understate true variance Schafer & Graham, 2002
• Key assumption: Missing At Random MAR conditional on observables, plausibility depends on the nature & sources of missing data
MI: Background Cont’d• Empirical applications, mostly in developed countries, with rate of
missing information at 20‐40%:
• Schenker et. al. 2006 – Income data from the NHIS US• Moss & Mishra 2010 – Agricultural input data from the FUPD US• Zarnoch et. al. 2010 – Income data from the NSRE US• Giusti & Little 2011 – Income data from the LFS Tuscany, Italy• Vermaak 2012 – Earnings data from the LFS South Africa
Steps in MI
Steps in MI Cont’dStep 3
• Estimate the final multivariate regression of interest using each of the m multiply‐imputed, complete database & store regression estimates
Step 4• Combine estimates from each of m databases, according to Rubin 1987• Overall parameter estimate: Average across parameters from m databases• Overall variance takes into account:
– Within‐imputation variability: Uncertainty from results from one dataset– Between‐imputation variability: Uncertainty due to missing information
Our Approach:• 50 imputations of GPS‐based plot area, using PMM with 5 neighbors• Robustness checks: # of m, # of neighbors, bootstrapping, PMM vs. OLS
Selected OLS Regression Results Underlying Multiple ImputationDependent Variable GPS‐Based Plot Area Acres
UNPS 2009/10 TZNPS 2010/11Farmer‐Reported Plot Area Acres 0.945*** 0.866***Log Value of Plot Output 0.023 0.056***Log Value of Plot Input 0.027** 0.032***# of Plots in Holding ‐0.141*** ‐0.094**District & Enumerator Fixed Effects YES YESObservations 2,814 3,363R2 0.658 0.688
MI Results
MI Results Cont’d
Key Descriptive Statistics Following Multiple Imputation
UNPS 2009/10 TZNPS 2010/11Obs Mean Std Err Obs Mean Std Err
Observed GPS‐Based Plot Area 2,814 2.130 0.132 3,383 2.588 0.148Value of Output Per Unit Observed GPS‐Based Plot Area 2,814 484,861 27,119 3,383 94,000 6,461
MI GPS‐Based Plot Area 4,333 2.124 0.125 4,141 2.564 0.147Value of Output Per Unit MI GPS‐Based Plot Area 4,333 553,951 55,010 4,141 92,142 10,338
Empirical Application• Investigate the implications of using incomplete vs. multiply‐imputed
data in exploring inverse scale‐land productivity relationship IR
• Estimate a plot‐level production function similar to the models estimated by Barrett et al. 2010 , & Carletto et al. Forthcoming :
• i , h denote Plot, Household respectively
• A & Y : Plot Area Acres & Value of Output, respectively
• Vectors P & H : Plot‐ & Household‐level attributes, respectively
• Vectors D & E: District & Enumerator fixed effects, respectively
• Estimate separately with i plot sample with observed GPS areas, & ii complete plot sample with multiply‐imputed GPS areas
Results
Selected OLS Regression ResultsDependent Variable Log Value of Plot Output/Acre
UNPS 2009/10 TZNPS 2010/111
Observed GPS‐Based Parcel Area
2 Multiple Imputed GPS‐Based Parcel Area
3 ObservedGPS‐Based Parcel Area
4 Multiple Imputed GPS‐Based Parcel Area
Log Plot Area Acres ‐0.388*** ‐0.515*** ‐0.448*** ‐0.487***
Observations 2,814 4,333 3,383 4,121Note: *** p 0.01, ** p 0.05, * p 0.1. Complex survey regressions underlie the combined MI estimates reported here.
• Stronger IR under MI – Robust to using District vs. EA vs. HH Fixed Effects or introducing as a dependent variable log per acre value of output net our value of input.
Conclusions• GPS‐based plot areas in large household surveys suffer from non‐
random missingness
• MI offers a promising way to reliably simulate missing information
• A full dataset can be obtained with reasonable quality: Farmer‐reported plot area information a key into the imputation model
• Imputing missing GPS‐based plot areas has clear implications for policy‐relevant productivity analysis
Missing ness in Action: Selectivity Bias in GPS‐Based Land Area Measurements
Talip Kilic †Alberto Zezza †
Calogero Carletto †Sara Savastano ‡
† The World Bank, Development Research Group ‡ University of Rome, Tor Vergata
AIEAA Conference 2013"Between Crisis and Development: Which Role for the Bio‐Economy“
Parma, 6‐7 June 2013