spatial disaggregation and small-area estimation …...september 2015 spatial disaggregation and...

September 2015

Spatial Disaggregation and Small-Area Estimation

Methods for Agricultural Surveys:

Solutions and Perspectives

Technical Report Series GO-07-2015

Spatial Disaggregation and Small-Area Estimation

Methods for Agricultural Surveys:

Solutions and Perspectives

5

Table of Contents

Preface 9

Acknowledgements 11

Acronyms and Abbreviations 12

General Introduction 13

Review of Spatial Disaggregation and SAE Methods 17

1. Spatial Disaggregation: Mapping Techniques 171.1 Introduction 171.2 Areal interpolation methods 17

1.2.1 Simple area weighting method 171.2.2 Pycnophylactic interpolation methods 181.2.3 Dasymetric mapping 18

1.2.4 Examples 19

1.3 Spatial disaggregation with interpolation based on Regression models 231.3.1 Regression models 231.3.2 The EM algorithm 24

1.3.3 Examples 25

2. Small-Area Estimators 262.1 Introduction 262.2 A classification of SAE models 28

2.3 Model-assisted estimators 302.3.1 GREG estimator 30

2.3.2 Example of calculation of GREG estimator 322.4 Model-Based estimators: area-level 37

2.4.1 FH-EBLUP 372.4.2 Example of calculation of the FH-EBLUP estimator 392.4.3 SEBLUP 412.4.4 Example of calculation of FH-SEBLUP estimator 432.4.5 Applications to agricultural data 45

2.4.6 Final remarks on the FH-EBLUP and FH-SEBLUP 452.5 Model-based estimators: unit-level 47

2.5.1 EBLUP 472.5.2 SEBLUP 482.5.3 The MQ estimator 50

2.5.4 The MQGWR estimator 522.5.5 Example of calculation of EBLUP, MQ and MQGWR estimators 542.5.6 Application to agricultural data 60

2.5.7 Final remarks on the EBLUP, SEBLUP, MQ and MQGWR 612.6 Extensions of the previous small-area models 63

2.6.1 Semi-parametric Fay and Herriot model 63

2.6.2 NPEBLUP specified at the unit level 642.6.3 Non-parametric MQ specified at unit level 642.6.4 GWEBLUP 65

6

2.6.5 MBDE and SMBDE 652.6.6 A note on Bayesian SAE methods 652.6.7 SAE for binary and count data 66

2.7 Geostatistical methods 672.7.1 Geoadditive models 682.7.2 Kriging 692.7.3 GWR 70

References (Part I) 71

Resilience of SAE Methods to Non-Standard Situations 81

Introduction 81

3. Sensitivity of SAE Predictors to Spatial Model Specifications 833.1 Introduction 833.2 Model-based simulation experiment 843.3 Design-based simulation experiment 883.4 Remarks and findings 91

4. The Modifiable Area Unit Problem 944.1 Introduction 944.2 An evaluation of the impact of the scale effect on SAE predictors and interpolation methods 954.3 Remarks and findings 96

5. The Robustness of SAE Predictors 985.1 Introduction 985.2 Small-area robust estimators 98

5.2.1 MQ estimators 985.2.2 Robust EBLUP 99

5.3 Assessment of the robustness of the EBLUP, MQ and robust EBLUP 995.4 The RSEBLUP: robust SAE using geo-referenced information in the mixed-model approach 100

5.4.1 Evaluating the Spatial REBLUP estimator using simulation studies 1015.5 Remarks and findings 106

6. The Complexity of Sample Design 1071076.1 Introduction

6.2 Design-consistent small-area estimators 1076.2.1 Expansion estimator 1076.2.2 Modified GREG estimators 1086.2.3 The Pseudo-EBLUP 1086.2.4 Weighted MQ estimators 109

6.3 Simulation study of the impact of ignorable and non-ignorable designs 1096.3.1 Description of the simulation experiment 1096.3.2 Simulation results 110

6.4 Investigating the impact of sampling designs on data interpolation 1116.4.1 A short introduction about the design effect on data interpolation 1116.4.2 A simulation experiment to assess the impact of the design effect on spatial interpolation 113

6.5 Remarks and findings 117

7

7. Missing Data in Spatial Datasets 1197.1 Introduction 1197.2 Missing values in datasets: general concepts and solutions 119

7.2.1 Multiple imputation 1227.2.2 Missing values in spatial data as measurement error 123

7.3 Missing data in spatial analysis 1247.3.1 Missing spatial information 1247.3.2 Missing values in auxiliary and target variables 1337.3.3 Missing information in methods of data integration 139

7.4 Remarks and findings 140

8. Analysis of Zero-Inflated Data in SAE 1428.1 Introduction 1428.2 Bayesian small-area estimator for zero-inflated data 1428.3 Frequentist SAE for zero-inflated data 1448.4 Empirical evaluation for the frequentist approach 1458.5 Remarks and findings 146

9. Final Remarks and Recommendations 147

References (Part II) 150

General Summary 158

9

Preface This Technical Report on Spatial Disaggregation and Small-Area Estimation Methods for Agricultural Surveys: Solutions and Perspectives was prepared within the framework of the Global Strategy to Improve Agricultural and Rural Statistics. The Global Strategy is an initiative endorsed in 2010 by the United Nations Statistical Commission, to provide a framework and a blueprint to meet current and emerging data requirements and the needs of policymakers and other data users. Its goal is to contribute to greater food security, reduced food price volatility, higher incomes and greater well-being for rural populations, through evidence-based policies. The Global Strategy is centred upon 3 pillars: (1) establishing a minimum set of core data (2) integrating agriculture into National Statistical Systems (NSSs) and (3) fostering the sustainability of the statistical system through governance and statistical capacity building.

The Action Plan to Implement the Global Strategy includes an important research programme, to address methodological issues for improving the quality of agricultural and rural statistics. The outcome of the research programme is to produce scientifically sound and cost-effective methods that will be used as inputs to prepare practical guidelines for use by country statisticians, training institutions, consultants, etc.

To enable countries and partners to benefit at an early stage from research activity results that are already available, it has been decided to establish a Technical Reports Series, to widely disseminate available technical reports and advanced draft guidelines and handbooks. This will also provide an opportunity for countries to give feedback on the papers.

Technical reports and draft guidelines and handbooks published in this Technical Report Series have been prepared by senior consultants and experts and reviewed by the Scientific Advisory Committee (SAC)1 of the Global Strategy, the Research Coordinator at the Global Office and other independent senior experts. For some of the research topics, field tests will be organized before final results are included in guidelines and handbooks.

The aim of this report on Spatial Disaggregation and Small-Area Estimation Methods for Agricultural Surveys: Solutions and Perspectives is to enhance disaggregation methods for adaptation to various agricultural situations and datasets.

Part 1 reviews the literature on this subject under two topics: i) mapping techniques and ii) small-area estimators.

With regard to mapping techniques, the main areal interpolation methods based on regression techniques are presented. SAE methods are classified as: i) model-assisted methods – for example the generalized regression estimator; and ii) model-based methods, which are considered as unit-level and area-level specifications – the empirical best linear unbiased predictors estimator, M-quantile estimator and Fay and Herriot estimator – with spatial specifications where available. Assumptions are explained and the information needed for each method is given, with illustrations from applications to rural and agricultural statistics or to socio-economic statistics.

Part 2 examines the reliability of the methods in non-standard situations that commonly arise in agricultural surveys. The main topics are sensitivity to spatial model specification, the modifiable area unit problem, robustness of predictors, complexity of sample design, missing data in spatial datasets and excess of zeros in survey data.

1 The SAC is composed of ten well-known senior experts in various fields relevant to the Research Programme of the Global Strategy. They are selected for a two-year term. The membership at the time of preparation of this report was composed of: Fred Vogel, Sarah Nusser, Ben Kiregyera, Seghir Bouzaffour, Miguel Galmes, Cristiano Ferraz, Ray Chambers, Vijay Bhatia, Jacques Delincé and Anders Walgreen.

10

This part analyses the methods presented in Part 1, presents the main contributions to the topics and proposes methodological and operational solutions.

Part 3 summarizes the issues in the review of mapping techniques and small-area estimators. It also offers remarks and recommendations based on the analysis of the reliability of the methods and draft guidelines for applying them in field tests.

11

AcknowledgementsThis paper was prepared by Monica Pratesi (Professor) of the University of Pisa, with the assistance of Alessandra Petrucci (Professor) of the University of Florence and Nicola Salvati (Professor) of the University of Pisa, and supported by Caterina Giusti (PhD, researcher) and Stefano Marchetti (PhD, researcher) of the University of Pisa, with the guidance and supervision of Elisabetta Carfagna and Naman Keita of the Global Office of the Global Strategy to improve agricultural and rural statistics (FAO).

The report was reviewed by the Scientific Advisory Committee of the Global Strategy, who provided comments and inputs.

Valuable inputs and comments were provided at various stages by Luigi Biggeri (Emeritus Professor) of the University of Florence and by Loredana di Consiglio of the Italian Istituto Nazionale di Statistica.

This publication was prepared with support from the Trust Fund of the Global Strategy funded by the United Kingdom Department for International Development and the Bill & Melinda Gates Foundation.

12

Acronyms and AbbreviationsANC Acid-Neutralizing Capacity AvRBias Average Relative BiasBARE Broad Area Ratio EstimatorCAR Conditional Auto-RegressiveCV Cross-ValidationDFID Department For International DevelopmentEBLUP Empirical Best Linear Unbiased PredictorEM Expectation-Maximization (algorithm)EMAP Environmental Monitoring and Assessment Program FAO Food and Agriculture Organization of the United NationsFH-EBLUP Fay and Herriot Empirical Best Linear Unbiased PredictorGIS Geographic Information SystemGLM Generalized Linear Mixed (model)GREG Generalized Regression (estimator)GREG-LV GREG under Lehtonen-Veijanen specification GREG-S GREG with Sample weights GWEBLUP Geographically Weighted Empirical Best Linear Unbiased PredictorGWR Geographically Weighted RegressionGWR-W Geographically Weighted Regression with Sample WeightsHT Horvitz and Thompson (expansion estimator) HUC Hydrologic Unit CodeISTAT Istituto Nazionale di Statistica (Italian National Statistics Institute)LMM Linear Mixed ModelMAE Model-Assisted EstimatorMAR Missing At RandomMAUP Modifiable Areal Unit ProblemMBDE Model-Based Direct EstimatorMBE Model-Based EstimatorMCAR Missing Completely At RandomMCMC Markov Chain Monte CarloMI Multiple ImputationML Maximum LikelihoodMNAR Missing Not At RandomMPML Multi-level Pseudo Maximum Likelihood MQ M-QuantileMQ-GC M-Quantile – Geographic CoordinatesMQGWR Model Quantile Geographically Weighted Regression (estimator)MQGWR-CD Model Quantile Geographically Weighted Regression MQ-WR M-Quantile –Welsh-RonchettiMSE Mean Squared ErrorNPEBLUP Non-Parametric Empirical Best Linear Unbiased PredictorP-spline Polynomial splineRB Relative BiasREML REstricted Maximum LikelihoodRRMSE Relative Root Mean Squared ErrorSAE Small-Area EstimationSAC Scientific Advisory Committee (of the Global Strategy)sSAR Simultaneously Auto-Regressive SEBLUP Spatial Empirical Best Linear Unbiased PredictorSMBDE Spatial Model-Based Direct EstimatorWMQ Weighted M-QuantileZIEBLUP Zero-Inflated data Empirical Best Linear Unbiased Predictor

13

General IntroductionThis report presents methods for estimating agricultural and rural statistics at the local level – small-area estimation (SAE).

The local level is the geographical level at which data are requested with a view to planning sub-regional policies or evaluating the results of policy. For this purpose a region can be split into subsets or domains of study. The “local” and “regional” levels will vary from country to country: administrative areas include municipalities and census divisions and localities such as the tehsil in India and the woreda in Ethiopia; the area level may also depend on the method applied. The domains may refer to a demographic group or a geographical area, or both.

In agro-environmental studies, such administrative areas are the traditional spatial area for which statistical data are available. Numbers 11, 12 and 14 of the Statistical Development Series of the Food and Agriculture Organization of the United Nations (FAO) review the data available in each country after the 2000 agricultural census.2

Satellite imagery and geographic information systems (GIS) facilitate surveys of the dynamics of change in land use and cover at any required level of spatial resolution; the LUCAS project in Europe3 is an example. Some of this work focuses on yield forecasting and estimation only. The reality is that the area harvested is needed to determine total production: even if a model provides information about crop yields, it may say nothing about the area being harvested. And, unfortunately, the data needed to produce detailed information and maps for various phenomena are often unavailable.

FAO publications show that the main sources of detailed local-level agricultural and rural data are agricultural censuses, sample surveys and administrative registers. But the censuses are conducted only periodically, and the FAO publications on the 2000 agricultural census reveal that national surveys of agricultural and other statistics differ widely and may be constrained by cost and other considerations.

Not every country has a current survey of agriculture, and some – often developing countries – conduct agricultural censuses on a sample basis4 And where a country has statistical data from which to compute indicators of land use, crop production, livestock, farm structure, incomes and living conditions, their quality at the local level is not homogenous.

Two situations stem from the analysis of the main sources of agricultural and rural data:i. Data on the target variable at the local level can be obtained from a survey such as an agricultural census or a

sample or other survey.5 This will not be as exhaustive as the Cross-Cutting Experiment in India, for example,where the sample size and number of observations in the domain of interest is large enough to provide statisticallysound estimates, and where information at the local level is available.

2 No. 11 – A system of integrated agricultural censuses and surveys; no. 12 – 2000 World Census of Agriculture; no. 14 – 2000 World Census of Agriculture 1996–2005, methodological review.

3 Since 2006 the EUROSTAT LUCAS survey has observed changes in land use and land cover in the European Union every three years. The latest survey in 2012 covered all 27 countries with observations at 270,000 locations.

4 Follow-up surveys can be register-based as in Denmark and Kuwait, or census-based as in India. See SDS 12, p. ix: “The Denmark agri-culture census is linked to the registers of Integrated Administration and Control System (IACS) which contain information relating to area under major crops for all the farms applying for crop subsidies. In India, the administrative functions of maintaining land ownership records and doing seasonal crop enumeration are vested in a single office at village level. The services of this office are utilized to carry out an agriculture census (limited to crops) once every five year by re-tabulating the land ownership registers to obtain a list of agricultural holders which provides the frame for the agriculture census and a follow-up survey”.

5 In many administrative data sources such as registers of farmers, location information may be part of the original microdata set or its metadata.

14

ii. Local-level data on the target variable are not collected or are collected from a small number of sources in asample survey. Such surveys can provide local-level data, but to be truly representative they must come fromlarge samples, which increases the costs significantly. Data collection for countries in special situations –developing countries are an example – requires a good deal of work because the data sources mentioned aboveare frequently not available.

When the survey data provides few observations and cost constraints prevent additional surveys or additional sampling of the study area, existing information must be integrated and harmonized to produce credible statistics on the dynamics of change at the local level. In this case, an estimate of the target variable for local domains can be obtained from reliable data relating to a larger domain that includes the domains in question. In short, the available aggregate data for broad areas must be disaggregated at the local level for small areas.

A more reliable estimate of the target variable at the local level can be obtained when auxiliary variables are known for larger areas or domains and for the small areas of interest for sampled and non-sampled units in the domains.

Spatial auxiliary information is crucial in many applications of the estimation method for local areas, because it can increase the efficiency and effectiveness of the estimations. Spatial auxiliary information can be derived from administrative archives and maps of the territory under study, and geographical information systems can provide spatial data relating to coverage, perimeters, extensions and distances.

Today, the quality and coverage of spatial information on land use are generally satisfactory. Such information constitutes the bulk of relevant auxiliary information; this also applies to developing countries. GIS satellites provide maps of land use from which indications of crop quantities and yields can be obtained, with acceptable and useful indications for official statisticians and other stakeholders.

As we showed in a previous FAO report,6 there are several methods for estimating small areas for which no data or insufficient or low-quality data are available. These can be classified into four groups: data interpolation, data integration, data fusion and data disaggregation. The techniques of the first three groups are well known, and are used for crop-yield estimates; they usually use a detailed cell size or a grid laid over the study zone, and they integrate various sources of data.7 The application of remotely sensed data frequently depends on the characteristics of the study zone and the quality of satellite imagery.8 For these reasons it is difficult to provide a set of methods that will be useful for most countries.

Where there are credible aggregate data for large areas that include the small areas in question, data-disaggregation methods are applicable; this applies in developing and developed countries. These methods have general conditions of applicability, and they can be useful in countries involved in the Global Strategy because they can be adapted to a range of situations.

The spatial disaggregation techniques developed by GIS and geo-statistics researchers (Kim and Yao, 2010; Li et al., 2007) are used to break down maps and spatially aggregated data into a zoning system with finer spatial resolution. They are based on estimation and data-interpolation techniques, and take into account assumptions about the spatial distribution of the target variable or the relationship between the target variable and the auxiliary geographical

6 Review of projects and contributions on statistical methods for spatial disaggregation and for integration of various kinds of geographical information and geo-referenced survey data. Available at: http://www.fao.org/fileadmin/templates/ess/documents/meetings_and_work-shops/GS_SAC_2013/Better_integration_of_geographic_information_and_statistics/Spatial_disaggregation_and_integration.docx

7 The history in applying GIS and remote sensing data analysis to crop production forecasting is long and rich. A comprehensive collection of the possible results using variety of observations data and processing models is accessible by Crop Explorer http://www.pecad.fas.usda.gov/cropexplorer

8 Data integration, aggregation and fusion techniques are applied to a convergence of evidence including weather patterns, actual ground observations and remotely sensed data. There are, however, numerous countries where some of these areas of evidence are not available.

15

variable. The local area – the target zone – is smaller in extent than the source zone, which is generally a broad area. In other words, “local area” is synonymous with “area of small extension”.

The SAE methods developed by sample survey statisticians are used to produce statistically sound estimates on the basis of data from surveys and administrative records when the sample size is small or equal to zero in the target area and provides few observations on the variable in question. The term “small area” is used to describe domains with too few observations to give statistically significant results. The estimates are based on the specification of a model that borrows strength from the related areas and links the study variable to the existing auxiliary information (Rao, 2010).

This report contains an overview of methods, particularly those related to the specification of statistical models for SAE, and considers the problems of some advanced SAE methods in particular situations or where deviation from standard assumptions occurs in agricultural surveys. Full understanding of the methods presented requires advanced knowledge of statistics, but our presentation in terms of simulation studies and actual applications should make them useful and should make them accessible to people with relatively limited knowledge of statistics. Some of the methods are complex, and relevant software is identified where needed.

The report is in two parts, and contains nine chapters.Part I: Review of spatial data disaggregation and SAE methodsChapter 1: Mapping techniques. Chapter 2: Small-area estimators

These chapters review the most common methods for data disaggregation and the most effective models for SAE, which usually use spatial information obtained by aggregation and integration of existing data sources. The assumptions on which the methods are based are clarified. The information needed for each method is described – auxiliary information, coordinates and geographical information, for example – and their role in the production of almost cost-free official statistics is suggested.9 Most of the methods are based on the assumption that the data come from a simple random sample of the population; extensions to more complex sampling systems are also presented. A comparison of the methods on the basis of available information is given, and there are examples of applications to rural and agricultural statistics or to general social and economic statistics.

Part II: Reliability of SAE methods in non-standard situationsChapter 3: Sensitivity of SAE predictors to spatial model specifications. Chapter 4: The modifiable area unit problem. Chapter 5: Robustness of SAE predictors. Chapter 6: The complexity of sample design. Chapter 7: Missing data in spatial datasets. Chapter 8: Excess of zeros in survey data.

There are several open issues with regard to the quality of small-area estimates. These are important for practitioners because they show the benefits and limitations of some advanced methods in specific situations or where deviations from standard assumptions occur in agricultural surveys. We refer in particular to the sensitivity and robustness of SAE methods in the specification of the model linking the study variable to spatial auxiliary information (see Chapter 3). The level of aggregation of spatial data to define target areas affects the “fit” of the model: in other words the area units are modifiable, and this affects the significance of the SAE models (see Chapter 4). The type and quality of available data determine the accuracy of small-area estimates: in general, the robustness of SAE predictors is significant when applying the methods in the presence of outliers and data errors (see Chapter 5). The data used for estimations often come from sample surveys that do not follow the simple random-sampling assumption common to many SAE models. We review the effect of the complexity of the sample design on the model (see Chapter 6),

9 The Australian Bureau of Statistics provides an online guide to SAE for stakeholders interested in having local data (see http://nss.gov.au/nss/home.NSF/ And interesting work has been done on the deliverables of ESSnet SAE and MEMOBUST handbook projects; see: http://www.cros-portal.eu

16

the effect of missing data in the spatial dataset used as auxiliary data (see Chapter 7) and of the excess of zero in the study variable (see Chapter 8).

Chapter 9: Conclusions and final remarks.This summarizes our main findings and recommendations.

The report is intended to stimulate research and application studies of SAE applied to agricultural and rural statistics when geo-referenced auxiliary information is used. For this reason, and because many topics were intentionally omitted, it cannot be considered a compendium of methods of integrating spatial information into SAE applied to agricultural surveys. But it reflects, to the best of our knowledge, the state-of-the-art with regard to several crucial issues.

17

1Review of Spatial Disaggregation and SAE Methods1. Spatial Disaggregation: Mapping Techniques

1.1 IntroductionThe main idea underlying the techniques in this chapter is to disaggregate spatially aggregated data into a zoning system of higher resolution. The original areas, with known data, are called source zones; the targeted areas are called target zones (Lam, 1983). Spatial disaggregation methods, which are based on areal interpolation techniques, can be classified according to various criteria such as underlying assumptions or the use of ancillary data (Wu et al., 2005). In all such techniques error is inevitably generated by the assumptions about the distribution of the objects – homogeneity of density, for example – or by the spatial relationship imposed in disaggregation process – the size of the target zones, for example (Li et al., 2007).

Areal interpolation – the process whereby data from one set of source polygons are redistributed to another set of overlapping target polygonal areas10 – is used primarily when the target variables are estimated on the basis of data available from various sources covering the same area, but with different internal boundaries. This approach frequently uses census data as the input, and applies interpolation or disaggregation techniques to obtain a refined population surface.

Two groups of techniques are considered below: i) interpolation based on the proportionality to the density distribution of the target variable or other, auxiliary, variables – the simple area-weighting method, pycnophylactic, or mass-preserving, interpolation methods or dasymetric mapping; and ii) interpolation based on regression models – the expectation-maximization (EM) algorithm. Other methods are then proposed to show some non-plausible hypotheses of the first set, and examples of practical applications are presented; further examples can be found at:

http://www.integrated-assessment.eu/guidebook/spatial_disaggregation. Most of these methods require digital maps and GIS data to estimate target variables such as crop production, land use and pesticide use.

1.2 Areal interpolation methods

1.2.1 Simple area weighting methodThe simplest interpolation approach for disaggregating data is the basic area weighting method, which apportions the attribute of interest by area, given the geometric intersection of the source zones with the target zones. This method assumes that the target variable y is uniformly distributed in each source zone. Given this hypothesis, the data in each target zone can be estimated as:

(1.1)

10 Includes the simple case of the existence only of non-overlapping target areas in a source area.

18

where is the estimated value of the target variable in the target zone t, is the observed value of the target variable in source zone s, is the area of source zone s and is the area of the intersection of the source and target zones. This method satisfies the pycnophylactic or volume-preserving property, which requires the preservation of the initial data: the predicted value for source area s obtained by aggregating the predicted values at intersections with area s should coincide with the observed value for area s (Do et al., 2013; Li et al., 2007). Several studies show, however, that the overall accuracy of simple area weighting is low compared with other techniques (see, for example Langford, 2006; Gregory and Paul, 2005; Reibel and Aditya, 2006).

To extend the assumption of homogeneity in the simple area-weighting method – it is very rarely acceptable – several approaches have been proposed. A number of studies, for example, aim to overcome the problem by “smoothing” with density functions such as kernel-based surface functions around area centroids, and there is Tobler’s (1979) pycnophylactic-interpolation method (Kim and Yao, 2010).

1.2.2 Pycnophylactic interpolation methodsTobler (1979) proposed the pycnophylactic interpolation method as an extension of simple area weighting to produce smooth population-density data from areally aggregated data. It calculates the target area values on the basis of the values and weighted distance-from-the-centre of neighbouring source areas, maintaining volume consistency in the source areas. It uses the following algorithm:1. intersect a dense grid over the study region;2. assign a value to each grid cell using simple area weighting;3. smooth the values of all the cells by replacing each cell value with the average of its neighbours;4. calculate the value in each source region by summing all the cell values;5. weight the values of the target cells in each source area equally, so that source-area values are consistent; and6. repeat steps 3 to 5 until there are no further changes to a specified tolerance.

In this approach, the choices of the appropriate smooth-density function and of the search window size depend on the characteristics of individual applications. The underlying assumption is that the value of a spatial variable in neighbouring target areas tends to be similar; Tobler’s first law of geography asserts that neighbouring things are more related than distant ones (Tobler, 1979). Comber et al. (2008), for example, refer to an application of pycnophylactic interpolation to agricultural data to identify land-use areas from aggregated agricultural census data.

1.2.3 Dasymetric mappingThe dasymetric mapping method (Wright, 1936; Mennis and Hultgren, 2006, Langford, 2003) is different. To reflect density variation in source zones, this method uses other information x to distribute y: that is, it uses additional information to estimate the actual distribution of aggregated data with the target units of analysis with a view to allocating y to the small intersection zones in the sources provided that the relationship between x and y is proportional and strongly correlated. Hence this method replaces the homogeneity assumption of simple area weighting with the assumption that data are proportional to the auxiliary information on any sub-region. Considering a quantitative variable x, the dasymetric mapping method extends formula (1.1) by substituting x for the area:

(1.2)

The simplest scheme for implementing dasymetric mapping is to use a binary mask of land-cover types (Langford and Unwin, 1994; Langford and Fisher, 1996; Eicher and Brewer, 2001; Mennis and Hultgren, 2006). In this case the auxiliary information is categorical, and its level defines the control zones (see Figure 2 in Example 1: Mask area weighting). The classic case, called binary dasymetric mapping, is population estimation when there are two

19

control zones, one known to be populated and the other unpopulated, and it is assumed that the count density is uniform throughout the control zones. In this case formula (1.1) becomes:

(1.3)

where is the estimated population in the target zone t, is the total population in source zone s, is the source zone area identified as populated, and is the area of overlap between target zone t and source zone s, with land cover identified as populated.

Several multi-class extensions to binary dasymetric mapping have been proposed (Kim and Yao, 2010; Mennis, 2003; Langford, 2006). Li et al. (2007) present three-class dasymetric mapping for population estimation that takes advantage of binary dasymetric mapping and a regression model with a limited number of ancillary class variables – non-urban, low-density residential and high-density residential – to present a range of residential densities in each source zone. The technique is based on a relaxed assumption about homogeneous density for each land class in each source zone:

(1.4)

Here, is the area of intersection between target zone t and source zone s identified as land class c, and is the area of source zone s identified as land class c. Hence represents the density estimate for class c in zone s. These densities can be estimated in a regression model, as described below.

The dasymetric and pycnophylactic methods have complementary strengths and shortcomings for population estimation and target variable disaggregation. For this reason, several hybrid pycnophylactic/dasymetric methods have been proposed (Kim and Yao, 2010; Mohammed et al., 2012; Comber et al., 2008). These use dasymetric mapping for a preliminary population/variable of interest redistribution, and an iterative pycnophylactic-interpolation process to obtain a volume-preserved smoothed surface. Comber et al. (2008) use the hybrid method to disaggregate agricultural census data to obtain a fine-grained 1 km2 maps of agricultural land use in the United Kingdom.

1.2.4 ExamplesExample 1 - Simple tabular and graphical examplesIn this first example we show how the different methods work, with simple examples.

Figure 1.1 illustrates a hypothetical example taken from Shu et al. (2010) with data for three source areas that are to be split into 25 target areas. The examples also compare the two approaches: volume-preserving interpolation and non-volume preserving interpolation.

Block (b) shows the results of applying the simple area weighting method; blocks (c) and (d) show the results obtained by applying volume-preserving interpolation and non-volume preserving interpolation. The results of block (c) applying non-volume preserving interpolation returns the total values of the three polygons as 112, 76, and 304; these are different from the original values of 90, 80 and 360 because the volumes are not preserved.

20

Figure 1.1: Example of volume-preserving interpolation

obtained by applying volume-preserving intnd 304; these are different from the original values of 90, 80 and 360 because the volumes are not preserved.

Figure 1.2 shows a simplified simulation of areal interpolation of artificial population data (see: http://www.integrated-assessment.eu/guidebook/spatial_disaggregation) comparing the simple area-weighting method with dasymetric mapping and considering the different density distribution of the different areas.

The simple area-weighting method attributes T=(100x0.25)+(60x0.25)=40 persons to the target area. The calculation is done in proportion to the relative extension of the intersection between source and target areas T∩A and T∩B.

The mask area weighting limits the intersection to the populated zone, thereby modifying the relative weights of T∩A and T∩B to the values 0.5 and 0.25.

Finally, dasymetric disaggregation takes into account the distribution of the population in the intersections and their relative extensions.

21

Figure 1.2: Simulation of areal interpolation of population data

An example of the results of mask area weighting for disaggregating county pesticide usage in East Anglia in the UK, where the target areas are defined on a 5x5 km grid is shown in the maps in Figure 1.3.

22

Figure 1.3: Disaggregation of county pesticide usage

(http://www.integratedassessment.eu/guidebook/mask_area_weighting_example_pesticide_usage)

Example 2 – Comparison of simple area weighting method to estimate crop production, applied by using three different kind of proportionsYou, L. and Wood (2006) presented an application in in which Brazilian state-level production statistics were used to generate pixel-level crop production data for eight crops. The robustness of the results of this entropy-based approach were compared with short-cut approaches to allocating crop-production statistics. They examined three possible short-cut methods for assigning state-level crop areas to municipalities: i) in proportion to the total land area of the municipalities; ii) in proportion to the cropland area of each municipality; and iii) in proportion to the amount of biophysically suitable land for the production of each crop in each municipality. For all crops, the proposed approach was most successful in predicting municipality crop areas – by a large margin for wheat and beans. The simplest procedure – distributing crop production in proportion to the total areas of the municipalities – was the second-best method for maize and beans, which are grown extensively to meet ubiquitous demand for primary foods and commodities such as maize-based feed.

Example 3 – Application of pycnophylactic interpolation and dasymetric mappingComber et al. (2008) describe an approach combining dasymetric and volume-preserving techniques to create a national land-use dataset at 1 km2 resolution. The results for an English county are compared with contemporaneous aggregated habitat data, and the results show that accurate estimates of local arable and grass land-use patterns can be obtained when individual 1 km squares are combined into blocks of > 9 squares, thereby providing local estimates of agricultural land use. This in turn allows more detailed modelling of land uses related to livestock and cropping activities.

23

Example 4 – Adaptation of Dasymetric mapping to agricultural dataDe Belém et al. (2012) examine the adaptation of dasymetric mapping methods to agricultural data, including testing and transposition, to recover the underlying statistical surface – that is, an approximation of the real distribution of data. The method was applied in the Alentejo region of Portugal using data from the 1999 agricultural census; several counties were used as source zones. The aim was to generate a distribution of agro-forestry occupations as close as possible to reality. Two lines of analysis were followed: i) simultaneous application of the method to all counties to obtain a definition of regional densities; and ii) separate application of the method to different sub-areas with similar characteristics to obtain a definition of sub-regional densities. The results were validated through error indicators at the county level and in a sample of parishes. The second variant of the method, which gave more precise results and was superior for the types of data available, yielded maps in which the distribution of the most relevant agro-forestry occupations was closest to reality.

1.3 Spatial disaggregation with interpolation based on Regression models

1.3.1 Regression modelsThe dasymetric weighting schemes in the previous paragraph have several restrictions: i) the assumption of proportionality of y and x; ii) the fact that the auxiliary information should be known at the intersection level; and iii) the limitation to a unique auxiliary variable. Spatial disaggregation techniques based on regression models can overcome these constraints (Langford et al., 1991; Yuan et al., 1997; Shu and Lam, 2011). Another limitation of the dasymetric method is that when predicting at the level of the source/target intersection s-t, only the areal data ys, in which the intersection is nested is used for prediction. This will not be the case for regression: in general, the regression techniques involve a regression of the source-level data of y on the target or the control values of x.

Generally speaking, regression models for estimating population counts assume that the given source zone population may be expressed as a sum of a set of densities related to the areas assigned to the different land classes. Other ancillary variables may be included for these area densities, but the basic model is:

(1.5)

where is the total population count for each source zone s, c is the land cover class, is the area size for each land class in each source zone, is the coefficient of the regression model and is the random error. The output of the regression model is an estimate of the population densities . A problem with this regression model is that the densities are derived from a global context and remain spatially stable in each land class in the study area; it has therefore been suggested that the locally fitted approach used by the dasymetric method will always outperform the global fitting approach used by regression models (Li et al., 2007). To overcome this limitation, locally fitted regression models have been proposed where the globally estimated density for each land class is locally adjusted in each source zone by the ratio of the predicted population and census counts to obtain a variation of the absolute value of population densities by reflecting the differences in terms of local population density between source zones. These methods were developed initially to ensure that the populations reported in target zones were constrained to match the sum of the source zones – the pycnophylactic property.

24

1.3.2 The EM algorithmAnother statistical approach is based on the EM algorithm (Flowerdew and Green, 1992). Rather than using a regression approach, the interpolation problem is set as a missing-data problem considering the intersection values of the target variable as unknown and the source values as known. The EM algorithm is used to predict the intersection values. This method is useful when the variable of interest is not a count but can be assumed to follow the normal distribution. Let be the mean of the values of the variable of interest over the values in the intersection zone s-t, and assume that:

(1.6)

The values are assumed as known or interpolated from . Hence:

(1.7)

and:

(1.8)

If the were known, we would obtain the mean in target zone t as:

with . Setting would give the simple areal weighting solution. But with the EM algorithm the interpolated values can be obtained following E-step and M-step operations until convergence is reached:

E-step:

where

M-step:

Treat the as a sample of independent observations with distribution and fit the model with least-weighted squares.

25

These steps are repeated until convergence, and then the interpolated are computed as the weighted mean of thevalues from the E-step:

(1.9)

If convergence cannot be achieved, an alternative non-iterative scheme can be used (see Flowerdew and Green, 1992).

Analogous regression models can be used also to disaggregate count, binary and categorical data (Langford and Harvey, 2001; Tassone et al., 2010).

1.3.3 ExamplesExample 5 – Application of regression models to describe the spatial patterns of corn yieldA study by Kaspar et al. (2003) developed a linear-regression model to describe the spatial patterns of corn yield for a 16 ha field in central Iowa, USA. The study examined the relationship between six years of Zea mays L. yield data and relative elevations, slopes and curvatures, and corn yield in six crop years with relative elevations measured by GIS imagery; slopes and curvatures were then determined by digital terrain analysis. The data showed that in the four years with less rain than usual in the growing season, corn yield was negatively correlated with relative elevations, slopes and curvatures, whereas in the two years with more rain than usual, yield was positively correlated with relative elevations and slopes. A multiple linear regression model based on relative elevation, slope and curvature was developed that predicted 78 percent of the spatial variability of the average yield of the transect plots for the four dry years, and that identified the spatial patterns in the entire field for yield monitoring data from 1997, which was one of the dry years. The relationship between terrain attributes and corn yield spatial patterns may provide opportunities for site-specific crop management.

Example 6 – Application of the EM algorithm to produce population density gridsGallego (2010) described four methods for producing dasymetric population density grids combining population data by commune with the CORINE land-cover map, which is available across the European Union. The four methods apply different versions of the dasymetric method and the EM algorithm. An accuracy assessment in five countries for which a reliable 1 km population-density grid exists showed that the improvement compared with the choropleth map by commune ranged from 20 percent for the weakest result – in Finland – to 62 percent for the best result – in the Netherlands. All methods overestimate populations in agricultural, heterogeneous and forest areas; it is often smaller for the EM method, but this approach significantly overestimates the population in the class "infrastructure" because it appears mainly in highly populated communes.

26

2. Small-Area Estimators

2.1 IntroductionMany target parameters in agricultural and rural statistics can be expressed in the form of means and percentages.11

A common practice is to estimate these quantities for sub-populations – or domains – with survey data, but as stated in the Introduction there are geographic domains, or areas, for which sufficiently precise direct estimates cannot be produced. Survey designs usually focus on achieving a particular degree of precision for estimates at a level of aggregation higher than that of small areas.

Knowledge of the parameter for a given domain or small area can be obtained in three ways, depending on the level and type of information from administrative archives and survey data:12

The broad area ratio estimator (BARE) is one of the simplest types of small-area model; it is applicable when the study variable is known for a larger domain in which the small area is included. In this case the estimator is calculated by applying the rate for a broad area obtained from a survey – for example, crop yield rates at the district or community-development block level from the Cross-Cutting Experiment in India13– to the small-area population obtained from a population census or demographic estimate.

The success of BARE relies on the choice of the broad area, which must be large enough in terms of sample size to allow for a reliable direct survey estimate but small enough to enable the assumption that the small areas in the broad area are homogenous in terms of the characteristic of interest. This is a major assumption, so users must be aware of it. As with direct estimation, BARE can be used to validate more complex approaches.

The BARE with auxiliary data approach uses information correlated with the variable of interest and is available at the small-area level to derive an estimate adjusted for compositional differences in small areas. It is a deterministic model that assumes that crop yield rates only vary by household size or farm size; it does not allow for other effects. It can, however, be applied in association with a broad area ratio estimate. As in BARE, there is a strong underlying assumption of homogeneity in broad areas.

Survey evidence shows that in many developing countries crop-yield rates can be correlated with household size. This could mean that areas with a large proportion of large households will have high crop yields. The estimator applies household size and crop yields to a small-area population classified by household size.

These two methods of estimation for small domains are similar to the simplest spatial disaggregation techniques in chapter 1, but they are based on different assumptions. The spatial disaggregation techniques disaggregate maps into more detailed maps on the basis of the assumed spatial relations between source and target zones. The BARE estimators start from survey data and estimates, and spatial relations are not essential.

In a third approach based on SAE methods, spatial relations between areas can be inserted in the definition of model-assisted estimators (MAE) and model-based estimators (MBE), which are based on regression models and make it possible to base small-area predictions on a number of variables. These come from sources other than the sample survey and refer to the area of interest – a municipality or a district for example – or to the unit of interest,

11 In Europe, many agro-environmental indicators are expressed in percentages, combining different kinds of data with arable land or mostly the “utilized agricultural area” – the total area taken up by arable land, permanent grassland, permanent crops and kitchen gardens (see Eurostat, LUCAS).

12 See also: Australian Bureau of Statistics. 2006. A Guide to Small-Area Estimation. Canberra.13 In this context, crop yield, or agricultural output, refers to the measure of the yield of a crop per unit area of land cultivation (see Sud et

al., 2011).

27

which could be a person, a rural household or a farm. In the first case area-level small-area models are defined; in the second, unit-level small-area models.

The term “small area” is used to describe domains whose sample sizes are not large enough to enable sufficiently precise direct estimates. In practice it is not possible to plan for all possible areas or domains and uses of survey data, because “the client will always require more than is specified at the design stage” Fuller (1999). When direct estimation is not possible, one must rely on alternative model-based methods for producing small-area estimates: these depend on the availability of population-level auxiliary information related to the variable in question and use linear mixed models; they are commonly referred to as “indirect methods” (see Rao, 2003; Ghosh and Rao, 1994; Pfeffermann, 2002; Jiang and Lahiri, 2006a; and Pfeffermann, 2013).

Accurate spatial data can nowadays be derived from satellite imagery and GIS.

The SAE models are generally regression models. Figure 2.1 shows how an SAE model works with the MAE or MBE approach.

To understand the characteristics of SAE methods, we may suppose that a phenomenon – a population, for example – U of size N is divided into m non-overlapping subsets , which may be domains of study or small areas, of size . These domains refer to geographical areas such as municipalities or census divisions, or to an agricultural group such as type of production or a farm, or to a demographic group such as a population defined by age, gender and race in a large geographical area, or to a cross-classification of these.

The index j identifies the units of the population, the index i the small areas. The population data consist of values of the variable of interest, and of values of a vector of p auxiliary variables. A sample s of units is drawn from

the population according to some sampling system such that the inclusion probability of unit j in area i is given by . The values of are known for area-specific samples , and unknown for each unit of the set

, which contains the non-sampled units in small area i. The p-vector of auxiliary variables it is known for each unit of the area i from external sources such as a census. At least the area and level totals or means are accurately known for all the small areas of interest.

Spatial information can enrich the auxiliary variables for sampled and non-sampled units.

Note that area-specific samples of size ni > 0 can be unavailable for each area. In the sample design there are areas that have , in which case is the empty set and the areas are out-of-sample areas.

Figure 2.1 shows fixed effects covariates f (xij) and random effects at the area level g (ui ) and at the individual level eij.

28

Figure 2.1: How an SAE unit-level model works

Provided that the model is fitted to the data, combining direct estimates and the predicted ŷij, the SAE estimates will be obtained. Their properties are evaluated in the MAE or MBE approach (see section 2.2) to find efficient predictors in terms of mean squared error (MSE) for the target area parameters. The objective of SAE predictors is to produce accurate estimates for small areas, and they should improve the precision of direct estimates. Although direct estimators have several sound properties, direct estimates often lack precision when domain sample sizes are small.

2.2 A classification of SAE modelsAs we have indicated, target variables at the area level can be estimated through: i) the design-based approach (see Hansen et al., 1953; Kish, 1965; Cochran, 1977); ii) the MAE approach (see Särndal et al., 1992); and iii) the MBE approach (see Gosh and Meeden, 1997; Valliant et al., 2000; Rao, 2003). They will be direct or indirect small-area estimates (see Figure 2.2).

Direct estimates are obtained in the design-based approach from data obtained from a single survey with the application of Horvitz-Thompson-type estimators. For the production of significant direct estimates at the small-area level, design issues that affect small-area estimation must be considered, particularly in the context of large-scale surveys. Rao (2003) discusses some of these design issues, and refers to Singh et al. (1994) for detailed analysis. When designing a survey, the use of direct estimators in SAE can be facilitated by: i) minimizing clustering; ii) replacing broad strata with several narrow strata from which samples are drawn; iii) adopting compromise sample allocations to satisfy reliability requirements at the small-area level and the large-area level; and iv) integrating surveys such as dual-frame surveys and repeated surveys.

29

The indirect estimates use auxiliary information, or variables, to improve the accuracy of survey estimates and to break down known values for large areas by using regression models. Indirect estimates are obtained in the model-assisted and model-based approaches where a statistical model – usually a regression model – is specified with a view to reinforcing the validity of evidence, or “borrowing strength”, from the auxiliary variables. Figure 2.2 shows the classification of SAE methods used in this report.

Figure 2.2: A classification of SAE estimation methods

In the MAE approach estimators generally have design-based properties, and their accuracy – as measured by MSE – is derived with the sampling system used to collect the survey data. In the MBE approach the properties of the estimators and their accuracy are evaluated with the statistical model specified for obtaining confirmation from the auxiliary variables.

In the last 30 years, indirect estimates have become popular. The generalized regression (GREG) estimator and its modifications together with the empirical best linear unbiased predictors (EBLUP) specified in the linear mixed models (LMMs) are currently applied by many statistics agencies. In LMMs the distribution of the study variable is a function of area-specific random effects and of unit random effects. Area-random effects make it possible to include differences between areas in the model.

The characteristics of the data available for the study have motivated the specification of area-level and unit-level models. The best linear unbiased predictors can be obtained in an area-level model – the Fay and Herriot model (FH-EBLUP) – or an EBLUP (Henderson, 1975; Rao, 2003), in which case it takes into account the area-specific random effect and the individual random effect.

30

These predictors can incorporate geographic information referring to the areas of interest as the spatial empirical best linear unbiased predictor (SEBLUP). Models can include random effects; those that do not are known as synthetic models.

All SAE models must reflect underlying data – continuous, count or categorical, for example –and must take account of specific characteristics of the distribution of the target variable such as non-parametric specifications or methods unaffected by outliers.

A recent approach to SME is based on the use of M-quantile (MQ) models, which are specified at the unit level (Chambers and Tzavidis, 2006). Differences between areas can be captured throughout quantile coefficients. This approach can be extended to model quantiles with the model quantile geographically weighted regression (MQGWR) estimator (Salvati et al., 2012).

Geostatistical models such as geo-additive models, kriging and MQGWR can also play a role in the spatial extension of SAE estimators (see section 2.7 of Part I and chapter 3 of Part II).

A number of further developments have taken place in the SAE literature in recent years.

The estimation of parameters other than averages and totals are the subject of several papers: examples include quantities of the small-area distribution function of the outcome of interest (Tzavidis et al., 2010) and complex indicators (Molina and Rao, 2010; Marchetti et al., 2012). Opsomer et al., 2008 focused on non-parametric versions of the random-effects model; others focused on the specification of models that borrow strength in spatial terms by applying models with spatially correlated or non-stationary random effects (Benedetti et al., 2012; Salvati et al., 2012; Chandra et al., 2012). The issue of outlier-robust SAE has attracted interest mainly because in many real data applications the Gaussian assumptions of the conventional random effects model are not satisfied (Sinha and Rao, 2009).

Categorical survey variables are not suited to standard SAE methods based on LMM. One option in such cases is to adopt an empirical best predictor based on generalized LMMs. Some details are given in section 2.6.7; a Bayesian approach to the non-spatial and spatial mixed effects models for SAE is described in section 2.6.6.

The main estimators of the MAE and MBE approaches are described in sections 2.3, 2.4 and 2.5. We give an example of calculation for each estimator, and present previous work applied to agricultural data. For each estimator the data needed to apply it is specified, its advantages and disadvantages in comparison with others are discussed and the extensions applied to overcome them are identified. Additional extensions or alternatives to the main solutions that will be used in Part II are reviewed in section 2.6. Equation Chapter (Next) Section 1

2.3 Model-assisted estimators

2.3.1 GREG estimatorGREG design-based estimators were introduced for SAE by Särndal (1984). The class of GREG estimators, which encompasses a range of estimators assisted by a model, are characterized by asymptotic design, absence of bias, and consistency. GREG estimators share the following structure:

(2.1)

31

Different GREG estimators are obtained in association with different estimation models, that is, for calculating predicted values , . To define this estimator and subsequent estimators we assume that contains 1 as its first component.

In the simplest case, a fixed-effects regression model is assumed: , , where the expectation is taken with respect to the assisting model. When sampling weights are used in the estimation process of the regression model, it leads to the estimator GREG-S:

(2.2)

where and (Rao, 2003, section 2.5). Note that in this case the regression coefficients are calculated on the basis of data from the whole sample and are not area-specific.

Table 2.1 summarizes the characteristics of the linear GREG estimator, with a focus on its underlying assumptions, its behaviour as an out-of-sample predictor, its design consistency and its robustness against outliers. It also highlights its advantages and disadvantages, which determine its extensions.

Table 2.1. Linear GREG: advantages, disadvantages, extensionsProperties Advantages Disadvantages Extensions

Model assumptions

Design-based from linear-regression model

One-level linear regression only, with fixed effects

Two-level linear model extension

(Lehtonen and Veijanen, 1999)

Design consistency

Asymptotic design, absence of bias and design consistency

Sensitivity to extreme values of sampling inclusion probabilities

Robustness to outliers

Yes Not robust against outliers

Robust GREG (Duchesne, 1999)

Out-of-sample predictions

Prediction not inclusive of spatial information

Spatial versions not yet developed;

spatial auxiliary info admitted; coordinates of sampled and non-sampled units

Model assumptions The model assisting the estimation is a fixed-effect linear model with common regression parameters as in Rao (2003), Section 2.5. In this case the resulting small-area estimators can overlook the “area effects” – the inter-area variation beyond that accounted for by model covariates – and may result in inefficient estimators. For this reason, Lehtonen and Veijanen (1999) introduce a supporting two-level model where , which is a model with area-specific regression coefficients. In practice not all coefficients need to be random, and models with area-specific intercepts that mimic LMM may be used (see Lehtonen et al., 2003). In this case the estimator GREG-LV takes the form (2.1) with . Estimators and are obtained by using generalized least squares and restricted maximum likelihood methods (see Lehtonen and Pahkinen, 2004, section 6.3).

Design consistencyDesign consistency is a general-purpose form of protection against model failures in that it guarantees that estimates make sense even if the assumed model fails completely, at least for large domains. The GREG estimator is asymptotically design-unbiased and consistent, but it can be sensible to extreme values of inclusion probabilities (Fabrizi et al., 2014). GREG estimators supported by LMM have turned to model-based estimation for the parameters of the model, so the efficiency of the resulting small-area estimators relies on the validity of the model assumption, and typically on the validity of the normality of residuals.

32

Robustness to outliersGREG and GREG-S expressions allow for survey weighting of outlying observations, but this does not guarantee protection against the outlying observations. A robust version of GREG was proposed in Duchesne (1999).

Predictions for out-of-sample areasPredictions for the out-of-sample areas – those with zero sample size – are based on the estimated parameters of the linear regression model and on the X auxiliary information:

(2.3)

Spatial versions of the GREG estimator have not yet been developed. Nonetheless, the coordinates of the positions of the sampled and non-sampled units and other auxiliary geographical variables referring to the same area can be included in the regression model (see Part II, chapter 3). This is a method that takes spatial interaction into account when it results from the covariates themselves and not to the spatial relation between the areas in the study zone.

2.3.2 Example of calculation of GREG estimatorThis example shows how to use and apply the generalized-regression estimator (2.1) introduced by Särndal (1984) to obtain small-area estimates of area mean values.

The target parameter is the mean forest biomass in ha in municipalities, taken as small areas. The data are from the Norwegian National Forest Inventory, which provides estimates of forest parameters at the national and regional scales from a network of permanent sample plots. The dataset is in the public domain, and detailed information is available in Breidenbach and Astrup (2012).

Application involves using R software: R is a language and environment for statistical computing and graphics, downloadable free at http://www.r-project.org It offers many packages of routines and functions to implement SAE techniques.

The forest in Vestfold county in Norway is a finite population subdivided into 14 municipalities, which are the small areas of interest. Above-ground forest biomass per hectare is the variable of interest, and mean forest biomass per hectare in the municipalities is the population characteristic of interest. Data on forest biomass per hectare – biomass/ha – are available for 145 sample plots; auxiliary data on mean canopy height are also available from GIS images.

Table 2.2 shows the first 6 of 145 lines on the sample plots of the Norwegian National Forest Inventory.

The R package JoSAE contains the function eblup.mse.f.wrap , which can be used to obtain GREG and EBLUP small-area estimates. The data can be loaded into R by using the commands:library(JoSAE)data(JoSAE.sample.data)

33

Table 2.2: Norwegian National Forest Inventory sample data

sample.ID domain.ID biomass.ha mean.canopy.ht

103 1 92.726264 93.278765

110 2 45.674576 67.355850

112 2 18.350648 32.969134

113 2 163.805240 125.400452

114 2 15.952520 30.983355

115 2 309.148416 167.646089

… … … …

The relationship between the target and the auxiliary data available from the sample is shown in Figure 2.3.

The command to obtain the scatterplot is:plot(biomass.ha~mean.canopy.ht,JoSAE.sample.data)

Figure 2.3: Scatterplot of the biomass/ha vs mean canopy height sample data

34

Table 2.3: Population data of mean canopy height from digital aerial images

domain.ID N.i mean.canopy.ht.bar

1 105267 108.15832

2 202513 77.34845

3 134156 94.26035

4 193807 86.64053

5 1379945 84.87776

6 176731 77.66091

7 474615 71.40756

8 442280 65.50692

9 495568 81.65170

10 520141 80.04376

11 230756 92.17368

12 83441 82.38918

13 57858 63.28690

14 905387 66.04283

Data on the mean canopy height are also available for all the elements. The population here is the forest covered by GIS images, from which the mean canopy values are available. Hence the population elements are the tiles in the forest for which auxiliary variables from the canopy height mean and image data were calculated.

To load the data in R, use the command:data(JoSAE.domain.data)

Using the data in Tables 2.2 and 2.3 we can obtain small-area GREG estimates of the mean of forest biomass/ha in the 14 municipalities.

First, an LMM must be estimated to obtain predicted yij values.

35

The R commands are:fit.lme <- lme(biomass.ha ~ mean.canopy.ht,data=JoSAE.sample.data, random=~1|domain.ID))where biomass.ha is the response variable, mean.canopy.ht is the auxiliary variable, JoSAE.sample.data is the data source and random=~1|domain.ID indicates that an LMM is being fitted where the second-level units are identified by domain.IDCheck that the name of the auxiliary variable is the same in the population and sample datasets. This is not the case in the package example data, so the name of the mean canopy data must be changed, for example from mean.canopy.ht.bar to mean.canopy.ht. This can be done in R with the commands:d.data <- JoSAE.domain.datanames(d.data)[3] <- "mean.canopy.ht"This provides all the information needed to obtain the GREG estimates by using the eblup.mse.f.wrap function:results <- eblup.mse.f.wrap(domain.data = d.data, lme.obj = fit.lme)The eblup.mse.f.wrap function has two elements: domain.data, which contains the population data, in this case the dataset d.data , and lme.obj, which contains the fitted LMM, in this case fit.lmeThe eblup.mse.f.wrap function automatically produces several results, including GREG points and MSE small-area estimates. These results can be obtained with the commands: results.GREG=cbind(results$GREG,results$GREG.se)results.GREG [,1] [,2] [1,] 112.97430 NA [2,] 87.43037 22.3619200 [3,] 105.08065 24.9611244 [4,] 99.75545 0.6452861 [5,] 115.19719 8.6429206 [6,] 136.17706 16.9847138 [7,] 135.54343 14.8789830 [8,] 105.79197 15.4088365 [9,] 112.59132 7.1419053[10,] 100.88560 12.3484579[11,] 142.97128 24.7789172[12,] 74.36564 NA[13,] 124.35662 NA[14,] 106.32493 8.3036947

The results obtained results are also shown in Table 2.4.

36

Table 2.4: Point and MSE GREG estimates of the mean forest biomass/ha for the 14 municipalities

domain.ID GREG GREG.se

1 112.97430 NA

2 87.43037 22.3619200

3 105.08065 24.9611244

4 99.75545 0.6452861

5 115.19719 8.6429206

6 136.17706 16.9847138

7 135.54343 14.8789830

8 105.79197 15.4088365

9 112.59132 7.1419053

10 100.88560 12.3484579

11 142.97128 24.7789172

12 74.36564 NA

13 124.35662 NA

14 106.32493 8.3036947

Note that for the three municipalities with only one sample observation (domains 1, 12 and 13) no MSE values are estimated.

Table 2.5 summarizes the data and software needed to implement the GREG estimator. It is data-hungry in that unit-level information is required to adapt the model.

Given the estimated regression parameters, however, the out-of-sample predictions can be obtained even when only the population-level average values of the auxiliary variables are known.

The method is popular, and routines to implement it can downloaded free from several websites. In this paper, the main references are the websites of two SAE projects funded by the European Commission – the EURAREA project14 and the SAMPLE project15 The Italian Istituto Nazionale di Statistica (ISTAT; National Statistics Institute) also provides SMART2 software (Fasulo et al., 2013). In any case, GREG can be obtained by applying the R functions described in this section.

14 http://www.ons.gov.uk/ons/guide-method/method-quality/general-methodology/spatial-analysis-and-modelling/eurarea/index.html15 http://www.sample-project.eu

37

Table 2.5: Model assisted methods: data needed and available softwareSAE methods Data needed Software

Study variable Y Aux info X

GREG Individual Y microdata classified by areas

(sampled units)

+ individual survey weights

(sampled units)

Individual X microdata classified by areas

(sampled and non-sampled units)

EURAREA project website

AMELI project website

ISTAT

R functions

GREG_S Individual Y microdata classified by areas

(sampled units)

+ individual survey weights

(sampled units)

Individual X microdata classified by areas



AMELI project website

ISTAT

R functions, SMART2

2.4 Model-Based estimators: area-level The most popular methods used for model-based SAE employ LLMs. Publications dealing with LMMs include Searle et al., (1992), Longford (1995), McCullogh and Searle (2001) and Demidenko (2004).

Model-dependent estimators that rely on linear-mixed or random-effects models have gained popularity (Rao, 2003; Jiang and Lahiri, 2006a) because they enable the inclusion of a random-area effect to explain inter-area variation in addition to that explained by fixed-effect covariates. The reliability of these methods depends on the validity of model assumptions, however, a criticism often raised in design-based research (Estevao and Särndal, 2004).

2.4.1 FH-EBLUPThe FH-EBLUP is the most popular method for producing small-area estimates from area-level data. The model can be extended to include correlated random area effects, the FH–Spatial EBLUP.

Let be the vector of the parameters of inferential interest, typically small-area totals ; small-area means with i = 1…m) and assume that the direct estimator is available and is design-unbiased:

(2.4)

where e is a vector of independent sampling errors with mean vector 0 and a known diagonal variance matrix , representing the sampling variances of the direct estimators of the area parameters of interest.

Usually, is unknown and is estimated by various methods such as generalized variance functions (Wolter, 1985; Wang and Fuller, 2003).

The basic area-level model assumes that an matrix of area-specific auxiliary variables including an intercept term is linearly related to as:

(2.5)

where is the vector of regression parameters, u is the vector of independent random area-specific effects with zero mean, and covariance matrix , with is the identity matrix. The combined model (Fay and Herriot, 1979) can be written as:

, (2.6)

38

It is a special type of LMM where normality and symmetry of the distribution of the u and e components holds. In this model, the EBLUP is extensively used to obtain model-based indirect estimators of small-area parameters and associated measures of variability. This approach and its modifications16 allow the survey data to be combined with other data in a synthetic regression fitted using population area-level covariates. The EBLUP estimate of is a composite estimate of the form:

, (2.7)

where and is the weighted least squares estimate of with weights obtained by regressing on , and is an estimate of the variance component .

The EBLUP estimate gives more weight to the synthetic estimate when the sampling variance, , is large or where is small, and moves towards the direct estimate as decreases or increases.

Table 2.6 summarizes the properties of the FH-EBLUP.

Table 2.6: EBLUP under area level specification: advantages, disadvantages and extensionsProperties Advantages Disadvantages Extensions

Model assumptions

Efficiency under the assumption of Normality of LMM

Linearity of the relation with fixed effects aux variables

Incorrelation between the random area effects

Non-parametric extension

EBLUP

(Giusti et al., 2012)

SEBLUP

(Petrucci and Salvati, 2006; Pratesi and Salvati, 2008)

Design consistency

Design consistent


Not robust against outliers



SEBLUP

(Petrucci and Salvati, 2006; Pratesi and Salvati, 2008, 2009)

Model assumptions The EBLUP is popular and is efficient under the assumption of normality of LLMs. It is specified under the assumption of linearity of the relation between the study variable and the auxiliary variables. Giusti et al. (2012) extended it, however, with a semi-parametric specification obtained by P splines, which allows non-linearities in the relationship between the response variable and the auxiliary variables (see section 2.6.1). The correlation between random-area effects is introduced in the SEBLUP (Petrucci and Salvati, 2006; Pratesi and Salvati, 2008, 2009).

Design consistencyThe FH-EBLUP is design-consistent. Although the estimator makes use of survey weights only to compute the direct estimates involved in the expression (2.7) and, in general, in the expression of the representing the sampling variances of the direct estimators of the area parameters of interest. These predictors are model-based, and their statistical properties such as bias and MSE are evaluated with respect to the distribution induced by the data-generating process and not with respect to randomization induced by the sampling system.

16 Jiang et al. (2011) derive the best predictive estimator of the fixed parameters under the Fay–Herriot model and the nested-error regression model. This leads to a new prediction procedure called “observed best prediction”, which is different from EBLUP. The authors show that the best predictive estimator is more reasonable than the traditional estimators derived from estimation considerations such as maximum likelihood and restricted maximum likelihood if the main interest is estimation of small-area means, which is a mixed-model prediction problem.

39

Robustness to outliersThe FH-EBLUP is not outlier-robust, but it is anticipated that the protection inserted by Sinha and Rao (2009) in the fitting procedure of the unit level EBLUP can also be used in FH-EBLUP to reduce the effect of influential residuals.

Predictions for out-of-sample areasFor non-sampled areas, the EBLUP estimate is given by the regression-synthetic estimate , using the known covariates associated with the non-sampled areas. This allows for the inclusion of geographical auxiliary variables – coordinates of the centroids of the areas – as suggested for GREG.

Geographical covariates can take into account spatial interaction when it results from the covariates themselves. In this case it is reasonable to assume that the random small-area effects are independent and that the EBLUP is still a valid predictor. There are circumstances, however, where the spatial interactions between the areas are not self-contained in the covariates themselves and the random effects are consequently spatially correlated. This motivated the spatial extensions of the method (Petrucci and Salvati, 2006; Pratesi and Salvati, 2008).

2.4.2 Example of calculation of the FH-EBLUP estimatorThis section describes the use of the FH-EBLUP to estimate the mean agrarian surface area used for grape production (θi ) in the 274 municipalities of Tuscany.

The population is based on the 2000 Italian Agricultural Census for the region, which collected information about farmland by type of cultivation, amount of breeding, kind of production, and structure and amount of farm employment. The municipalities are taken to be small areas with population sizes Ni , i = 1, . . . , m from the census.

The aim is to estimate the mean agrarian surface area used for grape production (θi ) in each municipality by using the agrarian surface area for production in hectares (x1i) and the average number of working days in the reference year (x2i ) as covariates in the model.

The sample data are collected from a simple random sample with size ni from each area, with sampling fractions ni/Ni approximately constant and equal to 0.05. These are used to compute i for each municipality, the direct estimator of the mean surface area for grape production in hectares (yi ) and its sampling variance (ψi). The census data provided the agrarian production area in hectares (x1i ) and the average number of working days in the reference year (x2i ).

The FH-EBLUP is computed from the implemented R functions available in the package created by I. Molina, using the R functions developed in the SAMPLE project.

Table 2.7 lists the data for the first ten areas: sample size ni, direct estimate yi, standard error of direct estimator, the production area in hectares (x1i ) and the average number of working days in the reference year (x2i ).

40

Table 2.7: Data on grape production

Small area niyi

(grapehect)x1i

(area)x2i

(workdays)ψi

(var)

1 55 30.94 203.93 73.95 21.53

2 14 57.21 187.22 148.21 201.43

3 35 73.75 590.73 171.34 2.82

4 13 66.24 318.32 105.86 21.38

5 10 36.93 217.25 87.83 64.24

6 34 78.53 1562.02 202.68 0.16

7 37 50.45 101.47 93.48 14.74

8 22 41.97 147.48 85.71 19.51

9 45 111.57 274.27 233.63 329.39

10 42 10.23 145.89 32.67 0.01

… … … … … …

An example of R code used to read the dataset in Table 2.7 and run function eblupFH using that data. Load the package by the command:> library(sae) Load the dataset:> data(grapes)The formula of the mixed-effect model (formula), the variance of the direct estimator (vardir), the estimation method, the maximum number of iterations (MAXITER) and the tolerance (PRECISION) must be specified in the function:> resultREML <- eblupFH(formula=grapehect ~ area + workdays - 1, vardir=var, data=grapes, MAXITER=500,PRECISION=1e-04)The function returns a list with the following items:eblup vector with the values of the estimators for the domains.fit a list containing the following items:• method: type of fitting method applied ("REML", "ML"or "FH"). • convergence: a logical value equal to TRUE if the Fisher-scoring algorithm converges in less than

MAXITER iterations. • iterations: number of iterations performed by the Fisher-scoring algorithm. • estcoef: a dataframe with the estimated model coefficients in the first column (beta), their asymptotic

standard errors in the second column (std.error), the t statistics in the third column (tvalue) and the p-values of the significance of each coefficient in the fourth column (pvalue).

• refvar: estimated random effects variance. • goodness: vector containing three goodness-of-fit measures: loglikehood, AIC and BIC.

The results for the first ten municipalities are shown in Table 2.8. The results for all 247 municipalities can be obtained at www.sample-project.eu

41


Small area niyi

(grapehect) FH-EBLUP MSE

1 55 30.94 31.43 1.795904e+01

2 14 57.21 65.60 6.992134e+01

3 35 73.75 73.842213 2.747896e+00

4 13 66.24 63.145510 1.786207e+01

5 10 36.93 38.247705 4.018609e+01

6 34 78.53 78.540217 1.626821e-01

7 37 50.45 49.688146 1.298132e+01

8 22 41.97 41.668088 1.653500e+01

9 45 111.57 110.704473 8.190804e+01

10 42 10.23 10.232886 6.658895e-03

… … … … …

The mean squared estimates can be obtained by the function mseFH:> resultMSE <- mseFH(grapehect ~ area + workdays - 1, var, data=grapes)resultMSE$mse [1] 1.795904e+01 6.992134e+01 2.747896e+00 1.786207e+01 4.018609e+01 1.626821e-01 1.298132e+01 1.653500e+01 8.190804e+01 6.658895e-03 1.179526e+01…

Table 2.8 shows that the MSE of the FH-EBLUP is lower than the variance of the direct estimates in Table 2.7.

2.4.3 SEBLUPSalvati (2004), Singh, B. et al. (2005), Petrucci and Salvati (2006) and Pratesi and Salvati (2008) proposed the introduction of spatial autocorrelation in SAE in the Fay-Herriot model. The spatial dependence among small areas is introduced by specifying an LMM with spatially correlated random area effects for :

(2.8)

where D is a matrix of known positive constants, v is an vector of spatially correlated random area effects given by the following simultaneous auto-regressive (SAR) process with SAR coefficient and spatial contiguity matrix W (Cressie, 1993; Anselin, 1992):

(2.9)

The W matrix describes the spatial interaction structure of the small areas, usually defined through the neighbourhood relationship between areas; generally speaking, W has a value of 1 in row i and column j if areas i and j are neighbours. The auto-regressive coefficient defines the strength of the spatial relationship among the random effects associated with neighbouring areas. For ease of interpretation the spatial interaction matrix is generally defined in row-standardized form in which the row elements sum to 1; in this case is called a spatial auto-correlation parameter (Banerjee et al., 2004).

42

Combining (2.4) and (2.8), the estimator with spatially correlated errors can be written as:

(2.10)

The error terms v have the SAR covariance matrix: and the covariance matrix of is given by where .

Under model (2.10), the SEBLUP estimator is:

(2.11)

where and is a vector with value 1 in the ith position. The predictor is obtained from Henderson’s (1975) results for general LMMs involving fixed and random effects.

In the SEBLUP estimator the value of is obtained either by maximum likelihood (ML) or restricted maximum likelihood (REML) methods based on the normality assumption of the random effects (see Singh, B. et al., 2005; Pratesi and Salvati, 2008).

The main features of the SEBLUP are summarized in Table 2.9.

Model assumptionsThe SEBLUP is efficient under the assumption of normality and spatial correlation in LLM. Its main advantage is the introduction of the spatial relation among the targeted areas through the spatial correlation of the random area effects. When the strength of the spatial relationship among the random effects associated with neighbouring areas is relevant – autoregressive coefficient >|0.5| – the efficiency gains are appreciable in comparison with FH-EBLUP. But it relies on the stationarity of the spatial relation in the studied zone. It can be extended to allow for local non-stationarity (Benedetti et al., 2012).

Design consistencyThe FH-SEBLUP is not design-consistent; it is model-based like the FH-EBLUP. It makes use of survey weights only to compute the direct estimates involved in the expression (2.11) and, in general, in the expression of the representing the sampling variances of the direct estimators of the area parameters of interest.

Table 2.9: SEBLUP under area-level specification: advantages, disadvantages and extensionsProperties Advantages Disadvantages Extensions

Model assumptions

Efficiency under the assumption of stationarity

of spatial correlation

Stationarity of spatial correlation given the contiguity matrix

Local stationarity extension

(Benedetti et al., 2012)

Design consistency

Not design-consistent


Not robust against outliers Robust-to-outliers extension

(Schmid and Münnich, 2013)


Prediction based on individual X information, and on spatial contiguity and spatial correlation of out-of-sample area

43

Robustness to outliersThe FH-SEBLUP is not outlier-robust but there is a robustified version of it to protect the estimates against the outlying observations (Schmid and Münnich, 2013). The protection is based on the extension of the correction by Chambers et al. (2014) to the SEBLUP considering area and individual outliers in u and e.

Predictions for out-of-sample areasThe main advantage of FH-SEBLUP is the introduction of the spatial relation among the target areas into the predictions for the out-of-sample areas (see Saei and Chambers, 2005b).

When the strength of the spatial relationship among the random effects associated with neighbouring areas is relevant – autoregressive coefficient >|0.5| – introducing it can mitigate the smoothing effect of the variability of the predicted values in comparison with those obtained by FH-EBLUP (see Saei and Chambers, 2005b).

2.4.4 Example of calculation of FH-SEBLUP estimatorThe objective is still to estimate the mean agrarian surface area used for grape production (θi) in each municipality of Tuscany. The data in the FH-EBLUP example can also be regarded as lattice data. In this case the information on the spatial structure of the areas is to be included in the estimation process. The spatial relation between contiguous areas is described by the SAR process. In order to apply the FH-SEBLUP the centroids of the municipalities are taken as spatial reference points.

The m × m proximity matrix W = (wij) was obtained from the neighbourhood structure of the municipalities. We first set wij equal to 1 if municipality i shares an edge with municipality j, and 0 otherwise. Next the rows of W are standardized so that their elements sum to 1. The W matrix for the first ten areas is:

Area 1 2 3 4 5 6 7 8 9 10 …

1 0.000 0.333 0.333 0.000 0.000 0.000 0.000 0.333 0.000 0.000 …

2 0.250 0.000 0.000 0.250 0.000 0.000 0.250 0.250 0.000 0.000 …

3 0.500 0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.000 0.000 …

4 0.000 0.333 0.000 0.000 0.333 0.000 0.333 0.000 0.000 0.000 …

5 0.000 0.000 0.000 0.143 0.000 0.143 0.143 0.000 0.143 0.000 …

6 0.000 0.000 0.000 0.000 0.500 0.000 0.000 0.000 0.500 0.000 …

7 0.000 0.200 0.000 0.200 0.200 0.000 0.000 0.200 0.000 0.000 …

8 0.200 0.200 0.200 0.000 0.000 0.000 0.200 0.000 0.000 0.000 …

9 0.000 0.000 0.000 0.000 0.111 0.111 0.000 0.000 0.000 0.111 …

10 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.250 0.000 …

… … … … … … … … … … … …

44

The function eblupSFH can be used for fitting the spatial Fay-Herriot model. This function gives small-area estimators based on a spatial Fay-Herriot model, where area effects follow a SAR(1) process. With respect to the eblupFH function, we have only to add the proximity matrix W (grapesprox) as a parameter of the function:> resultREML.sp <- eblupSFH(formula=grapehect ~ area + workdays - 1, vardir=var, proxmat=grapesprox, data=grapes, MAXITER=500,PRECISION=1e-04)The results obtainable with this function are the same as with the eblupFH, but we can also obtain the value of the estimated spatial correlation (spatialcorr)equal to 0.61:> resultREML.sp$eblup31.2473574 71.7091179 73.8818783 62.3119391 39.5331862 78.5372343 50.1179508 41.2215285 109.4147029 10.2332370 …$fit$fit$method[1] "REML"$fit$convergence[1] TRUE$fit$iterations[1] 6$fit$estcoef beta std.error tvalue pvaluearea -0.01236461 0.002071297 -5.969498 2.37984e-09workdays 0.49978791 0.012429599 40.209495 0.00000e+00$fit$refvar[1] 69.74899$fit$spatialcorr[1] 0.6142697$fit$goodness loglike AIC BIC -1210.200 2428.401 2442.853

The mean squared estimates can be obtained by the function mseSFH (Molina et al., 2009).

Table 2.10 shows the SAE estimates and the MSE.

45


Small area niyi

(grapehect) FH-SEBLUP MSE

1 55 30.94 31.25 1.660957e+01

2 14 57.21 71.71 5.176486e+01

3 35 73.75 73.88 2.720800e+00

4 13 66.24 62.31 1.690723e+01

5 10 36.93 39.53 3.136957e+01

6 34 78.53 78.54 1.626324e-01

7 37 50.45 50.12 1.226629e+01

8 22 41.97 41.22 1.509230e+01

9 45 111.57 109.41 4.918304e+01

10 42 10.23 10.23 7.456631e-03

… … … … …

A comparison of Table 2.8 with Table 2.10 shows that in this case FH-SEBLUP has lower MSE values than FH-EBLUP. This happens because the data are spatially correlated.

2.4.5 Applications to agricultural data In the last 40 years the Fay-Herriot model has been applied in many empirical studies. The results are generally satisfactory given the characteristics of each case-study.

Fuller (1981) applied the area-level model FH-EBLUP to estimate mean soybean hectares per segment in 1978 at the county level in the USA. He used the mean number of pixels of soybeans per area segment obtained by satellite imagery and the mean soybean hectares from the 1974 United States Agricultural Census as area-level covariates. Survey estimates for a sample of m = 10 counties were obtained by sampling area segments in sampled counties (see also Rao, 2010).

Petrucci et al. (2005) applied the FH-SEBLUP to estimate the average production of olives per farm in 42 local economic systems in Tuscany. The authors note that the introduction of spatial interaction improves the estimates obtained by SEBLUP by reducing the MSE. This happens because the covariates cannot take into account the spatial interaction in the target variable.

Sud et al. (2011) applied the FH-EBLUP to estimate crop yield at the district level in Uttar Pradesh in India. He used data pertaining to supervised crop-cutting experiments on paddy rice under the Improvement of Crop Statistics scheme for the kharif (autumn crop) season collected during 2009/10. The state is divided into 70 districts: under the Improvement of Crop Statistics scheme there are sample data in 58 districts, and 12 districts are out-of-sample areas. The population census provides auxiliary variables of average household size and the female population in marginal households, which are also used for the SAE in the out-of-sample districts. The estimated coefficient of variations of the estimators has a high degree of reliability when compared with the direct survey estimates.

2.4.6 Final remarks on the FH-EBLUP and FH-SEBLUPThe FH-EBLUP and FH-SEBLUP require access to survey estimates and they need to know the direct estimates at the area level; they do not require access to microdata. For this reason the methods are frequently applied, and routines to implement them can be downloaded free from several websites. Table 2.11 shows the data and software required.

46

The FH-SEBLUP also requires spatial information on the area of interest. Spatial-contiguity matrices, centroids of the areas and their coordinates and the distances between them are easily obtained when GIS maps of the studied areas are available. Using GIS, the available dataset for a study can be combined with digital maps of the study area to enrich the description of small areas with geographical coordinates, their geometric properties and neighbourhood structures. Other spatial reference data such as land-parcel codes, street addresses and postal codes can be added from other digital maps to facilitate the link with additional auxiliary variables from other sources.

Table 2.11: Model-based methods under area-level specification: data and softwareSAE methods Data needed Software


EBLUP Area-level direct estimates of means, percentages and totals

(sampled areas)

Area-level auxiliary information

(sampled and non-sampled areas)


SAMPLE project website

CRAN repository

SEBLUP Area-level direct estimates of means, percentages and totals

(sampled areas)

Area-level auxiliary information and contiguity matrix W



SAMPLE project website ISTAT

CRAN repository

In practical applications it is important to complement the estimates with the estimated MSE as a measure of their accuracy. FH-EBLUP and FH-SEBLUP have estimates of their MSE. The routines for application provide estimates of MSE based on this approximately unbiased analytical estimator:

where is equal to when EBLUP is used, and to when considering the SEBLUP estimator. The MSE estimator is the same as that derived by Prasad and Rao (1990). For more details on the specification of the g components in both models see Pratesi and Salvati (2009), and Giusti et al. (2012) for the semi-parametric version of FH-EBLUP. For a detailed discussion of the MSE and its estimation for the EBLUP based on the traditional F-H model, see Rao (2003). An alternative procedure for estimating the MSE of estimators and can be based on a bootstrapping procedure proposed by Gonzalez-Manteiga et al. (2007), Molina et al. (2009) and Opsomer et al. (2008).

The explicit modelling of spatial effects in the FH-EBLUP is advisable when: i) there are no geographic covariates that can take into account the spatial interaction in the target variable; and ii) there are some geographic covariates but the spatial interaction is so important – autoregressive spatial coefficient>|0.5| – that the small-area random effects are presumed to be still correlated. In this case, taking advantage of the information about the related areas appears to be the best solution; the FH-SEBLUP is more efficient than the FH with uncorrelated area random effects.

Both predictors are useful for estimating small-area parameters efficiently when the model assumptions hold, but they can be sensitive to “representative outliers”, or departures from the assumed normal distributions for the random effects in the model. Chambers (1986) defines a representative outlier as “a sample element with a value that has been correctly recorded and that cannot be regarded as unique. In particular, there is no reason to assume that there are no more similar outliers in the non-sampled part of the population.” Welsh and Ronchetti (1998) regard representative outliers as “extremely related to the bulk of the data.” That is, the deviations from the underlying distributions or assumptions refer to the fact that a small proportion of the data may come from an arbitrary distribution rather than the underlying “true” distribution, which may result in outliers or influential observations in the data. When the outlying observations are representative, the protections suggested by Sinha and Rao (2009) and Schmid and Münnich (2013) are recommended.

47

2.5 Model-based estimators: unit-levelThe EBLUP based on unit-level data is the standard tool for producing small-area estimates. As with the area-level specification, it can be extended to correlate random-area effects to obtain the SEBLUP. The MQ small-area approach is a robust alternative to the standard approach, and is based on mixed-effects models – EBLUP – and MQ regression models. The MQGWR model extends the MQ model to include spatial relations between areas, hence enabling local rather than global robust parameters for MQ models.

2.5.1 EBLUPLet denote a vector of p auxiliary variables for each population unit j in small area i and assume that information for the variable of interest y is available only from the sample. The aim is to use the data to estimate various area-specific quantities. A popular approach is to use mixed-effects models with random-area effects. A linear mixed-effects model is:

, (2.12)

where is the vector of regression coefficients, denotes a random-area effect that characterizes differences in the conditional distribution of y given x between the m small areas, is a constant whose value is known for all units in the population and is the error term associated with the j-th unit in the i-th area. Conventionally, and

are assumed to be independent and normally distributed, with mean zero and variances and respectively. The EBLUP of the mean for small area i (Battese et al., 1988; Rao, 2003) is then:

(2.13)

where , denotes the sampled units in area i, denotes the remaining units in the area i and and are obtained by substituting an optimal estimate of the covariance matrix of the random effects in (2.12) into the best linear unbiased estimator of and the best linear unbiased predictor of . For the estimation of the MSE of (2.13) see Prasad and Rao (1990).

Table 2.12 shows the properties of the EBLUP under the unit-level specification.

Table 2.12: EBLUP under unit level specification: advantages, disadvantages and extensionsProperties Advantages Disadvantages Extensions/

Model assumptions

Efficiency assuming normality of the LMM

Random effects at area and unit levels

Linearity of the relation with the aux info

Incorrelation with the random-area effect

Non-parametric extension

(Opsomer et al., 2008)

Geographically weighted EBLUP (GWEBLUP)

(Chandra et al., 2012)

SEBLUP (Petrucci and Salvati, 2006; Pratesi and Salvati, 2008)

Design consistency

Not design-consistent Design-consistent weighted extensions

(Kott, 1989; Prasad and Rao, 1990; Rao, 2003)



Robust-to-outliers extension

(Sinha and Rao, 2009)



48

Model assumptionsThe predictor is widely used in real-life applications. It has been extended to overcome the disadvantages deriving from the linearity of the relation with the auxiliary variables and the independence of random area effects. There is a non-parametric extension by Opsomer et al., (2008) and two extensions to include geography and correlation of the random area effects into the specification of the model. The first is the GWEBLUP by Chandra et al., 2012; see section 2.6.4. The second is the EBLUP at the unit level specified under the SAR assumption to derive the SEBLUP, which is described in the following section.

Design consistencyModel-based estimators using unit-level models typically do not make use of survey weights and the derived estimators are generally not design-consistent unless the sampling design is self-weighting within areas. Modifications to achieve design consistency were proposed by Kott (1989), Prasad and Rao (1999) and You and Rao (2002). Although they are design-consistent, these predictors are model-based and their statistical properties such as bias and MSEs are evaluated with respect to the distribution induced by the data-generating process and not randomization. Jiang and Lahiri (2006b) obtained design-consistent predictors for generalized linear models, and evaluated their corresponding MSEs with respect to the joint randomization-model distribution

Robustness to outliersSinha and Rao (2009) proposed a robust version of (2.13) that works well in presence of outlying values. It is based on a modification of the iterative method for the SAE model fitting based on M-estimation.

Out-of-sample predictionsIn the mixed model (2.12) the synthetic mean predictor for out-of-sample area i is:

Note that all variation in the area-specific predictions comes from the area-specific auxiliary information. The conventional synthetic estimation for out-of-sample areas can potentially be improved by using a model that borrows strength over space.

Besides its SEBLUP spatial extension, which is described below, there is a non-parametric extension and geographically weighted extensions (see section 2.6.2 and 2.6.4).

2.5.2 SEBLUPAs in the area-level models, model (2.12) can be extended to allow for correlated random area effects, specifying an SAR mixed model. Let the deviations v from the fixed part of the model be the result of an autoregressive process with parameter and proximity matrix W (Cressie, 1993): then . The matrix needs to be strictly positive-definite to ensure the existence of . This happens if

, where ’s are the eigenvalues of matrix W. The model with spatially correlated errors can be expressed as:

(2.14)

49

with independent of v. Under (2.14), the spatial best linear unbiased predictor of the small-area mean and its empirical version – SEBLUP – are obtained following Henderson (1975). In particular, the SEBLUP of the small-area mean is:

(2.15)

where , , , , , is the vector of the sample observations, are asymptotically

consistent estimators of the parameters obtained by ML or REML estimation, and is vector with value 1 in the i-th position. For the MSE of the predictor (2.15) (see Singh, B. et al., 2005).

Table 2.13 summarizes the main characteristics of the SEBLUP.

Table 2.13: SEBLUP under unit-level specification: advantages, disadvantages and extensionsProperties Advantages Disadvantages Extensions/ Alternatives

Model assumptions

Efficiency assuming stationarity

of spatial correlation

Stationarity of spatial correlation given the contiguity matrix

GWEBLUP (Chandra et al., 2012)

Spatial model-based direct estimator

(Chandra et al., 2007)

Possible extensions to outlier resistance (Sinha and Rao, 2009; Schmid and Münnich, 2013)

Design consistency

Not design-consistent




Prediction based on individual X information and on spatial contiguity and spatial correlation of out-of-sample areas

Model assumptionsAn alternative approach for incorporating the same spatial information in the model is the spatial model-based direct estimator by Chandra et al., (2007). This also assumes the stationarity of spatial correlation given the contiguity matrix. An extensive approach assumes that the regression coefficients vary spatially across the geography of interest. Models of this type can be fitted using geographically weighted regression (GWR), and are suitable for modelling spatial non-stationarity (Brunsdon et al., 1998; Fotheringham et al., 2002). Chandra et al. (2012) proposed a GWEBLUP for a small-area average and an estimator of its conditional MSE (see sections 2.6 and 2.7).

Design consistencyAs with the unit-level EBLUP, the SEBLUP typically does not make use of survey weights and the derived estimators are generally not design-consistent unless the sampling design is self-weighting within areas. Modifications following the solutions proposed for EBLUP can be design-consistent, but their statistical properties such as bias and MSE are evaluated with respect to the distribution induced by the data-generating process and not randomization.

Robustness to outliersSinha and Rao (2009) proposed a robust version of (2.13) that works well in the presence of outlying values. This solution may possibly be extended to the SEBLUP following Schmid and Münnich, 2013.

50

Out-of-sample predictionsTo enable spatial correlation of area random effects, the SEBLUP predictions for area parameters can be computed taking into account the contribution of the random part of the model for sampled areas and out-of-sample areas (see Saei and Chambers, 2005a). This counters the tendency to smooth the variability of the predicted values in comparison with those obtained by the EBLUP.

2.5.3 The MQ estimatorNone of the predictors described above are robust to deviations from the underlying distributions and assumptions unless they are extended with another estimator designed to address the problem.

A recently proposed approach to SME based on the use of MQ models (Chambers and Tzavidis, 2006) is naturally robust against the effect of the outlying observations on the validity of the small-area model, a feature that should be useful when the method is applied to agricultural surveys.

A linear MQ regression model is one where the MQ of the conditional distribution of y given x satisfies:

(2.16)

Here denotes the influence function associated with the MQ, usually a Huber-type function where c is the tuning constant and t is the error. Other influence functions can

be used.

For specified q and continuous , an estimate of is obtained from iterative weighted least squares.

The MQ coefficients of the population units obtained by the fitting of the model are the basis for constructing an alternative to random effects for characterizing the variability across the population. For unit j with values and

, this coefficient is the value such that . We observe that if a hierarchical structure explains part of the variability in the population data, units within areas – or clusters – defined by this hierarchy are expected to have similar MQ coefficients.

When the conditional MQs are assumed to follow a linear model, with a sufficiently smooth function of q, a predictor of small-area parameters is suggested in the form:

(2.17)

where is an estimate of the average value of the MQ coefficients of the units in area i. This is typically the average of the estimates of these coefficients for sample units in the area. These unit-level coefficients are estimated by solving denoting the estimated value of (2.16) at q. When there are no sample units in area i , then

Tzavidis et al. (2010) refer to (2.17) as the “naïve” MQ predictor and note that it can be biased. To address the problem, the authors propose a bias-adjusted MQ predictor of the small-area parameter derived as the mean functional of the Chambers and Dunstan (1986) estimator of the distribution function given by:

(2.18)

51

Note that in simple random sampling in the small areas, (2.18) is also derived from the expected value functional of the area i version of the Rao et al. (1990) distribution function estimator, which is a design-consistent and model-consistent estimator of the finite population distribution function. Table 2.14 shows the main characteristics of the method.

Table 2.14: MQ method under unit-level specification: advantages, disadvantages and extensions

Properties Advantages Disadvantages Extensions/

Model assumptions

It does not require normality of errors

It requires a linear regression model for quantiles

Non-parametric non-linear extension

(Pratesi et al., 2008)

Design consistency

Not design-consistent Design-consistent weighted extension

(Fabrizi et al., 2014)


Robust against outliers

(Giusti et al., 2014)

Possible failures in protection against outliers

Extension for bias correction

(Chambers et al., 2014)


Prediction not inclusive of spatial information and based on

Spatial extension

MQGWR

(Salvati et al., 2012)

Model assumptionsThe method does not require distributional assumptions about errors. This is an advantage because it is useful to describe non-normal study variables, but the MQ requires the relation between the quantiles of the study variable and the auxiliary variables to be linear. Its non-parametric extension to allow for non-linearities is described in section 2.5. The linear model for quantiles can be extended to local geographical regression in the MQGWR described below.

Design consistencyAs with the other unit-level SAE models, MQ models typically do not make use of survey weights and the derived estimators are not design-consistent unless the sampling design is self-weighting within areas. Modifications proposed by Fabrizi et al. (2014) can be used in the model-assisted approach, a version of MQ including the sample weights that is design-consistent. Its statistical properties such as the bias and MSE are evaluated with respect to the distribution induced by the data-generating process and by the sampling design.

Robustness to outliersThe proposed outlier-robust small-area estimators can be substantially biased when outliers are drawn from a distribution that has a different mean from the rest of the survey data. Chambers et al. (2014) proposed an outlier-robust bias correction for these estimators and two analytical MSE estimators for the ensuing bias-corrected outlier-robust estimators.

Resistance to outliers is the result of the M-estimation algorithm and of the Chambers and Dunston adjustment (Giusti et al., 2014).

Out-of-sample predictionsThe prediction for the out-of-sample areas are given by:

52

and are based only on X individual information and the conventional MQ average coefficient value . The expression can be modified to include the distances between the areas:

2.5.4 The MQGWR estimatorSAR mixed models are global models in the sense that it is assumed that the relations being modelled hold everywhere in the study area and that spatial correlation at the area level is allowed for. One way of incorporating the spatial structure of the data in the MQ small-area model is through an MQGWR model (Salvati et al., 2012); for a description of the GWR regression see section 2.7. Unlike SAR mixed models, MQGWR models are local models that allow for a spatially non-stationary process in the mean structure of the model.

Given n observations at a set of L locations with data values observed at location , an MQGWR model is defined as:

(2.19)

where now varies with h as well as with q. The MQGWR is a local model for the entire conditional distribution – not just the mean – of y given x. Estimates of in (2.19) can be obtained by solving:

(2.20)

where w(hl,h) is a spatial weighting function – generally a function of the Euclidean distances between the locations hl and h – and , where s is a suitable robust estimate of scale such as the median absolute deviation estimate; is normally assumed to be a Huber-type influence function, but other influence functions are also possible. A Huber-type function gives Provided c, the tuning constant, is bounded away from zero an iteratively re-weighted least squares algorithm can be used to solve (2.20), leading to estimates of the form:

(2.21)

In (2.21) y is the vector of n sample y values and X is the corresponding design matrix of order of sample x values. The matrix is a diagonal matrix of order n with entries corresponding to a particular sample observation and equal to the product of the spatial weight of this observation. This in turn depends on its distance from location h, with the weight that this observation has when the sample data are used to calculate the “spatially stationary” MQ estimate .

Provided there are sample observations in area i, an area-specific MQGWR coefficient, can be defined as the average value of the sample MQGWR coefficients in area i.

Following Salvati et al. (2012), the bias-adjusted MQGWR predictor of the mean in small area i is:

(2.22)

53

where is defined through the model (2.19). For details of the MSE estimator of predictor (2.22) see Salvati et al. (2012).

Table 2.15 summarizes the main features of the MQGWR method.

Table 2.15: MQGWR under unit- level specification: advantages, disadvantages and extensions

Properties Advantages Disadvantages Extensions

Model assumptions

It does not require normality of errors

It requires linear models for quantiles, but includes spatial regression

Possible extensions to other spatial weighting functions in the spatial regression

Design consistency


Robust against outliers Not design-consistent


Prediction based on individual X information and on a weighting function based on distances between in-sample and out-of-sample areas

Model specificationsIn addition to the characteristics of the MQ method, which are common to this extension, the spatial-regression coefficient allows for the representation of local non-stationarity in the data. Note that the spatial weight is derived from a spatial-weighting function whose value depends on the distance from sample location to h such that sample observations with locations close to the prediction location u receive more weight than those further away. A popular approach to defining such a weighting function is to use:

where denotes the Euclidean distance between and h and b is the bandwidth, which is best defined using a least-squares criterion (Fotheringham et al., 2002). But alternative weighting functions such as the bi-square function (Ibid.) can also be used.

Out-of-sample predictionsThe main advantage of the MQGWR in comparison with the MQ estimator is in the out-of-sample predictions. Focusing on the spatial structure of the estimator, the out-of-sample predictions are obtained by:

With non-spatial modelling, all variation in the area-specific predictions comes from the area-specific auxiliary information. As described above, one way of improving the conventional synthetic estimation for out-of-sample areas is by using a model that borrows strength over space. In the MQGWR model the improvement is searched following the assumption of local rather than global stationarity of the spatial relation.

54

Design consistencyThe MQGWR typically does not make use of sampling weights, and it is not design-consistent. Possible extensions such as that of Fabrizi et al. (2014) have not yet been tested.

Robustness to outliers As with the MQ predictor, robustness to outliers is the result of the M-estimation algorithm and of the Chambers and Dunston adjustment (Giusti et al., 2014). In any case there are no studies that test the resilience of the MQGWR.

2.5.5 Example of calculation of EBLUP, MQ and MQGWR estimatorsThis section gives an example of applying the EBLUP, the MQ and the MQGWR methods to obtain small-area estimates.

The target parameter is mean acid neutralizing capacity (ANC), which is an indicator of the acidification risk in bodies of water. This indicator has to be estimated at the level of the hydrologic unit, a domain for which there are few small-area observations in the United States Environmental Protection Agency's northeast lakes survey (Larsen et al. 2001).

A sample of 334 lakes is selected from the population of 21,026 using a random-systematic design. The lakes in this population are grouped according to 113 hydrologic unit codes (HUCs), of which 64 contain fewer than 5 observations and 27 have no observations. The variable of interest, y, is the ANC indicator: the higher the ANC the more acid a body of water can neutralize, and the less susceptible it is to acidification. The number of observed sites is 349, with 551 measurements: this gives 86 sampled areas out of 113 areas, with a sample of 551 units. For each sampled location, the Environmental Monitoring and Assessment Program (EMAP) dataset includes the elevation, the auxiliary variables used in the small-area models, x, and geographical coordinates of the centroid of each lake in the target area.

Table 2.16 shows the first 6 of the 551 lines in the EMAP survey dataset.

Table 2.16: EMAP Northeast lakes survey data

Lake id HUC lon lat Elev ANC

ME750L 1010001 -68.72129 47.19984 237 412.2

ME250L 1010001 -69.10148 47.18163 260 535.1

ME751L 1010001 -69.83241 46.77043 333 445

ME753L 1010002 -69.06759 46.80504 475 627

ME519L 1010003 -68.40639 47.08244 171 421

ME251L 1010003 -68.68429 46.77794 335 473.8

… … … … … …

For each lake in the target population, the HUC, the elevation, the longitude and latitude are known. Table 2.17 shows the first 6 of the 21,026 lines of the population data.

55

Table 2.17: EMAP Northeast lakes population data

Lake id HUC lon lat Elev

ME750L 1010001 -68.72129 47.19984 237

ME250L 1010001 -69.10148 47.18163 260

ME751L 1010001 -69.83241 46.77043 333

ME753L 1010002 -69.06759 46.80504 475

ME519L 1010003 -68.40639 47.08244 171

ME251L 1010003 -68.68429 46.77794 335

… … … … …

For each unit in the population, sampled and non-sampled, the HUC – the small area to which the lake belongs – the longitude, the latitude and elevation are known, and the ANC target variable for the 551 sampled units is also known. From these data we can obtain small-area estimates of the mean of ANC using the EBLUP, the MQ and the MQGWR.

56

Estimates of EBLUP can be obtained using the function eblupBHF present in the R library “sae”. The MQ and the MQGWR estimates can be obtained using the R functions, mq_function and mqgwr.sae available under the SAMPLE project.EBLUP – Function eblupBHF package “sae”The eblupBHF requires the following arguments:• formula: A symbolic description of the model to be fitted, for example y~1+x (ANC~1+Elev)• dom: Vector of small-area codes for sampled units (dom = HUC in table 1)• meanxpop: Data frame with domain codes in the first column. Each remaining column contains the

population means of each of the p auxiliary variables for all the domains. The domains considered inmeanxpop must contain those specified in selectdom (meanxpop in the example this is the small-areameans of the Elev variable)

• popnsize: Data frame with small-area codes in the first column and the corresponding small-area populationsizes in the second column (popnsize in the example is the population size for each of the 113 hydrologicunits)

• method: A character string. If "REML", the model is fitted by maximizing the restricted log-likelihood. If"ML" the log-likelihood is maximized. Defaults to "REML"

• data: Optional data frame containing the variables named in formula and dom. By default the variables aretaken from the environment from which eblupBHF is called.

• In the following an example of usage of the R package SAE is described, listing all the commands used:>library(sae)This command loads the library “sae” which contains the function eblupBHF. Now the EMAP survey dataset shown in Table 1 and the EMAP population dataset shown in Table 2 are loaded:>survey.lake=read.table("EMAPLakeSurvey.txt",header=TRUE,dec=",")>population.lake=read.table("EMAPLakePopulation.txt",header=TRUE, dec=",")The data frame survey.lake contains the sampled-unit data; the population data are in the data frame population.lake. The eblupBHF function can now be run:>area.means = tapply(population.lake$Elev, population.lake$HUC, mean)>SaeEst = eblupBHF(formula=survey.lake$ANC~1+survey.lake$HUC dom=survey.lake$HUC, meanxpop=area.means,popnsize=as.numeric(table(population.lake$HUC)),method="REML")The function eblupBHF returns a list with the following objects:eblup: Data frame with number of rows equal to number of sampled small areas (113), containing in its columns the domain codes (domain) and the EBLUPs of the means of selected domains based on the nested error linear regression model (eblup). For domains with zero sample size, the EBLUPs are the synthetic regression estimators.fit: A list containing the following objects:• summary: Summary of the unit level model fitting• fixed: Vector with the estimated values of the fixed regression coefficient• random: Vector with the predicted random effects• errorvar: Estimated model error variance• refvar: Estimated random effects variance• loglike: Log-likelihood• residuals: Vector with raw residuals

57

Results are shown in Table 2.18. The estimates for the 27 out-of-sample areas are obtained with a synthetic estimator.

Table 2.18: Estimated average ANC for all the 113 HUCs obtained using the eblupBHF R function

HUC EBLUP RMSE CV

1010001 396.29 144.00 0.363

1010002 709.67 239.29 0.337

1010003 513.85 87.13 0.170

… … … …

58

MQ – function mq available on the SAMPLE project website. The mq_function requires the following arguments:• x: Matrix of auxiliary variables for sampled units• y: The numeric response vector for sampled units• regioncode.s: Area code for sampled units• m: The number of small areas• p: The size of x+1• x.outs: Matrix of auxiliary variables for out-of-sample units• regioncode.r: Area code for out-of-sample units• tol.value: Convergence tolerance limit for the MQ model; default to 0.0001• maxit.value: Maximum number of iterations for the iterative weighted least squares procedure; default

to 100• k.value: Tuning constant used with the Huber proposal 2 scale estimation; default to 1.345 In the following an example of how to use the mq_function function with the dataset EMAP is given. First, the functions in the file “mq.sae.R” are loaded (at the time of writing this R-package is still not available):>source("mq.sae.R")The EMAP survey dataset shown in Table 1 and the EMAP population dataset shown in Table 2 are loaded:>survey.lake=read.table("EMAPLakeSurvey.txt",header=TRUE, dec=",")>population.lake=read.table("EMAPLakePopulation.txt",header=TRUE, dec=",")The data frame survey.lake contains the sampled unit data; the population data are in the data frame population.lake. To run the function mq_function the sampled units must be removed from the population data frame to obtain the out-of-sample data needed by the function:>s=survey.lake$id>outsample.lake=population.lake[-s,]The mq_function function can now be run:>SaeEst = mq_function(x=survey.lake$Elev,

y=survey.lake$ANC,regioncode.s=survey.lake$HUCm=86,p=2,x.outs=outsample.lake$Elev, regioncode.r=outsample.lake$HUC, tol.value=0.0001,maxit.value=100,k.value=1.345)

The function returns small-area estimates of the mean under the MQ model and the corresponding MSE estimates: • mq.cd: Estimates of small-area means using the MQ Chambers and Dunstan estimator (Tzavidis et al.,

2010) • mq.naive: Estimates of small-area means using the MQ naive estimator (Chambers and Tzavidis, 2006) • mse.cd: MSE estimates for the MQ Chamber and Dunstan small-area means • mse.naive: MSE estimates for the MQ naive small-area means • code.area: The codes of the small areas

Results for the first three areas are shown in Table 2.19 as an example. Estimates for out-of-sample areas are synthetic estimates obtained at quantile 0.5.

59

Table 2.19: Estimated average of ANC for all the 113 HUCs obtained using the MQ_function and R function

HUC MQ RMSE CV

1010001 429.21 31.62 0.074

1010002 719.79 58.74 0.082

1010003 460.32 86.71 0.188

… … … …

MQGWR – function mqgwr.sae available on the SAMPLE project website.The mqgwr.sae R function requires the following arguments:• x.s: matrix of auxiliary variables for sampled units, x.s = Elev • y: numeric response vector for sampled units (y = ANC in Table 1)• area.s: vector of small-area codes for sampled units (area.s = HUC in Table 1)• lon.s: vector of longitude of points representing the spatial positions of the sampled observations (lon.s

= lon in table 1)• lat.s: vector of latitude of points representing the spatial positions of the sampled observations (lat.s = lat

in table 1)• x.r: matrix of auxiliary variables for out-of-sample units (x.r = Elev in table 2)• area.r: vector of small-area codes for out-of-sample units (area.r = HUC in table 2)• lon.r: vector of longitude of points representing the spatial positions of the out-of-sample observations

(lon.r = lon in table 2)• lat.r: vector of latitude of points representing the spatial positions of the out-of-sample observations (lat.r

= lon in table 2)• k.value: tuning constant used for Huber proposal 2 scale estimation; default to 1.345 • method: a character string. If “mqgwr” the MQGWR model is used to fit the MQ surface. If “mqgwr-li”

the MQGWR-local intercepts (LI) is used; defaults to mqgwr• mqgwrweight: geographical weighting function: gwr.gauss() if TRUE or gwr.bisquare() if FALSE; defaults

to TRUEThe following is an example of using the R function mqgwr.sae. First, the functions in the file “mqgwr.sae.R” are loaded (at the time of writing an R package is still not available):>source("mqgwr.sae.R")then the EMAP survey dataset shown in Table 1 and the EMAP population dataset shown in Table 2 are loaded:>survey.lake=read.table("EMAPLakeSurvey.txt",header=TRUE, dec=",")>population.lake=read.table("EMAPLakePopulation.txt",header=TRUE, dec=",")The data frame survey.lake contains the sampled-unit data; the population data are in the data frame population.lake. To run the function mqgwr.sae the sampled units from the population data frame must be discarded to obtain the out-of-sample data needed by the function:>s=survey.lake$id>outsample.lake=population.lake[-s,]The mqgwr.sae function can now be run:>SaeEst = mqgwr.sae(x.s=survey.lake$Elev,

y=survey.lake$ANC, m=86, area.s=survey.lake$HUC, lon.s=survey.lake$lon,lat.s=survey.lake$lat,

60

k.value=1.345, method="mqgwr", mqgwrweight=TRUE)

The function mqgwr.sae returns:• area.code.in: unique list of the area code of the sampled areas (86 codes)• area.code.out: unique list of the area code of the non-sampled areas (27 codes)• est.mean.in: small-area estimates of the mean for the sampled areas (86 areas)• est.mean.out: small-area estimates of the mean for the non-sampled areas (27 areas)• est.mse.in: estimates of the MSE of the mean estimator (only available for the 86 sampled areas)Note: The sampled areas are the areas where there is at least one observation in the sample; the non-sampled areas are those where there is no observation.

As an example, the first three estimates were put into in Table 2.20. Using this estimator the spatial information is used to obtain out-of-sample area estimates.

Table 2.20: Estimated average of ANC for all the 113 HUCs obtained using the mqgwr.sae R function

HUC MQGWR RMSE CV

1010001 422.46 59.27 0.140

1010002 705.15 235.22 0.333

1010003 411.26 37.58 0.091

… … … …

2.5.6 Application to agricultural dataSince Battese et al. (1988) SAE unit-level models have been applied to various case studies in agriculture. In this section the model-based direct estimator (MBDE and the spatial model-based direct estimator [SMBDE]) are referred to; they are presented in Section 2.6.5.

Battese et al. (1988) applied EBLUP to estimate the area under corn and soybeans for each of 12 counties in north central Iowa using farm-interview data in conjunction with LANDSAT data. Each county was divided into area segments, and the areas under corn and soybeans were ascertained for a sample of segments by interviewing farmers. The number of sample segments ni in a county ranged from 1 to 5. In this application the auxiliary variables xij are the number of pixels classified as corn and the number of pixels classified as soybeans in the jth area segment of the ith county, and the response variable is the number of hectares of corn or soybeans in the jth sample area segment of the ith county. Unit-level auxiliary data in the form of number of pixels classified as corn and soybeans were also obtained for all the area segments, including the sample segments, in each county using the LANDSAT readings.

Chandra et al. (2007) use real data and design-based simulation to evaluate the performance of EBLUP, MBDE, SEBLUP and SMBDE in the context of a real population and realistic sampling methods, using the ISTAT farm structure survey in Tuscany. Chandra et al. (2007) used these sample farms to generate a population of N = 22,977 farms by sampling with replacement from the original sample of 529 farms, with probabilities proportional to their sample weights. The small areas of interest are defined by the 23 local economic systems of northern Tuscany.

61

Sample sizes in these areas were fixed to be the same as in the original sample. The aim was to estimate average olive production in quintali (100 kg units) in each local economic system using the surface utilized for olives in hectares as the auxiliary variable. The results show that EBLUP and SEBLUP are unstable in a few small areas, mainly because there is little or no variability in the variable of interest in these areas. In contrast, the MBDE and SMBDE methods appear unaffected by such behaviour. The median relative bias of MBDE is smaller than that of EBLUP. In contrast, the median relative root mean squared error (RRMSE) of EBLUP is smaller than that of MBDE. The median relative bias and median RRMSE of SEBLUP is marginally smaller than that of EBLUP.

Salvati et al. (2009) applied EBLUP, SEBLUP and a spatial version of the MQ predictor to estimate the average production of olives per farm in quintals for each of the small areas making up the local economic systems in Tuscany. In this application the authors employed data from the 2003 ISTAT farm structure survey, which is carried out every two years to collect information on farmland by type of cultivation, amount of animal production and structure and amount of farm employment on 55,030 farms. The GIS Atlas of Coverage of the Tuscany Region provided information on coordinates, surface area and positions of the small areas of interest. The centroid of each area is the spatial reference for all the units residing in the same small area. The auxiliary variable employed in the models is the surface area used for olive production.

Coelho and Pereira (2011) describe the design of the Monte Carlo simulation study, and present empirical results on the performance of the direct and indirect estimators using a real dataset from an agricultural survey conducted by the Portuguese Statistical Office. In particular, the authors analyse the performance of the EBLUP with random small-area effects to present a spatial covariance structure following an isotropic exponential model. To explore the behaviour of the small-area predictors the authors built a pseudo-population obtained from a real dataset containing the responses to the 1993 Agricultural Structure Survey, which is carried out by the Portuguese Statistical Office between agricultural censuses. The responses for the variable total production of cereals were extracted and circumscribed to the Nomenclatura das Unidadis Territoriais para Fins Estatísticos 2 of the Alentejo region. The total sample size was 7,060 and the population size 47,049. Production in 1989 is used as an auxiliary variable in the models applied in the simulation. Geographical coordinates associated with the centroids of freguesias (administrative divisions) are recorded. This is the lowest level of aggregation for which geographical referencing is available. From the results of the simulation experiment is evident that when the data display spatial variability, the estimators that reflect the spatial correlation between observations tend to present reductions in bias and bias ratio when compared with estimators that ignore this variability. These reductions are usually accompanied by a modest loss of precision, resulting in bias ratios that are generally substantially lower than those obtained for the other estimators.

2.5.7 Final remarks on the EBLUP, SEBLUP, MQ and MQGWRThe following remarks stem from the review of the features of the unit level models.1. The EBLUP, SEBLUP, MQ and MQGWR require access to microdata files from the survey, with the sampled

units classified by area of interest. Microdata are also needed for the auxiliary variables, which must be classified by area. The methods are frequently applied nonetheless, and routines to implement them can be downloaded free from several websites. The references here are mainly to routines developed under the EURAREA and SAMPLE projects. Table 2.21 summarizes the data and software needed to implement the methods.

2. The SEBLUP and MQGWR require spatial information on the individual units and areas, and localization of the individual units. Many spatial references useful to describe the spatial relation can be obtained from GIS digital maps – localization of the units and their coordinates, for example, spatial contiguity matrices for areas, centroids of the areas and their coordinates and distances between them.

62

Table 2.21: Model-based methods under unit-level specification: data needed and softwareSAE methods Data needed Software


EBLUP Individual Y microdata classified by area

(sampled units)

Individual X micro-data classified by area




SEBLUP Individual Y microdata

classified by area (sampled units)

Individual X microdata classified by area


+ contiguity matrix W (sampled and non-sampled areas)



MQ Individual Y microdata


Individual X micro-data classified by area



MQGWR Individual Y microdata


Individual X microdata classified by area

(sampled and non-sampled units) +

Euclidean distances between the centroids of the areas



http://www.sample-project.eu

http://www.ons.gov.uk/ons/guide-method/method-quality/general-methodology/small-area-estimation/eurarea/index.html

3. In practical applications it is important to complement estimates with the estimated MSE as a measure of their accuracy. For EBLUP, SEBLUP, MQ and MQGWR the routines for their application provide estimates of MSE. Basically the MSE estimator is obtained following Chambers et al. (2011), who proposed a method of MSE estimation for estimators of finite population-domain means that can be expressed in pseudo-linear form, that is as weighted sums of sample values. In particular, it can be used for estimating the MSE of the EBLUP, the MB direct estimator and the MQ-based predictors. There are many applications where the performance of the predictors is compared with real-life case studies. Gains in accuracy can be obtained when the underlying models fit the distribution of the study variable more closely. The estimators that reflect the spatial correlation between observations tend to present reductions in bias and bias ratio when compared with estimators that ignore this variability.

4. Explicit modelling of spatial effects in the SEBLUP and MQGWR becomes necessary when there are insufficient geographic covariates to explain local interactions. Simulation studies show that this happens when the auto-regressive spatial coefficient is more than |0.5|. In this case, the best solution appears to be to take advantage of the information of the related areas via an SAR model or a local geographic regression, and the SEBLUP will outperform the basic EBLUP model and the MQGWR will outperform the MQ model.

5. With regard to unit-level specifications, the EBLUP and SEBLUP can be very sensitive to “representative” outliers or departures from the assumed normal distributions for the random effects in the model (see Shina and Rao, 2009 for a robust version of EBLUP; Schmid and Münnich, 2013 for a robust version of SEBLUP). Simulation studies show that MQ predictors have better resistance to outliers than the traditional EBLUP (Giusti et al., 2014). There are no specific studies to test the resilience to outliers of the MQGWR.

63

2.6 Extensions of the previous small-area models

2.6.1 Semi-parametric Fay and Herriot modelAn alternative approach for introducing the spatial correlation in an area-level model was proposed by Giusti et al. (2012) with a semi-parametric specification of the Fay Herriot model obtained by truncated polynomial splines (P splines). This allows non-linearities in the relationship between the response variable and the auxiliary variables. A semi-parametric additive model – hereinafter the semi-parametric model – with one covariate can be written as where the function is unknown but assumed to be sufficiently well approximated by the function:

(2.23)

where is the vector of the coefficients of the polynomial function, is the coefficient vector of the polynomial spline (P-spline) basis and q is the degree of the spline if , 0 otherwise. The latter portion of the model allows for handling departures from a q polynomial t in the structure of the relationship. In this portion for is a set of fixed knots, and if K is sufficiently large the class of functions in (2.23) is very large and can approximate most smooth functions. Details of the choice of bases and knots can be found in Ruppert et al. (2003).

Since a P spline model can be viewed as a random-effects model (Ruppert et al. 2003; Opsomer et al. 2008), it can be combined with the Fay Herriot model to obtain a semi-parametric SAE framework based on LMM regression.

Correspondingly, the and vectors define:

Following the notation introduced previously for the Fay-Herriot model, and adding the matrix to the X effect matrix, , the model becomes:

(2.24)

where is a vector of regression coefficients, the component can be treated as a vector of independent and identically distributed random variables with mean 0 and variance matrix . The covariance matrix of model (2.24) is , where .

Model-based estimation of the small-area parameters can be obtained by using the EBLUP (Henderson, 1975):

(2.25)

with and – hereinafter non-parametric EBLUP (NPEBLUP).

64

When geographically referenced responses play a central role in the analysis and need to be converted to maps, we can deal with bivariate smoothing and specify a semi-parametric bivariate additive model (see Giusti et al., 2012).

2.6.2 NPEBLUP specified at the unit levelAlthough useful in many estimation contexts, LMMs depend on distributional assumptions for the random part of the model and do not easily allow for outlier-robust inference. The fixed part of the model may not be flexible enough to handle estimation contexts in which the relationship between the variable of interest and some covariates is more complex than a linear model. Opsomer et al. (2008) usefully extend model (2.12) to the case in which the small-area random effects can be combined with a smooth non-parametrically specified trend. In the simplest case:

(2.26)

where is an unknown smooth function of the variable , the estimator of the small-area mean is:

(2.27)

as in (2.13), where . By using penalized splines as the representation for the non-parametric trend, Opsomer et al. (2008) express the non-parametric small-area estimation problem as a mixed-effect model regression. The latter can be easily extended to handle bivariate smoothing and additive modelling. The P-spline model proposed by Ugarte et al. (2009) is considered to analyse trends in small areas and to forecast future values of the response. The prediction MSE for the fitted and the predicted values, together with estimators for those quantities, were also derived.

2.6.3 Non-parametric MQ specified at unit levelPratesi et al. (2008) extended this approach to the MQ method for estimating the small-area parameters using a non-parametric specification of the conditional MQ of the response variable, given the covariates. When the functional form of the relationship between the qth MQ and the covariates deviates from the assumed form, the traditional MQ regression can lead to biased estimators of the small-area parameters. Using P-splines for MQ regression exploits the properties of MQ models and also makes it possible to deal with an undefined functional relationship that can be estimated from the data. When the relationship between the qth MQ and the covariates is not linear, a P-spline MQ regression model may have significant advantages compared to the linear MQ model.

The small-area estimator of the mean may be taken as in (2.17), where the unobserved value for population unit is predicted using:

where and are the coefficient vectors of the parametric and spline portion respectively of the fitted P-splines MQ regression function at . In the case of P-splines and MQ regression models, the bias-adjusted estimator for the mean is given by:

(2.28)

65

where denotes the predicted values for the population units in and in . The use of bivariate P-spline approximations to fit non-parametric unit-level nested error and MQ regression models makes it possible to reflect spatial variation in the data and then to use these non-parametric models for SAE.

2.6.4 GWEBLUPAn alternative approach to incorporating spatial information in the model is to assume that the regression coefficients vary spatially across the geography of interest. Models of this type can be fitted using GWR (see section 2.5.4) and are suitable for modelling spatial non-stationarity (Brunsdon et al., 1998; Fotheringham et al., 2002). Chandra et al. (2012) proposed a GWEBLUP for a small-area average and an estimator of its conditional MSE. GWEBLUP is based on a mixed model that allows for spatially non-stationary linear fixed effects as well as random area effects. It is obtained by local linear fitting of an LMM using weights that are a function of the distance between the sample data points. Parameter estimation for the GWEBLUP is performed by extending the maximum likelihood estimation of the conventional LMM to incorporate the geographical information contained in these distances.

2.6.5 MBDE and SMBDEChandra and Chambers (2005) proposed an alternative approach to SAE based on the use of MBDE in the small areas. In this case an estimate for a small area of interest corresponds to a weighted linear combination of the sample data for that area, with weights based on a population-level version of the LMM. These weights borrow strength through this model, which includes random area effects. Provided the assumed small-area model is true, the EBLUP is asymptotically the most efficient estimator for a particular small area. In practice, however, the “true” model for the data is unknown, and the EBLUP can be inefficient if wrongly specified. Chandra and Chambers (2005) noted that in such circumstances MBDE offers an alternative to potentially unstable EBLUP. In particular, MBDE is easy to implement, produces sensible estimates when the sample data exhibit patterns of variability that are inconsistent with the assumed model – for example containing too many zeros – and generates robust MSE estimates.

Under the population-level LMM, the sample weights that define the EBLUP for the population total of y are:

(2.29)

where , and .

(see Royall, 1976). The MBDE (see Chandra and Chambers, 2005) of the ith small-area mean is then defined as:

(2.30)

Chandra et al. (2007) proposed an SMBDE in which ith small-area mean is given by (2.30), with the weights (2.29) there replaced by the spatial EBLUP weights w

SEBLUP defined as in (2.29) but where now:

and .

2.6.6 A note on Bayesian SAE methodsBayesian alternatives of the non-spatial and spatial mixed effects models for SAE include Datta and Ghosh (1991 and 2012), Ghosh et al. (1998) and Rao (2003). In particular, Bayesian small-area spatial modelling has been successful in similar contexts such as the estimation of rates of disease in different geographic regions (Best et al., 2005). Complex mixed effects and correlation between areas can be easily handled and modelled hierarchically in different layers of the model.

66

Although implementation of complex Bayesian models requires computationally intensive Markov Chain Monte Carlo (MCMC) simulation algorithms (Gilks et al., 1995), there are a number of potential benefits of the Bayesian approach for SAE. Gomez-Rubio et al. (2010) present these advantages:1. It offers a coherent framework that can handle different types of target variable – continuous, dichotomous and

categorical, for example – different random-effect structures such as independent and spatially correlated, areas with no direct survey information and models to smooth the survey sample variance estimates in a consistent way using the same computational methods and software whatever the model.

2. Uncertainty about all model parameters is automatically captured by the posterior distribution of the small-area estimates and any functions of these such as their rank, and by the predictive distribution of estimates for small areas not included in the survey sample.

3. Bayesian methods are particularly suited to sparse-data problems such as when the survey sample size per area is small, because Bayesian posterior inference is exact and does not rely on asymptotic arguments.

4. The posterior distribution obtained from a Bayesian model provides a richer output than the traditional point-and-interval estimates from a corresponding likelihood-based model. In particular, the ability to make direct probability statements about unknown quantities – for example the probability that the target variable exceeds some specified threshold in each area – and to quantify all sources of uncertainty in the model make Bayesian SAE suitable for informing and evaluating policy decisions.

2.6.7 SAE for binary and count dataLet be the value of the outcome of interest, a discrete or a categorical variable, for unit j in area i, and let denote a vector of unit-level covariates, including an intercept. Working within a frequentist paradigm one can follow Jiang and Lahiri (2001), who propose an empirical best predictor for a binary response, or Jiang (2003), who extends these results to generalized linear mixed models. Nevertheless, use of the empirical best predictor can be computationally challenging (Molina and Rao, 2010). Despite their attractive properties as far as modelling non-normal outcomes is concerned, fitting generalized LMMs requires numerical approximations. In particular, the likelihood function defined by a generalized LMM can involve high-dimensional integrals which cannot be evaluated analytically (see McCulloch, 1994 and 1997; Song et al., 2005). In such cases, numerical approximations can be used as for example in the R function glmer in the package lme4. Alternatively, estimation of the model parameters can be obtained by using an iterative procedure that combines maximum penalized quasi-likelihood and REML estimation (Saei and Chambers, 2003). Estimates of generalized LMM parameters can be sensitive to outliers or departures from underlying distributional assumptions. Large deviations from the expected response and outlying points in the space of the explanatory variables are known to have a significant influence on classical maximum-likelihood inference based on generalized linear models.

Nonetheless, in the case of discrete outcomes model-based SAE conventionally employs a generalized LMM for of the form:

(2.31)

where g is a link function. When is binary-valued, a popular choice for g is the logistic link function, and the individual values in area i are taken to be independent Bernoulli outcomes with:

and . When is a count outcome, the logarithmic link function is commonly used and the individual values in area i are assumed to be independent Poisson random variables with:

67

and . The q-dimensional vector is generally assumed to be independently distributed between areas according to a normal distribution with mean 0 and covariance matrix . This matrix depends on parameters

which are referred to as the variance components, and in (2.31) is the vector of fixed effects. If the target of inference is the small-area i mean (proportion), and the Poisson or Bernoulli generalized LMM is assumed, the approximation to the minimum MSE predictor of is . Since

depends on and , a further stage of approximation is required where unknown parameters are replaced by suitable estimates. This leads to the conditional expectation predictor for the area i mean (proportion) under logarithmic or logistic:

(2.32)

where or , , is the vector of the estimated fixed effects and denotes the vector of the predicted area-specific random effects. We refer to (2.32) in this case as a “random intercepts” conditional expectation predictor. For details, see Saei and Chambers (2003), Jiang and Lahiri (2006a) and Gonzalez-Manteiga et al. (2007). Note, however, that (2.32) is not taken to be the proper empirical best predictor by Jiang (2003). The proper empirical best predictor does not have closed form and needs to be computed by numerical approximations. For this reason the conditional expectation predictor version (2.32) is used in practice, as with the small-area estimates of labour force activity currently produced by the United Kingdom Office for National Statistics.

2.7 Geostatistical methodsGeostatistics is concerned with the problem of producing a map of a quantity of interest over a particular geographical region based on usually “noisy” measurement taken at a set of locations in the region. The aim is to describe and analyse the geographical pattern of the phenomenon of interest. Geostatistical methods are developed and applied in areas such as environmental studies and epidemiology, where spatial information is recorded and available. In recent years the diffusion of spatially detailed statistical data has been considerably increased, and this kind of procedure – with modifications as appropriate – can be used in other fields of application such as studies of demographic and socio-economic characteristics of a population in a particular region.

To obtain a surface estimate, one can exploit the exact knowledge of the latitude and longitude of the studied phenomenon by using bivariate smoothing techniques such as kernel estimates or kriging. Bivariate smoothing deals with the flexible smoothing of point clouds to obtain surface estimates that can be used to produce maps. The geographical application, however, is not the only use of bivariate smoothing because the method can be applied to handle the non-linear relation between any two continuous predictors and a response variable (Cressie, 1993; Ruppert et al., 2003). Kriging, a widely used method for interpolating or smoothing spatial data, has a close connection with P-spline smoothing. Its aims appear to be akin to non-parametric regression, and the understanding of spatial estimates can be enriched through their interpretation as smoothing estimates (Nychka, 2000).

The spatial information alone, however, does not properly explain the pattern of the response variable: one therefore needs to introduce some covariates in a more complex model.

68

2.7.1 Geoadditive modelsGeoadditive models were introduced by Kammann and Wand (2003) to analyse the spatial distribution of the study variable while accounting for possible covariate effects through an LMM representation. The first half of the model formulation involves a low-rank mixed-model representation of additive models; the geographical component is then added by expressing kriging as an LMM. This is then merged with the additive model to obtain a single mixed model, the geoadditive model.

The model is specified as:

(2.33)

where in the first part , and represent measurements on two predictors s and t and a response variable y for unit i, f and g are smooth but otherwise unspecified functions of s and t respectively. The second part of the model is the simple universal kriging model with representing the geographical location, and is a stationary zero-mean stochastic process. Because the first and the second part of model (2.33) can be specified as an LMM, the whole model (2.33) can also be formulated as a single LMM that can be fitted using standard mixed-model software. It can therefore be said that in a geoadditive model the LMM structure enables the inclusion of the area-specific effect as an additional random component. In particular, a geoadditive SAE model has two random effect components: the area-specific effects, and the spatial effects (Bocci, 2009). Kammann and Wand (2003) provide more details on geoadditive model specifications. Having a mixed-model specification, geoadditive models can be used to obtain small-area estimators under a non-parametric approach (Opsomer et al., 2008; see also Part II).

In this respect, Bocci et al. (2012) use a two-part geoadditive SAE model to estimate the per-farm average grape production, specified as a semi-continuous skewed variable, at the agrarian region level using data from the fifth Italian agricultural census. To provide more detail, the response variable, which is assumed to have a significant spatial pattern, has a semi-continuous structure: this means that the variable has a fraction of values equal to zero and a continuous skewed distribution among the remaining values. Hence the variable can be recorded as:

(2.34)

and

(2.35)

For this variable, Bocci et al. (2012) specify two uncorrelated geoadditive small-area models, one for the logit probability of and one for the conditional mean of the logarithm of the response .

Another extension to the work of Kammand and Wand (2003) is the geoadditive model proposed by Cafarelli and Castrignanò (2011), which was used to analyse the spatial distribution of grain weight, a common indicator of wheat production, taking into account its non-linear relations with other crop features.

69

2.7.2 KrigingThe principles of geostatistics and interpolation by kriging are described in a large body of literature that includes Burrough (1986), Cressie (1993), Deutsch and Journel (1992), Isaaks and Srivastava (1989), Journel and Huijbregts (1978), Matheron (1963) and Webster and Oliver (2001). Only the basic notions are outlined here. An early introduction to the origins of kriging is given by Cressie (1990).

Kriging is based on a concept of random functions: the surface or volume is assumed to be one realisation of a random function with a certain spatial covariance (Journel and Huijbregts, 1978; Matheron, 1963).

In this sense kriging is a form of weighted average where the weights depend upon the location and structure of covariance or semivariogram of observed points (Hemyari and Nofziger, 1987). The choice of weights must make the prediction error less than that of any other linear sum. A semivariogram is a function used to indicate spatial correlation in observations measured at sample locations. The literature on kriging provides a choice of functions that can be used as theoretical semivariograms – spherical, exponential, Gaussian or Bessel, for example. The parameters of these functions are then optimized for the best fit of the experimental semivariogram.

Kriging is used extensively to produce contour maps (Dowd, 1985; Sabin, 1985), for example to predict the values of soil attributes at non-sampled locations.

All kriging estimators are variants of the basic equation:

(2.36)

where µ is a known stationary mean assumed to be constant over the whole domain and calculated as the average of the data (Wackernagel, 2003). The parameter λi is kriging weight, N is the number of sampled points used to make the estimation – it depends on the size of the search window – and μ(x0) is the mean of samples within the search window.

The kriging weights are estimated by minimizing the variance, as follows:

(2.37)

where Z(x0) is the true value expected at point x0, N represents the number of observations to be included in the estimation and C(xi,xj) = Cov[Z(xi), Z(xj)] (Isaaks and Srivastava, 1989).

The main strengths of kriging are the statistical quality of its predictions – its unbiasedness, for example – and the ability to predict the spatial distribution of uncertainty. It has been less successful in applications where local geometry and smoothness are the key issues, and other methods prove to be competitive or even better (Deutsch and Journel, 1992; Hardy, 1990).

In ordinary kriging estimates, the value of the attribute is obtained using equations (2.36) by replacing μ with a local mean μ(x0) – that is the mean of samples within the search window – and forcing , that is , which is achieved by plugging it into equation (2.36) (Clark and Harper, 2001; Goovaerts, 2010). Kriging estimates the local constant mean, then performs spatial kriging on the corresponding residuals; it only requires the stationary mean of the local search window (Goovaerts, 2010).

70

2.7.3 GWRGWR has its roots in a linear-regression framework. Standard regression assumes that observations are independent, which is clearly not true for spatial data where the defining characteristic is that nearby observations are more similar than those far apart. Another assumption in regression is that the parameters of the model remain constant over the domain – in other words there is no local change in the parameter values (Fotheringham et al., 2002). As an illustration, a simple example of GWR on a two-dimensional dataset is considered. To accommodate the spatial correlation between predictors, GWR assumes a linear model in which the response variable changes as a function of the coordinates or parameters. The parameters of the GWR model depend on a weight function w(hl,h), which is chosen so that points near the prediction locations have more influence than points far away. Some common weight functions are the bi-square and Gaussian functions.

GWR is a popular spatial interpolation method. It is designed for spatial interpolation of a single dataset. There is no provision for incorporating multiple data sources, though such an extension might include additional equations for additional datasets in the model. The parameters must be identical across datasets. The method also assumes that data are at point-level support. Little work has been done to address change of support in GWR, though studies that apply GWR to modifiable areal unit problems, a class of change-of-support problem where continuous spatial processes are aggregated into districts, found extreme variation in GWR regression parameters (Fotheringham and Wong, 1991).

To use GWR the parameters at a set of locations must be estimated, typically locations associated with the data themselves. Computational order for this process is usually O(N3), where N is the number of data points. Hence GWR does not scale well with increases in data size (Grose et al., 2008). Modifications for large datasets include choosing a fixed number p of locations p << n, where the model parameters are evaluated. Another possible approach is to separate GWR into several non-interacting processes, which can be solved in parallel using grid computing methods (Grose et al., 2008).

71

References (Part I)Anselin, L. 1992. Spatial Econometrics: Method and Models. Boston, USA, Kluwer Academic Publishers.

Battese, G., Harter, R. & Fuller, W. 1988. An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data. Journal of the American Statistical Association 83, 28–36.

Banerjee, S., Carlin, B.P. & Gelfand, A.E. 2004. Hierarchical Modelling and Analysis for Spatial Data. New York, Chapman & Hall.

Benedetti, R., Pratesi, M. & Salvati, N. 2012. Local stationarity in small-area estimation models. Statistical Methods and Applications 22(1).

Best, N., Richardson, S. & Thomson, A. 2005. A comparison of Bayesian spatial models for disease mapping. Statistical Methods in Medical Research 14(1), 35–59.

Bocci, C. 2009. Geoadditive models for data with spatial information. PhD Thesis, Department of Statistics, University of Florence.

Bocci, C., Petrucci A. & Rocco E. 2012. Small-Area Methods for Agricultural Data: a Two-Part Geoadditive Model to Estimate the Agrarian Region Level Means of the Grapevines Production in Tuscany. Journal of the Indian Society of Agricultural Statistics 66(1), 135–144.

Breidenbach, J. & Astrup, R. 2012. Small-area estimation of forest attributes in the Norwegian National Forest Inventory. European Journal of Forest Research 131:1255–1267.

Brunsdon, C., Fotheringham, A.S. & Charlton, M. 1998. Geographically weighted regression-modelling spatial non-stationarity. Journal of the Royal Statistical Society, Series D 47 (3) 431–443.

Burrough, P.A. 1986. Principles of Geographical Information Systems for Land Resources Assessment. Oxford, UK, Oxford University Press.

Cafarelli, B. & Castrignanò, A. 2011. The use of geoadditive models to estimate the spatial distribution of grain weight in an agronomic field: a comparison with kriging with external drift. Environmetrics 22, 769–780.

Chambers, R. L. 1986. Outlier-robust finite population estimation. Journal of the American Statistical Association 81, 1063–1069.

Chambers, R. & Dunstan, P. 1986. Estimating distribution function from survey data. Biometrika 73, 597–604.

Chambers, R. & Tzavidis, N. 2006. M-quantile models for small area estimation. Biometrika 93, 255–268.

Chambers, R. Chandra, H. & Tzavidis, N. 2011. On bias-robust mean squared error estimation for pseudo-linear small area estimators. Survey Methodology 37, 153–170.

Chambers, R. Chandra, H., Salvati, N. and Tzavidis, N. 2014. Outlier-robust small-area estimation. Journal of the Royal Statistical Society, Series B 76 (1) 47–69.

72

Chandra, H. & Chambers, R.L. 2005. Comparing EBLUP and C-EBLUP for small-area estimation. Statistics in Transition 7, 637–648.

Chandra, H., Salvati, N. & Chambers, R. 2007. Small-area estimation for spatially correlated populations: a comparison of direct and indirect model-based methods. Statistics in Transition 8, 331–350.

Chandra, H., Salvati, N., Chambers, R. & Tzavidis, N. 2012. Small-area estimation under spatial non-stationarity. Computational Statistics and Data Analysis 56, 2875–2888.

Clark, I. & Harper, W.V. 2001. Practical Geostatistics 2000. Alloa, Scotland, UK, Geostokos (Ecosse) Ltd.

Cochran, W.G. 1977. Sampling Techniques, 3rd edn. New York, Wiley.

Coelho, P.S. & Pereira, L.N. 2011. A spatial unit-level model for small-area estimation. RevStat Statistical Journal 9(2): 155–180.

Comber, A., Proctor C. & Anthony, S. 2008. The creation of a national agricultural land-use dataset: combining pycnophylactic interpolation with dasymetric mapping techniques. Transactions in GIS 12(6): 775–791.

Cressie, N. 1990. The origins of kriging. Mathematical Geology 22 (3), 239–252.

Cressie, N. 1993. Statistics for Spatial Data. New York, Wiley.

Datta, G.S. & Ghosh, M. 1991. Bayesian prediction in linear models: Applications to small-area estimation. The Annals of Statistics 19, 1748–1770.

Datta, G. & Ghosh, M. 2012. Small-area shrinkage estimation. Statist. Sci. 27, 95–114.

De Belém Costa Freitas Martins, M., de Sousa Xavier A.M. & de Sousa Fragoso, R.M. 2012. Redistributing agricultural data by a dasymetric mapping methodology. Agricultural and Resource Economics Review 41(3): 351–366.

Demidenko, E. 2004. Mixed Models: Theory and Applications. New York, Wiley.

Deutsch, C.V. & Journel, A.G. 1992. Geostatistical Software Library and User's Guide. New York, Oxford University Press. 340 pp.

Do, V.H., Thomas-Agnan C. & Vanhemsz, A. 2013. Spatial Reallocation of Areal Data: a Review. Toulouse, France, Toulouse School of Economics.

Dowd, P.A. 1985. A Review of Geostatistical Techniques for Contouring. Earnshaw, R.A. (ed.). Fundamental Algorithms for Computer Graphics. NATO ASI Series, vol. F17. Berlin, Springer-Verlag.

Duchesne, P. 1999. Robust calibration estimators. Survey Methodology 25, 43–56.

Eicher, C.L. & Brewer, C.A. 2001. Dasymetric mapping and areal interpolation, implementation and evaluation. Cartography and Geographic Information Science 28(2), 125–138.

73

Estevao, V.M. & Särndal, C.E. 2004. Borrowing strength is not the best technique within a wide class of design-consistent domain estimators. Journal of Official Statistics 20, 645–669.

Fabrizi, E., Salvati, N., Tzavidis, N. & Pratesi, M. 2014. Outlier-robust model-assisted small-area estimation. Biometrical Journal 56, 157–175; doi:10.1002/bimj.201200095.

Fay, R.E. & Herriot, R.A. 1979. Estimates of income for small places: an application of James-Stein procedures to census data. Journal of the American Statistical Association 74, 269–277.

FAO. 2005. A System of Integrated Agricultural Censuses and Surveys. Statistical Development Series, no. 11. Rome

FAO. 2010. World Census of Agriculture. Statistical Development Series, no. 12. Rome.

FAO. 2013. World Census of Agriculture 1996–2005: Methodological Review. Statistical Development Series, no. 14. Rome.

Fasulo, A., D’Alò, M., Di Consiglio, L., Falorsi, S. & Solari, F. 2013. SMART2: a new web system for small-area estimation. Paper in Book of Abstract of ITACOSM2013, pp. 65–66.

Flowerdew, R. & Green, M. 1992. Statistical methods for inference between incompatible zonal systems. In: Goodchild, M.F. & Gopal, S. (eds.), The Accuracy of Spatial Databases, pp. 239–247. London, Taylor and Francis.

Fotheringham, A.S., Brunsdon, C. & Charlton, M. 2002. Geographically Weighted Regression. Bognor Regis, UK, John Wiley and Sons.

Fotheringham, A.S. & Wong, D.W.S. 1991. The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning 23(7).

Fuller, W.A. 1981. Regression estimation for small areas. In: Gilford, D.M., Nelson, G.L. & Ingram, L. (eds.), Rural America in Passage: Statistics for Policy, pp. 572–586. Washington DC, National Academy Press.

Fuller, W.A. 1999. Environmental surveys over time. Journal of Agricultural, Biological and Environmental Statistics 4, 331–345.

Gallego, F.J. 2010. A population density grid of the European Union. Population and Environment 31:460–473.

Ghosh, M. & Rao, J.N.K. 1994. Small-area estimation: an appraisal (with discussion). Statistical Science 9(1): 55–93.

Ghosh, M., Natarajan, K., Stroud, T.W.F. & Carlin, B.P. 1998. Generalized linear models for small-area estimation. Journal of the American Statistical Association 93, 273–282.

Gilks, W.R., Richardson, S. & Spiegelhalter, D.J. 1995. Markov Chain Monte Carlo in Practice. Boca Raton, FL, USA. Chapman and Hall/CRC.

Giusti, C., Marchetti, S., Pratesi, M. & Salvati, N. 2012. Semi-parametric Fay-Herriot model using penalized splines. Journal of the Indian Society of Agricultural Statistics 66, 1–14.

74

Giusti, C., Tzavidis, N., Pratesi, M. & Salvati, N. 2014. Resistance to Outliers of M-Quantile and Robust Random Effects Small-Area Models. Communication in Statistics: Simulation and Computation 43(3).

Gomez-Rubio, V., Best, N., Richardson, S., Li, G. & Clarke, P. 2010. Bayesian Statistics Small-Area Estimation. Technical Report. London, Imperial College.

Goovaerts, P. 2010. Combining areal and point data in geostatistical interpolation: applications to soil science and medical geography. Mathematical Geosciences 42, 535–554.

Gosh, M. & Meeden, G. 1997. Bayesian Methods for Finite Population Sampling. London, Chapman and Hall.

Gregory, I.N. & Paul, S.E. 2005. Breaking the boundaries: geographical approaches to integrating 200 years of the census. Journal of the Royal Statistical Society 168, 419– 437.

Grose, D.J., Harris, R., Brunsdon, C. & Kilham, D. 2008. Grid enabling geographically weighted regression. Available at: http://www.merc.ac.uk/sites/default/files/events/conference/2007/papers/paper147.pdf

Hansen, M.H., Hurwitz, W.N. & Madow, W.G. 1953. Sample Survey Methods and Theory. New York, Wiley.

Hardy, R.L. 1990. Theory and applications of the multiquadric-biharmonic method. Computers and Mathematics with Applications 19, 163–208.

Hemyari, P. & Nofziger, D.L. 1987. Analytical solution for punctual kriging in one dimension. Soil Science Society of America Journal 51, 268–269.

Henderson, C. 1975. Best linear unbiased estimation and prediction under a selection model. Biometrics 31, 423–447.

Jiang, J. & Lahiri, P. 2001. Empirical best prediction for small-area inference with binary data. Annals of the Institute of Statistical Mathematics 53, 217–243.

Jiang, J. 2003. Empirical best prediction for small-area inference based on generalized linear mixed models. Journal of Statistical Planning and Inference 111, 117–127.

Jiang, J. & Lahiri, P. 2006a. Mixed model prediction and small-area estimation. TEST 15, 1–96.

Jiang, J. & Lahiri, P. 2006b. Estimation of finite population domain means: a model-assisted empirical best prediction approach. Journal of the American Statistical Association 101, 301–311.

Jiang, J., Nguyen, T. & Rao J.S. 2011. Best predictive small-area estimation. Journal of the American Statistical Association 106, 732-745.

Journel, A.G. & Huijbregts, C.J. 1978. Mining Geostatistics, vol. 600. London, Academic Press.

Kammann, E.E. & Wand, M.P. 2003. Geoadditive models. Journal of Applied Statistics 52, 1–18.

Kaspar, T.C., Colvin, T.S., Jaynes, D.B., Karlen, D.L., James, D.E., Meek, D.W., Pulido, D. & Butler, H. 2003. Relationships between six years of corn yields and terrain attributes. Precision Agriculture 4, 87–101.

75

Kim, H. & Yao, X. 2010. Pycnophylactic interpolation revisited: integration with the dasymetric mapping method. International Journal of Remote Sensing 31(21): 10 5657–5671.

Kish, L. 1965. Survey Sampling. New York, Wiley.

Kott, P. 1989, Robust small domain estimation using random effects modelling. Survey Methodology 15, 1–12.

Isaaks, E.H. & Srivastava, R.M. 1989. Applied Geostatistics. New York, Oxford University Press.

Lam, N. 1983. Spatial interpolation methods: a review. American Cartographer 10, 129–149.

Langford, M. 2003. Refining methods for dasymetric mapping. In: Mesev, V. (ed.), Remotely Sensed Cities, pp. 181–205. London, Taylor and Francis.

Langford, M. 2006. Obtaining population estimations in non-census reporting zones: an evaluation of the three-class dasymetric method. Computers, Environment and Urban Systems 30, 161–180.

Langford, M. & Harvey, J.T. 2001. The use of remotely sensed data for spatial disaggregation of published census population counts. IEEE/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas, La Sapienza university, Rome.

Langford, M., Maguire, D. & Unwin, D. 1991. The areal interpolation problem: estimating population using remote sensing in a GIS framework. In: Masser, E. & Blakemore, M. (eds.), Handling Geographic Information: Methodology and Potential Applications, pp. 55–77. London, Longman.

Langford, M. & Fisher, P.F. 1996. Modelling sensitivity to accuracy in classification imagery: a study of areal interpolation by dasymetric mapping. Professional Geographer 48(3), 299–309.

Langford, M. & Unwin, D.J. 1994. Generating and mapping population density surfaces within a GIS. Cartographic Journal 31, 21–26.

Larsen, T., Nagoda, D. & Anderson, J.R. 2001. The Barents Sea Ecoregion: a Biodiversity Assessment. Oslo, World Wildlife Fund.

Lehtonen, R. & Veijanen, A. 1999. Domain estimation with logistic generalized regression and related estimators. IASS Satellite Conference on Small-Area Estimation. Riga, Latvian Council of Science.

Lehtonen, R., Särndal, C.E. & Veijanen, A. 2003. The effect of model choice in estimation for domains, including small domains. Survey Methodology 29, 33–44.

Lehtonen, R. & Pahkinen, E. 2004. Practical Methods for Design and Analysis of Complex Surveys. New York, Wiley.

Li, T., Pullar, D., Corcoran, J. & Stimson, R. 2007. A comparison of spatial disaggregation techniques as applied to population estimation for South East Queensland, Australia. Applied GIS 3(9): 1–16.

Longford, N.T. 1995. Random Coefficient Models. London, Clarendon Press.

76

Marchetti, S., Tzavidis, N. & Pratesi, M. 2012. Non-parametric bootstrap mean squared error estimation for M-quantile estimators of small-area averages, quantiles and poverty indicators. Computational Statistics and Data Analysis 56, 2889–2902.

Matheron, G.F. 1963. Principles of geostatistics. Economic Geology 58, 1246–1266.

McCullogh, C.E. 1994. Maximum likelihood variance components estimation for binary data. Journal of the American Statistical Association 89, 330–335.

McCullogh, C.E. 1997. Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association 92, 162–170.

McCullogh, P. & Searle, S.R. 2001. Generalized, Linear and Mixed Models. New York, Wiley.

Mennis, J. 2003. Generating surface models of population using dasymetric mapping. Professional Geographer 55(1): 31–42.

Mennis, J. & Hultgren, T. 2006. Intelligent dasymetric mapping and its application to areal interpolation. Cartography and Geographic Information Science 33, 179–194.

Mohammed, J.I., Comber, A. & Brunsdon C. 2012. Population estimation in small areas: combining dasymetric mapping with pycnophylactic interpolation. GIS Research UK Conference, Lancaster University.

Molina, I., Salvati, N. & Pratesi, M. 2009. Bootstrap for estimating the MSE of the Spatial EBLUP. Computational Statistics and Data Analysis 24, 441–458.

Molina, I. & Rao, J.N.K. 2010. Small-area estimation of poverty indicators. Canadian Journal of Statistics 38, 369–385.

Nychka, D. 2000. Spatial process estimates as smoothers. In: Schimek, M.G. (ed.), Smoothing and Regression: Approaches, Computation and Application, New York, Wiley, pp. 393–424.

Opsomer, J.D., Claeskens, G., Ranalli, M.G., Kauermann, G. & Breidt, F. J. 2008. Non-parametric small-area estimation using penalized spline regression. Journal of the Royal Statistical Society, Series B 70, 265–286.

Petrucci, A., Pratesi, M. & Salvati, N. 2005. Geographic information in small-area estimation: small-area models and spatially correlated random area effects. Statistics in Transition 7, 609–623.

Petrucci, A. & Salvati, N. 2006. Small-area estimation for spatial correlation in watershed erosion assessment. Journal of Agricultural, Biological and Environmental Statistics 11, 169–182.

Pfeffermann, D. 2002. Small-area estimation: new developments and directions. International Statistical Review 70(1): 125–143.

Pfeffermann, D. 2013. New important developments in small-area estimation. Statistical Science 28(1): 40–68.

Prasad, N. & Rao, J. 1990. The estimation of mean squared error of small-area estimators. Journal of the American Statistical Association 85, 163–171.

77

Prasad, N. & Rao, J. 1999. On robust small-area estimation using a simple random-effects model. Survey Methodology 25, 67–72.

Pratesi, M. & Salvati, N. 2008. Small-area estimation: the EBLUP estimator based on spatially correlated random area effects. Statistical Methods and Applications 17, 113–141.

Pratesi, M. & Salvati, N. 2009. Small-area estimation in the presence of correlated random area effects. Journal of Official Statistics 25, 37–53.

Pratesi, M., Ranalli, M.G. & Salvati, N. 2008. Semi-parametric M-quantile regression for estimating the proportion of acidic lakes in 8-digit HUCs of the north-eastern United States. Environmetrics 19, 687–701.

Rao, J.N.K., Kovar, J.G. & Mantel, H.J. 1990. On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika 77, 365–375.

Rao, J.N.K. 2003. Small-Area Estimation. New York, Wiley.

Rao, J.N.K. 2010. Small-area estimation with applications to agriculture. In: Benedetti, R., Bee, M., Espa, G. & Piersimoni, F. (eds.) Agricultural Survey Methods, London, John Wiley and Sons.

Reibel, M. & Aditya, A. 2006. Land use weighted areal interpolation. GIS Planet 2005 International Conference, Estoril, Portugal.

Royall, R.M. 1976. Current advances in sampling theory: implications for human observational studies. American Journal of Epidemiology 104, 463–474.

Ruppert, D., Wand, M.P. & Carroll, R. 2003. Semiparametric Regression. Cambridge, UK and New York, Cambridge University Press.

Sabin, M.A. 1985. Contouring: the state of the art. In: Earnshaw, R.A. (ed.), Fundamental Algorithms for Computer Graphics, NATO ASI series, vol. F17. Heidelberg, Springer-Verlag.

Saei, A. & Chambers, R. 2003. Small-area estimation under linear and generalized linear mixed models with time and area effects. In: University of Southampton Statistical Sciences Research Institute, S3RI Methodology Working Papers, Southampton, UK, pp. 1–35.

Saei, A. and Chambers, R. 2005a. Empirical best linear unbiased prediction for out-of-sample areas. Working paper M05/03. Southampton, University of Southampton Statistical Sciences Research Institute.

Saei, A. & Chambers, R. 2005b. Out-of-sample estimation for small areas using area-level data. Working paper M05/011. Southampton, University of Southampton Statistical Sciences Research Institute.

Salvati, N. 2004. Small-area estimation by spatial models: the spatial empirical best linear unbiased predictor (Spatial EBLUP). Working paper no. 2004/04. Florence, Italy, University of Florence Department of Statistics.

Salvati, N., Pratesi, M., Tzavidis, N & Chambers, R. 2009. Spatial M-quantile models for small-area estimation. Statistics in Transition vol. 10(2): 251–267.

78

Salvati, N., Tzavidis, N., Pratesi, M. & Chambers, R. 2012. Small-area estimation via M-quantile geographically weighted regression. TEST 21, 1–28.

Särndal, C.E. 1984. Design-consistent versus model-dependent estimation for small domains. Journal of the American Statistical Association 79, 624–631.

Särndal, C.E., Swensson, B. & Wretman, J. 1992. Model-Assisted Survey Sampling. New York, Springer Verlag.

Searle, S.R., Casella, G. & McCullogh, P. 1992. Variance Components. New York, Wiley.

Shu, Y. & Lam, N.S.N. 2011. Spatial disaggregation of carbon dioxide emissions from road traffic based on multiple linear regression model. Atmospheric Environment 45, 634–640.

Shu, Y., Lam N.S.N. & Reams, M. 2010. A new method for estimating carbon dioxide emissions from transportation at fine spatial scales. Environmental Research Letters 5.

Singh, B.B., Shukla, G.K. & Kundu, D. 2005. Spatio-temporal models in small-area estimation. Survey Methodology 31, 183–195.

Schmid, T. & Münnich, R. 2013. Spatial-robust small-area estimation. Statistical Papers. DOI: 10.1007/s00362-013-0517-y.

Singh, M.P., Gambino, J. & Mantel, H.J. 1994. Issues and strategies for small-area data. Survey Methodology 20, 3–22.

Sinha, S.K. & Rao, J.N.K. 2009. Robust small-area estimation. Canadian Journal of Statistics 37, 381–399.

Song, P. X., Fan, Y. & Kalbfleisch, J. 2005. Maximization by parts in likelihood inference (with discussion). Journal of the American Statistical Association 100, 1145–1158.

Sud, U.C., Bhatia, V.K., Chandra, H. & Srivastava, A.K. 2011. Crop yield estimation at district level by combining improvement of crop statistics scheme data and census data. Wye City Group on Rural Statistics and Agricultural Household Income, 4th meeting, Rio de Janeiro.

Tassone, E.C., Miranda, M.L. & Gelfand, A.E. 2010. Disaggregated spatial modelling for areal unit categorical data. Journal of the Royal Statistical Society 59(1): 175–190.

Tobler, W. 1979. Smooth pycnophylactic interpolation for geographical regions. Journal of the American Statistical Association 74, 519–529.

Tzavidis, N., Marchetti, S. & Chambers, R. 2010. Robust estimation of small-area means and quantiles. Australian and New Zealand Journal of Statistics 52, 167–186.

Ugarte, M.D., Goicoa, T., Militino, A.F. & Durban, M. 2009. Spline smoothing in small area trend estimation and forecasting. Computational Statistics and Data Analysis 53, 3616–3629.

Valliant, R., Dorfman, A.H. & Royall, R.M. 2000. Finite Population Sampling and Inference: a Prediction Approach. New York, Wiley.

79

You, L. & Wood, S. 2006. An entropy approach to spatial disaggregation of agricultural production. Agricultural Systems 90, 329–347.

You, Y. & Rao, J.N.K. 2002. A pseudo-empirical best linear unbiased prediction approach to small-area estimation using survey weights. Canadian Journal of Statistics 30, 431–439.

Yuan, Y., Smith, R.M. & Limp, W.F. 1997. Remodelling census population with spatial information from Landsat TM imagery. Computers, Environment and Urban Systems 21, pp. 245–258.

Wackernagel, H. 2003. Multivariate Geostatistics: an Introduction with Applications. Berlin, Springer Verlag.

Wang, J. & Fuller, W.A. 2003. Mean squared error of small-area predictors constructed with estimated area variances. Journal of the American Statistical Association 92, 716–723.

Webster, R. & Oliver, M. 2001. Geostatistics for Environmental Scientists. Chichester, UK, John Wiley and Sons.

Welsh, A.H. & Ronchetti, E. 1998. Bias-calibrated estimation from sample surveys containing outliers. Journal of the Royal Statistical Society B60, 413–428.

Wolter, K. 1985. Introduction to Variance Estimation. New York, Springer-Verlag.

Wright, J.K. 1936. A method of mapping densities of population. Geographical Review 26, 103–110.

Wu, S., Qiu, X. & Wang L. 2005. Population estimation methods in GIS and remote sensing: a review. GI Science and Remote Sensing 42(1): 58–74.

81

2Resilience of SAE Methods to Non-Standard SituationsIntroductionPart II discusses several open issues in terms of the resilience of SAE methods in non-standard situations that may occur in agricultural surveys, particularly with regard to assessment of the quality of small-area estimates and to the application of methods used for official statistics. Their relevance in agro-environmental applications is also discussed.

Chapter 3: Sensitivity of SAE predictors to spatial model specificationsSAE estimators are model-based. This means they are based on the specification of an operational model to link the study variable to the auxiliary variables. The model can accurately represent the real spatial distribution of the study variable, or it can simply mimic it. Because the spatial distribution of crops and land use in small areas is likely to be non-stationary and likely to show specific levels of spatial correlation, it is important to assess the extent to which the model’s goodness-of-fit affects the quality of small-area estimates.

Chapter 4: The impact of the modifiable areal unit Point-based census or survey data may, of course, be aggregated into areas or regular cells such as enumeration districts, administrative areas or any other spatial partition; the areal units are hence modifiable. The problem is that in analysis of the spatial or other relation between variables, the result can be different when the same relation is measured in areal units at different scales. This can give misleading results in the specification of SAE models, and affects the quality of SAE itself.

Chapter 5: The robustness of the predictors to departures from normality and the robustness of small-area estimators to outlier observationsMany traditional SAE models assume that the study variable has a normal distribution, an assumption that can rarely be accepted when the distribution of agro-environmental variables is studied. Even when a normal distribution can be achieved by transforming the original data, the presence of outliers can compromise the efficiency of the estimates. In particular, when the data-production process can cause errors – as may happen in statistical agencies – the use of robust estimators is suggested to minimize bias in making estimates.

Chapter 6: The effect on target variables of the complexity of sample designs in a surveyA survey’s design for sampling target variables can affect disaggregation methods because there may be an effect of the sampling design on the small area estimators. Stratification, clustering and varying probabilities of inclusion can alter the properties of statistical models, which are in general developed on the assumption of simple random sampling from an entire population. When selection follows a more complex design, the effects on the estimates produced by the model must be assessed. The problem is discussed here as it affects SAE in terms of cases in which the design can have an impact on the estimators, with two alternative small-area estimators that account for the sample design. The impact of sample design is assessed using a design-based simulation based on real agricultural data.

Chapter 7: Missing data in spatial datasetsComplete knowledge of the spatial distribution of an auxiliary variable correlated with a study or target variable and knowledge of the exact location of all the units can be useful in SAE. The performance of an SAE predictor is likely to be impaired to the extent that such knowledge is affected by errors, missing geographical data and missing values

82

in the study and auxiliary variables. This often occurs in practice, and it is therefore important to assess proposed ways of protecting the validity of SAE.

Chapter 8: The excess of zeros in survey dataThis problem is relevant when the target variable is skewed and strictly positive, and also characterized by a large number of zeros. This is likely to happen when the study variable is crop production. In survey data, zero values for crop production can be observed at many sampled farms because, for example, a crop is not cultivated over a wide area or because the land was not used to cultivate that crop in the survey period.

83

3. Sensitivity of SAE Predictors to Spatial Model Specifications

3.1 IntroductionMany target parameters in agricultural and rural statistics can be expressed in the form of means and percentages. In Europe, many agro-environmental indicators are expressed in percentages and combine different kinds of data with arable land, usually expressed as the utilized agricultural area – the total area occupied by arable land, permanent grassland, permanent crops and kitchen gardens. This is the case in the LUCAS surveys, for example.17

Using survey data to estimate these quantities of interest for sub-populations – domains – is a common practice. There are, however, geographic domains for which direct estimates of adequate precision cannot be produced: these are known as “small areas”. Survey designs usually focus on achieving a particular degree of precision in estimates at a higher level of aggregation than the small area, so sample sizes for small areas are typically small.

To explain the setting and objectives of the experiment described in this chapter and to interpret its main findings, attention is drawn to three major issues identified in the literature review in Part I:• Small-area estimates are obtained by fitting statistical models to survey data, and then applying them to auxiliary

information available for the small-area population of interest (see Chapter 2). These data can be administrativeand geographic, but they must always refer to the same domains or population units, which leads to the moregeneral problem of usage of misaligned data and their spatial integration as discussed in Chapter 4.

• Small-area estimates are new statistics that are not otherwise available from surveys or administrative datasources. Often, a number of potential models are considered that involve various combinations of the auxiliaryvariables (see Chapter 2). Because they are obtained by fitting a model to the data, it is obvious that therobustness of the results with regard to departures from the hypothesis of normal distribution of the populationvalues becomes an important issue.

• Various quality diagnostics must be examined to determine which of the potential small-area predictors to use.Once the model is chosen, users must be given an assessment of its quality and the quality of the small-areaestimates produced from it. Of the various diagnostics used to assess the accuracy, validity and consistencyof small-area estimates (Brown et al., 2001) the most common are: i) a bias test that compares the small-areapredictions with the direct estimates, usually by comparing the absolute relative bias of the small-area estimateswith direct estimates in a simulation study; and ii) the RRMSE test, which is analogous to sampling errorscalculated for survey estimates: it is a measure of the efficiency in terms of accuracy of the small-area estimates.

With these issues in mind, this section gives the results of a simulation in which the performance of different small-area predictors of the small-area mean are compared in alternative scenarios relevant to the production of agricultural and rural statistics at the small-area level.

The objectives of the study are: i. to provide evidence of the sensitivity of SAE predictors to the specification of a model to describe the spatial

structure of the available data;ii. to discuss the properties of SAE predictors at different levels of availability of survey and auxiliary data; andiii. o start discussion of the robustness of SAE predictors to departures from the hypothesis of normal distribution

of population values and their resilience to outliers (see Chapter 5).

The experiment involves two simulation studies. The first, a model-based experiment, analyses the sensitivity of the estimators to different specifications of the spatial structure of the data. The sample remains fixed, and many

17 Since 2006, EUROSTAT has carried out a survey every three years of the state and the dynamics of changes in land use and cover in the European Union – the LUCAS surveys, which are based on observations made and registered on the ground. The most recent, in 2012, cov-ered all 27 European Union countries and made observations at 270,000 points.

84

realizations of the same spatial population model are simulated; the properties of the predictors are studied separately for each spatial model, and the results obtained in three spatial models are compared (see section 3.2). The second, a design-based experiment, is pseudo-real in that it is based on real data collected by the United States Environmental Protection Agency; because the population is real, the properties of the predictors are evaluated on the basis of replicates of the sampling design applied to the population (see section 3.3).

In both experiments the performance of the SAE predictors is evaluated in terms of bias by the values of average relative bias – AvRBias:

( ) ( ){ }11 1

1 1ˆT T

i it it itt tAvRBias T m T m m

−− −

= == −∑ ∑

The relative bias is calculated for each small area i and averaged on the T replicates of the simulation study. In the expression, mit is the actual average for area i at simulation t and m̂it is the estimated small-area average.

The efficiency is evaluated by calculating the average RRMSE – AvRRMSE:

( ) ( ){ }1 21 1

1 1ˆT T

i it it itt tAvRRMSE T m T m m

−− −

= == −∑ ∑

.

The results obtained for the traditional small-area predictors are compared with those of the so-called spatial predictors. We consider the following estimators: EBLUP, GREG, which on p. 136 in Rao (2003) is called “modified GREG”, MBDE (Chandra and Chambers, 2005), MQ (see Chapter 2), SEBLUP (Petrucci and Salvati, 2006), SMBDE (Chandra et al., 2007), GWEBLUP (Chandra et al., 2012) and two common predictors used in spatial interpolation – GWR (Fotheringham et al., 2002) and the predictor based on ordinary kriging interpolation (see Cressie, 1993). Here, for GWR-EBLUP and GWR, a Gaussian specification for the weighting function is used:

, where denotes the Euclidean distance between ul and u and is the bandwidth. As the distance between ul and u increases, the spatial weight decreases exponentially. The bandwidth b is a measure of the rate at which the weighting function decays with increasing distance, and so determines the “roughness” of the fitted GWR function. Here also the bandwidth is defined by minimizing the cross-validation (CV) criterion proposed by Fotheringham et al. (2002), who also discuss other weighting functions and the computation of the bandwidth.

The SEBLUP, SMBDE and GWEBLUP are built to integrate geo-referenced information and to model it at the small-area level (see Chapter 2). This report also considers MQ standard predictors, which are naturally robust to outliers and to application to skewed distributions (see Chapter 5).

3.2 Model-based simulation experimentThree alternative spatial models are specified to generate the population values of the study variable, , for the unit j of the area i. A sample is drawn from each realization of the population. The number of small areas is fixed at A = 20 in each replication, following a simulation setting by Chandra et al. (2012). The level of availability of survey and auxiliary variables is high. In other words, survey and auxiliary data are not misaligned, population values of the study and auxiliary variables are known for each unit j and each area i, and the spatial coordinates of the sampled and non-sampled population units identify their location. In GIS, the spatial coordinates of the centroids of the small areas are provided for sampled and non-sampled areas. The latter, also called out-of-sample areas, are those where no sample observations are made and, in practice, where direct estimators cannot be computed. In field applications the study variable is observed in sampled units, but only auxiliary information

85

is available for out-of-sample units; the coordinates of the locations of out-of-sample units may be unknown, which is why the centroid of the small area they belong to is determined.18

1. Spatially stationary modelThis model generates a spatially stationary set of population values. Crop yield, for example, is generated under a spatial random process whose properties do not vary by location. The population values of and are generated according to the two-level model: , where , and , with the random area effects generated as and with level-one errors distributed as , corresponding to an intra-area correlation of 0.20.

2. SAR stationary modelThis model is used to generate the population corresponding to a nested error regression model, with random area effects for neighbouring areas distributed according to an SAR spatial correlation structure. In this case the distribution of yi is not conditional: the marginal distributions for all yi are specified as a system of simultaneous equations. It is not unusual, for example, for the crop yield of a point to be linked with neighbouring y. An alternative is the conditional auto-regressive (CAR) spatial correlation structure – that is, the distribution of yi conditional on all the other y-values is normal. In this simulation an SAR process is preferred as being likely for agricultural data (Pratesi and Salvati, 2005). The model assumes the form: and , in which W is a proximity matrix of order A, I is a diagonal matrix of order A, and is the SAR coefficient, which is set at 0.75 – high spatial correlation. The element of a contiguity matrix W takes the value 1 if area k shares an edge with area l, and 0 otherwise. The distribution for and for random area effects is the same as in the spatially stationary model.

3. Spatially non-stationary modelThe third model uses the same distribution for and for random area effects as the first, but it also allows the intercept and the slope of the linear model for to vary according to longitude and latitude. This leads to a spatially non-stationary set of population values. Crop yield, for example, is generated in this case under a spatial-random process whose properties vary by location: the two-level model in this case is: with and , and with the location coordinates for each unit of the population generated independently as (see Salvati et al., 2012).

The small-area population sizes were randomly drawn from a uniform distribution of [450,500] and kept fixed over the simulations. A sample of size was selected from each simulated population, with small-area sample sizes proportional to the fixed small-area population sizes, resulting in an average area sample size of . These area-specific sample sizes were kept fixed in the simulations and the small areas were treated as strata, with the final sample selection carried out by random sampling in each small area. A total of T = 500 simulations were carried out.

The small-area estimators compared in the simulations are the EBLUP (Rao, 2003), GREG (Rao, 2003), SEBLUP (Petrucci and Salvati, 2006; Pratesi and Salvati, 2009), MBDE (Chandra and Chambers, 2005), SMBDE (Chandra et al., 2007), GWEBLUP (Chandra et al., 2012) and MQ regression small-area estimator (Chambers and Tzavidis, 2006).

Note that in these model-based simulations the NPEBLUP (Opsomer et al., 2008) and small-area estimator based on the MQGWR (Salvati et al., 2012) are not evaluated because they perform well if the spatial coordinates are

18 An example of availability of geographical auxiliary data for out-of-sample data is the Baseline project of the Italian Ministry of Agricul-tural, Food and Forestry Policies, MIPAAF, which integrates data available from the AGRIT project with the POPOLUS spatial frame after it has been refreshed on the basis of CORINE land cover (see the LUCAS project).

86

available for sampled and out-of-sample units, which is not the case in this scenario, where the availability of the auxiliary information is restricted to the sampled areas. A drawback of MQGWR is that it is computationally intensive (see Salvati et al., 2012). The NPEBLUP can capture the spatial relationship through P-splines, which are useful when the functional form of the relationship between the variable of interest and the covariates is unspecified and the data are characterized by complex patterns of spatial dependence. It does not guarantee good performance when only the spatial coordinates of centroids are available: this is because a non-parametric model cannot use a large number of knots. In the proposed simulation experiments in 20 small areas, the NPEBLUP could use 10 or 11 knots.

In these simulations the estimator is also evaluated on the basis of interpolation methods – GWR (Fotheringham et al., 2002) and kriging (Cressie, 1993) (see Chapter 2).

Note that any method for taking spatial information into account must include some geographic covariates for each small area by considering data regarding the spatial location such as the centroid coordinates and/or GIS-generated auxiliary geographical variables referring to the same area.

Table 3.1. Definitions of models and small-area predictors used in the simulation studies

Acronym Predictor Model

EBLUPEmpirical best linear unbiased predictor

EBLUP-GC EBLUP + geographical coordinates

GREG Generalized regression estimator

GREG-GC GREG + geographical coordinates

SEBLUP Spatial EBLUP

MBDE Model-based direct estimator

SMBDE Spatial MBDE

GWEBLUP Geographically weighted EBLUP

MQ M-quantile model

MQ-GC MQ + geographical coordinates

GWR Geographically weighted regression

KRIG kriging

The covariates should be able to take spatial interaction into account when it results from the covariates themselves: for this reason we have evaluated the performance of small-area estimators based on EBLUP, GREG and MQ, adding longitude and latitude as covariates. The models and the estimators considered in our empirical evaluations are summarized in Table 3.1. The performance of different small-area estimators is evaluated by computing the average relative bias (AvRBias) and the AvRRMSE for each small area as follows:

,

87

.

and summarizing the results over the T=500 realizations of the population model. Here denotes the actual average for area i at simulation t, with denoting the estimated small-area average.

Table 3.2. Summary of results from model based simulations

Predictor

AvRBias% AvRRMSE%

Spatially stationary

SAR stationary

Spatially non-

stationary

Spatially stationary

SAR stationary

Spatially non-

stationary

EBLUP 0.017 0.017 -0.054 1.511 (1.00) 1.539 (1.00) 2.639 (1.00)

EBLUP-GC 0.017 0.017 -0.112 1.539 (1.02) 1.534 (1.00) 1.688 (0.64)

GREG 0.017 0.018 0.044 1.623 (1.07) 1.619 (1.05) 2.664 (1.01)

GREG-GC 0.018 0.018 0.023 1.624 (1.07) 1.622 (1.05) 2.082 (0.79)

SEBLUP 0.017 0.017 -0.050 1.534 (1.02) 1.505 (0.98) 2.559 (0.97)

MBDE 0.017 0.017 -0.015 2.226 (1.47) 2.221 (1.44) 4.768 (1.81)

SMBDE 0.017 0.017 -0.015 2.226 (1.47) 2.219 (1.44) 4.593 (1.74)

GWEBLUP 0.019 0.020 -0.062 1.667 (1.10) 1.645 (1.07) 1.489 (0.56)

MQ 0.013 0.011 0.443 1.737 (1.15) 1.870 (1.22) 2.728 (1.03)

MQ-GC 0.014 0.018 0.175 1.729 (1.14) 1.708 (1.11) 1.586 (0.60)

KRIG 0.017 0.016 0.085 2.219 (1.47) 2.007 (1.30) 1.875 (0.71)

GWR 0.019 0.021 0.012 2.326 (1.54) 2.052 (1.33) 1.598 (0.61)

Note: Values are expressed as percentages. The values of the ratios of AvRRMSE to EBLUP are given in parentheses.

Table 3.2 shows the mean of the distribution of values of AvRBias and AvRRMSE over simulations for spatially stationary, SAR stationary and the spatially non-stationary population models. In the stationary case, all the estimators show small average relative bias (0.013 for MQ) and, as one would expect, the EBLUP has a lower RRMSE than the other estimators. Things change, however, when one looks at the results for the spatially non-stationary case, where there is evidence of a substantial gain in efficiency, as measured by a lower RRMSE, when the GWEBLUP and GWR are compared with the other small-area predictors. The small-area estimators that take into account the spatial coordinates in the model increase in efficiency.

The mean value of the AvRRMSE of the EBLUP is 2.639, and that of the EBLUP-GC is 1.688; the mean value of the AvRRMSE of the MQ-GC is 1.586, which is lower than the corresponding values for MQ. This happens for the GREG-GC estimator as well, with AvRRMSE equal to 2.082 against 2.664 for GREG.

Under the SAR stationary population model, the mean value of the AvRRMSE of the SEBLUP estimator is 1.505 compared with 1.539 for the EBLUP: the SEBLUP is hence the most efficient estimator in this scenario. In the SAR stationary scenario, the models that use spatial information perform better than those that do not: the AvRRMSE, for example, is 1.708 for MQ-GC and 1.870 for MQ.

From the results of the simulation experiments it is evident that better estimates can be obtained by using the spatial information in both the fixed part and the random parts of the models, or even by specifying models with spatially correlated random area effects. The evidence from the case studies is that SEBLUP with correlated random area effects following a SAR process performs better when the spatial correlation in the study variable is high. But the

88

inclusion of covariates that capture the spatial effects may be useful when the process is spatially non-stationary. In view of the results obtained by MQ with spatial coordinates, it should improve the efficiency of the small-area estimator by fitting the MQGWR (see Salvati et al., 2012).

3.3 Design-based simulation experimentThe aims were: i) to compare the performance of the different small-area predictors and interpolation methods of the mean in each small area; and ii) to evaluate the performance of the different predictors for estimating the mean for out-of-sample areas. The level of availability of spatial information for survey and auxiliary variables is higher than in the model-based simulation; the spatial coordinates of the population units are available for sampled and non-sampled areas.

The data are drawn from the Environmental Monitoring and Assessment Program of the Space Time Aquatic Resources Modelling and Analysis Program at Colorado State University in the United States. The data set has been studied intensively in SAE experiments with spatial data (see Opsomer et al., 2008 and Salvati et al., 2012).

The survey data used in this design-based simulation come from the United States Environmental Protection Agency’s northeast lakes survey (Larsen et al. 2001), and are the same as those used in some examples in Chapter 2. To recap – between 1991 and 1995 researchers from the Environmental Protection Agency conducted anenvironmental health study of the lakes in the north-eastern states using a sample of 334 lakes from the population of 21,026, which were grouped according to 113 8-digit HUCs of which 64 contained fewer than 5 observations and 27 had no observations. The variable of interest was ANC, an indicator of the acidification risk of water bodies. Because some lakes were visited several times during the study and because some of these were measured at more than one site, the total number of observed sites was 349, with 551 measurements. The EMAP data set also contained the elevation and geographical coordinates of the centroid of each lake in the target area. For sampled locations, the exact spatial coordinates of the corresponding location are known and for non-sampled locations the centroid of the lake is known, so detailed information on the spatial coordinates for non-sampled locations exists as the geography defined by the lakes is below the geography of interest defined by the HUCs.

The aims of the simulation were: i) to compare the performance of the different small-area predictors and interpolation methods for the mean of ANC in each HUC; and ii) to evaluate the performance of the different predictors for estimating the mean ANC for out-of-sample HUCs. To do this, a population of ANC values was created with spatial characteristics similar to those of the lakes sampled by EMAP, with the value of the estimated spatial correlation equal to 0.7.

A total of 200 independent random samples were taken from each HUC sampled by EMAP, with sample sizes set equal to where is the sample size of each HUC in the original EMAP dataset. No observations were taken from HUCs that had not been sampled by EMAP. This process resulted in a sample of 652 ANC values from 86 HUCs. For details on the generation of the population see Salvati et al. (2012).

In the simulations, the small-area predictors evaluated in Part I are compared; in this case the performance of the MQGWR and the NPEBLUP are also evaluated. The relative bias (RB) and the RRMSE of estimates of the mean value of ANC in each HUC were computed. The summary of the across-area distribution of RB and RRMSE are set out in Tables 3.3 and 3.4 for sampled areas, and Tables 3.5 and 3.6 for out-of-sample areas. GREG, GREG-GC, MBDE and SMBDE cannot be computed for out-of-sample areas because the y values have to be known to be computed. For this reason the results for these estimators are not shown in Tables 3.5 and 3.6.

89

In Tables 3.3 and 3.4, all small-area predictors based on variants of the MQGWR model have significantly lower RB than the EBLUP, SEBLUP and NPEBLUP; the MQGWR predictor performs best. With regard to performance in terms of RRMSE, the small-area predictors that account for the spatial structure of the data have on average smaller root mean squared errors; GREG-GC is an exception. The NPEBLUP, SEBLUP and the MQGWR predictor perform best.

These results show that there is a substantial number of in-sample HUCs where the MQGWR predictor has lower RRMSE than the NPEBLUP and SEBLUP. The results also confirm that the MQGWR predictor is a good competitor of NPEBLUP and SEBLUP in sampled areas: in other words the MQGWR predictor is not expected to be uniformly better than the SEBLUP, but it is expected to be more efficient in some HUCs.

The results for the GWR and kriging interpolation methods show smaller RRMSE than the small-area predictors. This could be because of the small – about 15 percent – intra-class correlation coefficient in the data. This value is the ratio ; it measures the presence of area effects in the data. When there is little heterogeneity across the areas, a synthetic predictor such as those based on GWR and kriging can perform better than the predictors that take area effects into account. Note that the GWR is a particular case of MQGWR with a high tuning constant at quantile 0.5; this is the expectile version of MQGWR.

For out-of-sample areas, MQGWR-based small-area predictors have lower relative bias and lower root mean squared errors than the EBLUP, NPEBLUP and SEBLUP. It seems that the MQGWR model offers a straightforward approach for improving synthetic estimation for out-of-sample areas. The performance of the SEBLUP in this case may be surprising, but it should be borne in mind that there is evidence in this case of spatial non-stationary behaviour of the study variable. A synthetic SEBLUP was also used for out-of-sample areas. A more elaborate method for out-of-sample areas in the SAR model was proposed by Saei and Chambers (2005).

Another result to consider is that the small-area methods that take into account the spatial information in the covariates – EBLUP-GC and MQ-GC – perform better in terms of RRMSE than the small-area predictors than do not use EBLUP and MQ.

The interpolation methods also work well for out-of-sample areas: in particular, GWR shows low bias and RRMSE with values close to those of MQGWR.

It can be concluded from the design simulation experiment that if the intra-class correlation is small and spatial information is available and shows spatial non-stationarity, the interpolation methods can be used for estimation, and the MQGWR is the preferred predictor in the class of small-area estimation.

90

Table 3.3. Design-based simulation results using the EMAP data for 86 sampled areas

PredictorSummary of across-area distribution

Min Q1 median Mean Q3 Max

EBLUP -23.31 0.39 10.79 12.55 21.43 83.22

EBLUP-GC -15.60 -4.41 3.88 5.69 12.54 106.00

GREG -7.56 -1.97 -0.28 -0.30 1.20 6.35

GREG-GC -8.48 -1.95 -0.08 -0.19 1.65 7.03

SEBLUP -16.87 -5.12 2.50 5.27 12.33 62.04

MBDE -6.21 -1.29 0.15 0.08 1.54 7.45

SMBDE -6.94 -2.13 -0.33 -0.47 1.02 6.74

GWEBLUP -52.69 -9.14 2.28 6.59 17.94 132.24

MQ -11.09 -2.34 -0.42 -0.83 1.32 4.79

MQ-GC -54.48 -28.11 -21.15 -20.79 -13.27 3.01

MQGWR -8.87 -1.69 0.06 0.22 1.79 14.40

NPEBLUP -10.75 -1.23 10.45 12.19 23.70 37.17

GWR -27.43 -7.89 1.14 3.41 10.51 77.45

KRIG -14.20 -4.61 -0.05 4.04 11.20 52.96

Note: Results show across-area distribution of RB% over simulations.

Table 3.4. Design-based simulation results using the EMAP data for 86 sampled areas



EBLUP 14.20 23.95 35.18 38.05 49.49 99.00

EBLUP-GC 9.54 23.19 31.84 33.79 42.47 118.70

GREG 7.38 24.65 35.16 36.82 45.59 82.08

GREG-GC 12.46 25.66 36.59 38.49 48.33 85.66

SEBLUP 8.08 20.46 29.01 31.50 38.61 75.44

MBDE 4.56 25.05 34.18 37.10 47.51 78.95

SMBDE 4.62 25.23 34.47 37.00 48.20 79.56

GWEBLUP 4.33 17.67 24.64 29.51 35.39 136.56

MQ 6.64 25.81 35.49 39.45 49.71 119.07

MQ-GC 15.17 26.71 33.28 33.54 40.47 56.46

MQGWR 4.97 21.49 29.84 33.61 43.22 83.71

NPEBLUP 16.53 25.84 32.05 33.72 41.24 75.29

GWR 4.13 13.50 18.22 21.97 25.15 83.74

KRIG 7.44 16.85 22.33 24.98 30.19 66.18

Note: Results show across-area distribution of RRMSE% over simulations.

91

Table 3.5. Design-based simulation results using the EMAP data for 27 out of sample areas



EBLUP -72.50 -57.29 -36.59 -2.47 38.14 288.11

EBLUP-GC -60.10 -34.99 -5.49 5.07 17.27 135.10

SEBLUP -68.46 -51.05 -27.35 11.80 58.49 345.09

GWEBLUP -141.56 -13.26 10.58 5.58 34.57 103.02

MQ -85.57 -73.27 -66.29 -47.46 -31.32 106.96

MQ-GC -65.16 -36.25 -11.33 -2.07 10.19 136.90

MQGWR -48.98 -11.89 -3.69 -3.37 4.88 40.61

NPEBLUP -18.09 -7.63 12.33 13.38 29.50 59.99

GWR -32.99 -9.02 -2.64 0.78 5.99 57.69

KRIG -25.07 -14.51 -5.47 -0.21 7.99 42.20

Note: Results show across areas distribution of RB% over simulations.

Table 3.6. Design-based simulation results using the EMAP data for 27 out of sample areas



EBLUP 5.75 40.14 53.76 60.44 62.21 288.61

EBLUP_GC 6.27 12.41 34.18 40.41 52.11 135.90

SEBLUP 16.03 37.71 53.81 66.21 68.13 346.34

GWEBLUP 13.33 19.84 31.37 43.73 49.68 156.63

MQ 6.56 37.63 68.65 57.26 74.83 107.69

MQ_GC 17.10 24.28 30.96 46.00 54.77 145.30

MQGWR 10.21 14.88 17.50 22.93 23.29 78.24

NPEBLUP 14.39 18.13 31.39 34.58 38.80 65.10

GWR 10.12 14.16 16.88 22.71 23.32 78.87

KRIG 12.43 16.04 20.62 23.88 28.07 57.82

Results show across-areas distribution of RRMSE% over simulations.

3.4 Remarks and findings

The main findings for each objective of the study are summarized below.i. With regard to the sensitivity of the SAE predictors to different spatial models, it can be concluded that: i)

provided the spatial correlation is high – greater than |0.5| – good results are obtained using spatial information in the fixed and the random parts of the models; ii) when the process is spatially non-stationary, the inclusion of covariates that capture the spatial effects can be useful because they improve the efficiency of the predictors; and iii) when spatial heterogeneity is relevant across the areas, the SAE approach performs better than synthetic estimates from the kriging and GWR interpolation methods.

92

ii. The properties of the SAE predictors change with different levels of availability of survey and auxiliary data,and the recommended estimators change. The design-based experiment shows that MQGWR competes with theother predictors and the interpolation methods, especially when estimates for out-of-sample areas are requiredand the coordinates of the population units are available.

iii. When data are generated under the assumption of normality, the EBLUP family of predictors – EBLUP, EBLUP-GC and GWEBLUP – show a substantial gain in efficiency as measured by a lower RRMSE for the spatiallystationary, SAR stationary and spatially non-stationary models. Of the interpolation methods, only GWR showsa competitive performance in terms of efficiency. The MQ approach does not gain in efficiency in normalityscenarios; its resilience to departures from normality and to outliers is considered in Chapter 5.

It should be noted with regard to finding (ii) above that the availability of auxiliary spatial information is a crucial issue in the application of SAE predictors. Auxiliary information can consist of geo-coded GIS data about the spatial distribution of these domains and units. Such information can, for example, be obtained from digital maps that cover the domains of interest and so enable the calculation of their centroids, borders, perimeters and areas and the distances between them. Alternatively, spatial coordinates are available for all sampled and non-sampled population units and out-of-sample units as in the design-based case study. These attributes are commonly available in statistical agencies, and they are helpful in the analysis of social-economic data relating to these domains because these often show spatial structure – that is, they are correlated with the geography of the landscape.

In this context it is useful to recall Tobler’s (1970) first law of geography: “Everything is related to everything else, but near things are more related than distant things.” The law is also valid for small geographical areas: nearby areas are more likely to have values similar to those of the target parameter than widely separated areas. This suggests that appropriate use of geographical information and geographical modelling can help to produce accurate estimates for small-area parameters.

And in fact the spatially-based estimators presented above make it possible to use all components of survey data, including geographical data. This is an advantage for environmental and agro-environmental studies in which geographical information is fundamental in understanding the spatial pattern of the phenomena being analysed (Petrucci et al., 2005)

The spatial approach presented here has its limitations. The models and estimators presented are variable specific solutions: it is a matter of fact that geographical information relevant to one study variable cannot be relevant to another. Nevertheless, even if geographical information is not informative by itself, it must be accepted that the spatial conformation of a study area – land use, elevation and percentages of hill, mountain and plain – are likely to have a strong influence on many environmental and socio-economic phenomena and their distribution by small area of interest (Petrucci et al. 2005).

Another source of sensitivity is the definition of the geographical units under analysis. The modifiable areal unit problem (MAUP) (Unwin, 1996) is a potential source of error that can affect spatial studies, which utilize aggregate data sources and SAE results. The MAUP occurs in spatial analysis of aggregated data in which the results differ when the same analysis is applied to the same data but different aggregation schemes are used. It takes two forms: the scale effect, and the zone effect.

The scale effect gives different results when the same analysis is applied to the same data, but changes the scale of the aggregation units. Analysis using data aggregated by county, for example, will differ from analysis using data aggregated by census tract. This difference in results is often valid in that each analysis asks a different question because each evaluates the data from a different perspective or a different scale.

93

The zone effect is observed when the scale of analysis is fixed, but the shape of the aggregation units is changed. Analysis using data aggregated into one-mile grid cells, for example, will differ from analysis using one-mile hexagonal cells. The zone effect is a problem because it is an analysis, at least in part, of the aggregation scheme rather than the data themselves. A simple strategy to deal with MAUP in SAE is to carry out analyses at several scales or in several zones. But this can conflict with budget and time issues, which often constrain the production of small-area statistics (see Chapter 4).

Sensitivity of SAE predictorsSpatial models• Use of spatial information is recommended in the fixed and random part of the models when the spatial

correlation is high at >|0.5|.• Use of covariates that can capture the spatial effects is recommended when the process is spatially non-

stationary.• Use of the SAE approach is recommended when spatial heterogeneity is relevant across the areas compared

with the kriging and GWR approaches.Availability of survey and auxiliary data• The MQGWR competes with the other predictors and the interpolation methods when estimates for out-

of-sample areas are required and coordinates of the population units are available.Normality assumption• When the assumption is satisfied, the EBLUPs show a substantial gain in efficiency for spatially stationary,

SAR stationary and spatially non-stationary models.• Among the interpolation methods, only GWR has competitive performance in terms of efficiency. The

M-quantile approach does not gain in efficiency in normality scenarios.

94

4. The Modifiable Area Unit Problem

4.1 IntroductionThe availability of desktop computing power and GIS software has created interest in and a need to learn more about the MAUP, which has been discussed in spatial analysis literature since the 1930s (Unwin, 1996). The term is due to Openshaw and Taylor (1979) and it has long been recognized as a potentially troublesome feature of aggregated data.

The MAUP is a source of statistical bias that can radically affect the results of statistical analysis. It affects results when point-based measures of spatial phenomena such as population density are aggregated into larger areas, in that the resulting summary values – totals, rates and proportions – are influenced by the choice of the area boundaries. Point-based census or survey data, for example, may be aggregated into census enumeration districts, postcode areas or any other spatial partition and hence the areal units are modifiable.

The problem is particularly relevant in the production and analysis of agro-environmental data and in the analysis of socio-economic data in general. But it seems to be far from being solved, as indicated in the next two sections.

This section provides evidence of the MAUP effect in the application of SAE predictors and interpolation methods. Knowledge of the spatial distribution of the localities of the sampled units in the small areas, and the corresponding point-based measures of the study variable y, are assumed. The auxiliary variables are also available for out-of-sample units, as specified in Chapter 3.

To explain the rationale of the simulation experiment, the two forms of the MAUP – the scale effect and the zone effect – must be recalled:

i. The scale effect give different results when the same analysis is applied to the same data but there arechanges in the scale of the aggregation units. Analysis of average crop production carried out with dataaggregated by county, for example, will differ from analysis using data aggregated by agrarian zone.

ii. The zone effect is observed when the scale of analysis is fixed but the shape of the aggregation units ischanged. Analysis of data aggregated into one-mile grid cells, for example, will give different resultsfrom analysis based on one-mile hexagonal cells. The zone effect is a problem because it is an analysis,at least in part, of the aggregation scheme rather than the data themselves.

In current spatial analysis, in fact, the level of aggregation of point-based measures will often have been decided, and data gathered for particular areal units such as census tracts, enumeration areas, municipalities or other aggregated geographical zone of interest are used. When the values are averaged over the process of aggregation, variability in the dataset is lost as a result of the scale effect, and the results of the same statistical technique will tend to vary according to the level of spatial resolution (Openshaw and Taylor, 1979). This difference in results may be valid nonetheless in that each analysis asks a different question because each evaluates the data from a different perspective or different scale.

In our opinion, the zone effect is a secondary problem because the level of aggregation of point-based measures and the shape (square, hexagon, triangle cells) of the aggregated unit is often imposed by the goals of the analysis in real life applications.

The focus of this section is hence limited to the study of the scale effect on SAE predictors. As far as we know, no previous studies have been devoted to the topic. In their search for a method to overcome the MAUP, Tobler (1989) and Fotheringham (1989) looked for statistical methods whose results would be relatively robust to the definition of the spatial units for which data are recorded. Following their approach, evidence is given here from which to assess the robustness of SAE methods to different scales of aggregation of point-based measures in particular small areas or domains of interest. The rationale of the simulation is to determine the extent to which one can aggregate

95

individual values in small areas and still achieve an acceptably accurate estimate of the small-area parameter. It is recognized that point-based measures, even as the best option, are costly and that geographical and statistical aggregation can be an easier alternative.

4.2 An evaluation of the impact of the scale effect on SAE predictors and interpolation methods

Sensitivity to the level of aggregation of point-based measures is taken into account when the small-area parameter – the area mean – is predicted by EBLUP (Rao, 2003), the generalized regression estimator (Rao, 2003), SEBLUP (Petrucci and Salvati, 2006; Pratesi and Salvati, 2009), MBDE (Chandra and Chambers, 2005), SMBDE (Chandra et al., 2007), the MQ regression small-area estimator (Chambers and Tzavidis, 2006) and interpolation methods – GWR (Fotheringham et al., 2002) and ordinary kriging (Cressie, 1993).

The simulation experiment is based on the second model presented in Section 3.2, where the population is generated by a nested-error regression model with random area effects for neighbouring areas distributed according to an simultaneously auto-regressive (SAR) spatial correlation structure with spatial auto-regressive coefficient sets equal to 0.75 – high spatial correlation.

It is based on about 10,000 points, each representing an individual unit located randomly in 20 small areas. The small areas are in the form of quadrats; their population sizes are randomly drawn from a uniform distribution of [450,500] and kept fixed over the simulations. The location coordinates for each unit of the population are independently generated from a uniform random variable . It is assumed that the only spatial information available is the spatial coordinates of the sampled units and the spatial coordinates of the centroids of the small areas to which they belong.

To examine the scale effect, the points are aggregated into a mean 101 areal units or clusters in each small area. Spatial aggregation is carried out by aggregating a number of contiguous point spatial units into a single cluster unit whose boundaries are irregular as defined by a stopping rule of 100 individual units. The extension and shape of the cluster of 100 individuals depends on the random distribution of the point locations of the individual units.

The small area sizes of the aggregated population vary between 89 and 108 clusters. A sample of size clusters is selected from each simulated population, with small-area sample sizes proportional to the fixed small-area population sizes, giving an average area sample size of clusters. These area-specific sample sizes are kept fixed in the simulations and the small areas are treated as strata, with the final sample selection carried out by random sampling in each small area. A total of T = 500 simulations is carried out. The models used in the simulation study are presented Section 3, Table 3.1. The performance of different small-area estimators is evaluated by computing for each small area the average relative bias (AvRBias) and the AvRRMSE, as in Section 3.2.

Table 4.1 gives the results for the original simulation experiment in Section 3.2 and the results for the aggregated population. The results for the latter show that the MQ-type, EBLUP-type and SEBLUP estimators perform best in RRMSE. Kriging and GWR show less bias than the small-area predictors.

96

Table 4.1. Results from model-based simulations in 20 areas; SAR-stationary process

Predictor

Original Population Aggregated Population

AvRBias% AvRRMSE% AvRBias% AvRRMSE%

EBLUP 0.017 1.539 0.035 1.828 (+18.7%)

EBLUP_GC 0.017 1.534 0.036 1.814 (+18.2%)

GREG 0.018 1.619 0.035 1.957 (+20.8%)

GREG_GC 0.018 1.622 0.037 1.961 (+20.9%)

SEBLUP 0.017 1.505 0.036 1.829 (+21.5%)

MBDE 0.017 2.221 0.033 2.689 (+21.0%)

SMBDE 0.017 2.219 0.033 2.687 (+21.1%)

MQ 0.011 1.870 0.025 1.878 (+0.4%)

MQ_GC 0.018 1.708 0.028 1.828 (+7.0%)

KRIG 0.016 2.007 0.029 1.958 (-2.5%)

GWR 0.021 2.052 0.033 2.371 (+15.5%)

Values are expressed as percentages. In parenthesis the percentage increase of RRMSE for each predictor from the original population to the aggregated population.

To evaluate the scale effect, the last column of Table 4.1 shows in parenthesis the percentage of increase of RRMSE for each predictor from the original population to the aggregated population. The SEBLUP predictor shows the most increase in terms of RRMSE. The reason could be the decrease in the value of the spatial auto-correlation parameter such that variables, parameters and processes that are important at one scale or unit are frequently not important or predictive at another scale or unit. The MQ-type estimators have the lowest increase of RRMSE. Kriging has the best performance, with a reduction of 2.5 percent of RRMSE. The performance of MQ can be explained by the fact that the changes in geography do not affect the MQ coefficients at the area level. GWR also performs well, which suggests that locally varying models may less influenced by MAUP issues than traditional linear regression and linear mixed models.

4.3 Remarks and findingsThe MAUP has been studied for univariate statistics – mean, variance and Moran coefficient – and for bivariate and multivariate statistics by using dataset or simulation studies. Qi and Wu (1996) noted that the Moran coefficient, Geary ratio and Cliff-Ord statistic are scale-dependent: the spatial correlation values decline with scale, and are dependent on the zoning system used in the aggregation. In the case of bivariate statistics, Gehlke and Biehl (1934) noted that the coefficient of correlation increases as regions are aggregated into smaller numbers of larger regions. Openshaw and Taylor (1979) discovered that they could obtain almost any value of the correlation between voting behaviour and age in Iowa merely by aggregating counties in different ways. Fotheringham and Wong (1991) presented the results of an analysis of the effects of aggregation on linear regression and logit models, and demonstrated that some relationships can be relatively stable to data aggregation while others appear to be highly sensitive.

Many authors have tried to overcome the MAUP even though it has traditionally been written off as intractable. Steel and Holt (1996), Holt et al. (1996) and Tranmer and Steel (1998) propose a model structure that includes an extra set of grouping variables – z – that can be measured at the individual level and that are in some way related to the processes being measured at the aggregate level. The grouping variables are used to adjust the aggregate-level

97

variance–covariance matrix for the model so that it approximates the unknown individual-level variance–covariance matrix more closely.

Following Tobler (1989) and Fotheringham (1989) in search of SAE methods that are relatively robust to the definition of the spatial units for which data are recorded, we have obtained evidence of the effect of changing scale in most SAE predictors of small-area means. The results were also compared with those obtained by kriging and GWR.

Two main results stem from Table 4.1:i. The more the operational model underlying SAE is linked to a defined spatial structure of the data, the worse

the performance when changing scale. This is what happened to SEBLUP, EBLUP, GREG and MBDE, whichregistered the best performance in the original population. Methods based on spatial auto-correlation are directlyaffected by the MAUP and by scale-dependent spatial correlation coefficients (Qi and Wu, 1996).

ii. Methods that are naturally robust to outliers and not linked to distributional assumptions about the study variableas in MQ and MQ-GC models seem to perform better and to be more resilient to changing scales of analysis.Their performance, which is worse than the EBLUP-based methods used for the original population, does notbecome any worse at the new scale of analysis, probably because the changes in geography do not affect theMQ coefficients at the area level.

The performance of kriging and GWR comes somewhere between these. The GWR estimator is worse than kriging, probably because the measure of spatial correlation in the definition of local regression parameters is scale-dependent.

Sensitivity of SAE predictors to MAUP• SAE models linked to a defined spatial structure – SEBLUP, EBLUP, GREG and MBDE – perform worst

when the scale changes.• SAE models based on spatial auto-correlation suffer because the spatial correlation coefficient is scale-

dependent.• SAE models that are naturally robust to outliers and not linked to distributional assumptions about the study

variable – MQ and MQ-GC – are more resistant to changes in the scale of analysis.• GWR performs worse than kriging.

98

5. The Robustness of SAE Predictors

5.1 IntroductionThis section proposes robEquation Section 7ust estimators for small areas. The presence of outliers in agricultural data is common and should be taken into account in the estimation process. Two approaches are presented here: the MQ and the robust mixed model.

The literature contains numerous proposals for small-area estimators, particularly for means or totals. The most commonly used is the EBLUP estimator (see Chapter 2). If the LMM assumptions on which the EBLUP is based are respected, then it is the best available estimator in its class in terms of efficiency. But in many real applications the presence of outliers and the skewness of the data cause the EBLUP to become biased and inefficient. A significant body of literature establishes the effect that outliers can have on the parameter estimates of random-effects models (Huggins, 1993; Richardson and Welsh, 1995), so many different small-area estimators that take into account departures from normality assumptions in the LMM and the possible presence of outliers have been developed in recent years.

In this section, robust estimators are presented that are alternatives to the EBLUP. The initial tools for robust SAE come from Chambers and Tzavidis (2006) and Tzavidis et al. (2010), who suggested the use of MQ linear models to obtain small-area estimates. Sinha and Rao (2009) studied the effect of outliers on the widely used EBLUP of the small-area mean.

In practical applications, the use of robust estimators is suggested with a view to protecting estimates from bias induced by outliers in the data.

5.2 Small-area robust estimatorsThis section focuses on two different approaches to robust SAE with robustness against outliers. The first is based on the MQ linear model, the second on the robust random effect model.

5.2.1 MQ estimatorsThe MQ small-area estimator of the mean was described in chapter 2. This estimator is an M-type estimator, so outlier-robust estimation is automatically achieved. But it is possible to improve resistance to outliers by introducing a robust function in the MQ-CD estimator presented in chapter 2, defining a Welsh-Ronchetti (1998) MQ estimator MQ-WR:

, (5.1)

where, as usual, the population units are identified by j and the small areas by i, the data consist of values yij of the outcomes, values xij of a vector of p auxiliary variables, which includes the constant term as first component, a sample s is drawn and the area-specific samples si of size ni ≥ 0 are available for each area/domain, the set ri contains the Ni − ni indices of the non-sampled units in small area i and values of yij are known only for sampled values, while for the p-vector of auxiliary variables it is assumed that unit-level data are accurately known from external sources. The difference with the MQ-CD estimator is in the third addend of the right side of the equation (7.1), where is a robust estimate of scale such as the median absolute deviation of the residuals, and is the influence function associated with the MQ (see Tzavidis et al. 2010). Analytic and bootstrap MSE estimation for the MQ small-area estimators is described in Chambers and Tzavidis (2006), Chambers et al. (2011) and Tzavidis et al. (2010).

99

5.2.2 Robust EBLUPThere is ample documentation to show that the generalized least-squares estimator of β and the ML or REML estimators of the variance components are sensitive to outliers (Fellner, 1986; Huggins, 1993; Richardson and Welsh, 1995). This in turn can affect the small-area estimates. Sinha and Rao (2009) recognized this problem and proposed a small-area estimator of the small-area mean using an outlier-robust version of the LMM that is in practice an extension to the EBLUP estimator of the mean (see Chapter 2). In particular, Sinha and Rao (2009) suggested obtaining fixed-effects and variance components by using the robust ML proposal II defined by Richardson and Welsh (1995):

, (5.2)

where r = U1/2(y – Xβ) is the vector of unit-level residuals, U is a diagonal matrix with its elements equal to the diagonal elements of V, K is a diagonal matrix such that K = bI = E[ψ2(r)], and ψ is Huber’s function ψ(u) = min{u, max(−u, c)} with c = 1.345. V is the covariance matrix of the LMM, and θl is the l-th component of θ = (θ1,...,θq)T, which are the variance components of the covariance matrix V. For estimating outlier-robust random effects v, Sinha and Rao (2009) suggested the use of the equation of Fellner (1986):

(5.3)

where G and R are the covariance matrixes of the random-area effect and of the unit-level effect in the LMM.

The outlier-robust predictor REBLUP of the small-area mean is then obtained by substituting in the equation of the EBLUP mean estimator the robust estimates obtained through (5.2) and (5.3). Denoting by subscript M the robust estimates of the fixed and random effects, the robust version of the EBLUP is then:

(5.4)

See Sinha and Rao (2009) for further details and for estimation of the MSE of the robust EBLUP estimator in equation (5.4).

5.3 Assessment of the robustness of the EBLUP, MQ and robust EBLUPVarious empirical SAE studies assess the influence of outliers on the EBLUP, REBLUP and MQ small-area estimators. By comparing the efficiency of the REBLUP with respect to the EBLUP, Sinha and Rao (2009) generated different populations according to the LMM, introducing outliers in the area random error or the unit-level error or both. Their empirical studies of model-based and design-based simulations show the supremacy of the REBLUP compared with the EBLUP in terms of RRMSE, or efficiency.

Giusti et al. (2014) compared robust estimators in the small-area framework – the EBLUP, the REBLUP (equation 5.4), the MQ and the MQ-WR (equation 5.1) among others – using different model-based simulations. Their results show that the presence of outliers can significantly affect small-area estimates, suggesting that outlier-robust small-area methods should be used in real data applications.

100

In the Giusti et al. (2014) comparison of the MQ and REBLUP approaches to estimating the small-area mean, the two performed similarly. In their model-based simulation experiment the REBLUP performed well in the estimation of small-area means, followed by the MQ-WR and MQ estimators. Their comparison of the precision and efficiency of the MSE estimator showed that the bootstrap of the bias-corrected MQ estimators performed best in the worst scenarios – that is, with outliers affecting area-level and unit-level residuals. The bootstrap of the REBLUP performed quite well, but it is a more computer-intensive technique.

Giusti et al. (2014) also carried out an application to real income data from the 2008 Italian survey of income and living conditions and 2001 census, in which income data was not collected. The estimates obtained with the REBLUP, MQ-CD and MQ-WR estimators turned out to be similar, and the computation of a goodness-of-fit diagnostic suggested that none of these estimators can be considered statistically different from direct estimates. Note that in this case good results were also obtained for the EBLUP, though preliminary diagnostics suggested that the hypotheses of normality did not hold for these data.

5.4 The RSEBLUP: robust SAE using geo-referenced information in the mixed-model approach

As stated previously, in economic, environmental and epidemiological applications spatially close observations tend to be more alike than observations made further apart, and it can be important to include the available geographic information in models used for SAE. Examples of small-area predictors that explicitly incorporate spatial information in the mixed-model approach to SAE are the SEBLUP (Petrucci and Salvati, 2006; Singh et al., 2005; Pratesi and Salvati, 2008) and the GWEBLUP (Chandra et al., 2012).

In the MQ approach to SAE, Salvati et al. (2012) proposed an MQ-GWR model to define a bias-robust predictor of the small-area characteristic of interest that also accounts for spatial association in the data. Schmid and Munnich (2013) proposed the SREBLUP extension to cover spatial area effects in a simultaneous auto-regressive model. Like the MQ-GWR model, the SREBLUP integrates the concepts of bias-robust SAE and a unified framework to enhance spatial accuracy.

As explained in the previous section, the estimator (5.4) is robust to model mis-specifications or outliers, but it does not consider spatial dependencies in the data. This can be significant in many applications. Following Salvati (2004), Petrucci et al. (2005) and Singh et al. (2005), Schmid (2011) and Schmid and Munnich (2013) proposed the introduction of a simultaneously auto-regressive (SAR) model in the REBLUP to obtain the SREBLUP.

Denoting by the SAR parameter defining the strength of the spatial dependencies, and by W the proximity matrix between the areas, the vector v of spatially correlated random area effects is given by:

(5.5)

Given this, the model with spatial correlated random effects is:

(5.6)

where D is a matrix of known positive constants. The covariance matrix of the y is given by:

(5.7)

101

Following Sinha and Rao (2009), Schmid and Munnich (2013) suggested estimating and using outlier-robust ML-equations solved with a Newton-Raphson algorithm. Having thus obtained the estimates , and , estimates of the spatially robust random effects can be obtained using Fellern’s equation (Fellern, 1986). The SREBLUP of small-area means is then:

(5.8)

where . Like Sinha and Rao (2009), Schmid and Munnich (2013) proposed a parametric bootstrap estimator for the MSE of the REBLUP.

5.4.1 Evaluating the Spatial REBLUP estimator using simulation studiesSchmid (2011) carried out simulations to investigate and compare the performance of the EBLUP, REBLUP and SREBLUP point and MSE estimators in the absence or presence of outliers, observations and spatial dependence in the data. Different models were used to generate the data, simulating several scenarios to allow for violations of the normality assumptions of the EBLUP and to include spatial dependencies between areas.

The population data were generated taking m=100 areas of size , with covariate values normally distributed with . With regard to the spatial structure of the areas, the latitude and longitude of each were generated independently from a uniform distribution , and they were assigned to each unit inside the area. From this population, a simple random sample of size was selected in each area.

The values of the dependent variable y were generated from the model with the error terms and generated from the following distributions:

where the parameters and determine the amount of contamination in the data, with being the situation without outliers; if , then there are no spatial dependencies in the data and the standard EBLUP is obtained. Spatial dependence is introduced into the model through the G matrix, setting , and with the W matrix obtained with the nearest-neighbour approach. A small area j is defined as a neighbour of small area i if the Euclidean distance between them is less than 0.15, a value chosen because it determines the realistic situation in which each small area has on average 5 neighbours.

With regard to the presence of outlier observations, the simulation study considers the values , that is 5 percent of contaminated errors, under four different scenarios with no spatial dependence ( ): (0,0) scenario, no contamination; (0,e) scenario, contamination only in the individual errors e; (0,u) scenario, contamination only in the area errors u; (e,u) scenario, contamination in both errors. To evaluate the effect of the presence of outlying observations at the area level more accurately, in scenarios (0,u) and (e,u) the areas with the contamination in u are always the same – areas 96, 97, 98, 99 and 100.

These four scenarios are also considered together with the presence of spatial correlation between the areas, that is with : {(0,0)p, (e,0)p, (0,u)p, (e,u)p}.

The study simulated each scenario R=1,000 times and estimated during each run the small-area mean with the following estimators: GREG, standard EBLUP, SEBLUP, robust EBLUP, SR EBLUP, naïve MQ, MQ-CD and

102

MQ-GWR. To evaluate and compare the performance of the three estimators, the following statistics were computed for each area i RRMSE:

and the RB:

where A represent the estimator, is the index over the replications, is the estimate of the mean of y in area i, replication r using estimator A and is the true mean of y in area i. The RRMSE is a measure of the accuracy of the estimators; the RB is a measure of the bias of the estimators.

Table 5.1. Mean values of the RRMSE (%) in the non-spatial scenarios, with 5% symmetric outlier contamination

No outliers Indiv. outliers Area outliers Both

[0,0] [0,u] [e,0] [e,u] [0,u] [e,u]

1–100 1–95 1–100 1–95 96-100 96-100

GREG 6.328 6.411 6.215 6.489 4.332 3.859

EBLUP 0.428 0.587 0.578 0.685 1.036 3.425

SEBLUP 0.437 0.592 0.598 0.673 1.026 3.413

REBLUP 0.468 0.578 0.505 0.556 0.606 1.037

SREBLUP 0.466 0.576 0.509 0.556 0.582 1.000

MQ 0.451 0.587 0.539 0.592 1.074 1.383

MQ-CD 0.573 0.727 1.134 1.251 0.826 0.939

MQGWR 0.457 0.618 0.543 0.618 0.846 1.219

103

Table 5.2. Mean values of the RB (%) in the non-spatial scenarios with 5% symmetric outlier contamination


[0,0] [0,u] [e,0] [e,u] [0,u] [e,u]

1–100 1–95 1–100 1–95 96-100 96-100

GREG 0.000 0.000 0.002 -0.011 0-035 0-011

EBLUP 0.003 0.003 0.007 -0.010 0.046 0.146

SEBLUP 0.003 0.004 0.007 -0.010 0.046 0.145

REBLUP 0.002 0.006 0.003 -0.004 0.022 0.046

SREBLUP 0.002 0.006 0.003 -0.005 0.024 0.045

MQ 0.002 0.004 0.001 -0.004 0.046 0.060

MQ-CD -0.001 0.001 0.001 -0.011 -0.002 -0.024

MQGWR 0.001 0.005 0.001 -0.004 0.010 0.041

Tables 5.1 and 5.2 show that when outlier contamination is present in non-spatial scenarios, the robust estimators outperform the traditional ones, as expected. The REBLUP and the SREBLUP perform better with respect to the MQ estimators, especially when considering the MQ-CD that suffers from the individual outliers. This is true for the RRMSE and the RB: because the scenarios are non-spatial in this case, no particular gain is expected for the spatial estimators.

The positive effect of spatial modelling is evident in Tables 5.3 and 5.4, which show the results from the spatial scenarios. The SEBLUP outperforms the EBLUP in the settings with no contamination, and the SREBLUP outperforms the REBLUP. The MQ-GWR performs better than the MQ and MQ-CD. A comparison of the SREBLUP and the MQGWR shows that the latter has a slightly lower RB, whereas the former has a slightly lower RRMSE. In general there are no large differences in the results of the small-area models in the scenarios with symmetric contamination, because the symmetric outliers tend to follow the assumptions of the underlying model rather than alternative non-symmetric outliers.

Table 5.3. Mean values of the RRMSE (%) in the spatial scenarios with 5% symmetric outlier contamination


[0,0] [0,u] [e,0] [e,u] [0,u] [e,u]

1–100 1–95 1–100 1–95 96-100 96-100

GREG 5.642 6.185 5.700 6.090 3.925 3.157

EBLUP 0.496 0.641 0.782 0.827 0.659 1.989

SEBLUP 0.451 0.589 0.696 0.772 0.892 2.646

REBLUP 0.535 0.668 0.582 0.650 0.651 0.529

SREBLUP 0.480 0.581 0.527 0.567 0.852 0.825

MQ 0.532 0.681 0.616 0.651 0.776 0.816

MQ-CD 0.574 0.694 1.188 1.241 0.737 1.100

MQGWR 0.549 0.626 0.575 0.673 1.071 0.838

104

Table 5.4. Mean values of the RB (%) in the spatial scenarios with 5 % symmetric outlier contamination


[0,0] [0,u] [e,0] [e,u] [0,u] [e,u]

1–100 1–95 1–100 1–95 96-100 96-100

GREG 0.002 0.004 0.000 -0.016 0.019 -0.017

EBLUP 0.006 0.009 0.009 -0.009 0.012 0.058

SEBLUP 0.005 0.008 0.006 -0.012 0.028 0.087

REBLUP 0.006 0.011 0.004 -0.004 0.001 0.008

SREBLUP 0.005 0.010 0.003 -0.006 0.007 0.029

MQ 0.007 0.011 0.003 -0.004 0.025 0.031

MQ-CD 0.001 0.005 -0.002 -0.015 -0.017 -0.050

MQGWR 0.002 0.009 0.000 -0.006 0.003 -0.016

Schmid (2011) also considers four alternative non-symmetric simulation scenarios with the area-specific random effects ui and random errors eij generated using the model:

Results for these simulations are presented in Tables 5.5 and 5.6 for the non-spatial settings, and 5.7 and 5.8 for the spatial settings.

Table 5.5 shows that all robust small-area estimators suffer from high RRMSE except for the MQ-CD. This estimator captures the effect of the individual outliers eij in the sample, which helps with the estimation of the area means in the population. Further research is needed to enhance understanding of the behaviour of the other robust estimators. The EBLUP performs moderately well in the scenarios with individual non-symmetric representative outliers in the data, but weakly when contamination occurs at the area level.

With regard to the RB in the non-spatial scenarios with non-symmetric outlier contamination, it is evident from Table 5.6 that the robust estimators – REBLUP, SREBLUP, MQ and MQGWR – suffer from a negative bias of approximately 1 percent caused by the fact that these estimators treat individual outliers in the sample as unique in the population. The MQ-CD on the other hand corrects the bias in corresponding settings.

When considering the spatial scenarios in Tables 5.7 and 5.8, the results are much as in the non-spatial scenarios – that is, the spatial effect in this setting does not lead to an enhancement of the spatial estimators, as in the case ofsymmetric outliers.

105

Table 5.5. Mean values of the RRMSE (%) in the non-spatial scenarios with 5 % non-symmetric outlier contamination


[0,0] [0,u] [e,0] [e,u] [0,u] [e,u]

1–100 1–95 1–100 1–95 96-100 96-100

GREG 6.328 6.495 5.360 5.667 4.307 4.099

EBLUP 0.428 1.298 0.693 7.634 20.699 100.910

SEBLUP 0.437 1.309 45.995 7.510 20.931 103.494

REBLUP 0.468 1.416 15.012 15.030 14.363 86.396

SREBLUP 0.466 1.731 15.025 14.861 14.745 88.721

MQ 0.451 0.842 15.291 12.879 30.818 18.331

MQ-CD 0.573 0.850 1.697 2.012 5.665 2.950

MQGWR 0.457 1.132 15.264 12.671 19.287 18.549

Table 5.6. Mean values of the RB (%) in the non-spatial scenarios with 5 % non-symmetric outlier contamination


[0,0] [0,u] [e,0] [e,u] [0,u] [e,u]

1–100 1–95 1–100 1–95 96-100 96-100

GREG 0.000 0.009 -0.001 0.005 0.007 -0.114

EBLUP 0.003 0.058 0.006 -0.445 -0.947 -5.885

SEBLUP 0.003 0.058 -0.009 -0.437 -0.957 -6.036

REBLUP 0.002 0.065 -0.780 -0.877 -0.657 -5.039

SREBLUP 0.002 0.007 -0.782 -0.867 -0.674 -5.174

MQ 0.002 0.029 -0.795 -0.751 -1.410 -1.069

MQ-CD -0.001 0.006 -0.002 0.002 -0.120 -0.119

MQGWR 0.001 0.008 -0.794 -0.739 -0.882 -1.082

Table 5.7. Mean values of the RRMSE (%) in the spatial scenarios with 5 % non-symmetric outlier contamination


[0,0] [0,u] [e,0] [e,u] [0,u] [e,u]

1–100 1–95 1–100 1–95 96-100 96-100

GREG 5.642 5.548 4.593 5.944 3.233 4.077

EBLUP 0.496 1.058 0.878 3.779 14.771 62.748

SEBLUP 0.451 1.249 6.204 3.981 16.422 66.909

REBLUP 0.535 1.102 12.697 13.237 10.544 25.896

SREBLUP 0.480 1.854 12.915 13.066 15.144 31.481

MQ 0.532 1.024 12.595 13.055 25.90 19.966

MQ-CD 0.574 0.885 1.855 1.940 4.266 3.761

MQGWR 0.549 1.096 12.609 12.938 16.086 19.821

106

Table 5.8. Mean values of the RB (%) in the spatial scenarios with 5 % non-symmetric outlier contamination


[0,0] [0,u] [e,0] [e,u] [0,u] [e,u]

1–100 1–95 1–100 1–95 96-100 96-100

GREG 0.002 0.013 -0.008 0.009 -0.006 -0.125

EBLUP 0.006 0.054 0.006 0.205 -0.802 -3.403

SEBLUP 0.005 0.059 0.003 0.216 -0.892 -3.629

REBLUP 0.006 0.005 -0.771 -0.718 -0.573 -1.404

SREBLUP 0.005 0.071 -0.785 -0.709 -0.823 -1.707

MQ 0.007 0.046 -0.765 -0.708 -1.407 -1.083

MQ-CD 0.001 0.012 -0.009 0.008 -0.130 -0.135

MQGWR 0.02 0.014 -0.766 -0.702 -0.874 -1.075

5.5 Remarks and findingsAnalysis of the literature on SAE does not reveal a dominant robust estimator in the REBLUP, the MQ or the MQ-WR. What emerges is the inefficiency of the EBLUP when there are outliers in the data. In the light of the studies described in the literature, the practitioner is advised to use one of the suggested robust estimators – REBLUP or MQ-WR – if there is evidence or a suspicion of the presence of outliers in the data.

Sensitivity of small-area estimators to outliers• EBLUP is inefficient in the presence of outliers.• REBLUP and MQ-WR are to be preferred when there is evidence of outliers.• Neither REBLUP nor MQ-WR dominates the other.

107

6. The Complexity of Sample Design

6.1 IntroductionThis section considers the problem of sample design in SAE. First, the effect of a design on the estimators is discussed, using a design-based simulation using real agricultural data; second, two alternative small-area estimators are suggested that take the sample design into account.

SAE techniques generally focus on model-based and model-assisted estimators. The most commonly used model-based small-area estimators do not make use of sample weights, and they are not design-consistent unless the sampling design is self-weighting within areas. But design consistency is a desired property for a model-based estimator in that it guarantees that estimates make sense, at least for large domains, even if the model fails.

With regard to the effect of a particular sampling system on small-area estimators, there are two categories of design: ignorable and non-ignorable (Sugden and Smith, 1984; see Rubin, 1987 for a discussion of ignorability). In the field of small-area research, a design is considered non-ignorable if all the variables contributing to the calculation of sampling weights are excluded from the model. Hence as far as SAE methods are concerned the design itself does not matter – the only issue is whether it is ignorable or non-ignorable.

The effect of non-ignorable sample designs on SAE is assessed in the following paragraphs, and alternative estimators are presented that take sample design into account. In particular, the expansion (see 6.2.1), GREG, pseudo-EBLUP and weighted-MQ small-area estimators are considered. The effect of ignorable and non-ignorable designs on these estimators is evaluated in a design-based simulation based on real data – see Fabrizi et al. (2013), You and Rao (2002) and Särdnal (1982).

The first part of this section introduces some design-consistent small-area estimators. This is followed by a design-based simulation study to assess the effect of the designs on small-area estimators.

6.2 Design-consistent small-area estimatorsSuppose that a population U of size N is partitioned into m subsets Ui – domains of study or areas – of size Ni, i = 1,…,m. The population units are identified by j and the small areas by i. The population data consist of values yij of the variable of interest, and values xij of a vector of p auxiliary variables that include the constant term as the first component. Suppose that a sample s is drawn according to some possibly complex sampling design such that the inclusion probability of unit j in area i is given by πij, and that area-specific samples si Ui of size ni ≥ 0 are available for each area or domain. Note that it is possible to have non-sample areas, so ni = 0, in which case si is the empty set. The set ri Ui contains the Ni − ni indices of the non-sampled units in small area i. Values of yij and x are known only for sampled values; for the p-vector of auxiliary variables it is assumed that area level totals Xi or that their means are accurately known from external sources.

6.2.1 Expansion estimatorThe expansion estimator, also known as the Horvitz and Thompson (HT) estimator (Horvitz and Thompson, 1952), is defined as:

(6.1)

108

A popular variance estimator for (6.1) is

(6.2)

where is the probability to include in the sample si the unit j and k in area i. Many alternative estimators of the variance are available (see Särndal et al., 2003).

6.2.2 Modified GREG estimatorsThe GREG estimators have the following structure:

(6.3)

where is the estimate of the mean in area i. The class of estimators in (6.3) changes accordingly to the model used to fit the target variable. The most popular choice for fitting the target variable is the linear regression model:

(6.4)

where and (see Rao, 2003 section 2.5). This is the linear version of the GREG. By generalizing the model on which the linear GREG is based, different alternative estimators can be obtained such as a GREG based on a random-intercept model. This provides the advantage of taking between-area variability into account (Lehtonen and Veijanen, 1999):

(6.5)

where the parameters and ui are estimated by generalized least squares and restricted maximum likelihood (see Lehtonen and Pahkinen, 2004 section 6.3). MSE estimation is also suggested in Lehtonen and Pahkinen (2004).

6.2.3 The Pseudo-EBLUPThe pseudo-EBLUP is a design-consistent small-area estimator for the area mean proposed by You and Rao (2002). It is based on the random intercept regression model, with the assumption that the sample design is ignorable given the auxiliary variable included in the model. The design-consistent pseudo-EBLUP estimator of the i-th area mean is then given by:

(6.6)

where , , , and are the regression coefficients vector and the area effect estimates from the fitting of a random-intercept model; , ,

, , and are the estimates of the variance component of the random-intercept model obtained, for example, by the restricted maximum likelihood method. Prasad and Rao (1999) and You and Rao (2002) provided formulae for the model-based MSE associated with the pseudo-EBLUP estimators of the area mean. An alternative similar design-consistent estimator was proposed by Jiang and Lahiri (2006).

109

6.2.4 Weighted MQ estimatorsUsing the M-quantile approach to small-area estimation, Fabrizi et al. (2013) proposed a design-consistent small-area estimator of the mean:

(6.7)

in which the regression coefficient vector is estimated according to the MQ linear model accounting for the sample weights; in particular, for a quantile q where X is the nxp design matrix of auxiliary variables, y is the n-vector of the sample y values, W is the diagonal sampling weight matrix of order n and C(q) is a diagonal matrix of order n defined by the weights obtained from the iterative re-weighted least squares algorithm used to fit the design-weighted M-quantile regression coefficient at q (see Fabrizi et al., 2013 for details). The estimator in (6.7) was proved to be design-consistent under some assumptions by Fabrizi et al. (2013). It offers several advantages with respect to the GREG estimator, given that the use of an area-specific coefficient in MQ regression accounts for area characteristics that are not explained by the auxiliary variables. The use of M-estimation offers outlier-robust estimation.

An analytic MSE estimator was proposed by Fabrizi et al. (2013), but it underestimated the actual mean squared error of (6.7), particularly when the overall sample size is not moderate and the sampling variance of the yij and xij does not dominate the variance associated with the uncertainty in estimating . Alternative estimators of the MSE of (6.7) based on bootstrap was also proposed by Fabrizi et al. (2013).

6.3 Simulation study of the impact of ignorable and non-ignorable designsA simulation study carried out to assess the effect of a design on small-area estimates is based on a real dataset from the Australian Agricultural and Grazing Industries Survey. A sample of 1,652 broad-acre farms in 29 regions is studied. A population of N = 81,982 farms is generated by bootstrapping the original survey sample: that is, the 1,652 farms in the original sample are themselves sampled, with replacement using selection probabilities proportional to a farm’s survey sample weight, where the sum of survey sample weights is equal to 81,982 (Fabrizi et al., 2013). Because the interest is in the design-based properties of estimators, this population is kept fixed and repeatedly sampled according to a sampling design. To assess the impact of ignorable and non-ignorable designs on the small-area estimates, a comparison between the bias and MSE of the proposed estimators and the bias and MSE of selected alternatives is carried out.

6.3.1 Description of the simulation experimentThe synthetic survey population consists of 15 variables for 81,982 farms. In this simulation farms define the lower level (level 1) and the 29 Australian regions define the small areas of interest (level 2). The size of regions in terms of farms ranges from 79 to 10,930. The target variable is total cash costs (TCC; Y) – that is, payments made by the business for materials and services and for permanent and casual labour, excluding owner-managers, partners and family labour; its distribution shows strong positive skewness. For each farm, auxiliary variables (X) are available: the total revenues received by the business during the financial year (TTR) and the total area of the farm in hectares (FarmArea). A group of six binary variables is available for each farm, cross-classifying them by climatic zone and size (SizeZone). The six levels of SizeZone are defined as: i) pastoral zone, and area of 50,000 ha or less; ii) pastoral zone, and area of more than 50,000 ha; iii) wheat/sheep zone, and area of 1,500 ha or less; iv) wheat/sheep zone, and area of more than 1,500 ha; v) high-rainfall zone, and area of 750 ha or less; and vi) high-rainfall zone, and area of more than 750 ha. Three sets of auxiliary variables are taken into account to create three models with different values of R2 calculated on fitting an ordinary linear-regression model: i) weak linear relationship between Y and X1

110

= [SizeZone] characterized by R2 = 0.16 – low scenario; ii) medium linear relationship between X2 = [SizeZone, FarmArea], characterized by R2 = 0.40 –medium scenario; and iii) strong linear relationship between Y and X3 = [SizeZone, TTR], with R2 = 0.90 – high scenario.

To check model diagnostics and the characteristics of the synthetically generated population, a two-level mixed model with area-specific random effects is fitted in the different scenarios, using the population data. In all cases, analysis of the residuals shows that the normality assumption fails; the lack of normality for the model residuals is probably caused by several outliers in the regions. This situation can penalize the GREG and pseudo-EBLUP estimators, but it represents a realistic agricultural scenario.

Samples are selected according to a fixed size unequal probability without replacement sampling design, using the maximum entropy method (Tillé, 2006 chapter 5). The sample size is set at 578, corresponding approximately to a 0.7 percent sampling rate. Two alternative sets of inclusion probabilities are defined to be proportional to two size variables Z: i) livestock – beef, sheep and wool; and ii) a uniform variable on the interval (1, 20). In case i), πj = 0.2 × zj + 0.05 was defined for all j in U to minimize the number of inclusion probabilities equal to 1 (Fabrizi et al., 2013).

The design is non-ignorable when, conditionally on the covariates X, livestock is used as a size variable. The correlation computed on the population between TCC and livestock given X1 is equal to 0.24. When conditioning on X2 the correlation falls to 0.16, and on X3 to 0.11. The design would become ignorable if livestock were included as a covariate in the model (Fabrizi et al., 2013). This option is not considered here because the aim of the simulation is to mimic situations where not all design variables are available to the analyst. When inclusion probabilities are generated proportional to the uniform variable, the design is ignorable given X. The first scenario is called a non-ignorable design; the second is called an ignorable design.

The compared estimators are the weighted MQ (WMQ), the HT, the pseudo-EBLUP, the GREG-S and the [in full?GREG with Sample weights, see pag.27] GREG-LV (see pag.28).

The Monte Carlo experiment consists of drawing R = 5,000 samples from this population and calculating small-area estimates of the mean of TCC. The performance of the small-area estimators is evaluated using the RB and the RRMSE of estimates of the small-area means. The RB for small area i is computed as:

, (6.8)

and the relative RMSE for area i is computed as:

. (6.9)

In (6.8) and in (6.9) the subscript r = 1,…,R indexes the Monte Carlo simulation, i indexes the area and represent the true value of the parameter – the mean – in area i.

6.3.2 Simulation resultsTable 6.1 shows results for the mean RB and the RRMSE for the three model scenarios and for the two possible designs – ignorable and non-ignorable. Results are as usual averaged over areas and simulations.

111

The results in table 6.1 show that the proposed design-consistent small-area estimators are unbiased. In particular, the WMQ and the HT estimators show very little bias even when the model has a poor fit – the low scenario – whereas the other small-area estimator shows some bias in the ignorable and non-ignorable cases. As expected, the HT estimator has a very large RRMSE with respect to the WMQ and the pseudo-EBLUP, particularly when the model holds.

Table 6.1. Design-based simulation results; population generated using the AAGIS data

Predictors

Average RB%

Non-ignorable design Ignorable design

Low Medium High Low Medium High

WMQ 1.37 -0.41 0.84 -0.67 -0.58 -0.59

Pseudo-EBLUP

9.46 2.60 0.84 6.80 3.06 -0.43

GREG-S 4.56 -0.21 1.54 1.57 -1.23 -0.22

GREG-LV 11.78 4.56 1.82 1.36 -0.96 -0.42

Expansion 1.04 1.04 1.04 -0.91 -0.91 -0.91

Average RRMSE %

WMQ 32.07 26.79 15.09 28.75 21.72 15.94

Pseudo-EBLUP

37.18 35.46 15.33 28.71 24.88 15.32

GREG-S 35.74 35.59 16.73 30.93 25.40 16.86

GREG-LV 50.73 38.70 24.43 33.06 24.38 18.56

Expansion 40.35 40.35 40.35 35.43 35.43 35.43

Results show the RB% and the RRMSE% averaged over areas in the three model scenarios.

Results for estimators that are not design-consistent such as the EBLUP (Rao, 2003) and the MQ (Chambers and Tzavidis, 2006) are not given here, but they show greater bias, particularly when the model does not hold – the low scenario – in the case of the non-ignorable design; they do show good results for the ignorable design, however, as expected.

The WMQ estimator performs best in terms of RRMSE for the non-ignorable design; this is particularly evident for the medium and high scenarios. The GREG-S is less efficient than the pseudo-EBLUP and WMQ because it does not allow for area-specific regression coefficients.

Ignorable designs can be handled using EBLUP-based and MQ-based estimators that do not use survey weights.

6.4 Investigating the impact of sampling designs on data interpolationThis section investigates the effect of the sample design on interpolation methods. A brief introduction on the effect of the design on data interpolation is followed by a simulation experiment to assess the impact of three sample designs – simple random, two-stage and stratified two-stage – on two interpolation models, the GWR and ordinary kriging.

6.4.1 A short introduction about the design effect on data interpolationIn the mathematical field of nEquation Section 6umerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points. In real applications, the known data points are

112

usually a sample of a finite or infinite population. The way in which the sample is drawn is known as the sample design. Sampling weights weigh sample data to correct for the disproportionality of the sample with respect to the target population of interest; they reflect unequal sample inclusion probabilities and compensate for differential non-response and frame under-coverage. They are routinely included in survey data files released to analysts.

Sampling weights can be vital in two aspects of the modelling process: i) they can be used to test and protect against non-ignorable sampling designs that could cause selection bias; and ii) they can be used to protect against mis-specification of the model holding in the population.

When the design is non-ignorable, the estimation process for the model parameter should take sample weights into account. When a model is chosen to interpolate a set of sampled points, a desirable property of the model parameters is their design consistency. In classical statistics theory, consistency refers to the limiting behaviour of a sample statistic as the sample size is increased to infinity: hence defining the concept of consistency in finite population sampling requires that the population size is also allowed to increase. This raises the question, however, of a suitable formulation of the way in which the population and the sample increase such that their structure is preserved.

A sample statistic ts(n) is said to be design-consistent for a descriptive population quantity T(N) if where “plim” stands for “limit in probability” under the randomization distribution,

n is the sample size and N is the population size (Pfeffermann, 1993).

Why is design consistency a desirable property for an estimator? The answer is robustness. If the model holds in the population and the estimation technique used yields a corresponding descriptive population quantity that is consistent in the model, then as the population size increases the corresponding descriptive population quantity will converge to the model parameter. The following paragraphs give a short definition of corresponding descriptive population quantity.

Let be generated from a distribution indexed by a vector of unknown parameters. Let U(Y, ) = 0 define a set of estimating equations for obtained by an estimation rule . The solution T(Y) such that

U(Y,T(Y)) = 0 is the corresponding descriptive quantity for under the rule . When the sample is selected by simple random sampling, the model holding for the sample data is the same as the model holding in the population before sampling. With the complex sampling designs often used in practice the two models can be very different, however, and failure to account for the sample selection process might bias the inference on the target parameters. Incorporating the sampling weights in the analysis is the preferred way of dealing with the effects of the design (Pfeffermann, 1993; Kish, 1990). In some cases, the sample design becomes ignorable. The definition of ignorability and the conditions in which the design is ignorable are discussed in the literature, for example by Little (1982), Rubin (1976), Scott (1977) and Sugden and Smith (1984).

The ignorability of the design refers to the information provided by the selection scheme beyond what is already provided by the design variables. The ignorability conditions (Rubin, 1976) are clearly satisfied in sampling schemes that depend only on the design variables. Because analysts often do not know all the design variables, Sudgen and Smith (1984) explore the conditions under which a sampling scheme that depends only on the design variables is ignorable, given partial information on the design.

The ignorability of the sampling design depends on the design and the available design information, and also on the model and the parameters of interest. Hence if the regressor variables in a regression model include all the design variables, the sampling design is ignorable for estimating the regression model. If, however, the design variable values are only known for units in the sample, the sampling design is non-ignorable for estimating the unconditional mean and variance of the regression dependent variables. This last case is not relevant in this section.

113

In a study of the effects of ignoring the sample selection process when fitting models to survey data, Skinner et al. (1989) conclude that failure to account for all the important design variables or incorrectly specifying the conditional distribution of the survey variables given the design information can have severe effects on the inference process. Frequently, however, the analyst has only limited knowledge about the actual sampling process: in such cases, the sampling weights come into play. Estimators of model parameters are modified so that they are design-consistent for the corresponding descriptive population quantity in the finite population from which the sample was drawn. Consider, for example, as a descriptive Equation Section 4population quantity the B parameter of the simple linear model yi=xiB+ei, i = 1,…,N, where yi is the variable of interest for the population unit i, xi is the vector of p auxiliary variables for unit i and ei is the error term that fulfils the standard assumptions of the linear model; N is the size of the population. The descriptive population quantity B is then:

(6.10)

Given that the design is non-ignorable, the ordinary least square estimator of B is not design-consistent, so a different estimator of B is defined using the sample weights:

, (6.11)

that is, the solution of the equation . Here, s is the set of the sample units drawn from the population of interest following a complex design, wi is the sample weight for the unit i.

For more sophisticated models the principle is the same: when the design is non-ignorable the sample weights should be included in the estimation process.

6.4.2 A simulation experiment to assess the impact of the design effect on spatial interpolationThis section describes a simulation experiment to assess the impact of simple random sampling, two-stage cluster sampling and stratified two-stage cluster sampling designs on some spatial interpolation models such as the GWR and ordinary kriging. The experiment consists of generating populations with different spatial structure and drawing samples using different designs. For each sample, model parameters are estimated and predictions are made on all population units using GWR and ordinary kriging models. The effect of the design on the performance of the model is evaluated with the bias and the RMSE of the predicted values. First, the focus is on ignorable sample designs.

The experiment settings are inspired by the work of Crainiceanu et al. (2005), Marley and Wand (2010) and Bocci and Rocco (2011). Three different populations are generated and kept fixed in the simulations. Each population is generated using the following model:

where , , , s is an vector that represents the spatial location generated by a different spatial point process in each population, the function is obtained as a bivariate normal mixture density and

with . The bivariate normal mixture density is obtained as the weighted mean of the following bivariate normal density:

114

,

with weights 3/7 for Wa and Wb and 1/7 for Wc. The resulting density is shown in Figure 6.1.

Figure 6.1 Bivariate normal mixture density

Each population has N = 3,000 units located in the unit square O = [0,1] x [0,1]. This square is divided into four equal squares O1 = [0,0.5] x [0,0.5], O2 = [0,0.5] x [0.5,1], O3 = [0.5,1] x [0,0.5], O4 = [0.5,1] x [0.5,1] . The auxiliary variable (x) is common to the three populations, as is the individual error term ( ). The difference between the three populations is in the spatial point process used to generate the location of the units and consequently the values of the target variable. The three generating process are: A: non-homogeneous Poisson process on O; B: non-homogeneous Poisson process on Oi, i = 1,2,3,4; and C: Matern cluster process on Oi, i = 1,2,3,4. The surface O is divided into 64 clusters and four strata so that it is possible to carry out simple random, two-stage cluster and stratified two-stage cluster sample designs. Figure 6.2 shows the three populations, the strata and the clusters.

115

Figure 6.2 Spatial distribution of the population units

A: non-homogeneous Poisson process on O region. B: non-homogeneous Poisson process in Oi regions (i=1,…,4). C: Matern cluster process on Oi regions (i=1,…,4). Red lines identify the strata and black lines identify the clusters. The Oi regions are not drawn.

In each population, the strata contain 9, 15, 15 and 25 clusters. Table 6.1 shows the distribution of the units across the clusters in each generated population. Populations A and B are quite similar in terms of units in each cluster, whereas population C has a high concentration of units in the clusters, 25 percent of which have more than 93 units.

Table 6.1 Distribution over clusters of the population A, B and C and number of void clusters

1st quartile Median Mean 3rd quartile Void clusters

Population A 18.75 34.50 46.88 67.00 0

Population B 26.00 40.50 46.88 59.00 0

Population C 16.50 47.50 60.00 93.00 14

The Poisson process and the Matern cluster process used to generate unit locations are similar to those described in Bocci and Rocco (2011).

For each population of N=3000 units three sample designs have been carried out:1. Simple random sampling without replacement: 150 units are drawn from the target population.2. Two-stage sampling: a simple random sample of 30 clusters is drawn from the 64 clusters, and a simple random

sample of 5 units is drawn from each cluster; void clusters, which are present only in population C, are droppedfrom the sample and replaced randomly with another cluster until a non-void cluster is sampled.

3. Stratified two-stage sampling: each population is divided into 4 spatial strata; the strata are fixed for populationsA, B and C. Then in each stratum kj (j=1,…,4) clusters are sampled, and from each cluster hj (j=1,…,4) units areselected. The selection of clusters and units is obtained with simple random sampling. Void clusters are treatedas described in (2). The clusters and units drawn from each population are:

A: k = (4, 7, 7, 20) and h = (5, 5, 5, 3)B: k = (4, 7, 7, 12) and h = (5, 5, 5, 5)C: k = (4, 7, 12, 15) and h = (10, 5, 1, 4)

The allocation of the sample in the clusters follows the proportion in Bocci and Rocco 2011. For each sample, GWR and ordinary kriging models are estimated, and the target variable y is predicted for all population (N = 3,000) units. The GWR model parameters are estimated with and without the sample weights, which are as usual the multiplicative inverse of the first order inclusion probabilities. The prediction is made considering the x values and the location (s) of all population units as known. The structures of the GWR and ordinary kriging models are:

116

The Monte Carlo experiment has been carried out with L = 1,000 replications.

The performances of the interpolation models – Geographically Weighted Regression (GWR), Geographically Weighted Regression with sample Weights (GWR-W) and ordinary kriging – are evaluated in terms of bias and RMSE as follows:

where is the true value of the unit i and is the predicted value for the unit i in the replication l under a GWR, GWR-W or ordinary kriging.

Tables 6.2, 6.3 and 6.4 show the results for each population and each sample design.

Table 6.2 Results of the Monte Carlo experiment for Population A

BIAS RMSE

GWR GWR-W KRIG GWR GWR-W KRIG

SRS* 0.027 0.027 0.004 0.332 0.332 0.236

TS** 0.009 0.021 -0.003 0.445 0.452 0.325

STS*** -0.001 0.016 -0.006 0.388 0.398 0.274

* Simple random sampling** Two-stage sampling*** Stratified two-stage sampling

Table 6.3 Results of the Monte Carlo experiment for Population BBIAS RMSE


SRS 0.026 0.026 0.005 0.332 0.332 0.238

TS 0.021 0.030 0.003 0.409 0.413 0.299

STS 0.015 0.026 0.001 0.374 0.381 0.268

Table 6.4 Results of the Monte Carlo experiment for the Population CBIAS RMSE


SRS -0.001 -0.001 0.000 0.311 0.311 0.241

TS 0.048 0.029 0.030 0.375 0.383 0.297

STS 0.035 0.012 0.019 0.332 0.368 0.252

117

Results for populations A and B are similar, showing the superior performance of ordinary kriging in terms of bias and RMSE. In term of bias, kriging is not influenced by the design, whereas in terms of RMSE it is. Tables 6.2 and 6.3 show a higher RMSE in the two-stage sample design for ordinary kriging. The GWR and GWR-W show the same results under simple random sampling, as expected. In the two-stage sampling and stratified two-stage designs, results for the GWR and GWR-W are similar; GWR is slightly dominant. This result should not be a surprise, given that the designs are ignorable. In terms of bias, the GWR gains in the two-stage and stratified two-stage sample designs. The GWR-W behaves in the same way as the GWR for population A, whereas for population B performance in term of bias is similar in the three designs. For populations A and B, GWR and GWR-W show increasing RMSE in simple random, stratified two-stage and two-stage sampling. The conclusion is that complex designs negatively affect this interpolation method in the cases analysed.

Population C – where units are more clustered – shows slightly different results. Ordinary kriging is superior with respect to GWR and GWR-W in terms of RMSE, but in terms of bias there are similar results in simple random and two-stage sampling, and a small predominance of the GWR-W in stratified two-stage sampling. In this population the GWR-W shows less bias than the GWR and a higher RMSE in the two-stage and stratified two-stage sampling designs. Therefore, as expected, the weights have a positive effect on bias and a small negative effect on RMSE.

In conclusion, it is notable that only ignorable sample designs are considered. It can be said that ordinary kriging is negatively influenced by complex sample designs, whereas the performance of the GWR and GWR-R depends on the design and on the spatial structure of the population. Further investigation is needed into other spatial interpolation models and other population spatial structures; research with a view to including sample weights in spatial interpolation models would be a valuable contribution.

6.5 Remarks and findingsIt is clear that design-consistent small-area estimators improve upon the efficiency of traditional estimators – the EBLUP and MQ – when the sample design is non-ignorable. It is strongly recommended that one of the estimators presented above be adopted to obtain small-area estimates in cases where a design is non-ignorable.

A common problem in agricultural statistics is the presence of outliers. In such cases the use of the WMQ estimator is recommended, or other robust estimators that are design-consistent.

Specific sampling designs do not significantly influence the behaviour of small-area design-consistent estimators – what does influence estimates is large variation in survey weights (Gelman 2007). Münnich and Burgard (2012)assess the effects of large variation in survey weights on some small-area estimators as a result of different sampling designs. Their suggestions agree with the analysis here: i) design-consistent estimators such as the pseudo-EBLUP must be used to reduce the negative effect on the stability of the estimates caused by variability in sample weights; and ii) robust small-area estimators such as the WMQ must be used when outliers are present.

Practitioners should note that a design is considered non-ignorable when no variables contributing to the calculation of sampling weights are included in the model used. In such cases the pseudo-EBLUP estimator should be used if there is evidence of outliers; if outliers are present in the data, the WMQ estimator should be used. As stated at the beginning of the section, cases when the design affects small-area estimates are identified – that is, when the design is non-ignorable; two design-consistent estimators are proposed. The problem of outliers is addressed with the WMQ.

With regard to interpolation, the literature shows that ignoring the sample selection process can lead to failure in the inference process; this occurs when design variables are not included in the model. In such cases sampling weights should be used. Estimators of model parameters are modified to be design-consistent for the corresponding

118

descriptive population quantity in the finite population from which the sample has been drawn.

The simulations show that ordinary kriging is negatively influenced by complex sample designs, whereas the performance of the GWR and GWR-R depends on the design and spatial structure of the population. Further investigation is needed, and indeed encouraged.

Effect of sample design on small-area estimators• When the design is non-ignorable – that is, all the variables used to obtain sample weights are excluded

from the model – small-area design-consistent estimators should be used.• The influence of the design on design-consistent small-area estimators is related more closely to the

variability of sample weights than to the design scheme.• Pseudo-EBLUP and WMQ estimators are design-consistent small-area estimators.• WMQ should be preferred when outliers are present.Effect of the sample design on spatial interpolation• When the design is non-ignorable, sample weights should be used in the process of estimating parameters.• Inferences about parameters can fail if sample weights are ignored.

119

7. Missing Data in Spatial Datasets

7.1 IntroductionThis section addresses the issue of missing data in spatial datasets. An introduction to the problem highlights the main definitions and suggestions for handling cases of missing data in general datasets, and gives an overview of the concept of missing information from a measurement error perspective, with a focus on spatial datasets. Given the general relevance of the problem of missing values in spatial datasets, two particular problems are considered: missing data in geographical information, and missing data in study and auxiliary variables.

With regard to the treatment of missing spatial information in statistical models with spatial effects, the effect of missing point locations for out-of-sample units is considered when MQGWR or geo-additive models are used to estimate the parameter of interest in some geographical domains. Point locations are usually available for sampled units, whereas for the population units not included in the sample only the area they belong to is usually known. But if a geostatistical model is to be applied to these data, the missing locations must be filled in. The classic approach is to locate all the units belonging to the same area by the coordinates of the geographical centroid of their area; this solution is an approximation that can affect the final estimates.

This chapter evaluates this effect in two simulation studies. In the first, which is based on the design-based simulation in Chapter 3, the effect on MQGWR small-area estimates is evaluated in terms of bias and variability when the exact location of each unit, known in this case for sampled and out-of-sample units, is replaced by the location of the centroid of its area. In the second imputation method, which was recently proposed in the literature on geoadditive models as an improvement with respect to the classic centroid imputation, the performance of the technique is compared with that of the classical approach.

Then, with regard to the general issue of missing data in study variables and auxiliary variables in spatial datasets, some recommendations and case studies from the literature are presented, with a focus on crop yield data and SAR models. This is followed by consideration of a particularly relevant issue: the effect of an informative unit non-response on small-area estimates. As noted by Giusti and Rocco (2010), a possible solution when values for the study variable are missing is to use a weighting approach with a weight function for the response probabilities. Because these probabilities are usually unknown, they need to be estimated. Some of the simulation results proposed by Giusti and Rocco (2010) are presented with a view to evaluating the effect on the small-area mean estimator of the study variable resulting from different missing-data mechanisms –homogeneous or non-homogeneous between areas – and different estimation techniques for the response probabilities such as weighting within cells or the logit model.

7.2 Missing values in datasets: general concepts and solutionsMissing data are a pervasive problem in applied research. Typically, a researcher is interested in analysing the data in a rectangular dataset, a matrix where each row represents a unit – case, observation or subject – and each column represents a variable, which may for example be continuous or categorical, measured for each unit. Conventional statistical methods and software presume that all the values of this matrix are observed. But unfortunately it is often the case that some values are missing and if data are missing on all the variables for some cases we have what is commonly known in sample surveys as “unit non-response” as opposed to “item non-response” – that is, some but not all the values of the variables are missing for a given unit.

It has been established that when values are missing in a given dataset, then any method chosen to treat them – or even simply ignoring them – can have a significant effect on the results of the analysis of interest (Little and Rubin, 2002). This is true in any applied field, regardless of the analysis to be carried out on the dataset. Missing data can

120

introduce bias into estimates derived from statistical models, for example (Schafer, 1997; Allison, 2002), and they can cause a loss of information and of statistical power (Little and Rubin, 2002).

The effect of missing data on the methods applied and the subsequent results depends on the pattern of missing data and on the mechanism that led to missing data.

Figure 7.1. Representation of three multivariate missing data patterns: (i) and (ii) monotone, (iii) general and (iv) matching datasets

In a multivariate setting the pattern of the data indicate which values are missing – that is, which of the situations in Figure 7.1 applies. Schemes (i) and (ii) in Figure 7.1 are “monotone” missing-data patterns: (i) has only one variable subject to missingness, (ii) has more variables. Scheme (iii) is a general pattern of missingness where the variables cannot be ordered to obtain a monotone pattern. Scheme (iv) is the typical pattern obtained after datasets have been matched, with some of the variables never jointly observed. In some cases the pattern of missingness can help in deciding how to treat the missing values.

With regard to the missing-data mechanism, it must be noted that no method for handling missing data can be expected to perform well unless there are some restrictions in relation to how the data came to be missing. If we indicate with the data in the dataset of interest, they can be partitioned into two parts, observed and missing:

. Then let be the response indicator, with if is missing, if is observed. The missing-data mechanism concerns the distribution of R given Y: that is, it specifies a model for the response probabilities. A variable is said to be “missing completely at random” (MCAR) when:

(7.1)

that is, when the probability that is missing depends neither on the observed variables nor on the missing values . The concept can be generalized to more than one variable with missing data, in which case data are said to be missing completely at random if the probability that any variable is missing cannot depend on any other variable in the model of interest, or on the potentially missing values themselves (Little and Rubin, 2002). For most datasets, the MCAR assumption is unlikely to be precisely satisfied because it requires a strong assumption that the missingness of Y does not depend on estimation of the observed variables included in the model. A weaker assumption is the missing-at-random (MAR) hypothesis:

121

(7.2)

In this case the missingness of may depend on the observed data but not on the values of itself. As with MCAR, the extension to more than one variable with missing data requires care in stating the assumption (Rubin, 1976), but the basic idea is the same: the probability that a variable is missing may depend on anything that is observed – but it cannot depend on any of the unobserved values of the variables with missing data, even after adjusting for observed values. This means that the MAR hypothesis can be made more likely by including as many observed variables as possible in the model to be estimated, because in this way the residual dependence of the missingness on Y itself can be reduced or eliminated.

There is an additional technical definition linked to the MAR assumption. The missing-data mechanism – the process whereby missingness was generated – is said to be “ignorable” if the data are MAR and the parameters governing the missing-data mechanism are distinct from the parameters in the model to be estimated. This last condition is usually satisfied in real-world situations, so it is commonplace to use the terms MAR and ignorability interchangeably. As the name suggests, if the missing-data mechanism is ignorable then it is possible to obtain valid optimal estimates of parameters without directly modelling the missing-data mechanism.

A final possibility for the missing-data mechanism is the following: if the MAR assumption is violated, the data are said to be “missing not at random” (MNAR). In this case, the missingness of Y depends on the missing values: a classic example is non-response to personal income questions in sample surveys. When the data are MNAR, the missing-data mechanism is not ignorable, and valid estimation requires that the missing-data mechanism be modelled as part of the estimation process. Because every MNAR situation is different, the model for the missing-data mechanism must be adapted to each situation, which is why standard statistical and analysis software usually suppose the missing-data mechanism to be ignorable.

The methods proposed here for the treatment of data subject to missingness can be grouped into four partially overlapping categories (Little and Rubin, 2002):i. Procedures based on completely recorded units. These include the “available case analysis” and the “complete

case analysis”, which involve discarding the units with missing values from the analysis of interest. Thesemethods can cause a substantial reduction in sample size, and they requires the hypothesis of MCAR data. Ifthis is not the case, estimates from the analysis model can be severely biased.

ii. Imputation-based procedures. In these methods the missing values are filled in with plausible values, and standardmethods of analysis are applied to the completed dataset. Examples include hot-deck, mean and regressionimputation, each of which has its advantages and disadvantages. Imputation is the solution commonly used foritem non-response in sample surveys. A possible drawback of applying a model to a dataset completed usingsingle imputation – that is, imputing one value for each missing value – is underestimation of the variability ofthe true estimates; for this reason, Rubin (1987) proposed multiple imputation, which involves imputing morethan one value for each missing value.

iii. Weighting procedures. Randomization inference from sample surveys without non-response is usually based ondesign weights, which are inversely proportional to the probability of selection. Weighting procedures, whichmodify the design weights to adjust for non-response, represent the class of methods commonly used to treatunit non-response.

iv. Model-based procedures. These are generated by defining a model for partially missing data and then basinginferences on likelihood in that model, with parameters estimated by methods such as maximum likelihood.These procedures are usually flexible and satisfy the three desirable properties of a good missing-data method– minimizing bias, maximizing the available information and yielding good estimates of uncertainty (Allison,2002). Some methods in this category are intended for monotone patterns of missing data (see Figure7.1). The EM algorithm, on the other hand, which is a general technique for finding maximum likelihood estimates for

122

incomplete data, is applicable to more general missing-data patterns, but it can involve a noticeable increase in computing.

7.2.1 Multiple imputationMultiple imputation (MI) has become more popular in recent years as a method for addressing the problem of missing data. Introduced by Rubin (1987) in the context of complex sample surveys, its main objective is to overcome the limits of single imputation – that is, to consider imputed values as observed ones with consequent underestimation of the variability resulting from the imputation step. MI has its drawbacks, of course: in official statistics, for example, consideration must be given to the additional burden for users deriving from the release of many completed datasets and from the need to use special formulas to obtain estimates of interest. From the methodological point of view, MI has another drawback in that, in practice, multiply imputed datasets from complex sample designs are typically imputed under simple random sampling assumptions and then analysed using methods that account for the design features. Methods for accounting for complex sample designs directly in the multiple imputation procedure were proposed by Zhou (2014).

With MI, m imputations are created for each missing value; the variability of the different imputations reflects the uncertainty arising from the original missing values. Let be the estimate of interest. After the m imputations have been computed, each of the m completed datasets is analysed with traditional statistical techniques to obtain estimates and the corresponding estimated variances . The MI estimate can then be computed as:

whereas the estimate of its total variance is:

Thus, the total MI variance is the mean of the m variances , plus times the variance between the MI estimates, . This last quantity measures the increase in variance arising from the missing values. Thus the quantity:

is called the “fraction of missing information”, which measures the contribution of the missing values to the inferential uncertainty about Q (Schafer, 1997). This quantity depends on the corresponding percentage of missing values, but it is usually lower because it is positively influenced by the information blended into the imputation model (Schenker et al., 2006).

Rubin (1987) demonstrates the properties of MI from a Bayesian perspective and gives the conditions to obtain valid inferences through the randomization theory. In both theories if the MI procedure has some basic desirable properties, the fraction of missing information can be used to evaluate the quantity , the relative efficiency of an estimate based on m multiple imputations with respect to one based on an infinity of imputations. For example, if λ=0.3 or 30 percent of missing information, the relative efficiency with m=5 MIs is already 94 percent (Schafer and Olsen, 1998). Therefore even a small number of MIs can lead to efficient estimates.

123

To achieve this efficiency in actual application the MI procedure should have some desirable characteristics, which can be summarized as follows (Giusti, 2009):i. introduce into the imputation model all the covariates potentially influencing the missing mechanism to

“enhance” the MAR hypothesis;ii. include as covariates the variables related to the sampling scheme of the survey, which is particularly important

for improper MIs (Rubin, 1996);iii. consider as covariates the variables that are likely to be used in analysis of the imputed datasets, because the

incoherence between the imputation and the analysis model can lead to the uncongeniality problem (Meng,1994, 2002; Fay, 1992; Rubin, 1996); and

iv. make the imputations in different models to study the sensitivity of the final results to the imputation model.

MI is a highly flexible tool because it can be used in different settings and models and hence can be used in the special case of missing values in spatial datasets and surveys.

7.2.2 Missing values in spatial data as measurement errorA research area of geographical information science has recently been developed: i) to investigate the ways in which uncertainty in spatial data arises and is distributed through GIS operations; and ii) to assess the probable effects on subsequent decision-making (Heuvelink, 1998; Zhang and Goodchild, 2002).

Leung et al. (2004) observe: “With the ever increasing volume of geo-referenced data being generated, transferred and utilized, the amount of uncertainty embedded in spatial databases has become a major issue of crucial theoretical importance and practical consideration.” Uncertainty as to attributed values and positions generally in spatial databases reflects the accuracy, statistical precision and bias in initial values or in estimated coefficients. Spatial uncertainty also includes the estimation of errors in the final output that results from the propagation of external and internal uncertainty. It is therefore important to be able to track the occurrence and propagation of uncertainties (Goodchild, 1991).

Research on accuracy is closely associated with the study of errors in GIS, and the literature on this subject is extensive: see Goodchild and Gopal (1989), Heuvelink (1998), Leung and Yan (1998), Mowrer and Congalton (2000), Stanislawski et al. (1996), Wolf and Ghilani (1997) and Zhang and Goodchild (2002).

The error taxonomy of Veregin (1989) recognizes that different classes of spatial data exhibit different types of errors, and that errors may be introduced and propagated in various stages of data manipulation and spatial processing. Errors in spatial databases are generally divided into “inherent” errors and “operational” errors: inherent errors are those present in source documents and include errors in maps used as input to a GIS; operational errors occur throughout data manipulation and spatial modelling and are introduced during the processes of data entry or capture and manipulation functions of a GIS (Leung et al., 2004). From the modelling point of view, the errors can also be classified as “systematic” or “random”. The systematic component can usually be removed by modifying the model, but it is impossible to avoid random errors in measurements entirely (Wolf and Ghilani, 1997). Dealing with such measurement error is one of the most important problems in the use of geo-referenced data.

To support the determination of error structures in GIS location coordinates, the concept of a measurement-based GIS was proposed by Goodchild (1999): “…a system that provides access to measurements used to determine the locations of objects, to the geographical procedures (transformation functions) that link measurements to quantities to be measured, and to the rules used to determine interpolated positions”. The basic idea is to retain details of measurements so that error can be analysed. Leung et al. (2004) also propose a framework in which error propagation and the statistical approach to the analysis of measurement error can be formulated.

124

The measurement-error analysis approach is a geographical science approach involving some statistical tools and concepts, but it is mainly a “technical” approach.

In statistics, the measurement-error problem is concerned with the influence on regression models where some of the independent variables are contaminated with errors or otherwise not measured accurately on all subjects. The literature establishes that disregarding measurement error in a predictor distorts its estimated relationship with the response variable and produces biased estimates of the regression coefficients in linear (Buonaccorsi, 1995; Fuller, 1987, Chapter 1) and non-linear models (Carroll et al., 2006, Chapter 3). Hence most of measurement error analysis is concerned with correcting for such effects.

Measurement-error models usually have two components: i) an underlying model for the response variable y in terms of some predictors to distinguish between predictors measured without error – z – and predictors that cannot be observed exactly – x; and ii) a variable – w – that is related to the unobservable x. The parameters in the model relating y and (z,x) cannot be estimated directly because x is not observed. The aim of measurement-error modelling is to obtain nearly unbiased estimates of these parameters indirectly by fitting a model for y in terms of (z,w). In assessing measurement error, attention must be given to the type and nature of the error and to the sources of data that enable modelling of the error (Carroll et al., 2006).

A fundamental prerequisite for analysis of a measurement-error problem is the specification of a model for the measurement-error process. The two general types are: i) the classical error model, where the conditional distribution of w given (x,z) is modelled; and ii) the Berkson error model, where the conditional distribution of x given (w,z) is modelled (Berkson, 1950). In their simplest form, the two models correspond to:• classical error model: wi= xi+ ui, with E(ui|xi) = 0• Berkson error model: xi= wi+ ui, with E(ui|wi) = 0

where u can be distributed in various ways.

For details of these specifications, see Fuller (1987) and Carroll et al. (2006). But the basic difference between the two types of error model is that the classical model is to be used if the error-prone variable has to be measured uniquely for every individual, whereas the Berkson model is preferable if all individuals in a small group or stratum are given the same value of the error-prone covariate.

The literature on statistical measurement error analysis is enormous. Examples include Carroll et al. (1993), Bollinger (1998), Richardson et al. (2002), Chesher and Schluter (2002), Wang (2004), Carroll et al. (2004), Ganguli et al. (2005), Ybarra and Lohr (2008) and Torabi et al. (2009). In recent years various published applications for models with spatial measurement error include Zhuly et al. (2003), Gryparis et al. (2007), Madsen et al. (2008), Goovaerts (2009) and Gryparis et al. (2009).

7.3 Missing data in spatial analysisAll the concepts introduced in the previous sections are relevant in the particular case of data missing from a dataset containing spatial data. Spatial data raise additional issues that must be taken into account (Haining, 2003 chapters 2 and 4). Missing values in geographical information such as coordinates of units under study and missing values in SAE study variables and auxiliary variables are particular problems in geographical data analysis.

7.3.1 Missing spatial informationImplementation of geostatistical methods requires that the statistical units are referenced at point locations. If the aim is to analyse the spatial pattern or to produce a spatial interpolation of a studied phenomenon, then such

125

spatial information is required only for the sampled statistical units. But if GWR or a geoadditive model is used to produce estimates of a parameter of interest for some geographical domains, the spatial information is required for all population units.

This information is not always easily available, especially when socio-economic data are involved. The coordinates for sampled units, which could be specially collected for the analysis, are usually known, but the exact location of all the non-sampled population units is not known – only the areas to which they belong such as census districts or municipalities.

In such situations, the classic approach that allows the use of geostatistical techniques is to locate all the units belonging to the same area by the latitude and longitude coordinates of the centroid of each area. This is obviously an approximation based on a geometrical property, and the strength of its effect on the estimates will depend on the level of non-linearity in the spatial pattern and on the area dimension.

To evaluate the effect of imputing the locations of missing units using centroids in small-area estimates, the M-quantile Generalized Weighted Regression model based on the Chambers-Dunstan Correction (MQGWR-CD) model and the design-based simulation study on EMAP data in chapter 3 are considered. To check the effect of missing unit locations on the MQGWR-CD estimator, the same model was fitted with each unit located at the centroid of its area for sampled units and for out-of-sample households. Table 7.1 shows the median values over the areas of the percentage RB and of the RRMSE for the MQGWR-CD estimator, as presented in chapter 3, and for the sample estimator using the centroids of the areas and MQGWR-CD centroid predictors.

Table 7.1 Median values of the percentage RB and percentage RRMSE for the MQGWR-CD and MQGWR-CD centroid predictors in designed-based simulations (Chapter 3)

Predictor RB (%) RRMSE (%)

86 sampled HUCs

MQGWR-CD 0.06 29.84

MQGWR-CD centroid -5.46 31.38

27 out-of-sample HUCs

MQGWR-CD -3.69 17.50

MQGWR-CD centroid -4.92 20.75

When all locations at the unit level are missing, it is clear that replacing them by using the centroids can increase the bias and the variability of the final small-area estimates. In the design-based simulation study this happens for the sampled small areas and for the out-of-sample areas. These results suggest that when a large number of unit locations is missing, alternative small-area models such as area-level models or an alternative imputation method should be considered.

Lack of geographical information can be dealt with in a measurement-error approach instead of the centroids whereby a distribution for the locations inside each area is imposed such that xij is the vector of the exact spatial coordinates for the unit i belonging to the area j, and wj is the coordinates of the centroid of the area j. This enables formulation of the hypothesis as a Berkson-type error model: xij= wi+ uij , where E(uij|wi) = 0 and u can assume distributions with different parameters in each area.

A proposal by Little and Rubin (1987) for filling gaps in geographical information follows a stochastic imputation approach instead of the classic deterministic approach using the centroids. In another interesting approach, Bocci and Rocco (2011) proposed to deal with the absence of point referenced geographical data in a geoadditive model – which requires the location of all units to be known – by imposing a distribution to locate the units inside each area.

126

The intention is to make an improvement with respect to imputing the locations using the area centroids. Because this idea could be extended to other small-area and interpolation models, it is presented here with the main results obtained by Bocci and Rocco (2011).

The proposal by Bocci and Rocco (2011) is realized through a hierarchical Bayesian formulation of a geoadditive model in which a prior distribution of the spatial coordinates is defined, and the performance of the imputation approach is evaluated through various MCMC experiments in different scenarios: true distribution of the spatial coordinates – homogeneous Poisson process, non-homogeneous Poisson process and beta distribution – and a priori coordinate distribution used in the hierarchical Bayesian formulation – centroid, uniform and beta. The model is not a “complete” measurement-error model in that it is assumed that the measurement error does not influence the estimation of the parameters of the geoadditive model – the spatial information is available for the sample – but it does occur when the parameter of interest for the areas with the whole population covariates is predicted.

As stated in chapter 2, exact knowledge of the spatial coordinates of the studied phenomenon can be exploited to obtain a surface estimate by using bivariate smoothing techniques such as kernel estimates or kriging (Cressie, 1993; Ruppert et al., 2003). But spatial information alone does not properly explain the pattern of the response variable, and some covariates must be introduced in a more complex model.

Geoadditive models, introduced by Kammann and Wand (2003), answer this problem because they analyse the spatial distribution of the study variable while accounting for possible linear or non-linear covariate effects. Under the additivity assumption they can handle such covariate effects by merging an additive model that accounts for the relationship between the variables and a kriging model that accounts for the spatial correlation, and by expressing both as an LMM. The LMM representation is a useful instrument because it enables estimation with mixed-model methods and software (see Kammann and Wand, 2003). The addition of other explanatory variables is straightforward: smoothing components are added in the random effects term, and linear components can be incorporated as fixed effects. The mixed model structure provides a unified modular framework that enables straightforward extension of the model to include various kind of generalization and evolution (Ruppert et al., 2009).

The mixed model could be fitted in a frequentist framework using a best linear unbiased predictor or penalized quasi-likelihood estimation. A Bayesian inferential perspective can also be adopted by placing priors on the model parameters and simulating their joint posterior distribution. The posterior density is often analytically unavailable, but it can be simulated using MCMC. The posterior distribution of any explicit function of the model parameters can be obtained as a by-product of the simulation algorithm.

Let a population of N units be divided in Q regions, with interest in estimating the regional mean of a study variable y. A sample of n units is taken, from which the response variable y, the location s and possibly some other covariatesthat are known without error for all the population units are taken. To obtain the regional mean, a model-based mean estimator is required:

where Nq is the total number of units in region q and Sq and Rq indicate the sets of the sampled and non-sampled units belonging to region q.

The estimated parameters are obtained from the sampled units but if s is not known for the non-sampled units the simulator above cannot be used directly. To show the problem more clearly, consider a linear predictor xi of yi at spatial location si and use of the following spline-regression mode:

127

where a low-rank thin-plate spline with Ks knots is used to represent the unspecified bivariate smooth function of s.

The model-based mean estimator becomes:

(7.3)

The relevant issue now is: how can this estimator still be applied if si for the non-sampled units Rq is not known? In the classic approach the si values are replaced with the region centroid cq, which is a constant for all the units in region q. But this can, as stated above, have drawbacks with regard to the final estimates of interest.

As suggested by Bocci and Rocco (2011), lack of geographical information can be treated as a particular problem of missing data: instead of using the same coordinates cq for all the units in region q, which may be defined as a particular case of deterministic imputation, they suggest the use of a stochastic Bayesian imputation approach, including in the hierarchical Bayesian formulation of the geoadditive model (Ruppert et al., 2003 chapter 16), a prior distribution for si inside each region q, and then the use of the joint posterior distribution of all parameters given the data as the basis of inference (see Bocci and Rocco, 2011). Some of the simulation results are given here to evaluate the performance of this approach in comparison with the classic centroid approach.

In the experiments, Bocci and Rocco (2011) follow the settings and examples presented in Crainiceanu et al. (2005) and Marley and Wand (2010).

All scenarios are characterized by the following setting, with the study variable simulated by the model:

where , , , , , is a dummy variable known for the whole population, s represents the spatial location that is generated by a different spatial point process in each scenario, and function f(s) is obtained as a bivariate normal mixture density. The population consists of N = 3,000 units located in the unit square O = [0;1]×[0;1], which is divided into Q = 9 rectangular regions that can be represented by their vertices [(l1q;m1q); (l2q;m1q); (l2q;m2q); (l1q;m2q)]. The regions are obtained by using a random binary splitting procedure.

Each scenario differs from the others in the spatial point process used to generate s. Four data-generating processes are considered by Bocci and Rocco (2011), as shown in Figure 7.2.

128

Figure 7.2 Spatial distributions of population units

Legend: (a) homogeneous Poisson process; (b) non-homogeneous Poisson process; (c) non-homogeneous Poisson process on each region; (d) independent bivariate beta distribution on each region.Bocci and Rocco (2011).

For each population setting three MCMC experiments are performed to estimate the mean of y in the 9 regions applying the estimator (7.3) and using the complete hierarchical Bayesian formulation of the geoadditive model. They are characterized by three different choices of the prior distribution for si inside each region q, that is by three different imputation models: centroid imputation, uniform imputation and beta imputation.

The results of the simulation studies are presented in Tables 7.2, 7.3, 7.4 and 7.5, where the performance of the small-area mean estimator is evaluated in terms of RB and RRMSE in the three imputation approaches.

From Tables 7.2 and 7.5 it is evident that the stochastic imputation approach produces better estimates than the classic centroid approach when the imputation distribution corresponds to the population spatial distribution. This is the case with the uniform approach in scenario (a) and of the beta approach in scenario (d). The beta imputation approach also works well in scenario (a) because the true spatial distribution in each region is a special case of the bivariate beta distribution, but it produces less precise estimates than the uniform imputation because the beta parameters need to be estimated in the fitting process.

129

In scenarios (b) and (c) in Figures 7.3 and 7.4 none of the imputation models corresponds to the population spatial distribution, but the beta approach still performs well. This is because the beta distribution has the advantage of modelling different shapes depending on the values of the parameters. In the approach presented here these parameters are estimated directly in MCMC, exploiting the spatial distribution of the sampled units and producing a posterior bivariate beta distribution that is as similar as possible to the sample spatial distribution. The good performance of this approach obviously relies on the representativeness of the sample.

As a final remark on the classic centroid approach, the results suggest that in almost all cases it performs worse than the beta imputation, even if there are particular situations in which it seems a good choice. This depends strictly on the spatial distribution of specific units and the values of y in that region. This consideration also applies to the behaviour of the uniform distribution in scenarios (b), (c) and (d): generally it does not work well, but it may be good in particular situations. The good performance of the beta imputation in all the scenarios is reflected in the mean estimation for the overall area O (see Bocci and Rocco, 2011).

Table 7.2. Empirical RB % and RRMSE % of the model-based mean estimator for the three imputation approaches. Scenario (a): homogeneous Poisson process

RegionCentroid imputation Uniform imputation Beta imputation

RB % RRMSE % RB % RRMSE % RB % RRMSE %

1 0.7934 1.0726 -0.1060 0.3926 0.0748 0.5429

2 0.1790 0.6245 0.0380 0.4737 0.0169 0.5396

3 -3.1145 3.1856 0.0317 0.2641 -0.0834 0.4517

4 1.6969 1.8209 0.0565 0.3141 0.0632 0.4521

5 0.3569 0.8211 -0.0051 0.4036 -0.0537 0.7527

6 -0.3597 0.7326 0.1338 0.3857 0.0177 0.4059

7 0.0116 0.5783 0.2422 0.5127 0.1608 0.6099

8 0.2409 0.7920 -0.1407 0.3687 -0.1274 0.5344

9 2.0669 2.1835 -0.0713 0.3975 -0.1114 0.7074

Overall -0.2861 0.3860 0.0089 0.1254 -0.0269 0.1979

Table 7.3 Empirical RB % and RRMSE % of the model-based mean estimator for the three imputation approaches. Scenario (b): non-homogeneous Poisson process



1 1.5427 1.7086 0.3019 0.4796 0.0770 0.5067

2 -0.1183 0.6995 -0.1747 0.5283 0.0219 0.5200

3 -5.8033 5.8560 -3.0559 3.0820 -0.1910 0.4458

4 2.1125 2.2369 0.5721 0.6848 0.1740 0.5280

5 -0.7650 1.0841 -1.3633 1.4199 -0.1384 0.6976

6 -0.5418 0.9037 -0.1433 0.4055 -0.0390 0.3751

7 -0.7888 1.0004 -0.6332 0.8045 0.0422 0.6082

8 1.3448 1.5056 0.7700 0.8457 0.0124 0.4686

9 1.3178 1.5541 -0.6146 0.7459 0.1635 0.7659

Overall -1.1534 1.1905 -0.9406 0.9535 -0.0207 0.1979

130

Table 7.4 Empirical RB % and RRMSE % of the model-based mean estimator for the three imputation approaches. Scenario (c): non-homogeneous Poisson process on each region



1 1.5210 1.9031 -0.2256 0.7465 0.1777 0.9244

2 0.0809 0.6108 -0.1959 0.4660 0.0475 0.4581

3 -4.8635 4.9351 -2.0413 2.0646 -0.3337 0.5723

4 1.1259 1.2053 -0.6395 0.6768 -0.1398 0.3759

5 -0.3290 0.8681 -0.8983 0.9951 -0.1369 0.7468

6 0.4805 1.4733 -0.2036 1.1369 -0.0393 0.8779

7 -0.6250 1.0207 -0.9846 1.1835 -0.0861 0.8967

8 0.2472 0.7276 -0.2919 0.4204 -0.0261 0.4347

9 1.1119 1.4059 -1.2302 1.3163 -0.1010 0.8539

Overall -0.8551 0.9073 -0.9873 0.9953 -0.1425 0.2483

Table 7.5 Empirical RB % and RRMSE % of the model-based mean estimator for the three imputation approaches. Scenario (d): bivariate beta distribution on each region



1 2.3203 2.4269 1.3440 1.4220 -0.0046 0.4289

2 0.4227 0.8285 -0.0076 0.6944 0.0336 0.5121

3 -3.0580 3.1006 -0.7285 0.8775 0.0863 0.3261

4 -0.6350 0.8042 -2.4948 2.5406 -0.0298 0.3874

5 1.0137 1.3327 0.5416 0.7010 0.2488 0.8755

6 -0.4144 0.6593 0.0342 0.5208 -0.0778 0.4076

7 -1.2892 1.4489 -1.1110 1.2835 0.1120 0.6043

8 1.3736 1.5183 0.6886 0.7977 -0.0116 0.4258

9 0.0750 0.7751 -2.2929 2.3339 -0.0230 0.6983

Overall -0.5710 0.6144 -0.5872 0.6204 0.0403 0.1710

To show the relevance of the spatial representativeness property of the sample, Bocci and

Rocco (2011) present other MCMC experiments. In this case, s is assumed to be univariate so that the regions are actually intervals. In the new simulations the study variable y is simulated by the model

where , , α = 10, βx = 0.4, is a dummy variable known for the whole population, s represents the spatial location and is generated by a uniform distribution in every region and function f(s) = sin(3πs3). The population consisting of N = 3,000 units is located in the interval O = [0, 1], which is divided into Q = 4 intervals [0,0.2], [0.2,0.5], [0.5,0.82], [0.82,1].

131

The population obtained is shown in Figure 7.3(a), where the green dots correspond to the units with xi=0, and the black dots to the units with xi=1, the vertical dashed lines indicate the regions, and the red lines indicate the deterministic component of the model.

Figure 7.3 Scenario settings

Legend: (a) simulated population; green dots correspond to units with xi=0, black dots to units with xi=1; vertical dashed lines indicate regions, red lines indicate the deterministic component of the model; (b) distribution of a representative sample; (c) distribution of a type-1 non-representative sample; (d) distribution of a type-2 non-representative sample.Bocci and Rocco (2011).

Three scenarios are considered, each with a different type of sample selected from the population. For each scenario, three MCMC experiments are performed to estimate the mean of y in the four regions. The three types of sample are stratified samples of n=500 units, with strata corresponding to the four regions and proportional allocation of sampled units in each stratum. They differ in the sampling design used to select the units in each stratum: • representative sample – a simple random sample is selected in each stratum;• type-1 non-representative sample: in each stratum 70 percent of the sample is randomly selected among the

units with s values lower than the centroid; the remaining 30 percent is randomly selected among the units withs values greater than the centroid; and

• type-2 non-representative sample: in each stratum the units are selected with probability proportional to theinverse of the y values.

132

Examples of spatial distribution in the three samples are shown in Figure 7.3. The MCMC experiments follow the settings previously described in this section and are replicated m=100 times to take into account variability in the model and the sampling design. Function f(s) is modelled with a low-rank truncated linear spline with Ks = 30 knots located on the quantiles of the sample distribution of s.

The posterior densities of the regional model-based mean estimator in the three scenarios are presented in Figure 7.4. It is evident that when a simple random sample is selected in each stratum, the uniform imputation, which corresponds to the true spatial distribution, and the beta imputation work well as in the bivariate scenario (a).

In the other two scenarios the performance of the two imputation approaches deteriorates, with the beta imputation more affected. This is because the beta imputation exploits the spatial distribution of the sampled units to estimate its parameters, and as long as the spatial sample distribution does not reflect the one of population, the estimated parameters produce a posterior spatial distribution different from the true distribution.

On the other hand, the uniform imputation does not exploit any sample information and so it correctly imputes the coordinates of the non-sampled units. But because the selection of sampled units depends on their location or to their y value, which is connected to s by f(s), the joint spatial distribution of sampled and imputed units will not be uniform. Similar considerations apply with the classic centroid imputation approach. Hence whichever imputation approach is used, the mean estimator will be affected by the non-representativity of the sample.

It is important to note that the non-representativity of the sample is closely related to the imputation step of the presented analysis. Its semi-parametric spline structure makes the geoadditive model robust to sample non-representativity, and the model fitting step is hardly influenced by it.

133

Figure 7.4. Posterior density of the regional model-based mean estimator in the three scenarios and for the three imputation approaches

Legend: centroid = green line; uniform = red line; beta = blue line. Vertical lines indicate true mean values.Bocci and Rocco (2011).

7.3.2 Missing values in auxiliary and target variablesWith regard to the issue of missing auxiliary variables in geographic datasets, an initial consideration is that the MAR hypothesis does not necessarily imply that the missing values are geographically distributed “at random”. Observations can be missing at regular intervals in the region under study, or they can be clustered in some sub-regions; in the latter case the remaining data will have a strong influence on the fit of models used to describe surface variation. As with other data, it is important to consider why the values are missing before deciding on the approach to be adopted. Missing remotely sensed data, for example, does not necessarily violate the MAR assumption because sensor failure along a scan line is not usually related to the underlying surface. And if unemployment data are not recorded in some areas as a result of strike action, it does not imply a non-ignorable missing-data mechanism: the number of non-responses to questions on crime, for example, is often higher in inner-city areas as a result of a Missing Not At Random (MNAR) mechanism linked to crime levels in those areas.

The spatial continuity of the observations should be used to impute missing values. In the case of clustered missing data in a sub-region, therefore, imputations could be based largely on the values observed in the closer areas.

And if the aim is to map the spatial variability of given characteristics, an estimate of prediction error should be included to check the effect of the missing values; this could be represented in a map as well.

134

Apart from situations where the number of missing data items is low, analysis using complete cases only should be avoided because it can severely bias the results of the analysis of interest, and removing a relevant covariate from the analysis because of the missing values could cause a mis-specification of the model. The usual solution for treating missing data in datasets containing spatial data is, therefore, to impute the missing information.

All the imputation techniques defined for general datasets can be used for spatial data as well, with all extra information provided by the spatial distribution of the values included in the imputation process. Of course, with spatial data there may be particular circumstances to take into account. Some values of covariates may be missing for some observations, for example, but the total at the area level will be known. If data are missing at the small-area level, the total for a wider area including all the small areas may be known. In these cases, the missing values should be filled in so that the final estimates benchmark the wider area total.

Haining (2003) suggests that missing variable values should be imputed in spatial data matrixes using one of the following approaches: i. Spatial mean imputation with equal or unequal weights assigned to each data value. The idea is to impute the

missing value with the arithmetical mean – strictly speaking the median – of the data values in a spatial windowdefined round the area with the missing value. The mean could be weighted to avoid cluster effects in thedistribution of irregularly shaped areas.

ii. Spatial hot-deck imputation. In this approach a missing value is imputed by drawing it from the empiricaldistribution of the variable, considering the values obtained from a given spatial window.

iii. Spatial regression imputation. This approach extends regression imputation by including among the predictorsthe neighbouring values of a fully observed covariate, weighted using a contiguity matrix of the areas.

iv. Maximum likelihood approach. This involves the iterative estimation of model parameters and prediction ofmissing values; it is similar to the EM algorithm and to simple and universal kriging.

Lokupitiya et al. (2006) compared the effect of four techniques to impute missing crop yield data for barley in the 1997 database of the National Agricultural Statistical Survey. The data considered were crop yields aggregated at the county level entered into the National Agricultural Statistical Survey and the Census of Agriculture. The National Agricultural Statistical Survey crop-yield data are produced annually in a statistical sampling approach and surveys of selected farms in a county; Census of Agriculture crop-yield estimates, produced every five years, are based on a survey covering almost all farms in a county. Both datasets present missing data, but the aim of Lokupitiya et al. (2006) was to fill the gaps in the National Agricultural Statistical Survey database because it reports yields every year. A major source of missingness is that the survey only covers states that produce 90 percent to 95 percent of the national total for each crop.

Lokupitiya et al. (2006) compared the following imputation methods: regression, kernel smoothing, universal kriging and MI. As covariate information in these models, they used data on crop yields from the Census of Agriculture. In the multiple imputation procedure, an MCMC method was used to impute the missing values. Mean vector and covariance matrixes for the data that did not have missing values were computed as starting values and considered as the prior distribution. Filling missing values with the random numbers drawn from the available distribution created a complete dataset. The mean vector and covariance matrixes were re-computed for the complete dataset to obtain the posterior distribution. The missing values were then imputed again by generating random numbers from the posterior distribution. This procedure was repeated until the mean vector and covariance matrixes were stable. Imputations from the final iteration were taken to form a dataset with no missing values.

In the simulation studies Lokupitiya et al. (2006) used the omit-one cross-validation method and the deleting-k multifold cross-validation method with k=5 to compare the performance of the different techniques. The first method of validation worked by fitting the model to a sub-sample of the original dataset, where the sub-sample included all but one observation in each sub-sample, for a sample of size n the model is fitted n times on n sub-samples where

135

the sub-sample n has all the observations but one (e.g. the 1st sub-sample has obs. 2,3,...,n; the 2nd sub-sample has obs. 1,3,4,...,n, and so on, the nth sub-sample has obs. 1,2,3...,n-1. The omitted observation changed with each sub-sample so that every observation was held out exactly once; in each case the sub-sample was used to estimate the omitted observation and to compare the estimated value with the omitted observation. In multifold cross-validation, on the other hand, several (k>1) observations were deleted in each sub-sample. Table 7.6 shows the mean absolute prediction error obtained with the two cross-validation methods.

Table 7.6 Mean absolute prediction errors for each imputation method under the two cross-validation approaches

MethodMAPE*

Omit-one Deleting-5 multifold

Regression 2.8272 2.8370

Multiple imputation 2.8003 3.0450

Universal kriging 8.8162 9.2356

Kernel smoothing 25.1922 25.5886

* Mean absolute prediction error.Lokupitiya et al. (2006).

The results of the simulations show that regression and multiple imputation performed best, followed by universal kriging and kernel smoothing. Lokupitiya et al. (2006) suggested that the main problem of kernel smoothing was over-estimation because it is a distance-based method; it could occur when estimating a low crop-yield datum in a zone surrounded by high crop values. They suggested that universal kriging performed poorly because it depended on the hypothesis of isotropy; estimation could be improved in this case by correcting for anisotropy. Nevertheless, the final suggestion was to use regression imputation when data from the Census of Agriculture were available, and to use MI otherwise.

Several studies consider the issues of missing values in spatial datasets, where the SAR or CAR hypothesis can be used to specify the model of interest. Wang and Lee (2013) consider SAR panel models with randomly missing data in the dependent variable: they suggest that missing data can occur even more frequently in spatial-panel data because temporal and spatial missingness may occur across sectional dimensions. To deal with this type of missing data, they consider three approaches: a generalized method of moment estimation based on linear moments; the non-linear least-squares estimation, in which the reduced form of the panel SAR model is used; and a two-stage least-squares estimation with imputation. Wang and Lee (2013) also propose the use of the spatial Mundlak approach (Mundlak, 1978) if individual effects are correlated with the included regressors.

Polasek et al. (2010) proposed a spatial extension of the Chow and Lin (1971) method, the first to develop a unified framework for three problems – interpolation, extrapolation and distribution – of predicting time series by related series. This model predicts unobserved dependent data using indicators observed at the same disaggregated regional level and a spatial SAR model specification. In a similar approach, Horabik and Nahorski (2011) proposed a method derived from areal to areal data realignment for imputing missing data, accounting for spatial clustering in a CAR specification.

In both studies, the primary interest is to allocate or estimate the variable of interest at a finer geographical scale with respect to that currently available. Because this is the primary objective of SAE, the issue of missing data is addressed with a focus on unit non-response.

In addition to the difficulty associated with small sample sizes, an SAE problem can be further complicated by the fact that not all the units in the sample respond to the survey and the probability that a sample unit response may

136

be related to the study variable. Giusti and Rocco (2010) proposed a probability-weighted estimation procedure to adjust for the effect of a non-ignorable non-response mechanism on the small-area mean predictor when a small-area model at the unit level is adopted. Consider a one-fold nested error linear regression model:

where is a fixed covariate vector, is a fixed vector of parameters, is a known constant, and are normally distributed mutually independent terms of error at area and unit level with mean zero and variances and . There is an informative non-response at the first level when some of the values of the target variable are missing, and the associated non-response probabilities are related to the target variable even after conditioning on the covariates.

In this context the method suggested by Giusti and Rocco (2010) is the pseudo maximum likelihood approach introduced by Skinner (1989) to adjust for informative sample designs, extended to the case of informative unit non-response in SAE problems. This extension requires consideration of the population as two-level, with the individual – the first-level units – nested in the small areas – the second-level units. To compensate for the effects of an informative non-response at the first level, it is possible to use a multi-level pseudo maximum likelihood approach, which requires knowledge of the survey weights at every level of the population structure. When the sample design is self-weighting and non-response concerns only the first-level units, the first-level survey weights for unit j belonging to small area i can be specified as , where denotes the response probability of the unit. Because response probabilities are usually unknown, they must be estimated using the available information. The simplest and perhaps most common way to estimate individual response probabilities is to partition the sampled units in “weighting classes”, assumed to be homogeneous with respect to the mechanism of response, and then to estimate response probabilities as rates of respondent units in each class. Another common way to estimate individual response probabilities is to express them as a logit function of a set of known variables.

Hence the expression of the small-area mean estimator used by Giusti and Rocco (2010) is:

(7.4)

where and are the set of respondents and the set of non-sampled plus non-respondent units in area i, and with and obtained using the multi-level pseudo maximum likelihood (MPML) estimation with

weights and where the response probabilities may be true or estimated.

To illustrate the bias of the small-area mean estimator that can occur when ignoring an informative response mechanism, and to assess the performance of the MPML estimation procedure on the basis of the true or estimated response probabilities, Giusti and Rocco (2010) designed three simulation studies – A, B and C – each consisting of the following steps: i. Generate area indexes , and population sizes , with generated

from truncated below by and above by ; for the lie in the range [70, 126]. ii. Generate the population random area effects, and the covariates

, assuming . This rather complicated formula for generating the auxiliary variables follows Pfefferman and Sverchkov (2007) and guarantees that the covariates are the same in each of the three groups of areas, except for the random disturbances . The three groups consist of areas , areas and areas

iii. Generate the values according to the model defined in section 2, with and

137

iv. Associate with each level-1 unit a response probability as follows: in study A for each unit in each area theresponse probability is obtained through an exponential function of ; in study B the areas are splitinto four groups using the quartiles of the random area effects distribution, and in each group the responseprobabilities are generated through an exponential function of the values; but the parameters of this functionchange from one group to another. In study C the procedure is the same as in study B, but the exponentialfunction used to generate the non-response is assumed to depend only on the individual random effects . Inall the studies, the parameters of the non-response generating function are chosen to produce an expected overallpopulation response rate of about 0.7.

v. Select a stratified sample of the first-level units with strata equal to the second-level units and a sampling fractionequal to 0.1 in each stratum.

vi. Classify each level-1 unit in the sample as respondent or non-respondent, carrying out a Bernoulli experimentfor each of them.

vii. Repeat steps 2–6 1,000 times.

In study A for each set of respondents the following six predictors of the area means were computed:i. the standard unweighted EBLUP estimator calculated on the set of respondents;ii. the MPML predictor (7.4) with weights computed using the true response probabilities;iii. the MPML predictor (7.4) with weights computed using response probabilities estimated with the weighting-

within-cells method and using the values to define the cells;iv. the MPML predictor (7.4) with weights computed using response probabilities estimated with a logit model

function of the , supposed known for all the population units;v. the MPML predictor (7.4) with weights computed using response probabilities estimated as in point (d), but

assuming as explicative in the logit model , with ;vi. the standard unweighted EBLUP estimator calculated on the entire sample;vii. for study B and study C, the same predictors are computed except that the estimator described at point (e) is

replaced byviii. the MPML predictor (7.4) with weights computed using response probabilities estimated with a logit model

assuming as covariate not only but also a categorical variable that identifies the groups of areas with different response mechanisms.

Figures 7.5, Figure 7.6 and Figure 7.7 show the area percentage relative biases of the six predictors (a), (b), (c), (d) and (e) and (f) or (g), by study.

Figure 7.5. Study A

138

Figure 7.6. Study B

Figure 7.7. Study C

In all the figures the predictor (f), which corresponds to the hypothesis of complete responses, is considered and shown as benchmark. It is evident from all the figures that an informative response mechanism may induce a significant bias in the estimation of the small-area means if the hierarchical regression model is fitted using the standard ML estimation method (case a). The bias can be reduced effectively by the MPML, assuming unrealistically that the response probabilities are known (case b).

Figure 7.5 also provides evidence of the reduction of bias that occurs if auxiliary variables predictive of the response behaviour are available, and the unknown response probabilities are estimated through a logit model (cases d and e). Obviously, the more predictive the available auxiliary variables, the greater is the bias reduction (case d versus case e). When the response mechanism conditional on the auxiliary variables becomes fully ignorable, the estimated response probabilities produce a bias reduction equivalent to that obtained with the true response probabilities. The performance of the weighting-within-cells method (case c) is equivalent to the performance of true response probabilities when based on auxiliary variables predictive of the response behaviour.

139

The two response probability estimation methods using as covariates only the zij, the parametric and the non-parametric weighting-class methods appear equivalent in study A, where they perform well, and in studies B and C, where they do not significantly reduce the bias of the traditional EBLUP estimator.

In these two studies, good performance of the suggested MPML predictor requires the inclusion of a categorical variable that identifies the groups of areas with different response mechanisms in the estimation process for the response probabilities. The advantage of introducing this categorical variable to estimate the response probabilities in studies B and C is obvious. The point in question is the following: the response estimation procedures that use only the zij values almost remove the bias of the whole population mean direct estimator (see Table 7.7).

Table 7.7. Percentage relative bias of the whole-population mean direct estimator

StudyResponse estimation method

Logit model Weighting within cell

A -1.833 -0.140 -0.045

B -1.205 -0.153 -0.182

C -0.473 -0.075 -0.038

It follows that if the researcher who calculates the survey weights is not interested in the SAE problem, s/he may not realize this advantage. In other words, compensating for non-response using a method that works well for the estimation of the overall population mean without considering the estimation at the small-area level may reduce or not reduce the bias of the small-area mean predictions (predictors c and d). This depends on the compensation method, but also on the response mechanism.

From Table 7.7 it is also evident that in study (c) the bias of the whole population direct estimator with (not adjusted for non-response) is less than half of the bias of the corresponding small area mean unweighted predictor (see Figure 7.5). This result indicates that an informative response mechanism may have a modest effect on population estimators, but a significant effect on small-area estimators.

This study shows that the unit non-response and the SAE problems should probably be addressed simultaneously, because the non-response probabilities may depend on individual unit characteristics and also on area-level issues such as administrative problems in conducting the survey in certain areas.

7.3.3 Missing information in methods of data integrationMerging datasets from multiple sources can cause data to be missing from the resulting integrated dataset. A statistical matching problem is a missing-data problem with a non-monotone missingness pattern (see Figure 7.1); it is often described as a missing-by-design pattern. The treatment of missing data in this approach is usually more problematical than in a “standard” non-monotone pattern when the interest is in analysing the variables that have not been jointly observed. The inherent identification problem in statistical matching requires a conditional independence assumption between variables that have not been jointly observed given the variables jointly observed (Rassler, 2002). The analysis can be further complicated because the matched subset of data can be affected by additional missing-data mechanisms such as unit non-response. In such situations, even if there is a missing-by-design pattern, the missing-data mechanism is not MCAR because there is another underlying missing-data problem. Assuming conditional independence, the hypothesis can be maintained that the missing-data mechanism is still ignorable because conditional independence should include ignorability (Koller-Meinfelder, 2009).

With regard to spatial analysis, the data in an integrated dataset can suffer from spatial-temporal misalignment because the location and time characteristics of the original data do not align well. To investigate health effects of air pollution, for example, data on air pollution and health are needed: but the location and time stamps of air pollution data may be imperfectly aligned with the location and time stamps of aggregated, disaggregated or individual-level

140

health data. To address such a misalignment problem, environmental data have to be imputed to the spatial-temporal stamps of health data (Liang and Kumar, 2013).

It should be noted that data gaps deriving from the matching of spatial datasets fall into the problem category of “incompatible spatial data”. The problem has been described in various ways – the ecological inference problem, the MAUP, spatial data transformation, the scaling problem, inference between incompatible zonal systems, block kriging, pycnophylactic geographic interpolation, the polygonal overlay problem, areal interpolation, inference with spatially misaligned data, contour re-aggregation, multi-scale and multi-resolution modelling, and the change-of-support problem (Gotway and Young, 2002). Hence the treatment of data gaps deriving from misaligned spatial datasets can sometimes be treated using the methods suggested for corresponding “incompatible spatial data” problems such as the areal interpolation method using the expectation-maximization algorithm.

Analysis of time-space datasets that come from different sources requires that: i) they are aligned with respect to location and time; ii) they are arranged on the same spatial-temporal scales; and iii) missing values are filled. If adequate data points across geographic space and time are available, different methods of interpolation can be employed to impute values at a given location and time.

A recently suggested method is time-space kriging, which can be attractive because it minimizes mean squared prediction errors among linear unbiased predictors. Time-space kriging can therefore address multiple problems arising from the convergence of time and space domains, misalignment, missing values and mismatches in spatial-temporal resolutions (Kumar, 2012). Liang and Kumar (2013) proposed a Bayesian hierarchical spatial-temporal method of interpolation – Markov cube kriging – to deal with spatial-temporal misalignment, mismatches in spatial-temporal scales and missing values across space and time in large spatial-temporal datasets (see also the MAUP in chapter 4).

7.4 Remarks and findingsWith regard to the problem of missing geographical information, this chapter has highlighted what happens when geostatistical model is to be applied that requires knowledge of the exact locations of all population units, information that is seldom available. When estimates for geographical domains are needed, locating all the units in the centroid of the corresponding area can be a strong approximation. This was highlighted in the first simulation study, where an MQGWR model was fitted to estimate the mean of the variable of interest in areas in two alternative settings: using the exact coordinates for all units, or locating them on the centroid of their area. With this approximation, the performance of the MQGWR estimator was affected by an increase in bias and variability. The other option presented was the imputation technique suggested by Bocci and Rocco (2011) in the context of geoadditive models in a Bayesian framework. The simulation results showed that in the absence of prior knowledge of the spatial distribution, an approach that imputes spatial coordinates using a Beta prior distribution is certainly preferable to the classic approach that locates each unit with its corresponding area centroid. The proposal made by Bocci and Rocco (2011) is promising, and it could be extended to other settings and geostatistical models.

The chapter also described the effects on small-area estimates of data gaps resulting from an informative non-response mechanism. Giusti and Rocco (2010) suggested that when data are missing for the units in a sample and the interest is in producing small-area estimates, a possible solution is to use an MPML approach in which the units are assigned weights that are functions of the estimated response probabilities. The simulation studies presented in this chapter lead to the important conclusion that the issues of missing data and SAE should be addressed together if possible. This is because the unit non-response mechanism could be different between areas or between groups of areas, and so a weighting approach that reduces the non-response bias for the whole population may not reduce

141

it for small-area estimators. These observations should be taken into account in further contributions to the study of missing data in SAE problems.

Missing geographical information• Applying a geostatistical model requires knowledge of the exact locations of all population units.• Locating all the units in the centroid of the corresponding area can be a strong approximation.• In the context of geoadditive models in a Bayesian framework, a spatial coordinates imputation approach

using a Beta prior distribution is certainly preferable.Missing target/auxiliary variables• Apart from situations where the amount of missing data is small, analysis using complete cases only should

be avoided.• Removing a relevant covariate from the analysis because of the missing values could cause a mis-

specification of the model.• The usual solution treating missing data in datasets containing spatial data is to impute the missing

information.• All the imputation techniques defined for general datasets can also be used for spatial data; this includes

the imputed information on spatial distribution.Missing non-random target data • The issues of missing data and SAE should be addressed together.• A possible solution is to use a multi-level pseudo-maximum likelihood approach where the units are

assigned weights that are functions of the estimated response probabilities.

142

8. Analysis of Zero-Inflated Data in SAE

8.1 IntroductionThis section addresses the problem of the “zero-inflated dataset”. In many agricultural data there can be a large number of zeros in quantitative variables of interest, which leads to problems in the inference process. The expression “zero-inflated data” is used here to mean data that have a larger proportion of zeros than expected from pure-count Poisson data (see for example Barry and Welsh, 2002). Estimates for this particular type of dataset can be obtained by following a Bayesian or a frequentist approach; both are presented in this section.

As stated earlier, SAE techniques are usually based on the LMM. If the LMM is true, neither will be as efficient as the EBLUP if spatial information is not available. But the small-area estimator based on the LMM can be inefficient in zero-inflated data situations. In effect, zero inflation in the data invalidates the assumptions of the LMM (McCullagh and Nelder, 1989) and so problems with inference can occur if this feature of the data is not known. When the focus of the inference is on small areas, the presence of excess zeros in small areas will be more influential than in the overall sample.

This kind of mixed distribution has been considered recently in the SAE literature. In particular, the problem of zero-inflated data has been considered following the Bayesian paradigm (Pfeffermann et al., 2008; Dreassi et al., 2013) and the frequentist paradigm (Chandra and Chambers, 2014; Chandra and Sud, 2012).

The zero-inflation problem can be addressed in both paradigms using a two-part mixed model, of which the first part is the logic function used to model the probability of a positive outcome, and the second is a linear model with normal error terms fitted to the non-zero responses. Both models include individual-level and area-level covariates and area-random effects that account for variations not explained by the covariates.

8.2 Bayesian small-area estimator for zero-inflated dataSuppose that a population U of size N is partitioned into m subsets Ui – domains of study or areas – of size Ni, i = 1,…,m. The population units are identified by j and the small areas by i. The population data consist of values yij of the variable of interest, values xij of a vector of p auxiliary variables that includes the constant term as first component. Suppose that a sample s is drawn and that area-specific samples si Ui of size ni ≥ 0 are available for each area or domain. Note that it is possible to have non-sample areas, so ni = 0, in which case si is the empty set. The set ri Ui contains the Ni − ni indices of the non-sampled units in small area i. Values of yij are known only for sampled values, while for the p-vector of auxiliary variables it is assumed that area-level totals Xi or their means are accurately known from external sources.

Given y the response variable and Z the covariate variables and random effects (Pfeffermann et al., 2008):

(8.1)

For the classical SAE problem, a random-intercept model can be applied with areas or domains defining the first level and units defining the second level. For a unit j in area i with covariates Zij = z, the follow relationship exists:

(8.2)

The two parts in the right-hand side of (8.2) can be modelled separately. For units with positive target values, a random intercept model is assumed:

143

(8.3)

where x+ij and y+

ij are the positive outcome and the vector of covariates for units with positive outcomes, vi is the area-level error and eij is the unit-level error; standard mixed-model assumptions are considered true.

To model the probability of positive outcomes – the second part of equation (8.2) – the generalized LMM is used:

(8.4)

where xij is the vector of covariates for unit j in area i and ui represent random area effects not accounted for by the covariates; standard assumptions are considered true. The Bayesian model framework also allows for the non-zero correlation between the area random effects of the two parts – vi and ui (see Pfeffermann et al., 2008).

Given that the parameter of interest is the small-area mean or total, unobserved outcomes should be predicted:

(8.5)

where define appropriate sample estimates. Adding estimates of the unit-level errors to the estimated mean values reflects the variability of the positive responses more closely. In the Bayesian approach, the missing scores are predicted by drawing at random from their predictive distribution. The proportion of non-zero outcomes is predicted in the frequentist approach as:

(8.6)

where:

(8.7)

A Bayesian solution consists of predicting the indicators I(y>0) by drawing at random from their predictive distribution.

Methods for estimating fixed and random effects when fitting LMMs or generalized LMMs alone have been developed in the last two decades in the frequentist and the Bayesian paradigms. These methods make it possible to compute estimators of the MSE or the Bayes risk of the small-area predictors that account for hyper-parameter estimation to the correct order (see Rao, 2003; Jiang and Lahiri, 2005).

The use of Bayesian methods requires specification of prior distributions for the fixed parameters underlying the two-part model. With the aid of MCMC simulations the application of this approach permits sampling from the posterior distribution of the fixed parameters and the random effects, and hence sampling from the predictive distribution of the unobserved responses. Hence the use of this approach yields the whole posterior distribution of

144

the small-area parameters of interest, thereby enabling computation of correct MSE posterior variance measures or confidence intervals that account for all the sources of variation (Pfeffermann et al., 2008). The MSE or Bayes risk is estimated by computing the empirical variance of the sampled values. Credibility intervals with coverage rates of (1 − α) are defined by the α/2 and (1− α/2) level quantiles of the empirical posterior distribution.

Dreassi et al. (2013) suggest a hierarchical Bayesian approach to SAE for dealing with semi-continuous, skewed and spatially structured data, which occur frequently in agricultural applications. None of the methods mentioned earlier appear to be directly applicable to this problem, however, because of the nature of the response variable: its distribution is zero-inflated, highly skewed for the non-zero values and presents a spatial trend. To describe these features, a suitable extension to current methods is proposed that considers the highly skewed distribution of the positive responses. This justifies the choice of the gamma model in the second part of the model, whose effectiveness is confirmed by the results. In the SAE framework, the skewness of data is usually treated using the log-normal distribution. Because it is highly flexible, the gamma distribution could be a valid alternative (see for example Firth, 1988). When the target variable shows a spatial trend, appropriate use of geographical information makes it possible to achieve more accurate SAE.

Further investigation is needed, even though the suggested approach provides encouraging results: in particular, the conditions that make the full two-part model preferable to the separate ones need to be evaluated.

8.3 Frequentist SAE for zero-inflated dataThe frequentist approach is similar to the Bayesian approach: the difference lies in the parameter estimation of the two models – the linear model for positive outcomes and the model for zero and non-zero outcomes – and the estimation of the MSE. The models involved in the estimation of small-area means for zero-inflated data are:

(8.8)

where is a binary variable assumed to follow a generalized LMM with logit link function. In the model that links the probability of positive values with the covariates, is the unknown fixed-effect parameters and ui is the random area effect associated with area i, which is assumed to be normal with zero mean and constant variance. In the model for positive outcomes, is the unknown fixed-effect parameters, vi is the random area effect associated with area i, which is also assumed to be normal with zero mean and constant variance, xij

+ is the vector of auxiliary variables for positive outcome unit j in area i, pij is the probability of observing a positive outcome. In the frequentist approach, it is difficult to take into account the non-zero correlation between ui and vi so they are considered uncorrelated. An estimate of the unknown parameters and the variance components of the linear random effect model can be obtained by maximum likelihood or restricted maximum likelihood estimation, while the generalized LMM parameters and random effects can be estimated with the penalized quasi-likelihood method combined with restricted maximum likelihood estimation (Saei and Chambers, 2003).

An approximately model-unbiased estimate of the small-area mean is:

145

(8.9)

where

(8.10)

In (8.10), represent the estimate of the fixed-effect coefficient of the logit model, the fixed-effect coefficient of the linear random-effect model for positive outcomes, and the random area effect in the generalized LMM. The MSE estimation is obtained using a bootstrap scheme described in Chandra and Sud, 2012.

Note that if the positive outcomes do not follow an LMM, or if for any other reason a log-transformation of the outcome variable is necessary, then the small-area estimator must be corrected so that transformation back to the original scale of the outcome does not introduce consistent bias. Work on this issue is in progress.

8.4 Empirical evaluation for the frequentist approachThe Bayesian and frequentist approaches are similar, but the estimation process for the target statistics is different. The focus here is on the frequentist approach.

An empirical evaluation using data from the 2000 Italian farm census covering the 54 agrarian regions of Tuscany considered a large number of available variables for subjects such as crops, land use and employment for each farm. A population of 137,413 farms was generated using an LMM in which the target variable was the production of olives and the auxiliary variable was the surface used for this production, which was known from the farm census. Area effects were generated under a normal model with mean 0 and variance 36, the unit-level random error was generated with a normal model with mean 0 and variance 100, and 1,000 samples of fixed size were drawn from the generated population. Sample size in the 54 agrarian regions varied between 5 and 20 units – mean 11.56, median 10 – and the population size varied between 45 and 8,661 – mean 2,545 and median 2,018. The EBLUP (see chapter 2) and the Zero Inflated EBLUP (ZIEBLUP) (see equation 8.9) estimators were computed for each direct sample.The direct estimator in this case was the sample mean. The results are summarized in Table 8.1 using the RB and the RRMSE.

Table 8.1. Mean and median percentage RB and percentage RRMSE of different estimators in the design-based simulation

Direct EBLUP ZIEBLUP

% RB

Mean 0.2 1.8 -0.2

Median -1.2 0.5 -0.5

% RRMSE

Mean 91.8 7.0 6.8

Median 84.7 4.5 4.6

146

The results confirm that direct estimates are unreliable – RRMSE was more than 10 times the RRMSE of EBLUP and ZIEBLUP – and that the zero-inflated estimator offers a small gain in precision and in variability compared with the EBLUP estimator. In other situations the gain in variability can be more consistent with respect to the EBLUP, but further research is needed. These results are consistent with the results obtained in the design-based simulation in Chandra and Sud, 2012.

8.5 Remarks and findingsSAE methods using zero-inflated data requires further research. Practitioners should check how the model fits on to the whole dataset, including zeros, and if the fit is good they can proceed as usual using appropriate small-area estimators in line with data availability and the aims of the analysis. The alternative is to use either the frequentist or the Bayesian estimator. Note that the auxiliary variables used to model the probability of positive outcomes can be different from auxiliary variables used to model positive outcomes. Neither dominates the other in terms of efficiency, so practitioners should use the one that is easiest to implement with the available resources.

Small-area estimation with zero inflated target variable• Zero-inflated data can be handled in a Bayesian or frequentist framework.• A recommended practical approach is to fit models on to zero and non-zero data and proceed as usual if

the prediction is good.• When the model fit is not good, the Bayesian or frequentist approach can be used to obtain small-area

estimates.

147

9. Final Remarks and Recommendations

The objective of this report is to assess the use of simple spatial-disaggregation techniques based on digital maps and of SAE methods in the production of agricultural and rural statistics at the local or small-area level, which is the geographical level at which data are requested in order to plan sub-regional policies or evaluate the results of policy implementation.

With regard to the spatial disaggregation techniques, the availability of digital maps makes it possible to apply simple areal interpolation methods to disaggregate the original data of a region into smaller target zones of the same region. The integration of several maps of the original source zone provides disaggregated data at the local level. These easily obtained data can then be combined with information on environmental conditions, actual observations and measurements of agriculture, population distribution and other remotely sensed data. But the variety of observation data and processing models required are considerable, and in a number of countries several of these areas of evidence on environment, agriculture and population are not currently available.

Graphics and tables are provided to show how each method works, with examples of mapping of agricultural data for crop production data and the construction of population density grids.

The main observations on SAE are:• The term “small area” does not refer only to geographical extent. The literature qualifies a small area as a region

or a sub-population for which the sample size or the number of observations is not enough to guarantee thestatistical accuracy of estimates. In this sense, a state or a country can be considered a small area.

• Provided that auxiliary information linked to the study variable is available, it is possible to reinforce the validityof evidence from the small sample – in other words to borrow strength from it. How to model the link, which isthe most effective auxiliary information (spatial not spatial information) for increasing their efficiency are thethemes treated in the report. The report analyses the different availability and typology of outcome and auxiliaryvariables and try to suggest which is the "best" method for estimation.

• The small-area estimates produced by these techniques are “new statistics” that are not available from survey oradministrative data sources. They integrate survey data collected on sampled units and administrative data, andcan be derived for out-of-sample small areas if auxiliary information is available. The examples of obtainingsuch new statistics cover estimations of agro-environmental parameters – forest biomass per hectare in Norway,mean agrarian surface area for production of grapes in Italy and the acid neutralizing capacity of lakes in thenorth-eastern United States.

• Auxiliary spatial information is crucial in many applications of SAE methods, particularly when predictingtarget parameters for out-of-sample areas. In this case, digital GIS maps can provide basic spatial data on thegeography of the study zone such as the centroids of the areas and the distances between them, which are helpfulin capturing variability among areas to produce more credible estimates.

• Small-area methods are analytical. They can help significantly in the production of low-cost official statistics, butmethods and assumptions need to be clarified. The application of the traditional and more recent SAE methodsand their assumptions and properties are described in detail for users in Chapter 2 of the report.

• To produce SAE estimates, the minimum information requirement is direct estimates for the sampled areasand auxiliary information, including spatial information on the areas, for sampled and non-sampled areas.The FH-EBLUP is useful when there is no spatial correlation in the distribution of the study variable. Withmoderate values of spatial correlation – say |0.5| per module – the FH-SEBLUP is recommended because it ismore accurate.

• When data are available at the unit level – survey micro-data for sampled units and micro-data for populationunits, for example – several SAE models adapt well to the distribution of the study variable, particularly theEBLUP and its spatial version the SEBLUP. When the distribution of the study variable is not normal, the MQ

148

and MQGWR estimators are more versatile and have more useful properties than the standard solutions in real-life applications.

Several open issues are identified relating to the resilience of these SAE methods to non-standard situations that occur in agricultural surveys.

The following recommendations are listed in the order of the chapters of Part II, where they are treated in detail.

Sensitivity of the SAE predictors to different spatial modelsThe spatial distribution of crop and land use in small areas may be non-stationary and may show spatial correlation. It should be noted that: i) when spatial correlation is high and the process is stationary (SAR), SEBLUP is the best choice; if the spatial correlation is high it should be inserted in the fixed and random parts of the models using the SEBLUP estimator; ii) when the process is spatially non-stationary, the SAE model should include some covariates that capture the spatial effects, otherwise the efficiency of the predictors is compromised; and iii) when spatial heterogeneity is relevant across the areas, the SAE approach is useful because it performs better than synthetic estimates from interpolation methods such as kriging and GWR.

The MAUPThe MAUP (Unwin, 1996) is a potential source of error that can affect SAE results and spatial studies that utilize aggregate data sources. The problem is in the definition of the small area of interest because the relation between variables can be different when measured on sets of small areas at different scales. A simple way to deal with the MAUP in SAE is to carry out analyses at several scales or several zones. In the simulation it was shown that the SAE estimators in the MQ approach are the most resilient to the scale effect of MAUP; the performance of the kriging estimator is acceptable. The recommendation is to apply the MQ approach whenever new small-area statistics have to be produced at different scales of aggregation.

Robustness of the predictorsThe assumption of normality can rarely be accepted when studying the distribution of agro-environmental variables. When data are generated under the normality assumption, the EBLUP, EBLUP-GC and GWEBLUP predictors show a substantial gain in efficiency as measured by a lower RRMSE in the spatially stationary, SAR stationary and spatially non-stationary models. Of the interpolation methods, only GWR is competitive in terms of efficiency. The MQ approach does not gain in efficiency under normality scenarios, but its resilience to departures from normality and to outliers is better than that of the EBLUP family of predictors; when it is compared with other robust estimators, however, the results are different. In the comparison of the REBLUP based on the mixed-model approach, neither dominated the other in terms of efficiency and robustness and it is not easy to advise on which to use. The only recommendations are not to ignore the presence of outliers in the data, and to use one of the two proposed estimators instead of the EBLUP estimator or other non-robust estimator.

Complexity of the sampling design of the survey on the target variableWith regard to the effect of complex sampling design on small-area estimators, it is crucial to determine when the design can be considered ignorable. Practitioners should note that a design is considered non-ignorable when some of the variables contributing to the calculation of sampling weights cannot be included in the model used in the model-based or model-assisted small-area estimators. In such cases the pseudo-EBLUP SAE estimator should be used if there is no evidence of outliers; if outliers are present in the data the WMQ estimator should be used.

Missing data in spatial datasets Two particular problems were considered: missing data in geographical information, and missing data in the study or auxiliary variables. With regard to the former, the investigation looked into the effect of missing unit locations when

149

the interest is in applying a geostatistical model that requires knowledge of the exact locations of all the population units – which is seldom available. When the interest is in producing estimates for geographical domains, the classic solution is to locate all the units in the centroid of the corresponding area, which gives a strong approximation. If prior knowledge of the spatial distribution is not available, a spatial coordinate imputation approach using a beta prior distribution is certainly preferable to the classic approach that locates each unit with its corresponding area centroid. Consideration of the effect of missing data deriving from an informative non-response mechanism on small-area estimates led to the conclusion that the issues of missing data and SAE should be addressed together if possible.

The excess of zeros in survey data In agricultural data there are numerous zeros in the quantitative variables of interest: most crops, for example are only harvested in one or two quarters of a year, which leads to a recurring pattern of zeros in the original data and subsequent problems in the inference process and in the validity of SAE estimates. The proposed Bayesian and frequentist approaches are cumbersome and require the specification of complex models. In short, SAE for zero-inflated data requires further research. The recommendation is to check how the model fits on the whole dataset, including zeros: if the fit is satisfactory practitioners can proceed as usual in SAE, choosing appropriate small-area estimators in line with data availability and the aims of the analysis; the zeros can be ignored. Alternatively, the frequentist or Bayesian estimator can be applied: neither dominates the other in terms of efficiency, so practitioners should use the easiest one to implement with the available resources.

In all the case studies oversampling of the study area to increase the sample size might be a feasible alternative to SAE, but it is likely to be expensive and time-consuming.

Spatial information helped to validate the specification of the operational model underlying the SAE predictors of the target parameters, which are generally expressed in means and percentages. As these are correlated with the geography of the landscape, the introduction of spatial information in the SAE operational model improved the accuracy of the estimates and gave a more realistic mapping of the indicators.

It is essential that the core set of statistical indicators and data on agro-environmental phenomena be defined at the international level. FAO has done this, but it is outside the scope of this report to define what is local and what is not local in any country. That said, it is noted that the availability of data at the local level and their statistical quality is far from homogeneous among countries. But when survey data are few and costs and other constraints prevent additional surveys or over-sampling of study areas, the available information must be integrated into generally applicable approaches such as SAE methods.

150

References (Part II)Allison, P.D. 2002. Missing Data. Thousand Oaks, CA, USA, Sage.

Barry, S.C. & Welsh, A.H. 2002. Generalized additive modelling and zero-inflated count data. Ecological Modelling 157, 179–188.

Bocci, C. & Rocco, E. 2011. Estimates for geographical domains through geoadditive models in presence of missing geographical information. Working paper 2011/01. Florence, Italy, Department of Statistics, University of Florence.

Bollinger, C.R. 1998. Measurement error in the current population survey: a non-parametric look. Journal of Labor Economics 16, 576–594.

Brown, G., Chambers, R., Heady, P. & Heasman, D. 2001. Evaluation of small area estimation methods: an application to unemployment estimates from the UK Labour Force Survey. Proceedings of Statistics Canada symposium, 2001. Quebec, Canada.

Buonaccorsi, J.P. 1995. Prediction in the presence of measurement error: General discussion and an example predicting defoliation. Biometrics 51, 1562–1569.

Carroll, R., Eltinge, J. L. & Ruppert, D. 1993. Robust linear regression in replicated measurement error models. Statistics and Probability Letters 16, 169–175.

Carroll, R.J., Ruppert, D., Tosteson, T., Crainiceanu, C. & Karagas, M. 2004. Non-parametric regression and instrumental variables. Journal of the American Statistical Association 99, 736–750.

Carroll, R.J., Ruppert, D., Stefanski, L.A. & Crainiceanu, C.M. 2006. Measurement Error in Nonlinear Models. A Modern Perspective. Second edition. Chapman and Hall/CRC, Boca Raton, FL, USA.

Chambers, R., Chandra, H. & Tzavidis, N. 2011. On bias-robust mean squared error estimation for pseudo-linear small area estimators. Survey Methodology 37, 153–170.

Chambers, R. & Tzavidis, N. 2006. M-quantile models for small area estimation. Biometrika 93, 255–268.

Chandra, H. & Chambers, R. 2005. Comparing EBLUP and C-EBLUP for small area estimation. Statistics in Transition 7, 637–648.

Chandra, H., Salvati, N. & Chambers, R. 2007. Small area estimation for spatially correlated populations: a comparison of direct and indirect model-based methods. Statistics in Transition 8, 331–350.

Chandra, H., Salvati, N., Chambers, R. & Tzavidis, N. 2012. Small area estimation under spatial non-stationarity. Computational Statistics and Data Analysis 56, 2875–2888.

Chandra, H. & Sud, U.C. 2012. Small area estimation for zero-inflated data. Communications in Statistics – Simulation and Computation 41(5), 632–643.

Chandra, H. & Chambers, R. 2014. Small area estimation for semi-continuous data. Biometrical Journal 06/2014 DOI: 10.1002/bimj.201300233.

151

Chesher, A. & Schluter, C. 2002. Welfare measurement and measurement error. Review of Economic Studies 69, 357–378.

Chow, G. C. & Lin, A. 1971. Best linear unbiased interpolation, distribution and extrapolation of time series by related series. Review of Economics and Statistics 53, 372–375

Crainiceanu, C., Ruppert, D. & Wand, M.P. 2005. Bayesian analysis for penalized spline regression using WinBUGS. Journal of Statistical Software 14(14), 1–24.

Cressie, N. 1993. Statistics for spatial data. New York, Wiley.

Dreassi, E., Petrucci, A. & Rocco, E. 2013. Small area estimation for semi-continuous skewed spatial data: an application to the grape wine production in Tuscany. Biometrical Journal, doi:10.1002/bimj.201200271.

Fabrizi, E., Salvati, N., Tzavidis, N. & Pratesi, M. .2013. Outlier-robust model-assisted small area estimation. Biometrical Journal, DOI:10.1002/bimj.201200095.

Fay, R.E. 1992. When are inferences from multiple imputation valid? Proceedings of the Survey Research Methods Section. American Statistical Association 81(1), 227–32.

Fellner, W. H. 1986. Robust estimation of variance components. Technometrics 28, 51–60.

Firth, D. 1988. Multiplicative errors: log-normal or gamma? Journal of the Royal Statistical Society 50(2), 266–268.

Fotheringham, A.S. 1989. Scale-independent spatial analysis. In: Goodchild, M. & Gopal, S. (eds.) Accuracy of Spatial Database, London, Taylor and Francis, pp. 221–228.

Fotheringham, A.S. & Wong, D.W.S. 1991. The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning 23(7).

Fotheringham, A.S., Brunsdon, C. & Charlton, M. 2002. Geographically Weighted Regression. Bognor Regis, UK, John Wiley and Sons.

Fuller, W.A. 1987. Measurement Error Models. New York, Wiley.

Ganguli, B., Staudenmayer, J. & Wand, M.P. 2005. Additive models with predictors subject to measurement error. Australia and New Zealand Journal of Statistics 47, 193–202.

Gehlke, C.E. & Biehl, K. 1934. Certain effects of grouping upon the size of the correlation coefficient in census tract material. Journal of the American Statistical Association 29, 169–170.

Gelman, A. 2007. Struggles with survey weighting and regression modelling. Statistical Science 22(2): 153–164.

Giusti, C. 2009. Multiple imputation of missing income data in the survey on income and living conditions. Rivista di Statistica Ufficiale 2(3): 63.

Giusti, C., Tzavidis, N., Salvati, N. & Pratesi, M. 2014. Resistance to outliers of M-quantile and robust random effect small area models. Communications in Statistics – Simulation and Computation 43: 549–568.

152

Giusti, C. & Rocco, E. 2010. Small area estimation in presence of non-response. Working paper 2010/13. Florence, Italy, Department of Statistics, University of Florence.

Goodchild, M.F. 1991. Issues of quality and uncertainty. In: Muller, J.C. (ed.) Advances in Cartography, New York, Elsevier Science, pp. 113–139.

Goodchild, M.F. 1999. Measurement-based GIS. In: Shi, W., Goodchild, M.F. & Fisher, P.F. (eds.) Proceedings of the International Symposium on Spatial Data Quality 99, Hong Kong, Hong Kong Polytechnic University, pp. 1–9.

Goodchild, M.F. & Gopal, S. (eds.) 1989. Accuracy of Spatial Databases. London, Taylor and Francis.

Goovaerts, P. 2009. Combining area-based and individual-level data in the geostatistical mapping of late-stage cancer incidence. Spatial and Spatio-Temporal Epidemiology 1, 61–71.

Gotway, C.A. & Young, L.J. 2002. Combining incompatible spatial data. Journal of the American Statistical Association 97, 632–648.

Gryparis, A., Coull, B.A., Schwartz, J. & Suh, H.H. 2007. Semi-parametric latent variable regression models for spatio-temporal modelling of mobile source particles in the greater Boston area. Applied Statistics 56, 183–209.

Gryparis, A., Paciorek, C.J., Zeka, A., Schwartz, J. & Coull, B.A. 2009. Measurement error caused by spatial misalignment in environmental epidemiology. Biostatistics 10, 258–274.

Haining, R. 2003. Spatial Data Analysis: Theory and Practice. Cambridge, UK, Cambridge University Press.

Heuvelink G.B.M. 1998. Error Propagation in Environmental Modelling with GIS. London, Taylor and Francis.

Holt, D., Steel, D.G., Tranmer, M. & Wrigley, N. 1996. Aggregation and ecological effects in geographically based data. Geographical Analysis 28, 244–261.

Horabik, J. & Nahorski, Z. 2011. Spatial disaggregation of pollutant concentration data. Proceedings of the Spatial 2 Conference: Spatial Data Methods for Environmental and Ecological Processes. Foggia, Italy.

Horvitz, D.G. & Thompson, D.J. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 663–685.

Huggins, R.M. 1993. A robust approach to the analysis of repeated measures. Biometrics 49, 255–268.

Jiang, J. & Lahiri, P. 2005. Mixed model prediction and small area estimation. TEST 15, 65–72.

Jiang, J. & Lahiri, P. 2006. Estimation of finite population domain means: a model-assisted empirical best prediction approach. Journal of the American Statistical Association 101, 301–311.

Kammann, E.E. & Wand, M.P. 2003. Geoadditive models. Applied Statistics 52, 1–18.

Kish, L. 1990. Weighting: why, when and how? American Statistical Association, Proceedings of the section on survey research methods, pp. 121–130.

153

Koller-Meinfelder, F. 2009. Analysis of incomplete survey data: multiple imputation via Bayesian bootstrap predictive mean matching. PhD Dissertation. Bamberg, Germany, Otto Friedrich Universität.

Kumar, N. 2012. Uncertainty in the relationship between criteria pollutants and low birthweight in Chicago. Atmospheric Environment 49, 171–179.

Larsen, T., Nagoda, D. & Anderson, J.R. 2001. The Barents Sea Ecoregion: a Biodiversity Assessment. Oslo, World Wildlife Fund.]

Lehtonen, R. & Pahkinen, E. 2004. Practical Methods for Design and Analysis of Complex Surveys. New York, Wiley.

Lehtonen, R. & Veijanen, A. 1999. Domain estimation with logistic generalized regression and related estimators, pp. 121–128. IASS Satellite Conference on Small Area Estimation. Riga, Latvian Council of Science.

Leung, Y., Ma, J.-H. & Goodchild, M.F. 2004. A general framework for error analysis in measurement-based GIS: parts 1–4. Journal of Geographical Systems 6, 325–428.

Leung, Y. & Yan, J.P. 1998. A locational error model for spatial features. International Journal of Geographical Information Science 12, 607–620.

Liang, D. & Kumar, N. 2013. Time-space kriging to address the spatio-temporal misalignment in the large datasets. Atmospheric Environment 72, 60–69.

Little, R.J.A. 1982. Models for non-response in sample surveys. Journal of the American Statistical Association 77, 237–250.

Little, R.JA. & Rubin, D.B. 1987. Statistical Analysis with Missing Data. Cambridge, Mass., USA, Wiley.

Little, R.J.A. & Rubin, D.B. 2002. Statistical Analysis with Missing Values. Cambridge, Mass., USA, Wiley.

Lokupitiya, R.S., Lokupitiya, E. & Paustian, K. 2006. Comparison of missing value imputation methods for crop yield data. Environmetrics 17, 339–349.

Madsen, L., Ruppert, D. & Altman, N.S. 2008. Regression with spatially misaligned data. Environmetrics 19, 453–467.

Marley, J. & Wand, M.P. 2010. Non-standard semiparametric regression via BRugs. Journal of Statistical Software 37(5): 1–30.

McCullagh, P. & Nelder, J.A. 1989. Generalized Linear Models. New York, Chapman and Hall.

Meng, X.L. 2002. A congenial overview and investigation of multiple imputation inferences under uncongeniality. In: Groves, R.M., Dillman, D.A., Eltinge, J.L. & Little, R.J.A. (eds.). Survey Nonresponse, New York, Wiley, pp. 357–371.

Meng, X.L. 1994. Multiple-imputation inferences with uncongenial sources of input. Statistical Science 9, 538–573.

154

Mowrer, H.T. & Congalton, R.G. (eds.). 2000. Quantifying spatial uncertainty in natural resources: Theory and applications for GIS and remote sensing. Chelsea, MI, USA, Ann Arbor Press.

Mundlak, Y. 1978. On the pooling of time series and cross-section data. Econometrica 46, 69–85.

Münnich, R. & Burgard, J.P. 2012. On the influence of sampling design on small area estimates. Journal of the Indian Society of Agricultural Statistics 66(1): 145–156.

Openshaw, S. & Taylor, P.G. 1979. A million or so correlation coefficients: three experiments on the modifiable areal unit problem. In: Wrigley, N. (ed.), Statistical Application in the Spatial Sciences, London, Pion, pp. 127–144.

Opsomer, J.D., Claeskens, G., Ranalli, M.G., Kauermann, G., & Breidt, F.J. 2008. Nonparametric small area estimation using penalized spline regression. Journal of the Royal Statistical Society, Series B, 70, 265–286.

Petrucci, A., Pratesi, M., & Salvati, N. 2005. Geographic information in small area estimation: small area models and spatially correlated random area effects. Statistics in Transition 7, 609–623.

Petrucci, A. & Salvati, N. 2006. Small area estimation for spatial correlation in watershed erosion assessment. Journal of Agricultural, Biological and Environmental Statistics 11, 169–182.

Pfeffermann, D. 1993. The role of sampling weights when modelling survey data. International Statistical Review 61(2): 317–337.

Pfeffermann, D. & Sverchkov, M. 2007. Small-area estimation under informative probability sampling of areas and within the selected areas. Journal of the American Statistical Association 102, 1427–1439.

Pfeffermann, D., Terryn, B. & Moura, F.A.S. 2008. Small area estimation under a two-part random effects model with application to estimation of literacy in developing countries. Survey Methodology 34, 235–249.

Polasek, W., Llano, C. & Sellner, R. 2010. Bayesian methods for completing data in spatial models. Review of Economic Analysis 2(2): 194–294.

Prasad, N. & Rao, J. 1999. The estimation of mean squared error of small-area estimators. Journal of the American Statistical Association 85, 163–171.

Pratesi, M. & Salvati, N. 2005. Sampling strategies and multifunctionality in agricultural surveys. Atti del Convegno Intermedio SIS 2005 Statistica e Ambiente, Messina. ISBN: 88-7178-531-2, CLEUP.

Pratesi, M. & Salvati, N. 2008. Small area estimation: the EBLUP estimator based on spatially correlated random area effects. Statistical Methods and Applications 17, 113–141.

Pratesi, M. & Salvati, N. 2009. Small area estimation in the presence of correlated random area effects. Journal of Official Statistics 25, 37–53.

Qi, Y. & Wu, J. 1996. Effects of changing spatial resolution on the results of landscape pattern analysis using spatial autocorrelation indices. Landscape Ecology 11, 39–49.

Rao, J.N.K. 2003. Small Area Estimation. New York, Wiley.

155

Rassler, S. 2002. Statistical Matching - Lecture Notes in Statistics. New York, Springer.

Richardson, A.M. & Welsh, A.H. 1995. Robust estimation in the mixed linear model. Biometrics 51, 1429–1439.

Richardson, S., Leblond, L., Jaussent, I. & Green, P. 2002. Mixture models in measurement error problems, with reference to epidemiological studies. Journal of the Royal Statistical Society, Series A, 165, 549–556.

Rubin, D.B. 1996. Multiple imputation after 18+ year. Journal of the American Statistical Association 91, 473–489.

Rubin D.B. 1987. Multiple Imputation for Nonresponse in Sample Surveys. New York, Wiley.

Rubin, D.B. 1976. Inference and missing data. Biometrika 53, 581–592.

Ruppert, D., Wand, M.P. & Carroll, R.J. 2003. Semiparametric Regression. Cambridge, UK, Cambridge University Press.

Ruppert, D., Wand, M.P. & Carroll, R.J. 2009. Semiparametric regression during 2003–2007. Electronic Journal of Statistics 3 1193–1256.

Saei, A. & Chambers, R. 2003. Small area estimation under linear and generalized linear mixed models with time and area effects. Methodology working paper n. M03/15. Southampton, UK, University of Southampton.

Saei, A. & Chambers, R. 2005. Empirical best linear unbiased prediction for out-of-sample areas. Working paper M05/03. Southampton, UK, University of Southampton.

Salvati, N. 2004. Small area estimation by spatial models: the spatial empirical best linear unbiased prediction (Spatial EBLUP). Working paper 2004/04. Florence, Italy, University of Florence Department of Statistics.

Salvati, N., Tzavidis, N., Pratesi, M. & Chambers, R. 2012. Small area estimation via M-quantile geographically weighted regression. TEST 21, 1–28.

Särndal, C.E. 1982. Implications of survey design for generalized regression estimation of linear functions. Journal of Statistical Planning and Inference 7, 155–170.

Särndal, C.E., Swensson, B. & Wretman, J. 2003. Model Assisted Survey Sampling. New York, Springer.

Schafer, J.L. 1997. Analysis of Incomplete Multivariate Data. London, Chapman and Hall.

Schafer J.L. & Olsen M.K. 1998. Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivariate Behavioral Research 33, 545–571.

Schenker N., Raghunathan T.E., Chiu P., Makuc D., Zhang G. & Cohen, A.J. 2006. Multiple imputation of missing income data in the National Health Interview Survey. Journal of the American Statistical Association 101, 924–933.

Schmid, T. 2011. Spatial robust small area estimation applied on business data. University of Trier, Ph.D. thesis.

Schmid, T. & Münnich, R. 2013. Spatial robust small area estimation. Statistical Papers. DOI: 10.1007/s00362-013-0517-y, accepted.

156

Scott, A.J. 1977. Some comments on the problem of randomization in surveys. Indian Journal of Statistics, Series C, 39, 1–9.

Singh, B.B., Shukla, G.K. & Kundu, D. 2005. Spatio-temporal models in small area estimation. Survey Methodology 31, 183–195.

Sinha, S.K. & Rao, J.N.K. 2009. Robust small area estimation. Canadian Journal of Statistics 37, 381–399.

Skinner, C.J. 1989. Domain means, regression and multivariate analysis. In: Skinner, C.J, Holt, D. & Smith, T.M.F. (eds.), Analysis of Complex Surveys, Chichester, UK, John Wiley and Sons, pp. 59–87.

Skinner, C.J., Holt, D. & Smith, T.M.F. 1989. Analysis of Complex Surveys. New York, Wiley.

Stanislawski, L.V., Dewitt, B.A. & Shrestha, R.S. 1996. Estimating positional accuracy of data layers within a GIS through error propagation. Photogrammetric Engineering and Remote Sensing 62, 429–433.

Steel, D.G. & Holt, D. 1996. Rules for random aggregation. Environmental and Planning A 28, 957–978.

Sugden, R.A. & Smith, T.M.F. 1984. Ignorable and informative designs in survey sampling inference. Biometrika 71(3): 495–506.

Tillé, Y. 2006. Sampling Algorithms. New York, Springer.

Tobler, W.R. 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography 46: 234–40.

Tobler, W.R. 1989. Frame independent spatial analysis. In: Goodchild, M. & Gopal, S. (eds.), Accuracy of Spatial Database, London, Taylor and Francis, pp. 115–122.

Torabi, M., Datta, G.S. & Rao, J.N.K. 2009. Empirical Bayes estimation of small area means under a nested error linear regression model with measurement errors in the covariates. Scandinavian Journal of Statistics 36, 355–368.

Tranmer, M. & Steel, D.G. 1998. Using census data to investigate the causes of ecological fallacy. Environmental and Planning A 30, 817–831.

Tzavidis, N., Marchetti, S. & Chambers, R. 2010. Robust estimation of small-area means and quantile. Australian and New Zealand Journal of Statistics 52, 167–186.

Unwin, D.J. 1996. GIS, spatial analysis and spatial statistics. Progress in Human Geography 20(4): 540–551.

Veregin, H. 1989. A taxonomy of error in spatial databases. Technical paper 89–12. Santa Barbara, CA, USA, National Center for Geographic Information and Analysis. University of California Geography Department.

Wang, L. 2004. Estimation of nonlinear models with Berkson measurement errors. Annals of Statistics 32(6): 2559–2579.

Wang, W. & Lee, L.-F. 2013. Estimation of spatial panel data models with randomly missing data in the dependent variable. Regional Science and Urban Economics 43, 521–538.

157

Welsh, A. H., Ronchetti, E. 1998. Bias-calibrated estimation from sample surveys containing outliers. Journal of the Royal Statistical Society, Series B 60:413–428.

Wolf, P.R. & Ghilani, C.D. 1997. Adjustment computations: statistics and least squares in surveying and GIS. New York, Wiley.

Ybarra, L.M.R. & Lohr, S.L. 2008. Small area estimation when auxiliary information is measured with error. Biometrika 95, 919–931.

You, Y. & Rao, J.N.K. 2002. A pseudo-empirical best linear unbiased prediction approach to small area estimation using survey weights. Canadian Journal of Statistics 30, 431–439.

Zhang, J. & Goodchild, M. 2002. Uncertainty in Geographical Information. Boca Raton, FL, USA, CRC press.

Zhou, H. 2014. Accounting for complex sample designs in multiple imputation using the finite population Bayesian bootstrap. PhD Dissertation. Michigan, USA, University of Michigan.

Zhuly, L., Carlin, B.P. & Gelfand, A.E. 2003. Hierarchical regression with misaligned spatial data: relating ambient ozone and pediatric asthma ER visits in Atlanta. Environmetrics 14, 537–557.

158

General SummaryThis report examines the most commonly used disaggregation methods based on mapping and SAE methods. After a review of areal interpolation techniques, SAE methods are considered in detail. The basic data for applying them are available in many countries and they can be adapted to different agricultural and rural data.

The basic data are direct estimates – means, totals and percentages – of the target parameters at the area level drawn from a sample survey, and the values of area means, totals and percentages for auxiliary variables known from agricultural censuses or administrative archives. The latter values must be available for sampled areas. The data are used to produce statistically sound estimates when the sample size is too small, or even zero, in the target area of interest to produce direct significant estimates for the target parameter. With richer and more detailed auxiliary information – for example when auxiliary variables are available for each unit of the target population – the SAE methods become more complex and the prediction of the target parameters becomes more accurate.

The report then focuses on indirect SAE estimates, which use auxiliary information or variables to improve quality and accuracy and to break down known values related to larger areas by using regression-type models.

The indirect estimates are obtained in model-assisted and model-based approaches, where a statistical model – generally a regression model – is specified to obtain validation from the auxiliary variables. In the model-assisted approach, estimators generally have design-based properties and their accuracy as measured by the MSE is derived in the sampling design used to collect the survey data. In the model-based approach, the properties of the estimators and their accuracy are evaluated in the specified model to obtain validation from the auxiliary variables.

GREG estimators In the model-assisted approach the GREG estimator is the most popular and is, with its modifications, applied by many statistical agencies. It is design-consistent, which guarantees at least for large domains that the estimates make sense even is if the model fails. It is based on linearity in the relation of the study variable with the auxiliary variables, but it allows for several extensions of the basic regression model.

In the model-based approach the characteristics of the data available for the study determine the specification of area-level and unit-level models.

Specification at the area level is mandatory when the target or auxiliary variables are known only at area level. In this case the Fay and Herriot model is widely used.

FH-EBLUP estimatorsThis predictor uses the base level of available information – direct estimates for the sampled areas and auxiliary information for the sampled and non-sampled areas. The predictors can incorporate geographic information about the areas of interest such as the FH-SEBLUP. Even when the spatial correlation has moderate values, this estimator is recommended for its accuracy. The model assumes the linearity of the relation between the study variable and the auxiliary information; it is not recommended in cases of mis-specification and outlying observations.

When data are available at the unit level, there is space for modelling the behaviour of population unit inside the areas of interest. The generalized regression models can include individual covariates and individual random effects; area random effects can also be included to allow for differences between areas to be included in the model. The clustering of units in the areas can be modeled without assuming normality of the study variables, as in the MQ approach.

159

EBLUP estimatorsIn the generalized linear regression model the most used predictor is the EBLUP. The model assumes normality and non-correlation of random effects, and it is sensitive to spatial mis-specification. It is not robust against outlying observations. Even if it assumes that the observations are a simple random sample from the population, it can be extended to take into account the complexity of the sampling design. The model can also be extended as the SEBLUP to include area random effects that are spatially correlated; another extension – REBLUP –accounts for outliers.

MQ estimatorsA recent approach to SAE is based on the use of MQ models specified at the unit level – MQ estimators. Differences between areas can be captured through quantile coefficients. The model assumes linearity in the relation between the quantiles of the study variables and the auxiliary information; it does not assume normality in the distribution of the study variable. It is resilient against outliers. This approach can be extended to model quantiles with MQGWR models, and to account for non-linearity in the model.

All the SAE models need to reflect underlying data in terms of continuous, count or categorical and must take account of specific characteristics of the distribution of the study variable such as spatial distribution, presence of outliers, non-normality and non-symmetry, and for the quality and types of auxiliary information on sampled and non-sampled areas, the presence of outliers and missing data. All the model-based methods are specified with the assumption that the observations come from a simple random sample, but the complexity of the sampling design can have an effect on the estimates.

The second part of the report considers the resilience of model-assisted and model-based methods in non-standard situations that may occur in agricultural surveys.

Spatial correlation, non-stationary process, heterogeneity in the study variableIn general, use of the spatial extended version of EBLUP is recommended to address correlation and to insert covariates to capture non-stationarity and heterogeneity in the model. With the presence of moderate spatial correlation, the SEBLUP is more efficient.

The MAUP: when area units are measured at different scalesThe MQ estimator and geostatistical models such as the geo-additive, kriging and GWR models are competitive when the same relation is measured for areal units at different scales.

Robustness of the predictor against deviation from normality and in the presence of outliersWhen the LMM errors are not normal, the MQ and MQGWR estimators work better than the other solutions. Their resilience to departures from normality and to outliers is better than that of the EBLUP family of predictors. Use of the REBLUP is recommended in the presence of outliers.

Complexity of sampling design of the survey on the target variablesMost model-based small-area estimators are based on the assumption that the sample observations come from a simple random sample. This is not always the case when the sample design is more complex, in which case the sampling design can be non-ignorable and the pseudo-EBLUP estimator should be used. If outliers are present in the data as well, the weighted MQ estimator should be used to deal with the complexity of the sampling design.

Missing data in spatial datasetsIn the case of a large number of missing locations, all the units should be located in the centroid of the corresponding area. Use of a beta prior distribution to impute the missing coordinates is recommended because it gives better SAE estimates than imputation of all the units in the centroid. When the missing data are in the study variable, for

160

example as a result of an informative non-response mechanism, the SAE estimators are biased. This can be reduced by weighting for the estimated response probabilities.

Excess of zeros in the survey dataRecurring patterns of zeros in original data compromises the validity of SAE estimates. Further research is needed into SAE estimation for zero-inflated data. One of the proposed adjusted estimators should be chosen if resources are available to implement it because they are equivalent in terms of efficiency.

spatial disaggregation and small-area estimation …...september 2015 spatial disaggregation and...

Documents