piero demetrio falorsi , paolo righi [email protected] , [email protected]

Optimal Sampling Strategies for Multidomain, Multivariate Case with

different amount of auxiliary information

Piero Demetrio Falorsi , Paolo [email protected] , [email protected]

Italian National Statistical Institute

Seminar UNECE, 12 June 2012

OutlineOutline

Aim of the talk Statement of the problem (The unified approach for) sampling design (Mgreg) Estimator Experimental results Conclusions

Aim of the talk An overall strategy

1. Pre-processing The frame according to the schema A1 should be built up

2. Sample design

3. Throughput

4. Estimation

New proposal of Unified Strategy which allows to overcome the limits of the traditional

approach

Proposal of Mgreg estimator which fully

exploits the use of existing auxiliary

information

Statement of the problem

Large scale surveys in Official Statistics usually produce estimates for a set of parameters by a huge number of highly detailed estimation domains

Example from the Italian survey on Structural business statistics

4-digit Nace

3-digit Nace x Size (10) 2-digit Nace x NUTS2

4-digit Nace

SIZE

NUTS2

Domains of estimate: 1.800

Statement of the problem: Challenging informative contextMultiple sources of auxiliary information

Informative context SchemaA1. Example of the partition of the register of population with two administrative data sources Auxiliary variable

Subpopulation Ua)( Register

Variables Source=0

kx0

Additional data sources

Source b=1

kx1 Source b=2

kx2

Un

its

x U)1( : no additional administrative source available. kk xx 0)1( x x x x x

U)2( : only the administrative source 1 is available. ),( 10)2( kkk xxx x x x x x x x x x x x x x x x x U)3( both administrative sources 1 and 2 are available.

),,( 210)3( kkkk xxxx x x x x x x x x

x x x x x

U)4( : only the administrative source 2 is available. ),( 21)4( kkk xxx

x x x x x x x x x x

x x x x x

Statement of the problem: Design

Standard design solution for fixing the sample sizes at domain is based on a stratified sample where strata given by cross-classification of variables defining the different partitions (cross-classified or one-way stratified design) Italian SBS survey example

4- digit NACE 3-digit Nace x Size (10) 2-digit Nace x NUTS2

4-digit Nace

SIZE

NUTS2

Domains of estimate: 1.800 Cross classified strata: 37.000

Too detailed stratification:

Risk of sample size explosion;

Inefficient sample allocation (2 units per stratum constraint);

Risk of statistical burden (e.g. repeated business surveys)

Treatment not fully coherent for estimation and calculus of sampling variance with stratum non response.

Difficulty to take into account a priori information, pattern of auxiliary informatio

Statement of the problem: Estimation

Standard solution for estimation (calibration estimators) may allow for calibrating at domain level only for the register variables and does not calibrate on the domain existing totals deriving from auxiliary data sources

Main drawback:Too small sample size for some domains

Risk that the estimation of variables that could derive from administrative Data Source are significantly different from known totals

Biased estimation for small domains

Effect of non response or measurement error

It is necessary to guarantee a sufficient size for

each planned domain

each subpopulation Ua)( characterized (a=1,…,A) the same amount of auxiliary information

The strategy is that of considering as domains of interest the

original D domains dU (d=1,…,D) with the addition of the A

subpopulations Ua)( (a=1,…,A) .

Sampling Design: Multiple sources of auxiliary information

The stratification is overcame: the inc probabilities determined by solving

),...,1(

);,....,1(

),...,1;,...,1(

10)|ˆ(

)|ˆ(

)(

)()(

)()(

Nk

rADDd

RrDd

k

rdrd

drdr

kUk k

VtAV

VtAV

cMin

π

π

Constraints on the domains of interest

Additional constraints instrumental for Ua)(

Option x Planned domains

1: Original + Ua)( 2: cross of dU by Ua)(

Sampling Design: Multiple sources of auxiliary information

Estimation: Multiple sources of auxiliary information

Informative context Schema 5.1. Example of the partition of the register of population with two administrative data sources Auxiliary variable

Subpopulation Ua)( Register

Variables Source=0

kx0

Additional data sources

Source b=1

kx1 Source b=2

kx2

Un

its

x U)1( : no additional administrative source available. kk xx 0)1( x x x x x

U)2( : only the administrative source 1 is available. ),( 10)2( kkk xxx x x x x x x x x x x x x x x x x U)3( both administrative sources 1 and 2 are available.

),,( 210)3( kkkk xxxx x x x x x x x x

x x x x x

U)4( : only the administrative source 2 is available. ),( 21)4( kkk xxx

x x x x x x x x x x

x x x x x

Estimation:The Working model

For the units belonging to Ua)( (a=1,…,A), the following working superpopulation model is introduced

rkarakarky )()()( βx ,

A

a ksk rkA

a Uk rkdrmgregdada

yt11 )()(

/ˆ~ˆ ,

rakarky βx ˆ~)()( is the prediction of rky ,

rakarkrk y βx ˆˆ )()( is the sample residual, with

sk jajrkksj jajjajaraaa

aa vyv)()(

)()()(

1)()()()( /)/(ˆ xxxβ

.

Estimation:The Mgreg Estimator

Estimation: Properties

The estimator is efficient: the variance is based on the squared residuals of the working model

The estimates are calibrated for each subset da U)( . Thus, the

sample estimates of the total auxiliary variables ka x)(

reproduces the total known dxa t)( at domain level.

The sum of the estimates )~

(ˆ

rhmgreg t over the planned domains, which form a partition of a given domain of interest, are consistent at domain level:

dh drmgregrhmgreg tt~ )()~

(ˆˆ .

The estimates are consistent at population level, so to say that the sum of the domain estimates which represent a partition of the population U always reproduces the same estimate of the total referred to the population U.

Data Warehouse strategy

Estimation: Properties

The fundamental result If an auxiliary variable, say r , coincides with a variable of interest, the estimate of the total of the r

variable of interest for the subpopulations da U)(

coincides with the total of the variable known from an auxiliary data source and

it is estimated without sampling error. .

Estimation: Properties - auxiliary=interest

Empirical Results: Population of simulation - 1999 Italian enterprises from 1 to 99 employees- Computer and related economic activities (2-digits NACE Rev.1)

ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 12

Population size

Number of cross-classified

strata

Cumulative (%)

distribution

1 68 18.89

2 37 29.17

3-5 63 46.67

6-10 50 60.56

11-100 119 93.61

More than 100

23 100.00

The domains of interest (44):

(1) geographical region with 20 marginal domains (DOM1);

(2) economic activity group by Size class (24 domains)

Empirical Results: Simulation: allocation comparison between the one-way and multi-way design

Prediction models:

M1

lkuuEUUkuEuE

UUkuy

rlrkMddddrrkMrkM

ddrkddddrk

0),(;)(,0)(

)(

2121

212121

2)(

2

lkuuEUUkuEuE

UUkuy

rlrkMddddrrkMrkM

ddrkddrk

0),(;)(,0)(2121

2121

2)(

2

M2

2R

1M

2M 61.065.1

64.168.1

Value addedLabour cost

%Model

Sampling design Sample size

Units included in the sample with certainty Frequency Size of the enterprises

(number of employed persons)

Average Minimum

One-way (with 1M model)

no stratum sample size constraint

716.6 10 47.0 23.0

at least 1 sample unit per stratum

944 119 24.0 2.0

at least 2 sample units per stratum

1.042 175 20.6 2.0

Multi-way with 1M model 936 30 50.1 17.0

Multi-way with 2M model 991 40 42.9 16.0

Sampling distributions over the partition with different auxiliary information

Empirical Results: multiple sources of auxiliary information: example – efficiency of the proposed strategy

Oneway str. 2 units

HT using pattern info

Mgreg using pattern info

HT using employed persons

Oneway str. 2 units

HT using pattern info

Mgreg using pattern info

HT using employed persons

1 500 81,2 60,5 19,7 80,1 7,79 8,75 4,29 8,522 500 75,5 50,9 14,1 69,0 7,25 7,36 3,06 7,343 3.000 345,0 246,7 157,4 313,0 33,11 35,65 34,22 33,294 6.392 540,3 333,8 268,8 478,3 51,85 48,24 58,44 50,86

Tot 10.392 1.042 691,9 460,0 940,5 100,00 100,00 100,00 100,00

SubpopulationNumber of

enterprises

Sample allocation % Sample allocation

Conclusions

We propose a sampling strategy, based on balanced sampling and the mgreg estimator, practical and easy to implement, which may represent a general and unified approach for defining the optimal inclusion probabilities

The method, depending on how it is parameterized, can define a standard cross-classified or a multi-way stratified design.

The sampling algorithm defines an optimal solution -by minimizing the costs or the sampling sizes- which guarantees lower sampling errors of the domain estimates than given thresholds

The estimation exploits all the existing auxiliary information

Conclusions

The last result (The unified approach) of a research that has lasted almost 6 years

Survey Methodology (2008) Statistics in Transition (2006) 2 books published by Franco Angeli illustrating the main

findings of a research of strategic interest financed by the Ministry of University and Research

Presentations NTTS (2011), Neuchatel (2011) Invited talk to the next scientific conference of the Italian

Society of Statistics Accepted talk for the ICES

References

Bethel J. (1989) Sample Allocation in Multivariate Surveys, Survey Methodology, 15, 47-57.

Chromy J. (1987). Design Optimization with Multiple Objectives, Proceedings of the Survey Research Methods Sec-tion. American Statistical Association, 194-199.

Deville J.-C., Tillé Y. (2004) Efficient Balanced Sampling: the Cube Method, Biometrika, 91, 893-912.

Deville J.-C., Tillé Y. (2005) Variance approximation under balanced sampling, Journal of Statistical Planning and Inference, 128, 569-591

Falorsi P. D., Righi P. (2008) A Balanced Sampling Approach for Multi-way Stratification Designs for Small Area Estimation, Survey Methodology, 34, 223-234

Falorsi P. D., Orsini D., Righi P., (2006) Balanced and Coordinated Sampling Designs for Small Domain Estimation, Statistics in Transition, 7, 1173-1198

Isaki C.T., Fuller W.A. (1982) Survey design under a regression superpopulation model, Journal of the American Statistical Association, 77, 89-96

piero demetrio falorsi , paolo righi [email protected] , [email protected]

Documents

italian survey

survey methodologyaim

survey methodologystatement

survey methodologyoutlineaim

survey methodology1

survey methodologyestimatio

sample design

digit nace nuts2 sizestatement