piero demetrio falorsi , paolo righi [email protected] , [email protected]
DESCRIPTION
Optimal Sampling Strategies for Multidomain, Multivariate Case with different amount of auxiliary information. Piero Demetrio Falorsi , Paolo Righi [email protected] , [email protected] Italian National Statistical Institute Seminar UNECE, 12 June 2012. - PowerPoint PPT PresentationTRANSCRIPT
Optimal Sampling Strategies for Multidomain, Multivariate Case with
different amount of auxiliary information
Piero Demetrio Falorsi , Paolo [email protected] , [email protected]
Italian National Statistical Institute
Seminar UNECE, 12 June 2012
OutlineOutline
Aim of the talk Statement of the problem (The unified approach for) sampling design (Mgreg) Estimator Experimental results Conclusions
Aim of the talk An overall strategy
1. Pre-processing The frame according to the schema A1 should be built up
2. Sample design
3. Throughput
4. Estimation
New proposal of Unified Strategy which allows to overcome the limits of the traditional
approach
Proposal of Mgreg estimator which fully
exploits the use of existing auxiliary
information
Statement of the problem
Large scale surveys in Official Statistics usually produce estimates for a set of parameters by a huge number of highly detailed estimation domains
Example from the Italian survey on Structural business statistics
4-digit Nace
3-digit Nace x Size (10) 2-digit Nace x NUTS2
4-digit Nace
SIZE
NUTS2
Domains of estimate: 1.800
Statement of the problem: Challenging informative contextMultiple sources of auxiliary information
Informative context SchemaA1. Example of the partition of the register of population with two administrative data sources Auxiliary variable
Subpopulation Ua)( Register
Variables Source=0
kx0
Additional data sources
Source b=1
kx1 Source b=2
kx2
Un
its
x U)1( : no additional administrative source available. kk xx 0)1( x x x x x
U)2( : only the administrative source 1 is available. ),( 10)2( kkk xxx x x x x x x x x x x x x x x x x U)3( both administrative sources 1 and 2 are available.
),,( 210)3( kkkk xxxx x x x x x x x x
x x x x x
U)4( : only the administrative source 2 is available. ),( 21)4( kkk xxx
x x x x x x x x x x
x x x x x
Statement of the problem: Design
Standard design solution for fixing the sample sizes at domain is based on a stratified sample where strata given by cross-classification of variables defining the different partitions (cross-classified or one-way stratified design) Italian SBS survey example
4- digit NACE 3-digit Nace x Size (10) 2-digit Nace x NUTS2
4-digit Nace
SIZE
NUTS2
Domains of estimate: 1.800 Cross classified strata: 37.000
Too detailed stratification:
Risk of sample size explosion;
Inefficient sample allocation (2 units per stratum constraint);
Risk of statistical burden (e.g. repeated business surveys)
Treatment not fully coherent for estimation and calculus of sampling variance with stratum non response.
Difficulty to take into account a priori information, pattern of auxiliary informatio
Statement of the problem: Estimation
Standard solution for estimation (calibration estimators) may allow for calibrating at domain level only for the register variables and does not calibrate on the domain existing totals deriving from auxiliary data sources
Main drawback:Too small sample size for some domains
Risk that the estimation of variables that could derive from administrative Data Source are significantly different from known totals
Biased estimation for small domains
Effect of non response or measurement error
It is necessary to guarantee a sufficient size for
each planned domain
each subpopulation Ua)( characterized (a=1,…,A) the same amount of auxiliary information
The strategy is that of considering as domains of interest the
original D domains dU (d=1,…,D) with the addition of the A
subpopulations Ua)( (a=1,…,A) .
Sampling Design: Multiple sources of auxiliary information
It is necessary to guarantee a sufficient size for
each planned domain
each subpopulation Ua)( characterized (a=1,…,A) the same amount of auxiliary information
The strategy is that of considering as domains of interest the
original D domains dU (d=1,…,D) with the addition of the A
subpopulations Ua)( (a=1,…,A) .
Sampling Design: Multiple sources of auxiliary information
The stratification is overcame: the inc probabilities determined by solving
),...,1(
);,....,1(
),...,1;,...,1(
10)|ˆ(
)|ˆ(
)(
)()(
)()(
Nk
rADDd
RrDd
k
rdrd
drdr
kUk k
VtAV
VtAV
cMin
π
π
Constraints on the domains of interest
Additional constraints instrumental for Ua)(
Option x Planned domains
1: Original + Ua)( 2: cross of dU by Ua)(
Sampling Design: Multiple sources of auxiliary information
Estimation: Multiple sources of auxiliary information
Informative context Schema 5.1. Example of the partition of the register of population with two administrative data sources Auxiliary variable
Subpopulation Ua)( Register
Variables Source=0
kx0
Additional data sources
Source b=1
kx1 Source b=2
kx2
Un
its
x U)1( : no additional administrative source available. kk xx 0)1( x x x x x
U)2( : only the administrative source 1 is available. ),( 10)2( kkk xxx x x x x x x x x x x x x x x x x U)3( both administrative sources 1 and 2 are available.
),,( 210)3( kkkk xxxx x x x x x x x x
x x x x x
U)4( : only the administrative source 2 is available. ),( 21)4( kkk xxx
x x x x x x x x x x
x x x x x
Estimation:The Working model
For the units belonging to Ua)( (a=1,…,A), the following working superpopulation model is introduced
rkarakarky )()()( βx ,
A
a ksk rkA
a Uk rkdrmgregdada
yt11 )()(
/ˆ~ˆ ,
rakarky βx ˆ~)()( is the prediction of rky ,
rakarkrk y βx ˆˆ )()( is the sample residual, with
sk jajrkksj jajjajaraaa
aa vyv)()(
)()()(
1)()()()( /)/(ˆ xxxβ
.
Estimation:The Mgreg Estimator
Estimation: Properties
The estimator is efficient: the variance is based on the squared residuals of the working model
The estimates are calibrated for each subset da U)( . Thus, the
sample estimates of the total auxiliary variables ka x)(
reproduces the total known dxa t)( at domain level.
The sum of the estimates )~
(ˆ
rhmgreg t over the planned domains, which form a partition of a given domain of interest, are consistent at domain level:
dh drmgregrhmgreg tt~ )()~
(ˆˆ .
The estimates are consistent at population level, so to say that the sum of the domain estimates which represent a partition of the population U always reproduces the same estimate of the total referred to the population U.
Data Warehouse strategy
Estimation: Properties
The fundamental result If an auxiliary variable, say r , coincides with a variable of interest, the estimate of the total of the r
variable of interest for the subpopulations da U)(
coincides with the total of the variable known from an auxiliary data source and
it is estimated without sampling error. .
Estimation: Properties - auxiliary=interest
Empirical Results: Population of simulation - 1999 Italian enterprises from 1 to 99 employees- Computer and related economic activities (2-digits NACE Rev.1)
ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 12
Population size
Number of cross-classified
strata
Cumulative (%)
distribution
1 68 18.89
2 37 29.17
3-5 63 46.67
6-10 50 60.56
11-100 119 93.61
More than 100
23 100.00
The domains of interest (44):
(1) geographical region with 20 marginal domains (DOM1);
(2) economic activity group by Size class (24 domains)
Empirical Results: Simulation: allocation comparison between the one-way and multi-way design
Prediction models:
M1
lkuuEUUkuEuE
UUkuy
rlrkMddddrrkMrkM
ddrkddddrk
0),(;)(,0)(
)(
2121
212121
2)(
2
lkuuEUUkuEuE
UUkuy
rlrkMddddrrkMrkM
ddrkddrk
0),(;)(,0)(2121
2121
2)(
2
M2
2R
1M
2M 61.065.1
64.168.1
Value addedLabour cost
%Model
Sampling design Sample size
Units included in the sample with certainty Frequency Size of the enterprises
(number of employed persons)
Average Minimum
One-way (with 1M model)
no stratum sample size constraint
716.6 10 47.0 23.0
at least 1 sample unit per stratum
944 119 24.0 2.0
at least 2 sample units per stratum
1.042 175 20.6 2.0
Multi-way with 1M model 936 30 50.1 17.0
Multi-way with 2M model 991 40 42.9 16.0
Sampling distributions over the partition with different auxiliary information
Empirical Results: multiple sources of auxiliary information: example – efficiency of the proposed strategy
Oneway str. 2 units
HT using pattern info
Mgreg using pattern info
HT using employed persons
Oneway str. 2 units
HT using pattern info
Mgreg using pattern info
HT using employed persons
1 500 81,2 60,5 19,7 80,1 7,79 8,75 4,29 8,522 500 75,5 50,9 14,1 69,0 7,25 7,36 3,06 7,343 3.000 345,0 246,7 157,4 313,0 33,11 35,65 34,22 33,294 6.392 540,3 333,8 268,8 478,3 51,85 48,24 58,44 50,86
Tot 10.392 1.042 691,9 460,0 940,5 100,00 100,00 100,00 100,00
SubpopulationNumber of
enterprises
Sample allocation % Sample allocation
Conclusions
We propose a sampling strategy, based on balanced sampling and the mgreg estimator, practical and easy to implement, which may represent a general and unified approach for defining the optimal inclusion probabilities
The method, depending on how it is parameterized, can define a standard cross-classified or a multi-way stratified design.
The sampling algorithm defines an optimal solution -by minimizing the costs or the sampling sizes- which guarantees lower sampling errors of the domain estimates than given thresholds
The estimation exploits all the existing auxiliary information
Conclusions
The last result (The unified approach) of a research that has lasted almost 6 years
Survey Methodology (2008) Statistics in Transition (2006) 2 books published by Franco Angeli illustrating the main
findings of a research of strategic interest financed by the Ministry of University and Research
Presentations NTTS (2011), Neuchatel (2011) Invited talk to the next scientific conference of the Italian
Society of Statistics Accepted talk for the ICES
References
Bethel J. (1989) Sample Allocation in Multivariate Surveys, Survey Methodology, 15, 47-57.
Chromy J. (1987). Design Optimization with Multiple Objectives, Proceedings of the Survey Research Methods Sec-tion. American Statistical Association, 194-199.
Deville J.-C., Tillé Y. (2004) Efficient Balanced Sampling: the Cube Method, Biometrika, 91, 893-912.
Deville J.-C., Tillé Y. (2005) Variance approximation under balanced sampling, Journal of Statistical Planning and Inference, 128, 569-591
Falorsi P. D., Righi P. (2008) A Balanced Sampling Approach for Multi-way Stratification Designs for Small Area Estimation, Survey Methodology, 34, 223-234
Falorsi P. D., Orsini D., Righi P., (2006) Balanced and Coordinated Sampling Designs for Small Domain Estimation, Statistics in Transition, 7, 1173-1198
Isaki C.T., Fuller W.A. (1982) Survey design under a regression superpopulation model, Journal of the American Statistical Association, 77, 89-96