federica piersimoni istat - italian national institute of statistics roberto benedetti university...

31
Federica Piersimoni ISTAT - Italian National Institute of Statistics Roberto Benedetti University “G.d’Annunzio” of Chieti-Pescara, Italy Giuseppe Espa Universy of Trento, Italy On the use of auxiliary variables in agricultural surveys design

Upload: hollie-howard

Post on 18-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Federica PiersimoniISTAT - Italian National Institute of Statistics

Roberto BenedettiUniversity “G.d’Annunzio” of Chieti-Pescara, Italy

Giuseppe EspaUniversy of Trento, Italy

On the use of auxiliary variables in agricultural

surveys design

Contents

Actual situationProposal• Estimators• Sampling designsData descriptionSimulationAnalysis of the resultsConclusions

1999 Months 2000 Months 2001 Months1 2 … 12 1 2 … 12 1 2 … 12

1 1 12 2 2

12…n

12…n

12…n

… … ……

12…n

12…n

12…n … …

12…n

12…n

12…n

N N N

Population unitsSample units

Actual situation

Use of the

auxiliary information

ex ante ex post

Efficient and/or

optimal stratif ication setting up

Specif ication of

eff icient sample designs

Sampling w eights

calibration, post-stratif ication, etc.

in sample surveys

2001 scatter plot matrix

tc1= cattle slaughterings 2001tc2= sheep and goats slaughterings 2001tc3= pigs slaughterings 2001tc4= equines slaughterings 2001

- 6000 126000

t c1

0

1000

2000F

r

e

q

u

e

n

- 15000 375000

t c3

0

1000

2000F

r

e

q

u

e

n

0 275000t c2

0

1000

2000F

r

e

q

u

e

n

0 10800

t c4

0

1000

2000F

r

e

q

u

e

n

t c2

t

c

1

t c3

t

c

1

t c4

t

c

1

t c1

t

c

2

t c3

t

c

2

t c4

t

c

2

t c1

t

c

3

t c2

t

c

3

t c4

t

c

3

t c1

t

c

4

t c2

t

c

4

t c3

t

c

4

2000 scatter plot matrix

tc10= cattle slaughterings 2000tc20= sheep and goats slaughterings 2000tc30= pigs slaughterings 2000tc40= equines slaughterings 2000

t c20

t

c

1

0

t c30

t

c

1

0

t c40

t

c

1

0

- 7500 157500

t c10

0

1000

2000F

r

e

q

u

e

n

0 275000

t c20

0

1000

2000F

r

e

q

u

e

n

- 15000 375000t c30

0

1000

2000F

r

e

q

u

e

n

- 400 8400

t c40

0

1000

2000F

r

e

q

u

e

n

t c10

t

c

2

0

t c30

t

c

2

0

t c40

t

c

2

0

t c10

t

c

3

0

t c20

t

c

3

0

t c40

t

c

3

0

t c10

t

c

4

0

t c20

t

c

4

0

t c30

t

c

4

0

1999 scatter plot matrix

tc19= cattle slaughterings 1999tc29= sheep and goats slaughterings 1999tc39= pigs slaughterings 1999tc49= equines slaughterings 1999

t c29

t

c

1

9

t c39

t

c

1

9

t c49

t

c

1

9

0 165000

t c19

0

1000

2000F

r

e

q

u

e

n

- 15000 315000

t c29

0

1000

2000F

r

e

q

u

e

n

- 15000 345000

t c39

0

1000

2000F

r

e

q

u

e

n

- 400 8400t c49

0

1000

2000F

r

e

q

u

e

n

t c19

t

c

2

9

t c39

t

c

2

9

t c49

t

c

2

9

t c19

t

c

3

9

t c29

t

c

3

9

t c49

t

c

3

9

t c19

t

c

4

9

t c29

t

c

4

9

t c39

t

c

4

9

t c10

t

c

1

t c19

t

c

1

t c20

t

c

2

t c29

t

c

2

t c30

t

c

3

t c39

t

c

3

t c40

t

c

4

t c49

t

c

4

SCATTER PLOTS

tc1: cattle slaughterings 2001 tc2: sheep and goats slaughterings 2001 tc3: pigs slaughterings 2001 tc4: equines slaughterings 2001tc10: cattle slaughterings 2000 tc20: sheep and goats slaughterings 2000 tc30: pigs slaughterings 2000 tc40: equines slaughterings 2000

tc19: cattle slaughterings 1999 tc29: sheep and goats slaughterings 1999 tc39: pigs slaughterings 1999 tc49: equines slaughterings 1999

Year 2001

Year 2000

Year 1999

Correlation Matrix

tc1tc2tc3tc4

tc1

1.0000 0.0039 -0.0096 0.0566

tc2

0.0039 1.0000 0.0084 0.0551

tc3

-0.0096 0.0084 1.0000 -0.0045

tc4

0.0566 0.0551 -0.0045 1.0000

Correlation Matrix

tc10tc20tc30tc40

tc10

1.0000 -0.0003 -0.0111 0.0428

tc20

-0.0003 1.0000 0.0068 0.0486

tc30

-0.0111 0.0068 1.0000 -0.0077

tc40

0.0428 0.0486 -0.0077 1.0000

Correlation Matrix

tc19tc29tc39tc49

tc19

1.0000 -0.0019 -0.0098 0.0378

tc29

-0.0019 1.0000 0.0082 0.0598

tc39

-0.0098 0.0082 1.0000 -0.0053

tc49

0.0378 0.0598 -0.0053 1.0000

Sampling frame: N = 2.211 units (enterprises) and 12 variables:

number of:•cattle, •pigs, •sheep and goats, •equines slaughtered at the census surveys of 1999,

2000 e 2001.

2000 samples of size n = 200…

…using as auxiliary information the complete frame at 1999 and at 2000 to obtain estimates at 2001!

Estimates obtained through the HorvitzThompson expansion estimator and the calibration estimator (PV) by Deville and Särndal (1992):

s.to

,min

rsss

rssss

w

dwG

xtx

Distance functionVector of the totals of the auxiliary variables

Samples selection

• simple random sampling (SRS)• stratified sampling (ST)• ranked set sampling (RSS)• probability proportional to size

(PS)• balanced sampling PS + balanced sampling   

SRS: direct estimate doesn’t use auxiliary information

 ST: auxiliary information is

used ex ante the strata setting up;

five planned strata; multivariate allocation

model by Bethel (1989).   

RSS: original formulation:  • Selection SRS without reinsertion of a first

sample of n units;• Ranking in increasing order of the n units of

the sample with respect to an auxiliary variable x known for every population unit;

• The interest variable y is measured on the first unit only;

• A second SRS is drawn and ranked; • The interest variable y is measured on the

second unit only;

• ….and so on till n replications.

Ranking variable:

with k =1,…,N, i =1,4 and t=1999, 2000.

For the units k:

:k

U ik

iktik x

nx

,

,,

1k 1k

PS:

If y positive auxiliary variable x selection with probability x.

Such ex ante probability is k

BALANCED SAMPLING and PS + BALANCED SAMPLING:

The balance constraint

has been imposed for the four variables to be estimated.

The difference between the two criteria:in the second case the constraint is imposed ex

post to PS samples

's U kkk xxw

CATTLE

SHEEP

AND

GOATS

PI GS EQUI NES

Direct ----- 40,00% 51,25% 51,65% 57,70%

2000 14,12% 25,31% 27,43% 26,45%

1999 20,31% 31,66% 26,73% 35,12%

2000 28,04% 32,76% 34,43% 33,84%

1999 26,47% 32,17% 33,92% 30,36%

2000 4,71% 9,04% 13,62% 9,77%

1999 12,95% 12,43% 14,53% 12,87%

2000 39,33% 47,91% 47,33% 51,23%

1999 38,82% 47,82% 46,12% 52,43%

2000 13,52% 21,80% 22,36% 25,68%

1999 18,18% 31,77% 22,80% 35,18%

2000 6,17% 6,38% 15,61% 3,74%

1999 7,28% 10,18% 17,20% 6,79%

2000 4,52% 5,04% 14,87% 2,57%

1999 6,04% 9,28% 16,67% 6,48%

2000 6,24% 13,48% 17,26% 15,05%

1999 23,57% 20,05% 17,46% 19,43%

2000 5,55% 5,08% 14,60% 2,50%

1999 6,37% 9,45% 16,72% 6,96%

RMSE (as % of the estimate)Aux.

var.

year

Selection

criterionEstimator

Direct

SRS

Stratifi ed

Ranked

PPS

Direct

Balanced

Bal/ PPS

Calibration

Direct

Calibration

Direct

Calibration

Direct

Calibration

CATTLE - RMSE (as % of the estimate)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

SRS Dire

ct est.

-----

SRS Cali

bratio

n 20

00

SRS Cali

bratio

n 19

99

Stratif

ied D

irect

est.

2000

Stratif

ied D

irect

est.

1999

Stratif

ied C

alibra

tion

2000

Stratif

ied C

alibra

tion

1999

Ranke

d Dire

ct e

st. 2

000

Ranke

d Dire

ct e

st. 1

999

Ranke

d Cal

ibrat

ion

2000

Ranke

d Cal

ibrat

ion

1999

PPS Dire

ct e

st. 2

000

PPS Dire

ct e

st. 1

999

PPS Cal

ibrat

ion

2000

PPS Cal

ibrat

ion

1999

Balan

ced

Direct

est.

200

0

Balan

ced

Direct

est.

199

9

Bal/p

ps D

irect

est.

200

0

Bal/p

ps D

irect

est.

199

9

SHEEP AND GOATS - RMSE (as % of the estimate)

0%

10%

20%

30%

40%

50%

60%

SRS Dire

ct est.

-----

SRS Cali

bratio

n 20

00

SRS Cali

bratio

n 19

99

Stratif

ied D

irect

est.

2000

Stratif

ied D

irect

est.

1999

Stratif

ied C

alibra

tion

2000

Stratif

ied C

alibra

tion

1999

Ranke

d Dire

ct e

st. 2

000

Ranke

d Dire

ct e

st. 1

999

Ranke

d Cal

ibrat

ion

2000

Ranke

d Cal

ibrat

ion

1999

PPS Dire

ct e

st. 2

000

PPS Dire

ct e

st. 1

999

PPS Cal

ibrat

ion

2000

PPS Cal

ibrat

ion

1999

Balan

ced

Direct

est.

200

0

Balan

ced

Direct

est.

199

9

Bal/p

ps D

irect

est.

200

0

Bal/p

ps D

irect

est.

199

9

PIGS - RMSE (as % of the estimate)

0%

10%

20%

30%

40%

50%

60%

SRS Dire

ct est.

-----

SRS Cali

bratio

n 20

00

SRS Cali

bratio

n 19

99

Stratif

ied D

irect

est.

2000

Stratif

ied D

irect

est.

1999

Stratif

ied C

alibra

tion

2000

Stratif

ied C

alibra

tion

1999

Ranke

d Dire

ct e

st. 2

000

Ranke

d Dire

ct e

st. 1

999

Ranke

d Cal

ibrat

ion

2000

Ranke

d Cal

ibrat

ion

1999

PPS Dire

ct e

st. 2

000

PPS Dire

ct e

st. 1

999

PPS Cal

ibrat

ion

2000

PPS Cal

ibrat

ion

1999

Balan

ced

Direct

est.

200

0

Balan

ced

Direct

est.

199

9

Bal/p

ps D

irect

est.

200

0

Bal/p

ps D

irect

est.

199

9

HORSES - RMSE (as % of the estimate)

0%

10%

20%

30%

40%

50%

60%

70%

SRS Dire

ct est.

-----

SRS Cali

bratio

n 20

00

SRS Cali

bratio

n 19

99

Stratif

ied D

irect

est.

2000

Stratif

ied D

irect

est.

1999

Stratif

ied C

alibra

tion

2000

Stratif

ied C

alibra

tion

1999

Ranke

d Dire

ct e

st. 2

000

Ranke

d Dire

ct e

st. 1

999

Ranke

d Cal

ibrat

ion

2000

Ranke

d Cal

ibrat

ion

1999

PPS Dire

ct e

st. 2

000

PPS Dire

ct e

st. 1

999

PPS Cal

ibrat

ion

2000

PPS Cal

ibrat

ion

1999

Balan

ced

Direct

est.

200

0

Balan

ced

Direct

est.

199

9

Bal/p

ps D

irect

est.

200

0

Bal/p

ps D

irect

est.

199

9

RMSE (as % of the estimate)

0%

10%

20%

30%

40%

50%

60%

70%

CATTLE

SHEEP AND GOATS

PIGS

HORSES

TOTAL

Conclusions

It is better to impose the balance constraints in design phase, than in ex post (cf. RMSE SRS - RMSE BAL)

Best performances: balanced PS selections and PS with

calibration

a joint use of complex estimators together with efficient sampling designs may reduceconsiderably the variability of the estimates

but…...

PS and PS with calibration selection criteria

but…...

more efficient less robust of the others when

outliers are present

bad performance of RSS method

forced univariate use of the auxiliary information for the ranking setting up when linear independence is present

Simulated sampling distribution of the tc2 estimates in the case of ps, with calibration estimator based on auxiliary variables of 2000

TRUE VALUE

Simulated sampling distribution of the tc3 estimates in the case of ps, with calibration estimator based on auxiliary variables of 1999

TRUE VALUE

Simulated sampling distribution of the tc4 direct estimates in the case of balanced ps, based on auxiliary variables of 1999

TRUE VALUE

Simulated sampling distribution of the tc2 direct estimates in the case of balanced ps, based on auxiliary variables of 2000

TRUE VALUE

ReferencesAlSaleh M.F., AlOmari A.I. (2002) Multistage ranked set sampling, Journal of Statistical Planning and Inference, 102, 273286.Bai Z., Chen Z. (2003) On the theory of rankedset sampling and its ramifications, Journal of Statistical Planning and Inference, 109, 8199.Bethel J. (1989) Sample allocation in multivariate surveys, Survey Methodology, 15, 4757.Deville J.C., Särndal C.E. (1992) Calibration Estimators in Survey Sampling, Journal of the American Statistical Association, 87, 418, 376382.Dorfman A.H., Valliant R. (2000) Stratification by size revised, Journal of Official Statistics, 16, 2, 139154.Espa G., Benedetti R., Piersimoni F. (2001) Prospettive e soluzioni per il data editing nelle rilevazioni in agricoltura, Statistica Applicata, 13, 4, 363391.Hidiroglou M.A. (1986) The construction of a self-representing stratum of large units in survey design, The American Statistician, 40, 1, 2731.Li D., Sinha B.K., Perron F. (1999) Random selection in ranked set sampling and its applications, Journal of Statistical Planning and Inference, 76, 185201.McIntyre G.A. (1952) A method for unbiased selective sampling, using ranked set, The Australian Journal of Agricultural and Resource Economics, 3, 385390.Patil G.P., Sinha A.K., Taillie C. (1994a) Ranked set sampling, in G.P. Patil and C.R. Rao (eds) Handbook of Statistics, Volume 12, Environmental Statistics, North Holland Elsevier, New York, 167–200.Patil G.P., Sinha A.K., Taillie C. (1994b) Ranked set sampling for multiple characteristics, International Journal of Ecology and Environmental Sciences, 20, 94–109.Ridout M.S. (2003) On ranked set sampling for multiple characteristics, Environmental and Ecological Statistics, 10, 255–262. Rosén B. (1997) On sampling with probability proportional to size, Journal of Statistical Planning and Inference, 62, 159191.Royall R.M. (1970) On finite population sampling theory under certain linear regression models, Biometrika, 57, 2, 377387.Royall R.M. (1992) Robustness and optimal design under prediction models for finite populations, Survey Methodology, 18, 179185.Royall R.M., Herson J. (1973a) Robust estimation in finite populations I, Journal of the American Statistical Association, 68, 344, 880889.Royall R.M., Herson J. (1973b) Robust estimation in finite population II: stratification on a size variable, Journal of the American Statistical Association, 68, 344, 890893.Särndal C-E, Swensson B., Wretman J. (1992) Model Assisted Survey Sampling, Springer Verlag, New York.

THANK YOU FOR YOUR ATTENTION!