stat472/572 sampling: theory and practice instructor: yan luluyan/stat47257217/chapter3.pdfy„str...

61
Stat472/572 Sampling: Theory and Practice Instructor: Yan Lu 1

Upload: others

Post on 09-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Stat472/572 Sampling: Theory and Practice

Instructor: Yan Lu

1

Page 2: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Chapter 3: Stratified Sampling

Example: 1000 male and 100 female in population.

• Now take an SRS of size 55 from the population. Possibly we

got a sample without female.

—-Most people would not consider such a sample to be rep-

resentative of the population, since men and women might re-

spond differently on the item of interest

• Use stratified sample, we can take 50 male and 5 female

—-a sample with no or few males cannot be selected, protected

from the possibility of obtaining a really bad sample

—-increases the precision of the estimators2

Page 3: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Stratified Sampling

• Divide population into H subpopulations, called strata. The

strata do not overlap and they constitute the whole population

• Each sampling unit belongs to exactly one stratum

• Draw an independent probability sample from each stratum

• Pool the information to obtain overall population estimates

3

Page 4: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Figure 1: Stratification

4

Page 5: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Example 3.2: Agriculture survey (Refer to Example 2.5)

• In Example 2.5, we generated a random sample. But some areas were

overrepresented, and others not represented at all

• part of the large variability arises because counties in the western United

States are larger, and thus tend to have larger values of y, than counties in

the eastern United States

• Taking a stratified sample can provide some balance in the sample on the

stratifying variable

• We use the four census regions of the United States: Northeast (NE), North

Central (NC), South (S), and West (W) strata, and sample about 10% of the

counties in each stratum.

5

Page 6: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Figure 2: Boxplot of data from example 3.2. The thick line for each region is the median of

the sample data from that region; the other horizontal lines in the boxes are the 25th and 75th

percentiles. The Northeast region has a relatively small median and small variance; the West

region, however, has a much higher median and variance. The distribution of farm acreage

appears to be positively skewed in each of the regions.

NC NE S W

0.00.5

1.01.5

2.0

Region

Million

s of A

cres

6

Page 7: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Stratum # of counties in stratum # of counties in sample

Northeast 220 21

North Central 1054 103

South 1382 135

West 422 41

Total 3078 300

7

Page 8: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Table 1: Summary statistics for each stratum

region stratum size sample size average variance

Northeast 220 21 97,629.8 7,647,472,708

North Central 1045 103 300,504.2 29,618,183.543

South 1382 135 211,315.0 53,587,487,856

West 422 41 662,295.5 396,185,950,266

• We took an SRS in each stratum, for Northeast region

t1 = (220)(97, 629.81) = 21, 478, 558.2

V (t1) = (220)2(

1− 21220

)7, 647, 472, 708

21= 1.594316× 1013

8

Page 9: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Table 2: Estimates of the total number of farm acres and estimated variance of the total for

each of the four strata

region estimated total estimated variance of the total

Northeast 21, 478, 558.2 1.59432× 1013

North Central 316, 731, 379.4 2.88232× 1014

South 292, 037, 390.8 6.84076× 1014

West 279, 488, 706.1 1.55365× 1015

Total 909, 736, 034.4 2.5419× 1015

9

Page 10: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Table 3: Comparison between SRS and stratified random sampling for agriculture data

sample size t SE

SRS 300 916,927,110 58,169,381

Stratification 300 909,736,034 50,417,248

• Observations within many strata tend to be more homogeneous than observations in the

population as a whole. Reduction in variance in the individual strata often leads to a

reduced variance for the population estimate

• estimated variance from stratified sample, with n = 300estimated variance from SRS, with n = 300

=2.5419× 1015

3.3837× 1015= 0.75

• If these were the population variances, we would expect that we would need only (300)(0.75) =

225 observations with a stratified sample to obtain the same precision as from an SRS of

300 observations.

10

Page 11: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Comments:

• Reduce variability by eliminating possible bad samples

• May want data of known precision for subgroups

• Lower cost, convenient

• Usually reduce variability when estimating the whole popula-

tion

11

Page 12: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Theory of Stratified Sampling:

strata 1 2 · · · H

popn size N1 N2 · · · NH

∑Hh=1 Nh = N

sample size n1 n2 · · · nH

∑Hh=1 nh = n

popn total t1 t2 · · · tH

• Take an SRS of size nh from stratum H

12

Page 13: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

• tstr = t1 + t2 + · · ·+ tH

•tstr = t1 + t2 + · · ·+ tH

= N1y1 + N2y2 + · · ·NH yH

•V (tstr) = V (t1) + V (t2) + · · ·+ V (tH)

=H∑

h=1

(1− nh

Nh

)N 2

hs2h

nh

13

Page 14: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

ystr =tstrN

=

∑Hh=1 thN

=

∑Hh=1 Nhyh

N

=H∑

h=1

Nh

Nyh

Weighted average of stratum means

14

Page 15: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

• Confidence intervals for stratified samples

—If either(1) the sample sizes within each stratum are large

—or (2) the sampling design has a large number of strata

According to central limit theorem (Krewski and Rao 1981), an

approximate 100(1− α)% confidence interval for the popula-

tion mean yU is

ystr ± zα/2SE(ystr)

Some survey software packages use the percentile of a t dis-

tribution with n − H degrees of freedom rather than the per-

centile of the normal distribution

15

Page 16: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Population quantities Sample quantities

yhj : value of jth unit in stratum h

th =Nh∑j=1

yhj th =Nh

nh

∑j∈Sh

yhj = Nhyh

t =H∑

h=1

th tstr =H∑

h=1

th =H∑

h=1

Nhyh

yhU =

Nh∑j=1

yhj

Nh

yh =1

nh

∑j∈Sh

yhj

yU =t

N=

H∑h=1

Nh∑j=1

yhj

Nystr =

tstrN

=H∑

h=1

Nh

Nyh

S2h =

Nh∑j=1

(yhj − yhU)2

Nh − 1s2

h =∑

j∈Sh

(yhj − yh)2

nh − 116

Page 17: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

tstr = t1 + t2 + · · ·+ tH

= N1y1 + N2y2 + · · ·NH yH

V (tstr) = V (t1) + V (t2) + · · ·+ V (tH)

=H∑

h=1

(1− nh

Nh

)N2

hs2h

nh

ystr =tstrN

=H∑

h=1

Nh

Nyh

V (ystr) =1

N2V (tstr) =

H∑

h=1

(1− nh

Nh

)(Nh

N

)2s2

h

nh

17

Page 18: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Properties of the estimators:

• E[tstr] = t

• E[ystr] = yU

• V (tstr) is an unbiased estimator of V (tstr)

• V (ystr) is an unbiased estimator of V (ystr)

18

Page 19: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

E[tstr] = E[H∑

h=1

Nhyh]

=H∑

h=1

NhE(yh)

=H∑

h=1

NhyhU =H∑

h=1

th = t

E[ystr] = E[tstrN

]

=t

N= yU

19

Page 20: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Stratified sampling for proportions

Special case of mean when

yi =

1 if the unit has the characteristic

0 otherwise

20

Page 21: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

yh = ph

s2h =

nh

nh − 1ph(1− ph)

pstr =H∑

h=1

Nh

Nph

V (pstr) =H∑

h=1

(1− nh

Nh

)(Nh

N

)2ph(1− ph)

nh − 1

tstr =H∑

h=1

Nhph

V (tstr) = N 2V (pstr)

21

Page 22: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Example 3.4. The American Council of Learned Societies (ACLS)

used a stratified random sample of selected ACLS societies in

seven disciplines to study publication patterns and computer

and library use among scholars who belong to one of the mem-

ber organizations of the ACLS. The data is shown in the follow-

ing table.

22

Page 23: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Discipline Membership # mailed valid returns female

Nh nh members(%)

Literature 9100 915 636 38

Classics 1950 633 451 27

Philosophy 5500 658 481 18

History 10850 855 611 19

Linguistics 2100 667 493 36

Political Science 5500 833 575 13

Sociology 9000 824 588 26

Totals 44000 5385 3835

• Want to estimate the percentage and number of female members of the major societies in

those seven disciplines

23

Page 24: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

• Ignoring the nonresponse, assume no duplicate memberships

pstr =7∑

h=1

Nh

Nph

=910044000

× .38 + · · ·+ 900044000

× .26

= .2465

SE(pstr) =

√√√√7∑

h=1

(1− nh

Nh

)(Nh

N

)2ph(1− ph)

nh − 1

= .0071

The estimated total number of female members in the societies is

tstr = 44000× .2465 = 10847

with

SE(tstr) = 44000× .0071 = 312

24

Page 25: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Review: Stratified random sampling

Strata 1 2 · · · H

Population size N1 N2 · · · NH

∑Hh=1 Nh = N

Sample size n1 n2 · · · nH

∑Hh=1 nh = n

Population total t1 t2 · · · tH

25

Page 26: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Population quantities Sample quantities

yhj : value of jth unit in stratum h same

th =Nh∑j=1

yhj th =Nh

nh

∑j∈Sh

yhj = Nhyh

t =H∑

h=1

th tstr =H∑

h=1

th =H∑

h=1

Nhyh

yhU =

Nh∑j=1

yhj

Nh

yh =1

nh

∑j∈Sh

yhj

yU =t

N=

H∑h=1

Nh∑j=1

yhj

Nystr =

tstrN

=H∑

h=1

Nh

Nyh

S2h =

Nh∑j=1

(yhj − yhU)2

Nh − 1s2

h =∑

j∈Sh

(yhj − yh)2

nh − 126

Page 27: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Properties of the estimators:

• E[tstr] = t

• E[ystr] = yU

Confidence intervals for stratified samples

—If either(1) the sample sizes within each stratum are large

—or (2) the sampling design has a large number of strata

According to central limit theorem (Krewski and Rao 1981), an approximate

100(1− α)% confidence interval for the population mean yU is

ystr ± zα/2SE(ystr)

Some survey software packages use the percentile of a t distribution with

n−H degrees of freedom rather than the percentile of the normal distrib-

ution

27

Page 28: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Using Weights

Sampling weights: the number of units in the population represented by each sample

member (h, j), h: stratum, j: elements.

tstr =H∑

h=1

Nhyh

=H∑

h=1

j∈Sh

Nh

nhyhj

=H∑

h=1

j∈Sh

whjyhj

where whj =Nh

nh

ystr =

H∑h=1

∑j∈Sh

whjyhj

H∑h=1

∑j∈Sh

whj

28

Page 29: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Example: Suppose a population has 2000 units, 1600 of them

are males (stratum 1), and 400 are females (stratum 2). If the

sample has 400 units, 200 units from each stratum, then,

π1j =200

1600=

1

8and w1j =

1

π1j= 8

π2j =200

400=

1

2and w2j =

1

π2j= 2

• each man in the sample represents 8 men in the population

• each woman in the sample represents 2 women in the popula-

tion

29

Page 30: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

• πhj = nh/Nh

• whj = Nh/nh

• tstr =H∑

h=1

th =H∑

h=1

Nhyh =H∑

h=1

∑j∈Sh

Nh

nh

yhj =H∑

h=1

∑j∈Sh

whjyhj

• V (tstr) =H∑

h=1

V (th) =H∑

h=1

N2h

(1− nh

Nh

)S2

h

nh

• ystr = tstr/N =H∑

h=1

Nh

Nyh =

H∑h=1

∑j∈Sh

whjyhj

H∑h=1

∑j∈Sh

whj

• V (ystr) = V (tstr)/N2 =

H∑h=1

N2h

N2

(1− nh

Nh

)S2

h

nh

30

Page 31: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Comments:

• Let πhj be the probability of selecting unit j from stratum h. Then whj =

1/πhj = Nh/nh

• ∑Hh=1

∑i∈Sh

whj =∑H

h=1

∑i∈Sh

Nh

nh

=H∑

h=1

Nh = N

—-The whole sample represents the entire population and sum of the weights

is equal to the population size

• tstr =∑H

h=1

∑j∈Sh

whjyhj

• ystr =∑H

h=1

∑j∈Sh

whjyhj/∑H

h=1

∑j∈Sh

whj

31

Page 32: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Back to the previous example. Suppose a population has 2000

units, 1600 of them are males (stratum 1), and 400 are females

(stratum 2). If we randomly select 160 males from stratum 1

and 40 women from stratum 2,

π1j =160

1600=

1

10and w1j =

1

π1j= 10

π2j =40

400=

1

10and w2j =

1

π2j= 10

# of sampled units in each stratum is proportional to the size of

the stratum. We call this allocation method proportional alloca-

tion

32

Page 33: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Proportional Allocation: # of sampled units in each stratum is proportional

to the size of the stratum

nh

Nh

=n

N, nh = Nh

n

N

πhj =nh

Nh

=n

Nand whj =

1

πhj

=N

n

Sample is self-weighting

ystr =H∑

h=1

Nh

Nyh =

H∑

h=1

Nh

N

∑j∈Sh

yhj

nh

=H∑

h=1

1

n

∑j∈Sh

yhj =1

n

H∑

h=1

∑j∈Sh

yhj

= y

33

Page 34: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Variances:

Vprop(ystr) =(1− n

N

) 1

n

h

Nh

NS2

h

Vprop(tstr) =(1− n

N

) N

n

h

NhS2h

34

Page 35: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

ANOVA Table

SSB df Sum of Squares

Between strata SSB H − 1H∑

h=1

Nh∑j=1

(yhU − yU)2

=H∑

h=1

Nh(yhU − yU)2

Within Strata SSW N −HH∑

h=1

Nh∑j=1

(yhj − yhU)2

=H∑

h=1

(Nh − 1)S2h

Total corrected SSTO N − 1H∑

h=1

Nh∑j=1

(yhj − yU)2

= (N − 1)S2

SSTO = SSB +SSW35

Page 36: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Comparison between SRS and proportional allocation

V (tstr) = V

(H∑

h=1

Nhyh

)=

H∑

h=1

N2h

(1− nh

Nh

)S2

h

nh

=H∑

h=1

(1− n

N

) N

nNhN2

hS2h =

H∑

h=1

(1− n

N

) N

nNhS2

h

=(1− n

N

) N

n

[SSW +

H∑

h=1

S2h

]

V (tsrs) =(1− n

N

)N2 S2

n

=(1− n

N

) N2

n

1N − 1

(SSW + SSB)

≈(1− n

N

) N

n(SSW + SSB)

36

Page 37: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Proportional stratification is more efficient, if

H∑

h=1

S2h < SSB

where SSB =H∑

h=1

Nh(yhU − yU )2.

This is usually true, since the large population sizes of the strata will force Nh(yhU −yU )2 > S2

h

Comments

• In general, the variance of the estimator of t from a stratified sample with proportional

allocation will be smaller than the variance of the estimator of t from SRS with the same

number of observations

• The more unequal the stratum means yhU , the more homogeneous the within stratum

units, the more precision you will gain by using proportional allocation.

37

Page 38: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Optimal AllocationExample: Want to take a sample of American corporations to estimate the amount of trade

with Europe

• The variation among large corporations would be greater than the variation among small

ones

—-often, large units are more variable than small units

• Need to sample a higher percentage of the large corporations

• Proportional allocation won’t work well in this situation

—-Proportional allocation has same percentage of sampling within each stratum

—-If the variances S2h are similar, proportional allocation is a good choice

—-If the variances S2h vary substantially, we may want to take more samples from the

strata with larger variances

38

Page 39: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Cost function

c = c0 +H∑

h=1

chnh

where c0 is the overhead costs, such as maintaining an office, ch is the cost

of sampling an observation in stratum h

• Want to minimize V (tstr) for a fixed cost c or minimize c for a fixed V (tstr)

V (tstr) =H∑

h=1

N2h

(1− nh

Nh

)S2

h

nh

=H∑

h=1

N2h

S2h

nh

−H∑

h=1

NhS2h

—–Same as minimizeH∑

h=1

N2h

S2h

nh

39

Page 40: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

f =H∑

h=1

N2h

S2h

nh

+ λ

(c0 +

H∑

h=1

chnh − c

)

∂f

∂nh

=−N2

hS2h

n2h

+ λch = 0

nh =NhSh√

chλ

by the fact that∑

h nh = n we have

1√λ

=n∑H

l=1 NlSl/√

cl

nh,opt = n×(

NhSh/√

ch∑Hl=1 NlSl/

√cl

)

40

Page 41: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

nh,opt ∝ NhSh√ch

We take a larger sample from stratum h if

• The stratum size Nh is large

• The variance within the stratum Sh is large

• The sampling within the stratum ch is inexpensive

41

Page 42: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

nh,opt = n×(

NhSh/√

ch∑Hl=1 NlSl/

√cl

)

Neyman allocation: ch’s are all equal

nh,Neyman = n×(

NhSh∑Hl=1 NlSl

)

Let a =n∑l=H

l=1 NlSl

Recall

nh,Neyman = n×(

NhSh∑Hl=1 NlSl

)

42

Page 43: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

so that nh,Neyman = a×NhSh

V (tstr,Neyman) =H∑

h=1

(1− nh

Nh

)N2

hS2h

nh

=H∑

h=1

(1− aNhSh

Nh

)N2

hS2h

aNhSh

=H∑

h=1

(1− aSh)NhSh

a

=H∑

h=1

(1− n∑H

l=1 NlSl

Sh

)NhSh

∑Hl=1 NlSl

n

=

H∑h=1

NhSh

H∑l=1

NlSl

n−

H∑

h=1

NhS2h

43

Page 44: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

V (tstr,Prop) =H∑

h=1

(1− nh

Nh

)N 2

h

nhS2

h

=H∑

h=1

(1− n

N

) N

nNhS

2h

=H∑

h=1

N

nNhS

2h −

H∑

h=1

NhS2h

44

Page 45: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

H∑

h=1

NhSh

H∑

l=1

NlSl =H∑

h=1

N 2hS2

h + 2H∑

i=1

H∑j>i

NiNjSiSj

H∑

h=1

NNhS2h =

H∑

h=1

N 2hS2

h +H∑

i=1

H∑j>i

NiNj(S2i + S2

j )

V (tstr,Neyman) ≤ V (tstr,prop)

Relative precision of stratification and srs

V (tstr,Neyman) ≤ V (tstr,Prop) ≤ Vsrs(t)

45

Page 46: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Example 3.9, Dollar stratification is often used in accounting. The recorded book amounts

are used to stratify the population. If auditing the loan amounts for a financial institution

stratum 1 might consist of all loans of more than $1 million, S2h will be much larger in this

stratum, need a higher sampling fraction for this stratum

stratum 2 might consist of loans between $500,000 and $999,999 · · ·smallest stratum of loans less than $10,000

• Optimal allocation is often an efficient strategy for such a stratification

— If the goal of the audit is to estimate the dollar discrepancy between the audited amounts

and the amounts in the institution’s books, an error in the recorded amount of one of the

$3,000,000 loans is likely to contribute more to the audited difference than an error in the

recorded amount of one of the $3,000 loans. In a survey such as this, you may even want

to use sample size N1 in stratum 1.

46

Page 47: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Some design issues of stratified random sampling

• Allocating observations to strata

—-Proportional allocation:nh

Nh=

n

N—-Optimal allocation: Neyman allocation: ch’s are all equal

nh,Neyman = n

NhSh

H∑l=1

NlSl

• Sample size

• Defining strata: variables and number of strata

47

Page 48: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Determining sample size

V (tstr) =H∑

h=1

N2h

(1− nh

Nh

)S2

h

nh

≤H∑

h=1

N2h ·

S2h

nh

=1

n

H∑

h=1

n

nh

N2hS2

h = v/n

• v depends on stratum size Nh, variances S2h, and on the relative sample

sizes nh/n

• v can be thought of as the “average” variability per observation unit in a

stratified random sample with the specified allocation

95 % CI: tstr ± zα/2

√v/n

zα/2

√v/n = e, n = z2

α/2v/e2

48

Page 49: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Defining Strata:

1. Variables for stratification

• Highly associated with variables of interest

—–For estimating total business expenditures on advertising,

we might stratify by number of employees or size of the busi-

ness and by the type of product or service

—–For farm income, we might use the size of the farm as a

stratifying variable, since we expect that larger farms would

have higher incomes

• Known for all sampling units in the population

49

Page 50: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

2. Number of strata:

• Depends upon many factors such as the difficulty in construct-

ing a sampling frame with stratifying information, and the cost

of stratifying

• Formulas in literature

• Pilot study

• General rule: the more information you have about the pop-

ulation, the more strata you should use. You should use an

SRS when little prior information about the target population is

available.

50

Page 51: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Recall: Relative precision of stratification and SRS

V (tstr,opt) ≤ V (tstr,prop) ≤ Vsrs(t)

1. Stratified sampling provides higher precision than SRS, why conduct SRS?

• Stratification adds complexity to the survey, which may not be worth a small

gain in precision

• Need information which units and how many units belong to each stratum

2. When stratified sampling is efficient?

• SSB is large (strata means differ greatly)

• SSW is small (variability within stratum is small)

51

Page 52: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Example: National Pesticide Survey (NPS)

US Environmental Protection Agency (EPA) sampled drinking wells to esti-

mate the prevalence of pesticides and nitrate between 1988 and 1990.

• Want a sample that was representative of drinking water wells in the United

States

• Want to guarantee that wells in the sample would have a wide range of

levels of pesticide use and susceptibility to ground-water pollution

• Want to study two categories of wells: (1)Community water systems (CWS)

—systems of piped drinking water with at least 15 connections and/or 25 or

more permanent residents with at least one working well

and (2) rural domestic wells

—supplying occupied housing in rural areas, not on government property

52

Page 53: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

1. Frame issue: how many drinking wells exist in the United States?

• For CWS, list with addresses is in the Federal Reporting Data

System (FRDS), maintained by EPA, There are approximately

51,000 CWSs.

• The 1980 census data is used to estimate number of rural do-

mestic wells. There are about 13 million rural domestic wells.

53

Page 54: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

2. Stratification issue: EPA choose stratification design, which variables are

used to construct strata?

• EPA developed criteria for separating the population of CWS wells and

rural domestic wells into four categories of pesticide use and three relative

ground-water vulnerability measures. This design ensures that the range of

variability that exists nationally with respect to the agricultural use of pesti-

cides and ground-water vulnerability is reflected in the sample of wells.

• Pesticide use obtained from

—marketing research

—proportion of county in agricultural use

• Ground-water vulnerability measures (by DRASTIC)

• Four categories of pesticide use: high, moderate, low, uncommon; Three

categories of groundwater vulnerability: high, moderate, low gives 12 strata54

Page 55: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Table 4: Strata for National Pesticide Survey

Stratum pesticide use groundwater vulnerability number of

(estimated by DRASTIC) counties

1 high high 106

2 high moderate 234

3 high low 129

4 moderate high 110

5 moderate moderate 204

6 moderate low 267

7 low high 193

8 low moderate 375

9 low low 404

10 uncommon high 186

11 uncommon moderate 513

12 uncommon low 416

55

Page 56: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

3. Design considerations

—For CWS, assume 0.5% of wells contain pesticides; choose

n so that the probability of detection is 90%.

—For rural wells, there were some subgroups of particular in-

terest; assume a 1% rate and 97% probability of detection.

—n = 564 public, 734 private Rural wells

56

Page 57: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

4. Rural wells

—-Each county (N = 3137) categorized according to the strati-

fication variables.

—-Sample counties;

—-Characterize pesticide use and groundwater vulnerability for

subcounty areas.

—-No subcounty areas selection for CWS wells

57

Page 58: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Model-based inference for stratified sampling

• The one-way ANOVA model with fixed effects provides an un-

derlying structure for stratified sampling.

yhj = µh + εhj (1)

where εhj are independent with mean 0 and variance σ2h.

• The least squares estimator of µh is yh, the average in stratum

h

58

Page 59: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Estimators and Properties:

• Th =Nh∑j=1

yhj : the total in stratum h

• T =H∑

h=1

Th: the overall total

• Note that both Th and T are random variables

• The best linear unbiased estimator for Th is Th =Nh

nh

j∈Sh

yhj .

• EM [Th − Th] = 0

• EM [(Th − Th)2] = N2h

(1− nh

Nh

)σ2

h

nh

59

Page 60: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

By the fact that observations in different strata are independent under the model

EM [(T − T )2] = EM

{H∑

h=1

(Th − Th)

}2

= EM

H∑

h=1

(Th − Th)2 +H∑

h=1

k 6=h

(Th − Th)(Tk − Tk)

= EM

[H∑

h=1

(Th − Th)2]

=H∑

h=1

N2h

(1− nh

Nh

)σ2

h

nh

60

Page 61: Stat472/572 Sampling: Theory and Practice Instructor: Yan Luluyan/stat47257217/chapter3.pdfy„str § zfi=2SE(„ystr) Some survey software packages use the percentile of a t distribution

Comments:

• The theoretical variance σ2h can be estimated by s2

h

• Adopting the model in (1) results in the same estimation for t

and its standard error as found under randomization theory.

• If a different model is used, however, then different estimators

are obtained.

61