chapter 5: cluster sampling 1 introduction - stat ubc filerecords as sampling frames to sample...

23
Chapter 5: Cluster Sampling 1 Introduction AIDS is one of the most pressing issues in global health today. Today, if an HIV positive woman is pregnant, treatments can reduce the chance of infection in the child to almost zero (it has an efficiency rate of more than 99% with intention to treat). The health system in South Africa is set-up for accidental pregnancies from HIV+ women, but what about those who plan the pregnancies? Should a different system be set up for them? A researcher is interested in studying these populations. Among other things she is interested in the vital statistics of these mothers (age, socioeconomic status, ethnicity, tribe, etc.). Not having a sampling frame for this population, she obtains a list of all the hospitals in South Africa. Hospitals are naturally occurring groups, called clusters, in the population which contain the elements of interest. From this sampling frame, the researcher takes a simple random sample of 10 hos- pitals. Over the course of the summer, she visits each hospital and uses their records as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What is Cluster Sampling Cluster sampling is the first sampling method that we see that comes in many flavors (one-stage, two-stage, three-stage or more). It is often referred to as multi-stage sampling. We will only consider one- and two-stage cluster sampling, but the ideas presented here are easily generalized to more complex scenarios. 1

Upload: dinhdat

Post on 04-May-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

Chapter 5: Cluster Sampling

1 Introduction

AIDS is one of the most pressing issues in global health today. Today, ifan HIV positive woman is pregnant, treatments can reduce the chance ofinfection in the child to almost zero (it has an efficiency rate of more than99% with intention to treat). The health system in South Africa is set-upfor accidental pregnancies from HIV+ women, but what about those whoplan the pregnancies? Should a different system be set up for them? Aresearcher is interested in studying these populations. Among other thingsshe is interested in the vital statistics of these mothers (age, socioeconomicstatus, ethnicity, tribe, etc.).

Not having a sampling frame for this population, she obtains a list of all thehospitals in South Africa. Hospitals are naturally occurring groups, calledclusters, in the population which contain the elements of interest. Fromthis sampling frame, the researcher takes a simple random sample of 10 hos-pitals. Over the course of the summer, she visits each hospital and uses theirrecords as sampling frames to sample individuals from both sub-populations(accidental and planned pregnancies).

1.1 What is Cluster Sampling

Cluster sampling is the first sampling method that we see that comes inmany flavors (one-stage, two-stage, three-stage or more). It is often referredto as multi-stage sampling. We will only consider one- and two-stage clustersampling, but the ideas presented here are easily generalized to more complexscenarios.

1

Page 2: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

In essence, cluster sampling begins with the sampling of units which arenot the observation units. The first units to be sampled are the clusters.Clusters are also be called the primary sampling units or psu’s. Once theclusters are selected, the observation units are selected from the selected clus-ters. We call the units that are selected from within the clusters secondarysampling units or ssu’s. In this chapter, the ssu’s will be the elements wewish to study, but in situations where there are other sampling stages thiswouldn’t be the case.

One-Stage Cluster Sampling:

• Obtain a sampling frame of clusters.

• Collect a SRS of clusters.

• Sample all elements within the selected clusters.

2

Page 3: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

Two-Stage Cluster Sampling:

• Obtain a sampling frame of clusters.

• Run a SRS of clusters.

• Run a SRS of elements within the selected clusters.

How do stratified sampling and cluster sampling differ? While both methodsdivide the population in groups first, there are fundamental differences in thedesign, analysis and motivations for cluster sampling and stratified sampling.At the design level, cluster sampling requires that we take a SRS of thegroups (clusters) while stratified sampling requires that we sample all thegroups (strata). This leads to different sample properties and estimationstrategies which we expand on through out the chapter.

3

Page 4: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

Exercise: For each of the following:

• Determine if cluster sampling is being used.

– If not, explain why.

– If so, provide the different units (psu, ssu, etc.).

• In every case specify the characteristic and the statistic.

1. In order to determine the average number of hours Vancouver highschool students actively spend on the internet, researchers sampled 7high schools in the city schools using a SRS. They then sampled 50 stu-dents from each school by offering candy to the first students to answerthe questionnaire in the cafeteria. It was found that in this sample, theaverage time spent on the internet was 10.7 hours per week.

2. Researchers interested in spending habits of Canadian city dwellers sam-pled 5 cities. From these they sampled 30 city blocks each and went doorto door interviewing the inhabitants. They asked people about there fi-nancial situation - of particular interest was the percentage of net incomespent on consumer goods.

3. In studying the recreational reading habits of UBC students, researchersfirst separated the students according to faculty. In all 30 students fromeach faculty were contacted through a SRS. It was found that there is alarge discrepancy between readers and that there was a large portion ofstudents who did not read beyond the requirements.

4. An employer wishes to determine the proportion of employees whichcould benefit from a new daycare on company premisses. Since eachemployee has a number, he randomly selects the number 7 as a startingpoint and then selects every 20th employee on the list from there. Hefinds that 34% of employees would make use of such services.

4

Page 5: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

1.2 Why and when use Cluster Sampling

Having discussed how to obtain a cluster sample, some of the motivations forits use may be obvious. We take a closer look at the motivations.

Difficult Sampling Frame: Key to probability sampling is the samplingframe. In many cases, obtaining the sampling frame of all elements may bedifficult if not impossible. Nonetheless, it may be easy to obtain a samplingframe of clusters which contain the population of interest. Often, obtaininga sampling frame of elements in the selected clusters is simpler. If it isn’t, itmay not be important or we may sample all units in the cluster.

Simpler/Cheaper: The population may be widely distributed geographi-cally or may occur in natural clusters. When doing in person interviews orother forms of on-the-spot measurements, choosing a few natural clusters tosample from can save much time and money.

Why would we not use cluster sampling? With the advantage of easecomes a trade-off. In most cases, cluster sampling leads to less precise esti-mation than do SRS and stratified sampling.

We favor cluster sampling when:

• The cluster means are very similar.

• The ssu’s are heterogenous within clusters (variance within the clustersis large).

We favor stratified sampling when:

• The strata means differ much.

• The strata are homogenous (small variance).

5

Page 6: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

2 Notation

With multiple sampling stages, the notation required for this chapter is evenmore complex than what we’ve seen thus far. Before we can discuss estimationwe need to establish notation, so we do it all at once. (These are exactly asspecified in Lohr’s ”Sampling: Design and Analysis” to avoid confusion).

Population Quantities:

• N = number of psu’s in the population

• Mi = number of ssu’s in the ith psu

• K =N∑

i=1

Mi = total number of ssu’s in the population

• ti =

Mi∑j=1

yij = total in the ith psu

• t =N∑

i=1

ti =N∑

i=1

Mi∑j=1

yij = population total

• S2t =

1

N − 1

N∑i=1

(ti −

t

N

)2= population variance of the psu totals

• yU =N∑

i=1

Mi∑j=1

yij

K= population mean

• yiU =

Mi∑j=1

yij

Mi

=tiMi

= population mean in the ith psu

• S2 =N∑

i=1

Mi∑j=1

(yij − yU)2

K − 1= population variance (per ssu)

• S2i =

Mi∑j=1

(yij − yiU)2

Mi − 1= population variance within the ith psu

6

Page 7: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

We can quickly relate these to the example of the researcher trying to esti-mate the vital statistics of HIV+ women with planned pregnancies. Considerfamily income to be the variable of interest. Then, N is the number of hospi-tals in South Africa. The Mi are the number of such women who visit the ith

hospital and ti is the total family incomes for the women at the ith hospital.K is the number of such women in the entire population. yU is the averagefamily income for all such women in South Africa and so on.

Sampled Quantities

• n = number of psu’s in the sample

• mi = number of elements in the sample from the ith psu

• yi =∑j∈Si

yij

mi

= sample mean (per ssu) for the ith psu

• ti =∑j∈Si

Mi

mi

yij = estimated total for the ith psu

• tunb =∑i∈S

N

nti = unbiased estimator of population total

• s2t =

1

n− 1

∑i∈S

(ti −

tunb

N

)2= estimated variance of psu totals

• s2i =

∑j∈Si

(yij − yi)2

mi − 1= sample variance within the ith psu

As expected the sampled quantities are the population analogs of the sample.We’ll need these through our estimation process.

7

Page 8: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

3 One-Stage Cluster Sampling

We begin by exploring the estimation of one-stage sample surveys and thenmove on to the more complex two-stage sample surveys. One-stage samplesurveys can further be divided into two categories: those with clusters ofequal size and those with clusters of unequal size.

3.1 Estimation for Clusters of Equal Size

One-stage cluster sampling with equal size clusters is considered for two rea-sons: the estimation equations are simplified and it allows us to cover sometheory. The theory gets much more involved for other forms of cluster sam-pling. Specifically, we can use an ANOVA approach, as we did in stratifiedsampling, but for two stage cluster sampling we need to use random effectsmodels. Those details are not covered here. The interested student is sug-gested to read sections 5.7 (using random effects) and 11.4 (Mixed models)to learn more about the theory.

Example: A fruit importer was informed that his latest shipment of passionfruit may have been contaminated with an insect. The pest control companywant to estimate the number of insects in the shipment. They randomly open12 boxes (using a SRS, say), verify all 24 passion fruits in each box and countthe number of insects in each one. The insects are dormant due to the coldtransportation temperatures. The following data were collected:

Box 1 2 3 4 5 6 7 8 9 10 11 12

Insects 0 14 0 32 1 16 0 5 21 39 0 9

Problem: In this shipment, there are 400 boxes. Given this data, give a95% confidence interval for the total number of insects in the shipment.

8

Page 9: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

Solution: We begin by listing off what we know.N = 400 boxesM = 24 fruits per box and hence,K = 24× 400 = 9600 passion fruit in total. And,n = 12Now we estimate the total estimate,

t =N

n

∑i∈S

ti

=400

12[0 + 14 + 0 + 32 + 1 + 16 + 0 + 5 + 21 + 39 + 0 + 9]

= 4566.667

Next we obtain the estimated variance of psu’s,

s2t =

1

n− 1

∑i∈S

(ti −

t

N

)2

=1

12− 1

[(0− 4566.667/400)2 + ... + (9− 4566.667/400)2

]= 180.083

When sampling all the units in the clusters of equal size, we are, in effect,simply dealing with a SRS, so we can use the same equations as developedin chapter 2 with the notation of chapter 5. Therefore,

SE(t) = N

√(1− n

N

) s2t

n

= 400

√(1− 12

400

)180.083

12

= 1510.316

In order to construct a confidence interval based on the normal distribution,we need the CLT. The conditions required for the CLT are not met here.Why? Instead we provide a crude interval by using 2 rather than 1.96.

95%CI[t] = t± 2SE(t)

= 4566.667± 2(1510.316)

= [1546.035, 7587.299]

9

Page 10: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

It follows that the SRS theory is also used to find the equations for meansrather than totals when dealing with clusters of equal sizes.

• ˆy = tunb

NM

• V (ˆy) =(1− n

N

) S2t

nM2

• SE(ˆy) = 1M

√(1− n

N

)s2tn

Note: As in stratified sampling, units are weighted in cluster sampling. Forone-stage cluster sampling,

wij =1

P (ssu j of psu i is in the sample )=

N

n

Here, the weights for all units are the same and represents the same numberof units in the population. As in proportional allocation, we say that thesample is self-weighing.

Exercise: The fruit vendor would like to estimate the proportion of fruitsin his shipment that contain insects. Provide an estimate along with a SE.

Box 1 2 3 4 5 6 7 8 9 10 11 12

Fruits with Insects 0 4 0 11 1 8 0 5 15 19 0 3

10

Page 11: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

3.2 Estimation for Clusters of Unequal Size

Not much changes in the equations used for estimation here. The equationfor estimating the total is the same. Here are the equations that are modified:

• ˆyunb = tunb

K

• SE(ˆyunb) = NK

√(1− n

N

)s2tn

• SE(tunb) = N√(

1− nN

)s2tn

Notice the use of K in the equations for mean. Often the value of K isunknown. What are we to do in such cases?

11

Page 12: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

The Little Cow that Could: Monsanto is a company which produces abovine drug which helps cow’s triple their daily milk production. As a result,the cow’s often suffer, the nutritional value of the milk decreases and thebacteria count can go through the roof. The state of Iowa has outlawedthe use of this drug, but suspects that many farmers still use the product.The average cow produces between 4-6 litres of milk a day. There are 281dairy farms in the state of Iowa. Ten farms were randomly selected and thefollowing data was obtained:

Farm 1 2 3 4 5 6 7 8 9 10

Number of Cows 33 55 19 15 43 31 41 77 32 23Litres of Milk 301 500 140 255 511 302 260 950 280 299

Construct a confidence interval for the average daily milk production of cowsin the state of Iowa. Is there any evidence that the state’s suspicion is correct?

12

Page 13: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

3.3 Some Theory

3.3.1 Variance of Estimators

As always, we’re interested in comparing the precision of the sampling methodat hand to the results obtained in SRS. In general , cluster sampling is simplerand less precise than SRS. We want to express this mathematically. Wereturn to clusters of equal size to look at some of the theory behind clustersampling. We approach the problem in similar fashion to what we saw instratified sampling, by first producing an ANOVA table.

Source d.f. Sum of Squares Mean Squares

Between psu’s N − 1 SSB =∑N

i=1

∑Mj=1(yiU − yU)2 MSB

Within psu’s N(M − 1) SSW =∑N

i=1

∑Mj=1(yij − yiU)2 MSW

Total NM − 1 SSTo =∑N

i=1

∑Mj=1(yij − yU)2 S2

Suppose we took a sample of n primary sampling units to collect a total ofnM secondary sampling units. Then taking the SRS of equivalent size willlead to the following variance for an estimated total:

V ar(tsrs) = N 2M(

1− n

N

) S2

n.

One the other hand, we can re-express the variance of the cluster estimateusing ANOVA terms as well. First we note that,

S2t =

1

N − 1

N∑i=1

(ti −

t

N

)2

=1

N − 1

N∑i=1

M 2(yiU − yU)2 = M(MSB)

which implies that

V ar(tcluster) = N 2M(

1− n

N

)MSB

n

13

Page 14: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

3.3.2 Intraclass Correlation Coefficient and Design Effect.

We already know that we desire heterogenous clusters for cluster sampling.The Intraclass Correlation Coefficient (ICC) measures the homogeneity withinclusters - how similar are the secondary sampling units within clusters. Wecan express it in ANOVA terms

ICC = 1−(

M

M − 1

)(SSW

SSTo

).

In the extreme cases, complete homogeneity of the blocks will lead to anICC = 1 and equal cluster means will lead to a negative, ICC = − 1

M−1 .What’s preferable for cluster sampling?

From the definition of the ICC above, we can re-express the MSB.

1− ICC = 1−(

M

M − 1

)(SSW

SSTo

)⇒ SSTo[M − 1][1− ICC] = M(SSTo)−M(SSB)

⇒ M(SSB) = M(SSTo)− (M − 1)SSTo + SSTo[M − 1][ICC]

⇒ M(N − 1)MSB = SSTo[1 + (M − 1)ICC]

⇒ MSB =NM − 1

M(N − 1)S2[1 + (M − 1)ICC]

∴V ar(tcluster)

V ar(tsrs)=

NM − 1

M(N − 1)[1 + (M − 1)ICC]

14

Page 15: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

The design effect is the ratio of the variances for two different designs havingthe same number of sampled units. Typically, the variance of the SRS is inthe denominator. Here the design factor is

NM − 1

M(N − 1)[1 + (M − 1)ICC]

and it is the factor by which we would have to multiply the sample size of anSRS to obtain an equivalently efficient sample using cluster sampling.

Wrapping our heads around this: Consider a completely homogenouspopulation of 12 clusters each of size 200. Sample 1 is a single-stage clustersample with n = 3 and Sample 2 is a SRS of size 600. How much biggerwould we need the cluster sample to be in order for it to be as precise as theSRS?

Example: The ICC and Systematic Sampling

A population of 400 values which are more or less uniformly distributed over[0, 100] is divided into 4 clusters of equal size. Three populations result fromthree different clustering methods.

• Population 1 consists of random clusters.

• Population 2 consists of clusters which would be obtained through sys-tematic sampling using the ordered list of elements.

• Finally Population 3 consists of the list which would be obtained if thelist were cyclical.

The following table summarizes some results.

15

Page 16: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

Population 1 - Random

Pop yiU S2i

Cluster 1 (53,23,...,4) 44.25 752.92Cluster 2 (59,66,...,40) 47.12 759.93Cluster 3 (20,7,...,59) 48.90 942.58Cluster 4 (40,82,...,93) 50.72 863.36

Population 2 - Ordered

Pop yiU S2i

Cluster 1 (0,1,...,99) 47.37 834.08Cluster 2 (0,2,...,99) 47.62 834.99Cluster 3 (0,2,...,99) 47.86 834.91Cluster 4 (1,2,...,100) 48.14 837.46

Population 3 - Cyclical

Pop yiU S2i

Cluster 1 (0,0,...,22) 11.90 42.19Cluster 2 (22,22,...,45) 33.58 54.12Cluster 3 (46,46,...,73) 59.20 52.10Cluster 4 (73,73,...,100) 86.31 58.05

> apply(Pop1,2,mean)

[1] 44.25786 47.12303 48.90761 50.72092

> apply(Pop2,2,mean)

[1] 47.37295 47.62213 47.86799 48.14635

> apply(Pop3,2,mean)

[1] 11.90341 33.58594 59.20705 86.31302

> apply(Pop1,2,var)

[1] 752.9273 759.9304 942.5885 863.3697

> apply(Pop2,2,var)

[1] 834.0849 834.9975 834.9170 837.4679

> apply(Pop3,2,var)

[1] 42.19472 54.12697 52.10297 58.05298

16

Page 17: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

> fact <- as.factor(rep(1:4,each=100))

> summary(aov(as.vector(Pop1)~fact))

Df Sum Sq Mean Sq F value Pr(>F)

fact 3 2275 758 0.9142 0.4341

Residuals 396 328563 830

> summary(aov(as.vector(Pop2)~fact))

Df Sum Sq Mean Sq F value Pr(>F)

fact 3 33 11 0.0131 0.998

Residuals 396 330805 835

> summary(aov(as.vector(Pop3)~fact))

Df Sum Sq Mean Sq F value Pr(>F)

fact 3 310397 103466 2004.4 < 2.2e-16 ***

Residuals 396 20441 52

Compare the homogeneity of clusters in each population.

17

Page 18: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

4 Two-Stage Cluster Sampling

The first step to all forms of cluster sampling is to collect a SRS of primarysampling units. Up to now we’ve followed the first stage by sampling all unitsin the selected clusters, but in some circumstances this may be wasteful.If clusters are households, it may make sense to sample all its inhabitantsbecause it is small. On the other hand if the clusters are schools, samplingall of the 2000 students may not be realistic. Size of clusters isn’t the onlything to consider here. If the items within a cluster are very similar, then itwill be wasteful to measure all of them.

Two-stage cluster sampling avoids this issue by introducing a second samplingstage. In the second stage a SRS of secondary sampling units are collectedfrom each of the sampled psu’s. In the example pertaining to sampling HIV+pregnant women, the researcher first sampled hospitals and then used thehospital records as a sampling frame to sample individuals from each hospital.

Question: In simple random sampling we concluded that this method wasappropriate when very little was known about the structure of the population.Cluster sampling requires some additional structure, yet there are situationsin which cluster sampling can be done and SRS cant. Explain how this canbe and provide an example.

Adding an extra stage of sampling certainly complicates the analysis anddesign of the survey. We begin with estimation.

It turns out that estimation does not change much - use ti instead of ti.However replacing a fixed value, ti, by an estimate - the result of a randomprocess - does complicate the standard error calculations.

18

Page 19: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

Two-Stage Clustering Estimates

• ti =∑

j∈Si

Mi

miyij = Miyi

• tunb =∑

i∈SNn ti = N

n

∑i∈S Miyi

• V (tunb) = N 2(1− n

N

)s2tn + N

n

∑i∈S(1− mi

Mi

)M 2

is2imi

• ˆyunb = tunb

K

• SE(ˆyunb) = SE(tunb)K .

This changes the weights of each observation to NMi

nmi. How could we make

this self-weighting?

Consumer Survey Bureau (some ideas borrowed from Tryfos)

Once each year, the CSB conducts an extensive survey of household expen-ditures and attitudes in a city. The city is divided into Enumeration Areas(EAs). For its survey, CSB selects at random a number of EAs, and then,also at random, a number of households from each selected EA.

Each selected household is visited, and the head of the household is askedto complete a questionnaire. The city is divided into 200 EAs. The latestcensus shows that there are 60,000 households in the city. Supppose 4 EAsare selected in the first stage and let yi be the average household expenditureson clothing. Using the table below, calculate a confidence interval for meanexpenditures.

EA Mi Mi/K mi yi S2i

29 250 0.0042 50 95.0 11067 310 0.0052 60 84.0 80102 340 0.0057 70 75.5 124143 280 0.0047 55 90.3 105

19

Page 20: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

Ratio Estimation in Cluster Sampling

Ratio estimation readily lends itself to cluster sampling. As we’ve alreadyseen, the auxiliary variable is cluster sizes Mi.

• ˆyr =∑

i∈S ti∑i∈S Mi

is the ratio estimate for the mean in one-stage sampling

• ˆyr =∑

i∈S ti∑i∈S Mi

=∑

i∈S Miyi∑i∈S Mi

is the ratio estimate for the mean in two-stage

sampling

• tr = K ˆyr is the ratio estimate for the total

• SE(ˆyr) =

√(1− n

N

)N2

nK2

∑i∈S M

2i (yi−ˆyr)2

n−1 for one-stage clustering.

• SE(tr) = N

√(1− n

N

) 1n

∑i∈S M

2i (yi−ˆyr)2

n−1 also for single stage clustering.

• SE(yr) = 1M

√(1− n

N

)s2rn + 1

nN

∑i∈S M 2

i

(1− mi

Mi

) s2imi

is the two-stage clus-

ter sampling ratio estimate standard error. Here M is the population orsample average cluster size.

• s2r =

∑i∈S(Miyi−Mi ˆyr)2

n−1

Remarks:

1. Ratio estimation should be used when N and/or K are unknown orwhen the cluster sizes Mi vary a lot.

2. The unbiased estimator for total can be quite inefficient if there are largediscrepancies in cluster sizes. Ratio estimation can greatly improve this.

3. The above issue doesn’t apply to means.

20

Page 21: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

The Non-Existant Students

With increases in school shootings in recent years, a school board is consid-ering the addition of metal detectors in their schools. Among other things,they want to know how many students would be opposed to this. There arefive schools within the board. Two schools were selected at random and thetotal was estimated.

School 1 2 3 4 5

Number of students (Mi) 450 2216 600 721 1900Number sampled (mi) 350 300

Number opposed 321 288

Estimate the total using both ratio and non-ratio estimates.

4.1 Design

The design issues in cluster sampling are more complicated then what we’veseen thus far. There are two main issues we must face:

1. Defining the cluster to be used.

2. Determining sample size and sample allocation to obtain a desired pre-cision.

21

Page 22: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

4.1.1 Cluster Size

In many cases the clusters are naturally occurring so there is no choice inthe size of the clusters. Occasionally, we create clusters using an auxiliaryvariable. In such cases we may be able to determine the sizes of the psu’s.Examples of this would be dividing area or time. For example, creatingclusters over an area of forest or creating time intervals from which to sampleclients. There is no formula to help us decide which psu size to opt for, butthe following principles can be helpful towards this endeavor.

• The larger the cluster, the smaller the ICC tends to be (that’s good!)

• Clusters which are made too large can remove the cost/simlicity benefitswhich motivated cluster sampling to begin with.

Hence, forming clusters in the absence of natural clusters is an art of makingclusters as large as possible while keeping the cost simplicity advantage.

4.1.2 Sampling Size

Here we work in the opposite direction than we did in stratified sampling.Thus we start by choosing the sample size within each cluster first. Of coursethe degree of homogeneity will play a big role in the sub sampling size. Thecloser to one the ICC is, the smaller the sample size per cluster will be. Thus,key to our discussion here is the relative size of MSB and MSW . That meanswe need a previous study or a pilot study in order to be able to determinethe required sample and subsample sizes.

Here we begin with the simpler scenario of clusters of equal size, i.e. Mi = M

and mi = m for all i. Furthermore, we’ll consider the simplest and mostcommon cost function which we’ll use to determine allocation:

totalcost = C = c1n + c2nm

22

Page 23: Chapter 5: Cluster Sampling 1 Introduction - Stat UBC filerecords as sampling frames to sample individuals from both sub-populations (accidental and planned pregnancies). 1.1 What

which through optimization leads to:

n =C

c1 + c2m

and

m =

√c1M(MSW )

c2(MSB −MSW )

This will minimize the variance for the given cost constraint. Note that wecan replace m and M by m and M in the case of unequal samples.

If it’s precision we wish to set the sample size to then we use:

n =

(zα/2ME

[MSB

M+ (1− m

M

MSW

m

])2

This assumes that MSW , MSB and m are all known, so it is of limited use.

23