7.1 sampling distribution of7.1 sampling distribution of x¯ definition 1 the population...
TRANSCRIPT
7.1 Sampling Distribution of X̄
Definition 1 The population distribution is the probability distribution of the population data.
Example 1 Suppose there are only five students in an advanced statistics class and the midterm
scores of these five students are
70 78 80 80 95
Solution: Let X denote the score of a student. Using single-valued classes, the frequency
distribution of scores as follows
X f f(x)
70 1 0.2
78 1 0.2
80 2 0.4
95 1 0.2
N = 5∑
f(x) = 1.0
The values of the mean and standard deviation calculated for the probability distribution give
the values of the population parameters µ and σ. These values are µ = 80.60 and σ = 8.09.
Definition 2 The probability distribution of X̄ is called the sampling distribution of X̄. It lists
the various values that X̄ can assume and the probability of each value of X̄.
Example 2 Reconsider the population of midterm scores of five students given in Example 1. Con-
sider all possible samples of three scores each that can be selected, without replacement, from that
population. The total number of possible samples, given by the combinations formula
Total number of samples =
(
5
3
)
=5!
3!2!= 10
Solution: Suppose we assign letters A, B, C, D, and E to the scores of five students so that A
= 70, B = 78, C = 80, D = 80, E = 95. Then the 10 possible samples of three scores each are ABC,
ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE. These 10 samples and their respective
2
means are listed in the following table.
Sample Scores in the Sample X̄
ABC 70, 78, 80 76.00
ABD 70, 78, 80 76.00
ABE 70, 78, 95 81.00
ACD 70, 80, 80 76.67
ACE 70, 80, 95 81.67
ADE 70, 80, 95 81.67
BCD 78, 80, 80 79.33
BCE 78, 80, 95 84.33
BDE 78, 80, 95 84.33
CDE 80, 80, 95 85.00
By using the values of X̄ given in the above table, we record the frequency distribution of X̄ as
3
follows:
X̄ f f(
X̄)
76.00 2 0.2
76.67 1 0.1
79.33 1 0.1
81.00 1 0.1
81.67 2 0.2
84.33 2 0.2
85.00 1 0.1∑
f(
X̄)
= 1.0
• Sampling Error – Sampling error is the difference between the value of a sample statistic
and the value of the corresponding population parameter. In the case of the mean,
Sampling error = X̄ − µ
assuming that the sample is random and no nonsampling error has been made.
It is important to remember that a sampling error occurs because of chance. The errors that
occur for other reasons, such as errors made during collection, recording, and tabulation of data, are
called nonsampling errors. Such errors occur because of human mistakes and not chance. Note
that there is only one kind of sampling error–the error that occurs due to chance. However, there
is not just one nonsampling error but many nonsampling errors that may occur due to different
reasons.
Definition 3 The errors that occur in the collection, recording, and tabulation of data are called
nonsampling errors.
The following paragraph, reproduced from the Current Population Reports of the U.S. Bureau
of the Census, explains how nonsampling errors can occur.
Nonsampling errors can be attributed to many sources, e.g., inability to obtain information about
all cases in the sample, differences in the interpretation of questions, inability or unwillingness on
the part of the respondents to provide correct information, inability to recall information, errors
made in collection such as in recording or coding the data, errors made in processing the data, errors
made in estimating values for missing data, biases resulting from the differing recall periods caused
4
by the interviewing pattern used, and failure of all units in the universe to have some probability
of being selected for the sample (undercoverage).
The following are the main reasons for the occurrence of nonsampling errors.
1. If a sample is nonrandom (and, hence, nonrepresentative), the sample results may be too dif-
ferent from the census results. The following quote from US. News & World Report describes
how even a randomly selected sample can become nonrandom if some of the members included
in the sample cannot be contacted.
A test poll conducted in the 1984 presidential election found that if the poll were halted after
interviewing only those subjects who could be reached on the first try, Reagan showed a 3-
percentage-point lead over Mondale. But when interviewers made a determined effort to reach
everyone on their lists of randomly selected subjects—calling some as many as 30 times before
finally reaching them—Reagan showed a 13 percent lead, much closer to the actual election
result. As it turned out, people who were planning to vote Republican were simply less likely
to be at home. (“The Numbers Racket: How Polls and Statistics Lie,” U.S. News & World
Report, July 11, 1988. Copyright 1988 by U.S. News & World Report, Inc. Reprinted with
permission.)
2. The questions may be phrased in such a way that they are not fully understood by the
members of the sample or population. As a result, the answers obtained are not accurate.
3. The respondents may intentionally give false information in response to some sensitive ques-
tions. For example, people may not tell the truth about drinking habits, incomes, or opinions
about minorities. Sometimes the respondents may give wrong answers be cause of ignorance.
For example, a person may not remember the exact amount he spent on clothes during the
last year. If asked in a survey, he may give an inaccurate answer.
4. The poll taker may make a mistake and enter a wrong number in the records or make an error
while entering the data on a computer.
Note that nonsampling errors can occur both in a sample survey and in a census, whereas sam-
pling error occurs only when a sample survey is conducted. Nonsampling errors can be minimized
by preparing the survey questionnaire carefully and handling the data cautiously. However, it is
impossible to avoid sampling error.
5
Example 3 Reconsider the population of five scores given in Example 1. The scores of the five
students are 70, 78, 80, 80, and 95. The population mean is
µ =70 + 78 + 80 + 80 + 95
5= 80.60
Now suppose we take a random sample of three scores from this population. Assume that this sample
includes the scores 70, 80, and 95. The mean for this sample is
X̄ =70 + 80 + 95
3= 81.67
Consequently,
Sampling error = X̄ − µ = 81.67 − 80.60 = 1.07
That is, the mean score estimated from the sample is 1.07 higher than the mean score of the popu-
lation. Note that this difference occurred due to chance, that is, because we used a sample instead
of the population.
Now suppose, when we select the above mentioned sample, we mistakenly record as record second
score as 82 instead of 80. As a result, we calculate the sample mean as
X̄ =70 + 82 + 95
3= 82.33
Consequently, the difference between this sample mean and the population mean is
X̄ − µ = 82.33 − 80.60 = 1.73
However, this difference between the sample mean and the population mean does not represent the
sampling error. As we calculated earlier, only 1.07 of this difference is due to the sampling error.
The remaining portion, which is equal to 1.73 − 1.07 = 0.66, represents the nonsampling error
because it occurred due to the error we made in recording the second score in the sample. Thus,
Sampling error = 1.07 and Nonsampling error = 0.66.
7.2 Mean and Standard Deviation of X̄
The mean and standard deviation calculated for the sampling distribution of X̄ are called the
mean µX̄ and standard deviation σX̄ of X̄. Actually, the mean and standard deviation of X̄ are,
respectively, the mean and standard deviation of the means of all samples of the same size selected
from a population. The standard deviation of σX̄ is also called the standard error of X̄.
6
• Mean of the Sampling Distribution of X̄
The mean of the sampling distribution of is equal to the mean of the population. Thus,
µX̄ = µ
• Standard Deviation of the Sampling Distribution of X̄
The standard deviation of the sampling distribution of X̄ is
σX̄ =σ√
n
where σ is the standard deviation of the population and n is the sample size. This formula is used
when n/N ≤ 0.05, where N is the population size. If this condition is not satisfied, we use the
following formula to calculate σX̄
σX̄ =σ√
n
√
N − n
N − 1
where the factor√
N−n
N−1is called the finite population correction factor.
7.3 Shape of the Sampling Distribution of X̄
The shape of the sampling distribution of X̄ relates to the following two cases.
1. The population from which samples are drawn has a normal distribution.
2. The population from which samples are drawn does not have a normal distribution.
• Sampling from a Normally Distributed Population
When the population from which samples are drawn is normally distributed with its mean equals
to µ and standard deviation equal to σ, then
1. The mean of X̄, µX̄ , is equal to the mean of the population, µ.
2. The standard deviation of X̄, σX̄ , is equal to σ/√
n, assuming n/N ≤ 0.05.
3. The shape of the sampling distribution of X̄ is normal, whatever the value of n.
• Sampling from a population that is not Normally Distributed but n ≥ 30
7
Most of the time the population from which the samples are selected is not normally distributed.
In such cases, the shape of the sampling distribution of X̄ is inferred from a very important theorem
called the central limit theorem.
• Central Limit Theorem
For a large sample size, the sampling distribution of the sample mean X̄ is approximately normal,
irrespective of the shape of the population distribution. The mean and standard deviation of the
sampling distribution of X̄ are
µX̄ = µ and σX̄ =σ√
n.
The sample size is usually considered to be large if n ≥ 30.
Note: If the population distribution is fairly symmetrical, the sampling distribution of the
sample mean is approximately normal if the samples of at least 15 observations are selected.
8
Sampling Distribution of X̄
Normal Population Non-normal Population
Mean µX̄ = µ µX̄ = µ
Standard error σX̄ =σ√
nσX̄ =
σ√
n
Shape Normal Approximate Normal if n ≥ 30
Notation X̄ ∼ N
(
µ,
(
σ√
n
)2)
X̄ ∼ N
(
µ,
(
σ√
n
)2)
7.4 Applications of the Sampling Distribution of X̄
Example 4 A company which manufacturers drink dispensing machines sets the fill level at 198cc.
The standard deviation is 4cc. Assuming that the fill levels have a normal distribution.
(a) A drink is randomly selected, what is the probability that the drink will have less than 195cc?
(b) What is the probability that a random sample of 50 drinks has a mean value greater than
9
199cc?
Solution: (a) Let X be fill level and µ be the mean fill level. Given X ∼ N (198, 42) ,
P (X < 195) = P
(
X − µ
σ<
195 − 198
4
)
= P (Z < −0.75)
= 0.2266
(b) Let X̄ be the sample mean. Since the population is normally distributed, thus the shape of
the sampling distribution of X̄ is normal. The mean and standard deviation of X̄ are
µX̄ = µ = 198 and σX̄ =σ√
n=
4√
50.
That is, X̄ ∼ N
(
198,(
4√
50
)2)
.
P(
X̄ > 199)
= P
(
X̄ − µX̄
σX̄
>199 − 198
4/√
50
)
= P (Z > 1.77)
= 0.0384
Example 5 Suppose that the mean and standard deviation of the weights of all packages of a certain
brand of cookies are 32 grams and 0.8 grams, respectively.
(a) Find the probability that the mean weight of a random sample of 40 packages of this brand of
cookies will be between 31.8 and 31.9 grams.
(b) 97.5% of the sample means will be less than what value?
Solution: (a) Since the sample size is large (n ≥ 30), by CLT, the shape of the sampling
distribution of X̄ is normal. The mean and standard deviation of X̄ are
µX̄ = µ = 32 and σX̄ =σ√
n=
0.8√
40.
Thus, X̄ ∼ N(
32, 0.82
40
)
. We are to compute the probability that the value of X̄ calculated for one
randomly drawn sample of 40 packages is between 31.8 and 31.9 grams, that is,
P (31.8 < X̄ < 31.9)
10
This probability is given by the area under the normal curve for X̄ between the points X̄ = 31.8
and X̄ = 31.9. The first step in finding this area is to convert the two X̄ values to respective Z
values.
The probability that X̄ is between 31.8 and 31.9 is given by the area under the standard normal
curve between Z = −1.58 and Z = −0.79. Thus, the required probability is
P (31.8 < X̄ < 31.9)
= P
(
31.8 − 32
0.8/√
40<
X̄ − µX̄
σX̄
<31.9 − 32
0.8/√
40
)
= P (−1.58 < Z < −0.79)
= 0.1577.
Therefore, the probability is 0.1577 that the mean weight of a sample of 40 packages will be between
31.8 and 31.9 grams.
(b) Let A be the required value. Given P(
X̄ < A)
= 0.975, from the normal table, we have
P (Z < 1.96) = 0.975. Hence
A = µX̄ + ZσX̄ = 32 + 1.96 ×0.8√
40= 32.248 grams
7.5 Population and Sample Proportions
Definition 4 The Population proportion, denoted by p, is obtained by taking the ratio of the
number of elements in a population with a specific characteristic to the total number of elements in
the population. The sample proportion, denoted by p̄, gives a similar ratio for a sample.
The population and sample proportions, denoted by p and p̄, respectively, are calculated as
p =x
Nand p̄ =
x
n
where
N = Total number of elements in the population
n = Total number of elements in the sample
x = Number of elements in the population or sample that possess a specific characteristic.
Example 6 Suppose a total of 789,654 families live in a city and 563,282 of them own homes.
Then, N = Population size = 789, 654, x = Families in the population who own homes = 563, 282.
11
Solution: The proportion of all families in this city who own homes is
p =x
N=
563, 282
789, 654= 0.71.
Now, suppose a sample of 240 families is taken from this city and 158 of them are homeowner.
Then, n = Sample size = 240, x = Families in the sample who own homes = 158. The sample
proportion is
p̄ =x
n=
158
240= 0.66.
As in the case of the mean, the difference between the sample proportion and the correspond-
ing population proportion gives the sampling error, assuming that the sample is random and no
nonsampling error has been made. That is, in case of the proportion,
Sampling error = p̄ − p
For instance, sampling error = p̄ − p = 0.66 − 0.71 = −0.05.
7.6 Mean, Standard Deviation, and Shape of the Sampling Distribution
of p̄
Definition 5 The probability distribution of the sample proportion p̄ is called the sampling dis-
tribution of p̄. It gives the various values that p̄ can assume and their probabilities.
Example 7 Boe Consultant Associates has five employees. The following table gives the name of
these five employees and information concerning their knowledge of statistics.
Name Knows Statistics
Ally yes
John no
Susan no
Peter yes
Tom yes
Solution: If we define the population proportion p as the proportion of employees who know
statistics, then,
p = 3/5 = 0.60
12
Now, suppose we draw all possible samples of three employees each and compute the proportion of
employees, for each sample, who know statistics. The total number of samples of size three that
can be drawn from the population of five employees is
Total number of samples =
(
5
3
)
= 10
The above table lists these 10 possible samples and the proportion of employees who know for each
of those samples.
Proportion who
Sample know statistics p̄
Ally, John, Susan 1/3 = 0.33
Ally, John, Peter 2/3 = 0.67
Ally, John, Tom 2/3 = 0.67
Ally, Susan, Peter 2/3 = 0.67
Ally, Susan, Tom 2/3 = 0.67
Ally, Peter, Tom 3/3 = 1.00
John, Susan, Peter 1/3 = 0.33
John, Susan, Tom 1/3 = 0.33
John, Peter, Tom 2/3 = 0.67
Susan, Peter, Tom 2/3 = 0.67
The sampling distribution of p̄ as recorded in follows
p̄ f(p̄)
0.33 0.30
0.67 0.60
1.00 0.10∑
f(p̄) = 1.00
• Mean of the sampling distribution p̄
The mean of the sample proportion p̄ is denoted by µp̄ and is equal to the population proportion
p. Thus,
µp̄ = p
• Standard Deviation of the sampling distribution p̄
13
The standard deviation of the sample proportion p̄ is denoted by σp̄ and is given by the formula
σp̄ =
√
p (1 − p)
n
where p is the population proportion, and n is the sample size. This formula is used when n/N ≤
0.05 where N is the population size. However, if n/N is greater than 0.05, then σp̄ is calculated as
follows.
σp̄ =
√
p (1 − p)
n
√
N − n
N − 1
where the factor√
N−n
N−1is the finite population correction factor.
• Shape of the Sampling Distribution of p̄
Central limit theorem – The sampling distribution of p̄ is approximately normal for a suffi-
ciently large sample size. In the case of proportion, the sample size n is considered to be sufficiently
large if np and n (1 − p) are both greater than 5, that is, if np ≥ 5 and n (1 − p) ≥ 5.
14
Sampling Distribution of p̄
Mean µp̄ = p
Standard error σp̄ =
√
p (1 − p)
n
Shape Normal if np ≥ 5 and n (1 − p) ≥ 5
Notation p̄ ∼ N
p,
(
√
p (1 − p)
n
)2
7.7 Applications of the Sampling Distribution of p̄
Example 8 The election returns showed that a certain candidate received 46% of the votes.
(a) Determine the probability that a poll of 200 people selected at random from the voting popula-
tion would have shown a majority (over 50%) of votes in favor of the candidate.
(b) 95% of the sample proportions will be greater than what value?
Solution: (a) From the given information,
p = 0.46 and 1 − p = 1 − 0.46 = 0.54
The mean and standard of the sample proportion p̄ are
µp̄ = p = 0.46 and σp̄ =
√
p (1 − p)
n=
√
(0.46)(0.54)
200= 0.0352.
As np = 200(0.46) = 92 and n (1 − p) = 200(0.54) = 108 are both greater than 5, we can infer
from the central limit theorem that the sampling distribution of p̄ is approximately normal. Thus,
p̄ ∼ N(
0.46, (0.0352)2)
. A majority is indicated in the sample if the proportion in favor of the
candidate is 0.5 or more.
P (p̄ > 0.5) = P
(
p̄ − µp̄
σp̄
>0.5 − 0.46
0.0352
)
= P (Z > 1.14)
= 0.1271
(b) Let A be the required value. Given P (p̄ > A) = 0.95, from the normal table, we have
P (Z > −1.645) = 0.95. Hence
A = µp̄ + Zσp̄ = 0.46 + (−1.645) × 0.0352 = 0.4021
15
7.8 Sample Surveys and Sampling Techniques
7.8.1 Why Sample?
Most of the time surveys are conducted by using samples and not a census of the population. Some
of the main reasons for conducting a sample survey instead of a census are as follows.
Time In most cases, the size of the population is quite large. Consequently, conducting a
census will take a long time, whereas a sample survey can be conducted very quickly. It will be
time consuming to interview or contact hundreds of thousands or even millions of members of a
population. On the other hand, a survey of a sample of a few hundred elements may be completed
in little time. In fact, because of the amount of time needed to conduct a census, by the time the
census is completed the results may be obsolete.
Cost The cost of collecting information from all members of a population, may easily fall
outside the limited budget of most, if not all, surveys. Consequently, to stay within the available
resources, conducting a sample survey may be the best approach.
Impossibility of Conducting a Census Sometimes it is impossible to conduct a census.
First, it may not be possible to identify and access each member of the population. For example,
if a researcher wants to conduct a survey about homeless people, it will not be possible to locate
each member of the population and include him/her in the survey. Second, sometimes conducting
a survey means destroying the items included in the survey. For example, to estimate the mean
life of light bulbs would necessitate burning out all the bulbs included in the survey. In such cases,
only a portion of the population can be selected for the survey.
7.8.2 Random and Nonrandom Samples
Depending on how a sample is drawn, it may be a random sample or a nonrandom sample.
Definition 6 A random sample is a sample drawn in such a way that each member of the
population has some chance of being selected in the sample. In a nonrandom sample, some
members of the population may not have any chance of being selected in the sample.
Suppose we have a list of 100 students and we want to select 10 of them. If we write the names
of all 100 students on pieces of paper, put them, in a box, mix them, and then draw 10 names,
16
the result will be a random, sample of 10 students. However, if we arrange the names of these 100
students alphabetically and pick the first 10 names, it will be a nonrandom sample because the
students who are not among the first 10 have no chance of being selected in the sample.
Note that for a random sample, each member of the population may or may not have the same
chance of being included in the sample.
Three types of nonrandom samples are a convenience sample, a judgment sample and a
quota sample.
In a convenience sample, the most accessible members of the population are selected to obtain
the results quickly. For example, an opinion poll may be conducted in a few hours by collecting
information from certain shoppers at a single shopping mall.
In a judgment sample, the members are selected from the population based on the judgment and
prior knowledge of an expert. Although such a sample may happen to be a representative sample,
the chances of it being so are small. If the population is large, it is not an easy task to select a
representative sample based on judgment.
In quota sampling, randomness is forfeited in the interests of cheapness and administrative
simplicity. Investigators are told to interview all the people they meet up to a certain quota. A
large degree of bias could be introduced accidentally. For example, an interviewer in a shopping
centre may fill his quota by only meeting people who can go shopping during the week. In practice,
this problem can be partly overcome by subdividing the quota into different types of people, for
example on the basis of age, sex and income, to ensure that the sample mirrors the structure or
stratification of the population.
7.8.3 Random Sampling Techniques
There are many ways to select a random sample. Five of these techniques are discussed below.
Simple Random Sampling A sample that assigns the same probability of being selected to
each member of the population is called a simple random sample.
Definition 7 A simple random sample is a sample that is, selected in such a way that each
member of the population has the same chance of being included in the sample.
One way to select a simple random sample is by a lottery or drawing. For example, if we need
to select five students from a class of 50, we write each of the 50 names on separate pieces of paper.
17
Then, we place all 50 names in a box and mix them thoroughly. Next, we draw one name randomly
from the bat. We repeat this experiment four more times. The five drawn names comprise a simple
random sample.
Tables of Random Numbers
The most commonly used inanimate device for introducing chance into the sampling process
is a table of random numbers (or table of random digits). Such a table, which typically has been
created with a computer random-number-generating function, consists, of thousands of digits, each
of which is any one of the ten numbers from 0 to 9. Every digit has, in essence, been selected by
a simple random sample from the numbers 0 to 9. Consequently, the numbers 0 to 9 are equally
likely to appear in any digit-position in the table, and there are no systematic connections between
digits. Table E.1 (Random Numbers) in the Appendix is such a table of random numbers.
The second procedure to select a simple random sample is to use a table of random numbers.
Table E.1 in Appendix lists random numbers. These numbers are generated by a random process.
Suppose we have a group of 1000 persons and we need to select 12 persons randomly from this
group. To select a simple random sample, we arrange the names of all 1000 persons in alphabetic
order and assign a three-digit number, from 000 to 999, to each person.
Next, we use the table of random numbers to select 12 persons. The random numbers in Table
E.1 are recorded in blocks of five digits. To use this table, we can start anywhere.
One way to do so is to close our eyes and put a finger anywhere on the page and start at that
point. From there, we can move in any direction. We need to pick three-digit numbers from the
table because we have assigned three-digit numbers to the 1000 persons in our population.
Suppose we start at Row 08 and Column 6 of Table E.1. The first block of three numbers in
Table E.1 is 938. We use the three digits of this block to select the first person from the population.
Hence, the first person selected is the one with the number 938. Suppose we move along the row
to the right to make the next selection. The second block of three numbers in Table E.1 is 090.
Consequently, the second person selected is the one with the number 090. We continue this process
until all 12 required persons are selected. This gives us a simple random sample of 12 persons. Since
we is sampling without replacement, the repeating random numbers are discarded such as 090 and
a sample of 12 unique persons obtained.
Systematic Random Sampling The simple random sampling procedure will become very te-
dious if the size of the population is large. For example, if we need to select 150 households from
18
a list of 45,000, it will be very time consuming either to write the 45,000 names on pieces of paper
and then select 150 households or to assign a five-digit number to each of the 45,000 households and
then select 150 households using the table of random numbers. In such cases, it is more convenient
to use systematic random sampling.
The procedure to select a systematic random sample is as follows. In the example just mentioned,
we would arrange all 45,000 households alphabetically (or based on some other characteristic). Since
the sample size should equal 150, the ratio of population to sample size is 45, 000/150 = 300. Using
this ratio, we randomly select one household from the first 300 households in the arranged list either
by using the lottery system or by using a table of random numbers. Suppose by using either of
these methods, we select the 210th household. We then select every 210th household from every
300 households in the list. In other words, our sample includes the households with numbers 210,
510, 810, 1110, 1410, 1710, and so on.
Definition 8 In systematic random sampling, we first randomly select one member from the
19
first k units. Then every kth member, starting with the first selected member, is included in the
sample.
Note that systematic random sampling does not give a simple random sample because we cannot
select two adjacent elements. Hence, every member of the population does not have the same
probability of being selected.
Stratified Random Sampling Suppose we need to select a sample from the population of a
city and we want households with different income levels to be equally represented in the sample.
In this case, instead of selecting a simple random sample or a systematic random sample, we may
prefer to apply a different technique. First, we divide the whole population into different groups
based on income levels. For example, we may form three groups of low-, medium-, and high-income
households. We will now have three subpopulations, which are usually called strata.
We then select one sample from each subpopulation or stratum. The collection of all three
samples selected from three strata gives the required sample, called the stratified random sample.
Usually, the sizes of the samples selected from different strata are proportionate to the sizes of the
subpopulations in these strata. Note that the elements of each stratum are identical with regard to
the possession of a characteristic.
Definition 9 In a stratified random sample, we first divide the population into subpopulations,
which are called strata. Then, one sample is selected from each of these strata. The collection of
all samples from all strata gives the stratified random sample.
20
Thus, whenever we observe that a population differs widely in the possession of a characteristic,
we may prefer to divide it into different strata and then select one sample from each stratum.
We can divide the population on the basis of any characteristic, such as income, expenditure, sex,
education, race, employment, or family size.
Cluster Sampling Sometimes the target population is scattered over a wide geographical area.
Consequently, if a simple random sample is selected, it may be costly to contact each member of
the sample. In such a case, we divide the population into different geographical groups or clusters
and as a first step select a random sample of certain clusters from all clusters. We then take all
elements from each selected cluster. For example, suppose we are to conduct a survey of households
in Hong Kong. First, we divide the whole Hong Kong into, say, 40 regions, which will be called
clusters or primary units. We make sure that all clusters are similar and, hence, representative
of the population.
We then select at random, say, 5 clusters from 40. Next, we randomly select certain households
from each of these 5 clusters and conduct a survey of these selected households. This is called
cluster sampling. Note that all clusters must be representative of the population.
Definition 10 In cluster sampling, the whole population is first divided into (geographical)
21
groups called clusters. Each cluster is representative of the population. Then a random sample of
clusters is selected. Finally, all elements in each of the selected clusters is selected.
Example 9 A company owning a chain of newsagents wishes to undertake a customer service
survey. Interviewers will be despatched to a sample of 100 branches to question customers in the
shops. The number of newsagents owned in each area is as follows.
Central 360
North West 240
North 200
North East 100
Greater London 700
Scotland 200
Wales 140
South 60
Explain how the 100 newsagents might be chosen, given the relative advantages and disadvantages
of each method, if the survey is to be performed using the following sampling methods.
(a) Stratified random sampling
22
(b) Systematic sampling
(c) Cluster sampling
Solution:
(a) Stratified sampling
Each area could be taken as a stratum. There are 2,000 newsagents altogether, so 100/2, 000 =
5% of the newsagents in each area must be included in the sample.
In the Central area, for example, 360 × 5% = 18 newsagents would be included. These should
be selected randomly from the 360 newsagents. The newsagents should be numbered from 000 to
359, and three-digit random numbers used to select the sample.
Stratified sampling has the advantage that the same proportion of newsagents from each area
will be included in the sample. It gives a closer approximation to random sampling than cluster
sampling, and may give a closer approximation to random sampling than systematic sampling.
Statistical calculations generally require that the samples on which they are based be random or
nearly so.
The disadvantage of stratified sampling is that it is likely to be more expensive than the other
two methods.
(b) Systematic sampling
In order to take a systematic sample, the 2,000 newsagents must first be arranged in some order.
Any order will do, but orders which might produce some cyclical patterns (such as by size within
one area, they by size with the next area, and so on) should be avoided if possible. The sample
then comprises every 2, 000/100 = 20th newsagent in the order, with the first newsagent chosen at
random from among the first 20 in the order.
The advantage of systematic sampling is that, provided a complete list of newsagents is available,
it is cheap and easy to obtain the sample. The disadvantage is that if there is any cyclical pattern
in the ordering, an unrepresentative sample may be obtained.
(c) Cluster sampling
Each area could be divided into clusters (perhaps towns and their surrounding areas), and a
random sample of clusters from all area would then be taken. The sample would comprise all the
newsagents within the selected clusters.
This method is fairly cheap to use, but the clusters must first be identified. It is possible that
a sample smaller or larger than 100 newsagents will be obtained.
23
7.8.4 Sampling and Nonsampling Errors
The results obtained from a sample survey may contain two types of errors: the sampling and
nonsampling errors. The sampling error is also called the chance error, and the non-sampling errors
are also called the systematic errors.
Sampling or Chance Error Usually, all samples taken from the same population will give
different results because they contain different elements of the population. Moreover, the results
obtained from any one sample will not be exactly the same as the ones obtained from a census.
The difference between a sample result and the result we would have, obtained by conducting a
census is called the sampling error, assuming that the sample is random and no nonsampling error
has been made.
Definition 11 The sampling error is the difference between the result obtained from a sample
survey and the result that would have been obtained if the whole population had been included in the
survey.
The sampling error occurs because of chance, and it cannot be avoided. A sampling error can
occur only in a sample survey. It does not occur in a census.
Nonsampling or Systematic Errors Nonsampling errors can occur both in a sample survey
and in a census. Such errors occur because of human mistakes and not chance.
Definition 12 The errors that occur in the collection, recording, and tabulation of data are called
nonsampling errors.
Nonsampling errors occur because of human mistakes and not chance. Nonsampling errors
can be minimized if questions are, prepared carefully and data are handled cautiously. There are
many types of systematic errors or biases that can occur in a survey including coverage error,
nonresponse error, response error, and voluntary response error.
Coverage Error When we need to select a sample, we use a list of elements from which we
draw a sample, and this list usually does not include many members of the target population. Most
of the time it is not feasible to include every member of the target population in this list. This list
of members of the population that is used to select a sample is called the sampling frame.
24
For example, if we use a telephone directory to select a sample, the list of names that appears in
this directory makes the sampling frame. In this case we will miss the people who are not listed in
the telephone directory. The people we miss, for example, will be poor people (including homeless
people) who do not have telephones and people who do not want to be listed in the directory. Thus,
the sampling frame that is used to select a sample may not be representative of the population.
This may cause the sample results to be different from the population results. The error that occurs
because the sampling frame is not representative of the population is called the coverage error.
Definition 13 The list of members of the target population that is used to select a sample is called
the sampling frame. The error that occurs because the sampling frame is not representative of
the population is called the coverage error.
If a sample is nonrandom (and, hence, nonrepresentative), the sample results may be quite
different from the census results.
Nonresponse Error Even if our sampling frame and, consequently, the sample is represen-
tative of the population, the nonresponse error may occur because many of the people included in
the sample did not respond to the survey.
Definition 14 The error that occurs because many of the people included in the sample do respond
to a survey is called. the nonresponse error.
This type of error occurs especially when a survey is conducted by mail. A lot of people do not
return the questionnaires. It has been observed that families with low and high incomes do not
respond to surveys by mail. Consequently, such surveys overrepresent middle-income families. This
kind of error occurs in other types of surveys, too. For instance, in a face-to-face survey where the
interviewer interviews people at their homes, many people may not be home when the interviewer
visits their homes. The people who are home at the time the interviewer visits their homes and
the ones who are not home at that time may differ in many respects, causing a bias in the survey
results. This kind of error may also occur in a telephone survey. Many people may not be home
when the interviewer calls. This may distort the results. To avoid the nonresponse error, every
effort should be made to contact all people included in the survey.
25
Response Error The response error occurs when the answer given by a person included in the
survey is not correct. This may happen for many reasons. One reason is that the respondent may not
have understood the question. Thus, the wording of the question may have caused the respondent
to answer incorrectly. It has been observed that when the same question is worded differently,
many people do not respond the same way. Usually such an error on the part of respondents is not
intentional.
Definition 15 The response error occurs when people included in the survey do not provide
correct answers.
Sometimes the respondents do not want to give correct information when answering a ques-
tion. For example, many respondents will not disclose, their true incomes on questionnaires or in
interviews. When information, on income is provided, it is almost always biased in the upward
direction.
Sometimes the race of the interviewer may affect the answers of respondents. This is especially
true if the questions asked are about race relations. The answers given by respondents will differ
depending on whether the interviewer is white or nonwhite.
Voluntary Response Error Another source of systematic error is a survey based on a vol-
untary response sample.
Definition 16 Voluntary response error occurs when a survey is not conducted, on a randomly
selected sample but a questionnaire is published in a magazine or newspaper and people are invited
to respond to that questionnaire.
The polls conducted based on samples of readers of magazines and newspapers suffer from
voluntary response error or bias. Usually only those readers who have very strong opinions about
the issues involved respond to such surveys. Surveys in which the respondents are required to
call 900 telephone numbers also suffer from this type of error. Here, in order to participate, a
respondent must pay for the call, and many people do not want to bear this cost. Consequently, the
sample is usually neither random nor representative of the target population because participation
is voluntary.
Example 10 Why is the following true story an example of nonrandom sampling, with both selec-
tion bias and voluntary response bias?
26
The 1936 presidential election in the United States had two major candidates: the Republican, Al-
fred M. Landon, and the Democrat, the incumbent president, Franklin D. Roosevelt. Several weeks
before the election, Literary Digest magazine tried to predict the outcome by mailing 10 million ques-
tionnaires to people selected from three sources: the subscription list for the magazine, telephone
directories, and automobile registration records. The magazine received back approximately 2.5 mil-
lion answers, and of these some 57% favored Landon. From these results the magazine predicted a
landslide victory for Landon. A few weeks later, however, in the actual election, it was Roosevelt
who got the majority of the votes (62%).
Solution: This is an example of nonrandom sampling because by limiting the sample to magazine
subscribers and to owners of telephones and automobiles, most of the voting population had a zero
probability of being included, in the sample. The time was 1936, in the depths of the Depression,
and the judgement-selection limited the sample to a relatively prosperous stratum of the popula-
tion. Besides this severe selection bias, produced by a discrepancy between the target population
and the sampling frame, there was also a voluntary response bias. This voluntary response bias,
called self-selection bias, occurred because only about 25% of the selected sample returned their
questionnaires. Thus even for this chosen stratum of the population, the probabilities for inclusion
in the sample were unknown before sampling.
27