7.1 sampling distribution of7.1 sampling distribution of x¯ deﬁnition 1 the population...

7.1 Sampling Distribution of X̄

Definition 1 The population distribution is the probability distribution of the population data.

Example 1 Suppose there are only five students in an advanced statistics class and the midterm

scores of these five students are

70 78 80 80 95

Solution: Let X denote the score of a student. Using single-valued classes, the frequency

distribution of scores as follows

X f f(x)

70 1 0.2

78 1 0.2

80 2 0.4

95 1 0.2

N = 5∑

f(x) = 1.0

The values of the mean and standard deviation calculated for the probability distribution give

the values of the population parameters µ and σ. These values are µ = 80.60 and σ = 8.09.

Definition 2 The probability distribution of X̄ is called the sampling distribution of X̄. It lists

the various values that X̄ can assume and the probability of each value of X̄.

Example 2 Reconsider the population of midterm scores of five students given in Example 1. Con-

sider all possible samples of three scores each that can be selected, without replacement, from that

population. The total number of possible samples, given by the combinations formula

Total number of samples =

(

5

3

)

=5!

3!2!= 10

Solution: Suppose we assign letters A, B, C, D, and E to the scores of five students so that A

= 70, B = 78, C = 80, D = 80, E = 95. Then the 10 possible samples of three scores each are ABC,

ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE. These 10 samples and their respective

2

means are listed in the following table.

Sample Scores in the Sample X̄

ABC 70, 78, 80 76.00

ABD 70, 78, 80 76.00

ABE 70, 78, 95 81.00

ACD 70, 80, 80 76.67

ACE 70, 80, 95 81.67

ADE 70, 80, 95 81.67

BCD 78, 80, 80 79.33

BCE 78, 80, 95 84.33

BDE 78, 80, 95 84.33

CDE 80, 80, 95 85.00

By using the values of X̄ given in the above table, we record the frequency distribution of X̄ as

3

follows:

X̄ f f(

X̄)

76.00 2 0.2

76.67 1 0.1

79.33 1 0.1

81.00 1 0.1

81.67 2 0.2

84.33 2 0.2

85.00 1 0.1∑

f(

X̄)

= 1.0

• Sampling Error – Sampling error is the difference between the value of a sample statistic

and the value of the corresponding population parameter. In the case of the mean,

Sampling error = X̄ − µ

assuming that the sample is random and no nonsampling error has been made.

It is important to remember that a sampling error occurs because of chance. The errors that

occur for other reasons, such as errors made during collection, recording, and tabulation of data, are

called nonsampling errors. Such errors occur because of human mistakes and not chance. Note

that there is only one kind of sampling error–the error that occurs due to chance. However, there

is not just one nonsampling error but many nonsampling errors that may occur due to different

reasons.

Definition 3 The errors that occur in the collection, recording, and tabulation of data are called

nonsampling errors.

The following paragraph, reproduced from the Current Population Reports of the U.S. Bureau

of the Census, explains how nonsampling errors can occur.

Nonsampling errors can be attributed to many sources, e.g., inability to obtain information about

all cases in the sample, differences in the interpretation of questions, inability or unwillingness on

the part of the respondents to provide correct information, inability to recall information, errors

made in collection such as in recording or coding the data, errors made in processing the data, errors

made in estimating values for missing data, biases resulting from the differing recall periods caused

4

by the interviewing pattern used, and failure of all units in the universe to have some probability

of being selected for the sample (undercoverage).

The following are the main reasons for the occurrence of nonsampling errors.

1. If a sample is nonrandom (and, hence, nonrepresentative), the sample results may be too dif-

ferent from the census results. The following quote from US. News & World Report describes

how even a randomly selected sample can become nonrandom if some of the members included

in the sample cannot be contacted.

A test poll conducted in the 1984 presidential election found that if the poll were halted after

interviewing only those subjects who could be reached on the first try, Reagan showed a 3-

percentage-point lead over Mondale. But when interviewers made a determined effort to reach

everyone on their lists of randomly selected subjects—calling some as many as 30 times before

finally reaching them—Reagan showed a 13 percent lead, much closer to the actual election

result. As it turned out, people who were planning to vote Republican were simply less likely

to be at home. (“The Numbers Racket: How Polls and Statistics Lie,” U.S. News & World

Report, July 11, 1988. Copyright 1988 by U.S. News & World Report, Inc. Reprinted with

permission.)

2. The questions may be phrased in such a way that they are not fully understood by the

members of the sample or population. As a result, the answers obtained are not accurate.

3. The respondents may intentionally give false information in response to some sensitive ques-

tions. For example, people may not tell the truth about drinking habits, incomes, or opinions

about minorities. Sometimes the respondents may give wrong answers be cause of ignorance.

For example, a person may not remember the exact amount he spent on clothes during the

last year. If asked in a survey, he may give an inaccurate answer.

4. The poll taker may make a mistake and enter a wrong number in the records or make an error

while entering the data on a computer.

Note that nonsampling errors can occur both in a sample survey and in a census, whereas sam-

pling error occurs only when a sample survey is conducted. Nonsampling errors can be minimized

by preparing the survey questionnaire carefully and handling the data cautiously. However, it is

impossible to avoid sampling error.

5

Example 3 Reconsider the population of five scores given in Example 1. The scores of the five

students are 70, 78, 80, 80, and 95. The population mean is

µ =70 + 78 + 80 + 80 + 95

5= 80.60

Now suppose we take a random sample of three scores from this population. Assume that this sample

includes the scores 70, 80, and 95. The mean for this sample is

X̄ =70 + 80 + 95

3= 81.67

Consequently,

Sampling error = X̄ − µ = 81.67 − 80.60 = 1.07

That is, the mean score estimated from the sample is 1.07 higher than the mean score of the popu-

lation. Note that this difference occurred due to chance, that is, because we used a sample instead

of the population.

Now suppose, when we select the above mentioned sample, we mistakenly record as record second

score as 82 instead of 80. As a result, we calculate the sample mean as

X̄ =70 + 82 + 95

3= 82.33

Consequently, the difference between this sample mean and the population mean is

X̄ − µ = 82.33 − 80.60 = 1.73

However, this difference between the sample mean and the population mean does not represent the

sampling error. As we calculated earlier, only 1.07 of this difference is due to the sampling error.

The remaining portion, which is equal to 1.73 − 1.07 = 0.66, represents the nonsampling error

because it occurred due to the error we made in recording the second score in the sample. Thus,

Sampling error = 1.07 and Nonsampling error = 0.66.

7.2 Mean and Standard Deviation of X̄

The mean and standard deviation calculated for the sampling distribution of X̄ are called the

mean µX̄ and standard deviation σX̄ of X̄. Actually, the mean and standard deviation of X̄ are,

respectively, the mean and standard deviation of the means of all samples of the same size selected

from a population. The standard deviation of σX̄ is also called the standard error of X̄.

6

• Mean of the Sampling Distribution of X̄

The mean of the sampling distribution of is equal to the mean of the population. Thus,

µX̄ = µ

• Standard Deviation of the Sampling Distribution of X̄

The standard deviation of the sampling distribution of X̄ is

σX̄ =σ√

n

where σ is the standard deviation of the population and n is the sample size. This formula is used

when n/N ≤ 0.05, where N is the population size. If this condition is not satisfied, we use the

following formula to calculate σX̄

σX̄ =σ√

n

√

N − n

N − 1

where the factor√

N−n

N−1is called the finite population correction factor.

7.3 Shape of the Sampling Distribution of X̄

The shape of the sampling distribution of X̄ relates to the following two cases.

1. The population from which samples are drawn has a normal distribution.

2. The population from which samples are drawn does not have a normal distribution.

• Sampling from a Normally Distributed Population

When the population from which samples are drawn is normally distributed with its mean equals

to µ and standard deviation equal to σ, then

1. The mean of X̄, µX̄ , is equal to the mean of the population, µ.

2. The standard deviation of X̄, σX̄ , is equal to σ/√

n, assuming n/N ≤ 0.05.

3. The shape of the sampling distribution of X̄ is normal, whatever the value of n.

• Sampling from a population that is not Normally Distributed but n ≥ 30

7

Most of the time the population from which the samples are selected is not normally distributed.

In such cases, the shape of the sampling distribution of X̄ is inferred from a very important theorem

called the central limit theorem.

• Central Limit Theorem

For a large sample size, the sampling distribution of the sample mean X̄ is approximately normal,

irrespective of the shape of the population distribution. The mean and standard deviation of the

sampling distribution of X̄ are

µX̄ = µ and σX̄ =σ√

n.

The sample size is usually considered to be large if n ≥ 30.

Note: If the population distribution is fairly symmetrical, the sampling distribution of the

sample mean is approximately normal if the samples of at least 15 observations are selected.

8

Sampling Distribution of X̄

Normal Population Non-normal Population

Mean µX̄ = µ µX̄ = µ

Standard error σX̄ =σ√

nσX̄ =

σ√

n

Shape Normal Approximate Normal if n ≥ 30

Notation X̄ ∼ N

(

µ,

(

σ√

n

)2)

X̄ ∼ N

(

µ,

(

σ√

n

)2)

7.4 Applications of the Sampling Distribution of X̄

Example 4 A company which manufacturers drink dispensing machines sets the fill level at 198cc.

The standard deviation is 4cc. Assuming that the fill levels have a normal distribution.

(a) A drink is randomly selected, what is the probability that the drink will have less than 195cc?

(b) What is the probability that a random sample of 50 drinks has a mean value greater than

9

199cc?

Solution: (a) Let X be fill level and µ be the mean fill level. Given X ∼ N (198, 42) ,

P (X < 195) = P

(

X − µ

σ<

195 − 198

4

)

= P (Z < −0.75)

= 0.2266

(b) Let X̄ be the sample mean. Since the population is normally distributed, thus the shape of

the sampling distribution of X̄ is normal. The mean and standard deviation of X̄ are

µX̄ = µ = 198 and σX̄ =σ√

n=

4√

50.

That is, X̄ ∼ N

(

198,(

4√

50

)2)

.

P(

X̄ > 199)

= P

(

X̄ − µX̄

σX̄

>199 − 198

4/√

50

)

= P (Z > 1.77)

= 0.0384

Example 5 Suppose that the mean and standard deviation of the weights of all packages of a certain

brand of cookies are 32 grams and 0.8 grams, respectively.

(a) Find the probability that the mean weight of a random sample of 40 packages of this brand of

cookies will be between 31.8 and 31.9 grams.

(b) 97.5% of the sample means will be less than what value?

Solution: (a) Since the sample size is large (n ≥ 30), by CLT, the shape of the sampling

distribution of X̄ is normal. The mean and standard deviation of X̄ are

µX̄ = µ = 32 and σX̄ =σ√

n=

0.8√

40.

Thus, X̄ ∼ N(

32, 0.82

40

)

. We are to compute the probability that the value of X̄ calculated for one

randomly drawn sample of 40 packages is between 31.8 and 31.9 grams, that is,

P (31.8 < X̄ < 31.9)

10

This probability is given by the area under the normal curve for X̄ between the points X̄ = 31.8

and X̄ = 31.9. The first step in finding this area is to convert the two X̄ values to respective Z

values.

The probability that X̄ is between 31.8 and 31.9 is given by the area under the standard normal

curve between Z = −1.58 and Z = −0.79. Thus, the required probability is

P (31.8 < X̄ < 31.9)

= P

(

31.8 − 32

0.8/√

40<

X̄ − µX̄

σX̄

<31.9 − 32

0.8/√

40

)

= P (−1.58 < Z < −0.79)

= 0.1577.

Therefore, the probability is 0.1577 that the mean weight of a sample of 40 packages will be between

31.8 and 31.9 grams.

(b) Let A be the required value. Given P(

X̄ < A)

= 0.975, from the normal table, we have

P (Z < 1.96) = 0.975. Hence

A = µX̄ + ZσX̄ = 32 + 1.96 ×0.8√

40= 32.248 grams

7.5 Population and Sample Proportions

Definition 4 The Population proportion, denoted by p, is obtained by taking the ratio of the

number of elements in a population with a specific characteristic to the total number of elements in

the population. The sample proportion, denoted by p̄, gives a similar ratio for a sample.

The population and sample proportions, denoted by p and p̄, respectively, are calculated as

p =x

Nand p̄ =

x

n

where

N = Total number of elements in the population

n = Total number of elements in the sample

x = Number of elements in the population or sample that possess a specific characteristic.

Example 6 Suppose a total of 789,654 families live in a city and 563,282 of them own homes.

Then, N = Population size = 789, 654, x = Families in the population who own homes = 563, 282.

11

Solution: The proportion of all families in this city who own homes is

p =x

N=

563, 282

789, 654= 0.71.

Now, suppose a sample of 240 families is taken from this city and 158 of them are homeowner.

Then, n = Sample size = 240, x = Families in the sample who own homes = 158. The sample

proportion is

p̄ =x

n=

158

240= 0.66.

As in the case of the mean, the difference between the sample proportion and the correspond-

ing population proportion gives the sampling error, assuming that the sample is random and no

nonsampling error has been made. That is, in case of the proportion,

Sampling error = p̄ − p

For instance, sampling error = p̄ − p = 0.66 − 0.71 = −0.05.

7.6 Mean, Standard Deviation, and Shape of the Sampling Distribution

of p̄

Definition 5 The probability distribution of the sample proportion p̄ is called the sampling dis-

tribution of p̄. It gives the various values that p̄ can assume and their probabilities.

Example 7 Boe Consultant Associates has five employees. The following table gives the name of

these five employees and information concerning their knowledge of statistics.

Name Knows Statistics

Ally yes

John no

Susan no

Peter yes

Tom yes

Solution: If we define the population proportion p as the proportion of employees who know

statistics, then,

p = 3/5 = 0.60

12

Now, suppose we draw all possible samples of three employees each and compute the proportion of

employees, for each sample, who know statistics. The total number of samples of size three that

can be drawn from the population of five employees is

Total number of samples =

(

5

3

)

= 10

The above table lists these 10 possible samples and the proportion of employees who know for each

of those samples.

Proportion who

Sample know statistics p̄

Ally, John, Susan 1/3 = 0.33

Ally, John, Peter 2/3 = 0.67

Ally, John, Tom 2/3 = 0.67

Ally, Susan, Peter 2/3 = 0.67

Ally, Susan, Tom 2/3 = 0.67

Ally, Peter, Tom 3/3 = 1.00

John, Susan, Peter 1/3 = 0.33

John, Susan, Tom 1/3 = 0.33

John, Peter, Tom 2/3 = 0.67

Susan, Peter, Tom 2/3 = 0.67

The sampling distribution of p̄ as recorded in follows

p̄ f(p̄)

0.33 0.30

0.67 0.60

1.00 0.10∑

f(p̄) = 1.00

• Mean of the sampling distribution p̄

The mean of the sample proportion p̄ is denoted by µp̄ and is equal to the population proportion

p. Thus,

µp̄ = p

• Standard Deviation of the sampling distribution p̄

13

The standard deviation of the sample proportion p̄ is denoted by σp̄ and is given by the formula

σp̄ =

√

p (1 − p)

n

where p is the population proportion, and n is the sample size. This formula is used when n/N ≤

0.05 where N is the population size. However, if n/N is greater than 0.05, then σp̄ is calculated as

follows.

σp̄ =

√

p (1 − p)

n

√

N − n

N − 1

where the factor√

N−n

N−1is the finite population correction factor.

• Shape of the Sampling Distribution of p̄

Central limit theorem – The sampling distribution of p̄ is approximately normal for a suffi-

ciently large sample size. In the case of proportion, the sample size n is considered to be sufficiently

large if np and n (1 − p) are both greater than 5, that is, if np ≥ 5 and n (1 − p) ≥ 5.

14

Sampling Distribution of p̄

Mean µp̄ = p

Standard error σp̄ =

√

p (1 − p)

n

Shape Normal if np ≥ 5 and n (1 − p) ≥ 5

Notation p̄ ∼ N

p,

(

√

p (1 − p)

n

)2

7.7 Applications of the Sampling Distribution of p̄

Example 8 The election returns showed that a certain candidate received 46% of the votes.

(a) Determine the probability that a poll of 200 people selected at random from the voting popula-

tion would have shown a majority (over 50%) of votes in favor of the candidate.

(b) 95% of the sample proportions will be greater than what value?

Solution: (a) From the given information,

p = 0.46 and 1 − p = 1 − 0.46 = 0.54

The mean and standard of the sample proportion p̄ are

µp̄ = p = 0.46 and σp̄ =

√

p (1 − p)

n=

√

(0.46)(0.54)

200= 0.0352.

As np = 200(0.46) = 92 and n (1 − p) = 200(0.54) = 108 are both greater than 5, we can infer

from the central limit theorem that the sampling distribution of p̄ is approximately normal. Thus,

p̄ ∼ N(

0.46, (0.0352)2)

. A majority is indicated in the sample if the proportion in favor of the

candidate is 0.5 or more.

P (p̄ > 0.5) = P

(

p̄ − µp̄

σp̄

>0.5 − 0.46

0.0352

)

= P (Z > 1.14)

= 0.1271

(b) Let A be the required value. Given P (p̄ > A) = 0.95, from the normal table, we have

P (Z > −1.645) = 0.95. Hence

A = µp̄ + Zσp̄ = 0.46 + (−1.645) × 0.0352 = 0.4021

15

7.8 Sample Surveys and Sampling Techniques

7.8.1 Why Sample?

Most of the time surveys are conducted by using samples and not a census of the population. Some

of the main reasons for conducting a sample survey instead of a census are as follows.

Time In most cases, the size of the population is quite large. Consequently, conducting a

census will take a long time, whereas a sample survey can be conducted very quickly. It will be

time consuming to interview or contact hundreds of thousands or even millions of members of a

population. On the other hand, a survey of a sample of a few hundred elements may be completed

in little time. In fact, because of the amount of time needed to conduct a census, by the time the

census is completed the results may be obsolete.

Cost The cost of collecting information from all members of a population, may easily fall

outside the limited budget of most, if not all, surveys. Consequently, to stay within the available

resources, conducting a sample survey may be the best approach.

Impossibility of Conducting a Census Sometimes it is impossible to conduct a census.

First, it may not be possible to identify and access each member of the population. For example,

if a researcher wants to conduct a survey about homeless people, it will not be possible to locate

each member of the population and include him/her in the survey. Second, sometimes conducting

a survey means destroying the items included in the survey. For example, to estimate the mean

life of light bulbs would necessitate burning out all the bulbs included in the survey. In such cases,

only a portion of the population can be selected for the survey.

7.8.2 Random and Nonrandom Samples

Depending on how a sample is drawn, it may be a random sample or a nonrandom sample.

Definition 6 A random sample is a sample drawn in such a way that each member of the

population has some chance of being selected in the sample. In a nonrandom sample, some

members of the population may not have any chance of being selected in the sample.

Suppose we have a list of 100 students and we want to select 10 of them. If we write the names

of all 100 students on pieces of paper, put them, in a box, mix them, and then draw 10 names,

16

the result will be a random, sample of 10 students. However, if we arrange the names of these 100

students alphabetically and pick the first 10 names, it will be a nonrandom sample because the

students who are not among the first 10 have no chance of being selected in the sample.

Note that for a random sample, each member of the population may or may not have the same

chance of being included in the sample.

Three types of nonrandom samples are a convenience sample, a judgment sample and a

quota sample.

In a convenience sample, the most accessible members of the population are selected to obtain

the results quickly. For example, an opinion poll may be conducted in a few hours by collecting

information from certain shoppers at a single shopping mall.

In a judgment sample, the members are selected from the population based on the judgment and

prior knowledge of an expert. Although such a sample may happen to be a representative sample,

the chances of it being so are small. If the population is large, it is not an easy task to select a

representative sample based on judgment.

In quota sampling, randomness is forfeited in the interests of cheapness and administrative

simplicity. Investigators are told to interview all the people they meet up to a certain quota. A

large degree of bias could be introduced accidentally. For example, an interviewer in a shopping

centre may fill his quota by only meeting people who can go shopping during the week. In practice,

this problem can be partly overcome by subdividing the quota into different types of people, for

example on the basis of age, sex and income, to ensure that the sample mirrors the structure or

stratification of the population.

7.8.3 Random Sampling Techniques

There are many ways to select a random sample. Five of these techniques are discussed below.

Simple Random Sampling A sample that assigns the same probability of being selected to

each member of the population is called a simple random sample.

Definition 7 A simple random sample is a sample that is, selected in such a way that each

member of the population has the same chance of being included in the sample.

One way to select a simple random sample is by a lottery or drawing. For example, if we need

to select five students from a class of 50, we write each of the 50 names on separate pieces of paper.

17

Then, we place all 50 names in a box and mix them thoroughly. Next, we draw one name randomly

from the bat. We repeat this experiment four more times. The five drawn names comprise a simple

random sample.

Tables of Random Numbers

The most commonly used inanimate device for introducing chance into the sampling process

is a table of random numbers (or table of random digits). Such a table, which typically has been

created with a computer random-number-generating function, consists, of thousands of digits, each

of which is any one of the ten numbers from 0 to 9. Every digit has, in essence, been selected by

a simple random sample from the numbers 0 to 9. Consequently, the numbers 0 to 9 are equally

likely to appear in any digit-position in the table, and there are no systematic connections between

digits. Table E.1 (Random Numbers) in the Appendix is such a table of random numbers.

The second procedure to select a simple random sample is to use a table of random numbers.

Table E.1 in Appendix lists random numbers. These numbers are generated by a random process.

Suppose we have a group of 1000 persons and we need to select 12 persons randomly from this

group. To select a simple random sample, we arrange the names of all 1000 persons in alphabetic

order and assign a three-digit number, from 000 to 999, to each person.

Next, we use the table of random numbers to select 12 persons. The random numbers in Table

E.1 are recorded in blocks of five digits. To use this table, we can start anywhere.

One way to do so is to close our eyes and put a finger anywhere on the page and start at that

point. From there, we can move in any direction. We need to pick three-digit numbers from the

table because we have assigned three-digit numbers to the 1000 persons in our population.

Suppose we start at Row 08 and Column 6 of Table E.1. The first block of three numbers in

Table E.1 is 938. We use the three digits of this block to select the first person from the population.

Hence, the first person selected is the one with the number 938. Suppose we move along the row

to the right to make the next selection. The second block of three numbers in Table E.1 is 090.

Consequently, the second person selected is the one with the number 090. We continue this process

until all 12 required persons are selected. This gives us a simple random sample of 12 persons. Since

we is sampling without replacement, the repeating random numbers are discarded such as 090 and

a sample of 12 unique persons obtained.

Systematic Random Sampling The simple random sampling procedure will become very te-

dious if the size of the population is large. For example, if we need to select 150 households from

18

a list of 45,000, it will be very time consuming either to write the 45,000 names on pieces of paper

and then select 150 households or to assign a five-digit number to each of the 45,000 households and

then select 150 households using the table of random numbers. In such cases, it is more convenient

to use systematic random sampling.

The procedure to select a systematic random sample is as follows. In the example just mentioned,

we would arrange all 45,000 households alphabetically (or based on some other characteristic). Since

the sample size should equal 150, the ratio of population to sample size is 45, 000/150 = 300. Using

this ratio, we randomly select one household from the first 300 households in the arranged list either

by using the lottery system or by using a table of random numbers. Suppose by using either of

these methods, we select the 210th household. We then select every 210th household from every

300 households in the list. In other words, our sample includes the households with numbers 210,

510, 810, 1110, 1410, 1710, and so on.

Definition 8 In systematic random sampling, we first randomly select one member from the

19

first k units. Then every kth member, starting with the first selected member, is included in the

sample.

Note that systematic random sampling does not give a simple random sample because we cannot

select two adjacent elements. Hence, every member of the population does not have the same

probability of being selected.

Stratified Random Sampling Suppose we need to select a sample from the population of a

city and we want households with different income levels to be equally represented in the sample.

In this case, instead of selecting a simple random sample or a systematic random sample, we may

prefer to apply a different technique. First, we divide the whole population into different groups

based on income levels. For example, we may form three groups of low-, medium-, and high-income

households. We will now have three subpopulations, which are usually called strata.

We then select one sample from each subpopulation or stratum. The collection of all three

samples selected from three strata gives the required sample, called the stratified random sample.

Usually, the sizes of the samples selected from different strata are proportionate to the sizes of the

subpopulations in these strata. Note that the elements of each stratum are identical with regard to

the possession of a characteristic.

Definition 9 In a stratified random sample, we first divide the population into subpopulations,

which are called strata. Then, one sample is selected from each of these strata. The collection of

all samples from all strata gives the stratified random sample.

20

Thus, whenever we observe that a population differs widely in the possession of a characteristic,

we may prefer to divide it into different strata and then select one sample from each stratum.

We can divide the population on the basis of any characteristic, such as income, expenditure, sex,

education, race, employment, or family size.

Cluster Sampling Sometimes the target population is scattered over a wide geographical area.

Consequently, if a simple random sample is selected, it may be costly to contact each member of

the sample. In such a case, we divide the population into different geographical groups or clusters

and as a first step select a random sample of certain clusters from all clusters. We then take all

elements from each selected cluster. For example, suppose we are to conduct a survey of households

in Hong Kong. First, we divide the whole Hong Kong into, say, 40 regions, which will be called

clusters or primary units. We make sure that all clusters are similar and, hence, representative

of the population.

We then select at random, say, 5 clusters from 40. Next, we randomly select certain households

from each of these 5 clusters and conduct a survey of these selected households. This is called

cluster sampling. Note that all clusters must be representative of the population.

Definition 10 In cluster sampling, the whole population is first divided into (geographical)

21

groups called clusters. Each cluster is representative of the population. Then a random sample of

clusters is selected. Finally, all elements in each of the selected clusters is selected.

Example 9 A company owning a chain of newsagents wishes to undertake a customer service

survey. Interviewers will be despatched to a sample of 100 branches to question customers in the

shops. The number of newsagents owned in each area is as follows.

Central 360

North West 240

North 200

North East 100

Greater London 700

Scotland 200

Wales 140

South 60

Explain how the 100 newsagents might be chosen, given the relative advantages and disadvantages

of each method, if the survey is to be performed using the following sampling methods.

(a) Stratified random sampling

22

(b) Systematic sampling

(c) Cluster sampling

Solution:

(a) Stratified sampling

Each area could be taken as a stratum. There are 2,000 newsagents altogether, so 100/2, 000 =

5% of the newsagents in each area must be included in the sample.

In the Central area, for example, 360 × 5% = 18 newsagents would be included. These should

be selected randomly from the 360 newsagents. The newsagents should be numbered from 000 to

359, and three-digit random numbers used to select the sample.

Stratified sampling has the advantage that the same proportion of newsagents from each area

will be included in the sample. It gives a closer approximation to random sampling than cluster

sampling, and may give a closer approximation to random sampling than systematic sampling.

Statistical calculations generally require that the samples on which they are based be random or

nearly so.

The disadvantage of stratified sampling is that it is likely to be more expensive than the other

two methods.

(b) Systematic sampling

In order to take a systematic sample, the 2,000 newsagents must first be arranged in some order.

Any order will do, but orders which might produce some cyclical patterns (such as by size within

one area, they by size with the next area, and so on) should be avoided if possible. The sample

then comprises every 2, 000/100 = 20th newsagent in the order, with the first newsagent chosen at

random from among the first 20 in the order.

The advantage of systematic sampling is that, provided a complete list of newsagents is available,

it is cheap and easy to obtain the sample. The disadvantage is that if there is any cyclical pattern

in the ordering, an unrepresentative sample may be obtained.

(c) Cluster sampling

Each area could be divided into clusters (perhaps towns and their surrounding areas), and a

random sample of clusters from all area would then be taken. The sample would comprise all the

newsagents within the selected clusters.

This method is fairly cheap to use, but the clusters must first be identified. It is possible that

a sample smaller or larger than 100 newsagents will be obtained.

23

7.8.4 Sampling and Nonsampling Errors

The results obtained from a sample survey may contain two types of errors: the sampling and

nonsampling errors. The sampling error is also called the chance error, and the non-sampling errors

are also called the systematic errors.

Sampling or Chance Error Usually, all samples taken from the same population will give

different results because they contain different elements of the population. Moreover, the results

obtained from any one sample will not be exactly the same as the ones obtained from a census.

The difference between a sample result and the result we would have, obtained by conducting a

census is called the sampling error, assuming that the sample is random and no nonsampling error

has been made.

Definition 11 The sampling error is the difference between the result obtained from a sample

survey and the result that would have been obtained if the whole population had been included in the

survey.

The sampling error occurs because of chance, and it cannot be avoided. A sampling error can

occur only in a sample survey. It does not occur in a census.

Nonsampling or Systematic Errors Nonsampling errors can occur both in a sample survey

and in a census. Such errors occur because of human mistakes and not chance.

Definition 12 The errors that occur in the collection, recording, and tabulation of data are called

nonsampling errors.

Nonsampling errors occur because of human mistakes and not chance. Nonsampling errors

can be minimized if questions are, prepared carefully and data are handled cautiously. There are

many types of systematic errors or biases that can occur in a survey including coverage error,

nonresponse error, response error, and voluntary response error.

Coverage Error When we need to select a sample, we use a list of elements from which we

draw a sample, and this list usually does not include many members of the target population. Most

of the time it is not feasible to include every member of the target population in this list. This list

of members of the population that is used to select a sample is called the sampling frame.

24

For example, if we use a telephone directory to select a sample, the list of names that appears in

this directory makes the sampling frame. In this case we will miss the people who are not listed in

the telephone directory. The people we miss, for example, will be poor people (including homeless

people) who do not have telephones and people who do not want to be listed in the directory. Thus,

the sampling frame that is used to select a sample may not be representative of the population.

This may cause the sample results to be different from the population results. The error that occurs

because the sampling frame is not representative of the population is called the coverage error.

Definition 13 The list of members of the target population that is used to select a sample is called

the sampling frame. The error that occurs because the sampling frame is not representative of

the population is called the coverage error.

If a sample is nonrandom (and, hence, nonrepresentative), the sample results may be quite

different from the census results.

Nonresponse Error Even if our sampling frame and, consequently, the sample is represen-

tative of the population, the nonresponse error may occur because many of the people included in

the sample did not respond to the survey.

Definition 14 The error that occurs because many of the people included in the sample do respond

to a survey is called. the nonresponse error.

This type of error occurs especially when a survey is conducted by mail. A lot of people do not

return the questionnaires. It has been observed that families with low and high incomes do not

respond to surveys by mail. Consequently, such surveys overrepresent middle-income families. This

kind of error occurs in other types of surveys, too. For instance, in a face-to-face survey where the

interviewer interviews people at their homes, many people may not be home when the interviewer

visits their homes. The people who are home at the time the interviewer visits their homes and

the ones who are not home at that time may differ in many respects, causing a bias in the survey

results. This kind of error may also occur in a telephone survey. Many people may not be home

when the interviewer calls. This may distort the results. To avoid the nonresponse error, every

effort should be made to contact all people included in the survey.

25

Response Error The response error occurs when the answer given by a person included in the

survey is not correct. This may happen for many reasons. One reason is that the respondent may not

have understood the question. Thus, the wording of the question may have caused the respondent

to answer incorrectly. It has been observed that when the same question is worded differently,

many people do not respond the same way. Usually such an error on the part of respondents is not

intentional.

Definition 15 The response error occurs when people included in the survey do not provide

correct answers.

Sometimes the respondents do not want to give correct information when answering a ques-

tion. For example, many respondents will not disclose, their true incomes on questionnaires or in

interviews. When information, on income is provided, it is almost always biased in the upward

direction.

Sometimes the race of the interviewer may affect the answers of respondents. This is especially

true if the questions asked are about race relations. The answers given by respondents will differ

depending on whether the interviewer is white or nonwhite.

Voluntary Response Error Another source of systematic error is a survey based on a vol-

untary response sample.

Definition 16 Voluntary response error occurs when a survey is not conducted, on a randomly

selected sample but a questionnaire is published in a magazine or newspaper and people are invited

to respond to that questionnaire.

The polls conducted based on samples of readers of magazines and newspapers suffer from

voluntary response error or bias. Usually only those readers who have very strong opinions about

the issues involved respond to such surveys. Surveys in which the respondents are required to

call 900 telephone numbers also suffer from this type of error. Here, in order to participate, a

respondent must pay for the call, and many people do not want to bear this cost. Consequently, the

sample is usually neither random nor representative of the target population because participation

is voluntary.

Example 10 Why is the following true story an example of nonrandom sampling, with both selec-

tion bias and voluntary response bias?

26

The 1936 presidential election in the United States had two major candidates: the Republican, Al-

fred M. Landon, and the Democrat, the incumbent president, Franklin D. Roosevelt. Several weeks

before the election, Literary Digest magazine tried to predict the outcome by mailing 10 million ques-

tionnaires to people selected from three sources: the subscription list for the magazine, telephone

directories, and automobile registration records. The magazine received back approximately 2.5 mil-

lion answers, and of these some 57% favored Landon. From these results the magazine predicted a

landslide victory for Landon. A few weeks later, however, in the actual election, it was Roosevelt

who got the majority of the votes (62%).

Solution: This is an example of nonrandom sampling because by limiting the sample to magazine

subscribers and to owners of telephones and automobiles, most of the voting population had a zero

probability of being included, in the sample. The time was 1936, in the depths of the Depression,

and the judgement-selection limited the sample to a relatively prosperous stratum of the popula-

tion. Besides this severe selection bias, produced by a discrepancy between the target population

and the sampling frame, there was also a voluntary response bias. This voluntary response bias,

called self-selection bias, occurred because only about 25% of the selected sample returned their

questionnaires. Thus even for this chosen stratum of the population, the probabilities for inclusion

in the sample were unknown before sampling.

27

7.1 sampling distribution of7.1 sampling distribution of x¯ deﬁnition 1 the population...

Documents