146 05 sampling_methods online

MATH& 146

Lesson 5

Section 1.4

Sampling Methods

1

Data Collection

There are two primary types of data collection:

observational studies and experiments.

Generally, data in observational studies are

collected only by monitoring what occurs, while

experiments require the primary explanatory

variable in a study be assigned for each subject by

the researchers.

2

Observational Studies

Researchers perform an observational study when

they collect data in a way that does not directly

interfere with how the data arise.

For instance, researchers may collect information

via surveys, review medical or company records,

or follow a cohort of many similar individuals to

study why certain diseases might develop.

3

Observational Studies

In each of these situations, researchers merely

observe the data that arise.

In general, observational studies can provide

evidence of a naturally occurring association

between variables, but they cannot by themselves

show a causal connection.

4

Experiments

When researchers want to investigate the possibility of

a causal connection, they conduct an experiment.

For instance, we may suspect administering a drug will

reduce mortality in heart attack patients over the

following year. To check if there really is a causal

connection, researchers will collect a sample of

individuals and split them into groups. The individuals

in each group are assigned a treatment.

5

Experiments

When individuals are randomly assigned to a group,

the experiment is called a randomized experiment.

For example, each heart attack patient in the drug trial

could be randomly assigned into one of two groups,

perhaps by flipping a coin: the first group (called the

control group) receives a placebo and the second

group (called the treatment group) receives the drug.

6

Association ≠ Causation

In data analysis, association does not imply causation,

and causation can only be inferred from a randomized

experiment.

Making causal conclusions based on experiments is

often reasonable. However, making the same causal

conclusions based on observational data can be

treacherous and is not recommended. Thus,

observational studies are generally only sufficient to

show associations.

7

Example 1

Suppose an observational study tracked

sunscreen use and skin cancer, and it was found

that the more sunscreen someone used, the more

likely the person was to have skin cancer. Does

this mean sunscreen causes skin cancer?

8

Confounding Variables

Some previous research tells us that using

sunscreen actually reduces skin cancer risk, so

maybe there is another variable that can explain

this association between sunscreen usage and

skin cancer.

One important piece of information that is absent is

sun exposure.

9


If someone is out in the sun all day, they would be

more likely to use sunscreen and more likely to get

skin cancer. Exposure to the sun is unaccounted for in

this investigation.

Sun exposure is what is called a confounding

variable, which is a variable that is correlated with

both the explanatory and response variables, but was

not accounted for in the study.10


A drawback of observational studies is that we can

never know whether we have accounted for all

possible confounding variables. We can search

very hard for it, but just because we don't find a

confounding variable does not mean it isn't there.

For this reason, we can never make cause-and-

effect conclusions from observational studies.

11

Example 2

In a paper titled "Kimchi and Soybean Pastes Are Risk

Factors of Gastric Cancer," researchers noted that

gastric cancer rates were ten times higher in Korea

and Japan than in the United States. Koreans and

Japanese also eat many times more kimchee than

those living in the United States.

On the basis of this information, can we conclude that

kimchee causes gastric cancer? If yes, explain why. If

no, state a potential confounding variable.

12

What English can do to you …

The Japanese eat very little fat and suffer fewer heart

attacks than the British or Americans.

On the other hand, the French eat a lot of fat and also

suffer fewer heart attacks than the British or Americans.

The Japanese drink very little red wine and suffer fewer

heart attacks than the British or Americans.

On the other hand, the Italians drink excessive amounts of

red wine and also suffer fewer heart attacks than the British

or Americans.

Conclusion: Eat and drink what you like. It's speaking

English that kills you.

13

Sampling Methods

Almost all statistical methods are based on the notion

of implied randomness. If observational data are not

collected in a random framework from a population,

results from these statistical methods are not reliable.

Here we consider four random sampling techniques:

• Simple random sampling

• Stratified sampling

• Cluster sampling (and multistage sampling)

• Systematic sampling (sometimes)

14

Simple Random

Simple random sampling is probably the most

intuitive form of random sampling. Cases are

randomly selected from the population until the desired

sample size is reached.

15

Simple Random

Consider the salaries of Major League Baseball (MLB)

players, where each player is a member of one of the

league's 30 teams.

To take a simple random sample of 120 baseball

players and their salaries from the 2010 season, we

could write the names of that season's 828 players

onto slips of paper, drop the slips into a bucket, shake

the bucket around until we are sure the names are all

mixed up, then draw out slips until we have the sample

of 120 players.

16

Simple Random

In general, a sample is referred to as "simple random"

if each case in the population has an equal chance of

being included in the final sample and knowing that a

case is included in a sample does not provide useful

information about which other cases are included.

17

Stratified

Stratified sampling is a divide-and-conquer strategy.

The population is first divided into groups, called

strata.

18

Stratified

The strata are chosen so that similar cases are

grouped together, then a second sampling method,

usually simple random sampling, is employed within

each stratum.

In the baseball salary example, each position (pitcher,

catcher, shortstop, etc.) could represent a strata. We

might randomly survey 10 players from each position

to make up our sample.

19

Stratified

Stratified sampling is especially useful when the

cases in each stratum are very similar with respect

to the outcome of interest.

The downside is that analyzing data from a

stratified sample is a more complex task than

analyzing data from a simple random sample. A

second course in statistics might even be needed.

20

Example 3

Why would it be good for cases within each

stratum to be very similar?

Answer: We might get a more stable estimate for

the subpopulation in a stratum if the cases are

very similar. These improved estimates for each

subpopulation will help us build a reliable estimate

for the full population.

21

Cluster

In cluster sampling, we group observations into

clusters, then randomly choose some of the

clusters.

22

Cluster

Sometimes cluster sampling can be a more

economical (cheaper) technique than the alternatives.

Also, unlike stratified sampling, cluster sampling is

most helpful when there is a lot of case-to-case

variability within a cluster but the clusters themselves

don't look very different from one another.

23

Cluster

In the baseball salary example, each of the 30 teams

could represent a cluster. We might randomly sample

4 of the teams, then survey every player on those

teams.

A downside of cluster sampling is that more advanced

analysis techniques are typically required, though the

methods in this course can be extended to handle

such data.

24

Example 4

Suppose we are interested in estimating the

malaria rate in a densely tropical portion of rural

Indonesia. We learn there are 30 villages in that

part of the Indonesian jungle, each more or less

similar to the next. What sampling method should

be employed?

25

Example 4 Answer

A simple random sample would likely draw

individuals from all 30 villages, which could make

data collection extremely expensive.

Stratified sampling would be a challenge since it is

unclear how we would build strata of similar

individuals.

26

Example 4 Answer

However, cluster sampling seems like a very good

idea. We might randomly select a small number of

villages.

This would probably reduce our data collection

costs substantially in comparison to a simple

random sample and would still give us helpful

information.

27

Multistage Sampling

Multistage sampling is similar to cluster sampling,

except that we take a simple random sample (as

opposed to taking everyone) within each selected

cluster.

For instance, if we sampled neighborhoods using

cluster sampling, we would next sample a subset

of homes within each selected neighborhood if we

were using multistage sampling.

28

Systematic

In a systematic sample, a starting point is randomly

selected and then every kth piece of data from a listing

of the population is added to the sample.

29

Systematic

For example, suppose you have to do a phone survey.

Your phone book contains 20,000 residence listings.

You must choose 400 names for the sample. Since

20,000 ÷ 400 = 50, you could randomly choose

between 1 and 50 to represent the first name, then

take every 50th name after that.

Assuming that no patterns exist in the list, your sample

should be representative of the population.

30

Systematic

The advantage of systematic samples is that it can be

fast and convenient when you already have a list of

your population. But the disadvantage is that a

systematic sample might lead to bias. If the population

is arranged in any kind of repeating pattern, then you

should use systematic samples with caution.

For example, suppose you choose every 4th person

from a list that is arranged boy-girl-boy-girl. Your

sample will be either all boys or all girls, and not

representative of the overall population.

31

Biased Samples

When samples are selected without any regard to

random selection, a biased sample may result. A

biased sample causes problems because any statistic

computed from that sample has the potential to be

consistently erroneous.

Two commonly used sampling methods that result in

biased samples are the convenience and volunteer

samples.

32

Convenience

Convenience sampling is a sampling technique

where subjects are selected because of their

convenient accessibility and proximity to the

researcher. The results of convenience sampling may

be very good in some cases and highly biased (favors

certain outcomes) in others.

For example, did you ever buy a basket of fruit based

on the "good appearance" of the fruit on top, only to

later discover the rest of the fruit was not as fresh? It

was too inconvenient to inspect the bottom fruit, so

you trusted a convenience sample.

33

Volunteer

A volunteer sample consists of results collected from

those elements of the population that chose to

contribute the needed information on their own

initiative.

Voluntary response samples are always biased: they

only include people who choose to volunteer (those

who tend to have strong opinions on the subject),

whereas a random sample would need to include

people whether or not they choose to volunteer.

34

Example 5

Administrators of the fire department are concerned

about the possibility of implementing a new property

tax to raise moneys needed to replace old equipment.

They decide to check on public opinion by taking

sample of the city's population. Several plans for

choosing the sample are proposed.

Determine the sampling design (simple random,

stratified, cluster, systematic, convenience) next to

each of the following plans.

35

continued

Example 5 continued

a) The city has five property classifications: single

family homes, apartments, condominiums,

temporary housing (hotel and campgrounds), and

retail property. Randomly select ten residents from

each category.

b) Each property owner has a five-digit ID number.

Use random numbers to choose fifty numbers.

c) Go to the city park and survey every tenth person

who arrives. Stop after the sample has reached

fifty.

36

continued

Example 5 continued

d) Randomly select three neighborhoods and survey

all the people who live in those neighborhoods.

e) Have ten firefighters each survey five of his/her

neighbors.

f) Set up a table outside the local supermarket, then

offer a gift to everyone who agrees to fill out the

survey.

37

Example 6

Describe the sampling method used for each sampling

plan.

a) A quality control engineer at a manufacturing

company samples every tenth production lot in

order to estimate the proportion of defective items

being produced.

b) A journalist attends a town hall meeting and

surveys the attendees regarding the upcoming

election for mayor.

continued38

Example 6 continued

c) Students at a state university are sorted according

to class (freshman, sophomore, junior, senior).

The registrar then obtains a random sample of size

150 from each class.

d) A cable company divides its service area into

neighborhoods, randomly selects 30 of these

neighborhoods and then within these 30

neighborhoods, surveys each residence to

determine which of the cable company’s services

the residence is using.

39

146 05 sampling_methods online

Education