146 05 sampling_methods online
TRANSCRIPT
MATH& 146
Lesson 5
Section 1.4
Sampling Methods
1
Data Collection
There are two primary types of data collection:
observational studies and experiments.
Generally, data in observational studies are
collected only by monitoring what occurs, while
experiments require the primary explanatory
variable in a study be assigned for each subject by
the researchers.
2
Observational Studies
Researchers perform an observational study when
they collect data in a way that does not directly
interfere with how the data arise.
For instance, researchers may collect information
via surveys, review medical or company records,
or follow a cohort of many similar individuals to
study why certain diseases might develop.
3
Observational Studies
In each of these situations, researchers merely
observe the data that arise.
In general, observational studies can provide
evidence of a naturally occurring association
between variables, but they cannot by themselves
show a causal connection.
4
Experiments
When researchers want to investigate the possibility of
a causal connection, they conduct an experiment.
For instance, we may suspect administering a drug will
reduce mortality in heart attack patients over the
following year. To check if there really is a causal
connection, researchers will collect a sample of
individuals and split them into groups. The individuals
in each group are assigned a treatment.
5
Experiments
When individuals are randomly assigned to a group,
the experiment is called a randomized experiment.
For example, each heart attack patient in the drug trial
could be randomly assigned into one of two groups,
perhaps by flipping a coin: the first group (called the
control group) receives a placebo and the second
group (called the treatment group) receives the drug.
6
Association ≠ Causation
In data analysis, association does not imply causation,
and causation can only be inferred from a randomized
experiment.
Making causal conclusions based on experiments is
often reasonable. However, making the same causal
conclusions based on observational data can be
treacherous and is not recommended. Thus,
observational studies are generally only sufficient to
show associations.
7
Example 1
Suppose an observational study tracked
sunscreen use and skin cancer, and it was found
that the more sunscreen someone used, the more
likely the person was to have skin cancer. Does
this mean sunscreen causes skin cancer?
8
Confounding Variables
Some previous research tells us that using
sunscreen actually reduces skin cancer risk, so
maybe there is another variable that can explain
this association between sunscreen usage and
skin cancer.
One important piece of information that is absent is
sun exposure.
9
Confounding Variables
If someone is out in the sun all day, they would be
more likely to use sunscreen and more likely to get
skin cancer. Exposure to the sun is unaccounted for in
this investigation.
Sun exposure is what is called a confounding
variable, which is a variable that is correlated with
both the explanatory and response variables, but was
not accounted for in the study.10
Confounding Variables
A drawback of observational studies is that we can
never know whether we have accounted for all
possible confounding variables. We can search
very hard for it, but just because we don't find a
confounding variable does not mean it isn't there.
For this reason, we can never make cause-and-
effect conclusions from observational studies.
11
Example 2
In a paper titled "Kimchi and Soybean Pastes Are Risk
Factors of Gastric Cancer," researchers noted that
gastric cancer rates were ten times higher in Korea
and Japan than in the United States. Koreans and
Japanese also eat many times more kimchee than
those living in the United States.
On the basis of this information, can we conclude that
kimchee causes gastric cancer? If yes, explain why. If
no, state a potential confounding variable.
12
What English can do to you …
The Japanese eat very little fat and suffer fewer heart
attacks than the British or Americans.
On the other hand, the French eat a lot of fat and also
suffer fewer heart attacks than the British or Americans.
The Japanese drink very little red wine and suffer fewer
heart attacks than the British or Americans.
On the other hand, the Italians drink excessive amounts of
red wine and also suffer fewer heart attacks than the British
or Americans.
Conclusion: Eat and drink what you like. It's speaking
English that kills you.
13
Sampling Methods
Almost all statistical methods are based on the notion
of implied randomness. If observational data are not
collected in a random framework from a population,
results from these statistical methods are not reliable.
Here we consider four random sampling techniques:
• Simple random sampling
• Stratified sampling
• Cluster sampling (and multistage sampling)
• Systematic sampling (sometimes)
14
Simple Random
Simple random sampling is probably the most
intuitive form of random sampling. Cases are
randomly selected from the population until the desired
sample size is reached.
15
Simple Random
Consider the salaries of Major League Baseball (MLB)
players, where each player is a member of one of the
league's 30 teams.
To take a simple random sample of 120 baseball
players and their salaries from the 2010 season, we
could write the names of that season's 828 players
onto slips of paper, drop the slips into a bucket, shake
the bucket around until we are sure the names are all
mixed up, then draw out slips until we have the sample
of 120 players.
16
Simple Random
In general, a sample is referred to as "simple random"
if each case in the population has an equal chance of
being included in the final sample and knowing that a
case is included in a sample does not provide useful
information about which other cases are included.
17
Stratified
Stratified sampling is a divide-and-conquer strategy.
The population is first divided into groups, called
strata.
18
Stratified
The strata are chosen so that similar cases are
grouped together, then a second sampling method,
usually simple random sampling, is employed within
each stratum.
In the baseball salary example, each position (pitcher,
catcher, shortstop, etc.) could represent a strata. We
might randomly survey 10 players from each position
to make up our sample.
19
Stratified
Stratified sampling is especially useful when the
cases in each stratum are very similar with respect
to the outcome of interest.
The downside is that analyzing data from a
stratified sample is a more complex task than
analyzing data from a simple random sample. A
second course in statistics might even be needed.
20
Example 3
Why would it be good for cases within each
stratum to be very similar?
Answer: We might get a more stable estimate for
the subpopulation in a stratum if the cases are
very similar. These improved estimates for each
subpopulation will help us build a reliable estimate
for the full population.
21
Cluster
In cluster sampling, we group observations into
clusters, then randomly choose some of the
clusters.
22
Cluster
Sometimes cluster sampling can be a more
economical (cheaper) technique than the alternatives.
Also, unlike stratified sampling, cluster sampling is
most helpful when there is a lot of case-to-case
variability within a cluster but the clusters themselves
don't look very different from one another.
23
Cluster
In the baseball salary example, each of the 30 teams
could represent a cluster. We might randomly sample
4 of the teams, then survey every player on those
teams.
A downside of cluster sampling is that more advanced
analysis techniques are typically required, though the
methods in this course can be extended to handle
such data.
24
Example 4
Suppose we are interested in estimating the
malaria rate in a densely tropical portion of rural
Indonesia. We learn there are 30 villages in that
part of the Indonesian jungle, each more or less
similar to the next. What sampling method should
be employed?
25
Example 4 Answer
A simple random sample would likely draw
individuals from all 30 villages, which could make
data collection extremely expensive.
Stratified sampling would be a challenge since it is
unclear how we would build strata of similar
individuals.
26
Example 4 Answer
However, cluster sampling seems like a very good
idea. We might randomly select a small number of
villages.
This would probably reduce our data collection
costs substantially in comparison to a simple
random sample and would still give us helpful
information.
27
Multistage Sampling
Multistage sampling is similar to cluster sampling,
except that we take a simple random sample (as
opposed to taking everyone) within each selected
cluster.
For instance, if we sampled neighborhoods using
cluster sampling, we would next sample a subset
of homes within each selected neighborhood if we
were using multistage sampling.
28
Systematic
In a systematic sample, a starting point is randomly
selected and then every kth piece of data from a listing
of the population is added to the sample.
29
Systematic
For example, suppose you have to do a phone survey.
Your phone book contains 20,000 residence listings.
You must choose 400 names for the sample. Since
20,000 ÷ 400 = 50, you could randomly choose
between 1 and 50 to represent the first name, then
take every 50th name after that.
Assuming that no patterns exist in the list, your sample
should be representative of the population.
30
Systematic
The advantage of systematic samples is that it can be
fast and convenient when you already have a list of
your population. But the disadvantage is that a
systematic sample might lead to bias. If the population
is arranged in any kind of repeating pattern, then you
should use systematic samples with caution.
For example, suppose you choose every 4th person
from a list that is arranged boy-girl-boy-girl. Your
sample will be either all boys or all girls, and not
representative of the overall population.
31
Biased Samples
When samples are selected without any regard to
random selection, a biased sample may result. A
biased sample causes problems because any statistic
computed from that sample has the potential to be
consistently erroneous.
Two commonly used sampling methods that result in
biased samples are the convenience and volunteer
samples.
32
Convenience
Convenience sampling is a sampling technique
where subjects are selected because of their
convenient accessibility and proximity to the
researcher. The results of convenience sampling may
be very good in some cases and highly biased (favors
certain outcomes) in others.
For example, did you ever buy a basket of fruit based
on the "good appearance" of the fruit on top, only to
later discover the rest of the fruit was not as fresh? It
was too inconvenient to inspect the bottom fruit, so
you trusted a convenience sample.
33
Volunteer
A volunteer sample consists of results collected from
those elements of the population that chose to
contribute the needed information on their own
initiative.
Voluntary response samples are always biased: they
only include people who choose to volunteer (those
who tend to have strong opinions on the subject),
whereas a random sample would need to include
people whether or not they choose to volunteer.
34
Example 5
Administrators of the fire department are concerned
about the possibility of implementing a new property
tax to raise moneys needed to replace old equipment.
They decide to check on public opinion by taking
sample of the city's population. Several plans for
choosing the sample are proposed.
Determine the sampling design (simple random,
stratified, cluster, systematic, convenience) next to
each of the following plans.
35
continued
Example 5 continued
a) The city has five property classifications: single
family homes, apartments, condominiums,
temporary housing (hotel and campgrounds), and
retail property. Randomly select ten residents from
each category.
b) Each property owner has a five-digit ID number.
Use random numbers to choose fifty numbers.
c) Go to the city park and survey every tenth person
who arrives. Stop after the sample has reached
fifty.
36
continued
Example 5 continued
d) Randomly select three neighborhoods and survey
all the people who live in those neighborhoods.
e) Have ten firefighters each survey five of his/her
neighbors.
f) Set up a table outside the local supermarket, then
offer a gift to everyone who agrees to fill out the
survey.
37
Example 6
Describe the sampling method used for each sampling
plan.
a) A quality control engineer at a manufacturing
company samples every tenth production lot in
order to estimate the proportion of defective items
being produced.
b) A journalist attends a town hall meeting and
surveys the attendees regarding the upcoming
election for mayor.
continued38
Example 6 continued
c) Students at a state university are sorted according
to class (freshman, sophomore, junior, senior).
The registrar then obtains a random sample of size
150 from each class.
d) A cable company divides its service area into
neighborhoods, randomly selects 30 of these
neighborhoods and then within these 30
neighborhoods, surveys each residence to
determine which of the cable company’s services
the residence is using.
39