Estimation in Sampling
GTECH 201Lecture 15
Conceptual Setting
How do we come to conclusions from empirical evidence? Isn’t common sense enough? Why?
Systematic methods for drawing conclusions from data Statistical inference
Inductive versus Deductive Reasoning
Drawing Conclusions Statistical inference
Based on the laws of probability What would happen if?
You ran your experiment hundreds of times You repeated your survey over and over again
Statistic and Parameter The proportion of the population who are
<disabled> usually denoted by: p In a SRS of 1000 people, the proportion of
the people who are <disabled> usually denoted by: (p -hat)
p̂
Estimating with Confidence
Say you are conducting an opinion poll… SRS of 1000 adult television viewers You ask these folks if they trust Walter
Cronkite when he delivers the nightly news Out of 1000, 570 say, they trust him 57% of the people trust Walter is 0.57 If you collect another set of 1000 television
viewers, what will the rating be?
p̂
Confidence Statement We need to add a confidence statement We need to say something about the
margin of error Confidence statements are based on
the distribution of the values of the sample proportion that would occur if many independent SRS were taken from the same population
The sampling distribution of the statistic
p̂
p̂
Terminology Review Sample Population Statistic
a numerical characteristic associated with a sample
Parameter A numerical characteristic associated with the
population Sampling error
The need for interval estimation
Point Estimation
Point estimation of a parameter is the value of a statistic that is used to estimate the parameter Compute statistic (e.g., mean) Use it to estimate corresponding
population parameter Point Estimators of Population Parameters
(see next slide)
Point Estimators for Population Parameters
x
ni
i
i
n
x
1
ni
i
i
n
xx
1
2
1
)( s
n
x
Tn
XNXN i)(
p
Population Sample CalculatingParameter statistic formula
Interval Estimation Sample point estimators are usually not
absolutely precise How close or how distant is the calculated
sample statistic from the population parameter
We can say that the sample statistic is within a certain range or interval of the population parameter.
The determination of this range is the basis for interval estimation
Interval Estimation (2)
A confidence interval (CI) represents the level of precision associated with a population estimate
Width of the interval is determined by Sample size, variability of the population, and the probability level or the level of confidence
selected
Sampling Distributionof the Mean
The distribution of all possible sample means for a sample of a given size
Use the mean of a sample to estimate and draw conclusions about the mean of that entire population
So we have samples of a particular size We need formulas to determine the mean and the
standard deviation of all possible sample means for samples of a given size from a population
Sample and Population Mean
For samples of size n, mean of the variable Is equal to the mean of the variable
under consideration Mean of all possible sample means is
equal to the population mean x
X
Sample Standard Deviation
For samples of size n, the standard deviation of the variable Is equal to the standard deviation of the
variable under consideration, divided by the square root of the sample size
For each sample size, the standard deviation of all possible sample means equals the population standard deviation divided by the square root of the sample size
nx
X
Central Limit Theorem Suppose all possible random samples of
size n are drawn from an infinitely large, normally distributed population having a mean and a standard deviation
The frequency distribution of these sample means will have: A mean of (the population mean) A normal distribution around this population
mean A standard deviation of
nx
Sampling Error Standard Error of the mean (SEM) is a basic
measure for the amount of sampling error
SEM indicates how much a typical sample mean is likely to differ from a true population mean
Sample size, and population standard deviation affect the sampling error
nx
Sampling Error (2) The larger the sample size, the smaller
the amount of sampling error The larger the standard deviation, the
greater the amount of sampling errorLarge
LargeSmall
Small
Sample size (n)Standard deviatio
n of populatio
n ()
Finite Population Correction Factor
The frequency distribution of the sample means is approximately normal if the sample size is large
N < 30 (small sample); N > 30 (large sample) If you have a finite population, then you need to
introduce a correction, i.e., the fpc rule/factor in the estimation process
where fpc = finite population correction; n = sample size; N = population size
1
N
nNfpc
Standard Error of the Mean for Finite
Populations
When including the fpc should be:
In general, you include the fpc in the population estimates only when the ratio of sample size to population size exceeds 5 % orwhen n / N > 0.05
)( fpcn
x
Constructing Confidence Intervals
A random sample of 50 commuters reveals that their average journey-to-work distance was 9.6 miles
A recent study has determined that the std. deviation of journey-to-work distance is approximately 3 miles
What is the CI around this sample mean of 9.6 that guarantees with 90 % certainty that the true population mean is enclosed within that interval?
Confidence Intervalfor the Mean
Z value associated with a 90 % confidence level (Z =1.65)
The sample mean is the best estimate of the true population mean
CI = 9.6 +1.65 (3/ ) = 10.30 miles 9.6 - 1.65 (3/ ) = 8.90 miles
xzx
50
3
6.9
n
x
5050
Confidence Interval We say that the sample statistic is within a certain
range or interval of the population parameter e.g., in our sample, 57% of the viewers thought Walter
Cronkite is trustworthy In the general population, between 54% and 60%
of viewers think that Walter Cronkite is trustworthy
Or, in our sample, the average commuting distance was 9.6 miles
In the population, we calculated that the average commute is likely to be somewhere between 8.9 miles and 10.3 miles
Confidence Level Gives you an understanding of how reliable your
previous statement regarding the confidence interval is
The probability that the interval actually includes the population parameter
For example, the confidence level refers to the probability that the interval (8.9 miles to 10.3 miles) actually encompasses the TRUE population mean (90%, 95%, 99.7%)
Confidence Level probability is 1 -
Significance Level
(alpha) The probability that the interval that
surrounds the sample statistic DOES NOT include the population parameter
E.g., the probability that the average commuting distance does not fall between 8.9 miles and 10.3 miles
= 0.10 (90%); 0.05 (95%); 0.01 (99.7%) Confidence Interval width -- increases
Sampling Error
Total sampling error = Probability that the sample statistic will
fall into either tail of the distribution is:
/2 If you want 99.7% confidence (i.e., low
error), then you have to settle for giving a less precise estimate (the CI is wider)
If the Standard Deviationis Unknown
If we don’t know the population mean, its likely we don’t know the standard deviation
What you are likely to have is the variance and standard deviation of your sample
Also, you have a small population, so you have to use the finite population correction factor that was discussed earlier
Once you have the formula for standard error, then you can proceed as before to determine the confidence interval
Standard Error
)( fpcn
x
1
N
nNfpc
N
nN
n
sx
2
n N2s
xzxCI
Student’s T Distribution
William Gosset (1876-1937) Published his contributions to
statistical theory under a pseudonym Student’s t distribution is used in
performing inferences for a population mean, when,
The population being sampled is approximately normally distributed
The population standard deviation is unknown
And the sample size is small (n < 30)
Characteristics of the t - Distribution
A t curve is symmetric, bell shaped Exact shape of distribution varies with
sample size When n nears 30, the value of t approaches
the standard normal Z value A particular distribution is identified by
defining its degrees of freedom (df) For a t distribution, df = (n -1)
n
sx
t
Properties of t Curves The total area under a t curve = 1 A t curve extends indefinitely in both
directions, approaching, but never touching the horizontal axis
A t-curve is symmetrical about 0 As the degrees of freedom become larger,
t curves look increasingly like the standard normal curve
We need to use a t-table and look for values of t, instead of Z to determine the confidence interval
Calculating various CIs Sampling
SRS, systematic, or stratified Parameters
Mean, total, or proportion Six situations
Consider whether to use fpc when n/N > 0.05
Consider whether to use Z or t when n < 30
If Random or Systematic Sample
Estimate of Population Mean Best estimate is ?
Estimate of sampling error Standard error of the mean (inc. fpc)
N
nN
n
sx
2
xzxCI
If Stratified Sample
Estimate of population mean Still equal to sample mean but…
Std. Error of the mean (inc. fpc)
mi
iii XN
NX
1
1
Where m=number of strata; i= refers to a particular stratum
i
iimi
i i
iix N
nN
n
sN
N 1
22
2
1
Minimum Sample Size
Before going out to the field, you want to know how big the sample ought to be for your research problem
Sample must be large enough to achieve precision and CI width that you desire
Formulas to determine the three basic population parameters with random sampling
Sample Size Selection - Mean
Your goal is to determine the minimum sample size You want to situate the estimated
population mean, in a specified CI
xzxCI
E = amount of error you are willing to tolerate
E
Zn
nZ
ZE x
2
Example 1
We are looking at Neighborhood X 3,500 households Sample size = 25 households Sample mean = 2.73 Sample variance = 2.6 CI = 90% Find the mean number of people per
household
Example 2 Sample of 30 households Sample standard deviation is 1.25 What sample size is needed to
estimate the mean number of persons per household in neighborhood X and be 90% confident that your estimate
will be within 0.3 persons of the true population mean?