lecture 3: the no rm a l di str ibution a n d sta tistica l in...

62
Lecture 3: The Normal Distribution and Statistical Inference Ani Manichaikul [email protected] 19 April 2007 1 / 62

Upload: others

Post on 16-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Lecture 3: The Normal Distribution andStatistical Inference

Ani [email protected]

19 April 2007

1 / 62

Page 2: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

A Review and Some Connections

The Normal Distribution

The Central Limit Theorem

Estimates of means and proportions: uses and properties

Confidence intervals and Hypothesis tests

2 / 62

Page 3: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

The Normal Distribution

Probability distribution for continuous data

Under certain conditions, can be used to approximatebinomial probabilities

np>5n(1-p)>5

Characterized by a symmetric bell-shaped curve (Gaussiancurve)

Symmetric about its mean µ

3 / 62

Page 4: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Normal Distribution

Takes on values between −∞ and +∞Mean = Median = Mode

Area under curve equals 1

Parametersµ = meanσ = standard deviation

4 / 62

Page 5: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Normal Distribution

Norm

al D

ensit

y

−∞ µ +∞

Notation for Normal random variable: X ∼ N(µ,σ2)

5 / 62

Page 6: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Formula: Normal Distribution

The normal probability distribution is given by:

f (x) =1√2πσ

· e−(x−µ)2/2σ2,−∞ < x < +∞

π ≈ 3.14 and e ≈ 2.72 are mathematical constants

µ,σ are mean and SD parameters of the distribution

6 / 62

Page 7: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Standard Normal

The standard normal distribution has parametersµ = 0 and σ = 1

Its density function is written as:

f (x) =1√2π

· e−x2/2,−∞ < x < +∞

We typically use the letter Z to denote a standard normalrandom variable (Z ∼ N(0, 1))

If X ∼ N(µ,σ), then X−µσ ∼ N(0, 1)

7 / 62

Page 8: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

68-95-99.7 Rule I

68% of density is within one standard deviation of the mean

8 / 62

Page 9: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

68-95-99.7 Rule II

95% of density is within two standard deviations of the mean

9 / 62

Page 10: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

68-95-99.7 Rule III

99.7% of density is within three standard deviations of the mean

10 / 62

Page 11: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Different Means

Norm

al D

ensit

y

µ1 µ2 µ3

Three normal distributions with different meansµ1 < µ2 < µ3

11 / 62

Page 12: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Different Standard Deviations

Norm

al D

ensit

y

σ1

σ2

σ3

Three normal distributions with different standard deviationsσ1 < σ2 < σ3

12 / 62

Page 13: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Standard Normal

−4 −2 0 2 4

µ=0

Norm

al D

ensit

y

σ=1

13 / 62

Page 14: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Example: Birthweights I

Birthweights (in grams) of infants in a population

14 / 62

Page 15: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Example: Birthweights II

Continuous data

Mean = Median = Mode = 3000 = µ

Standard deviation = 1000 = σ

The area under the curve represents the probability(proportion) of infants with birthweights between certainvalues

15 / 62

Page 16: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Normal Probabilities

16 / 62

Page 17: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Calculating Probabilities

Equivalent to finding area under the curve

Continuous distribution, so we cannot use sums to findprobabilities

Performing the integration is not necessary since tables andcomputers are available

17 / 62

Page 18: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Z Tables

18 / 62

Page 19: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Normal Table

19 / 62

Page 20: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Looking up z=2.22

20 / 62

Page 21: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Looking up z=-0.67

21 / 62

Page 22: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Example: Birthweights

22 / 62

Page 23: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Question I

What is the probability of an infant weighing more than 5000g?

P(X > 5000) = P(X − µ

σ>

5000− 3000

1000)

= P(Z > 2)

= 0.0228

23 / 62

Page 24: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Question II

What is the probability of an infant weighing between 2500 and4000g?

P(2500 < X < 4000) = P(2500− 3000

1000<

X − µ

σ<

4000− 3000

1000)

= P(−0.5 < Z < 1)

= 1− P(Z > 1)− P(Z < −0.5)

= 1− 0.1587− 0.3085

= 0.5328

24 / 62

Page 25: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Question III

What is the probability of an infant weighing less than 3500g?

P(X < 3500) = P(X − µ

σ<

3500− 3000

1000)

= P(Z < 0.5)

= 1− P(Z > 0.5)

= 1− 0.3085

= 0.6915

25 / 62

Page 26: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Statistical Inference

Populations and samples

Sampling distributions

26 / 62

Page 27: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Definitions

Statistical inference is “the attempt to reach a conclusionconcerning all members of a class from observations of onlysome of them.” (Runes 1959)

A population is a collection of observations

A parameter is a numerical descriptor of a population

A sample is a part or subset of a population

A statistic is a numerical descriptor of the sample

27 / 62

Page 28: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Population

Population size = N

µ = mean, a measure of center

σ2 = variance, a measure of dispersion

σ = standard deviation

28 / 62

Page 29: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Sample Estimates

Sample size = n

X̄ = sample mean

s2 = sample variance

s = sample standard deviation

Population: parameters

Sample: statistics

29 / 62

Page 30: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Estimating µ

Usually µ is unknown and we would like to estimate it

We use X̄ to estimate µ

We know the sampling distribution of X̄

30 / 62

Page 31: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Sampling Distribution

The distribution of all possible values of some statistic, computedfrom samples of the same size randomly drawn from the samepopulation, is called the sampling distribution of that statistic

31 / 62

Page 32: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Sampling Distribution of X̄

When sampling from a normally distributed population

X̄ will be normally distributed

The mean of the distribution of X̄ is equal to the true mean µof the population from which the samples were drawn

The variance of the distribution is σ2/n, where σ2 is thevariance of the population and n is the sample size

We can write: X̄ ∼ N(µ,σ2/n)

When sampling is from a population whose distribution is notnormal and the sample size is large, use the Central LimitTheorem

32 / 62

Page 33: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

The Central Limit Theorem (CLT)

Given a population of any distribution with mean, µ, and variance,σ2, the sampling distribution of X̄ , computed from samples of sizen from this population, will be approximately N(µ,σ2/n) whenthe sample size is large

In general, this applies when n ≥ 25

The approximation of normality becomes better as n increases

33 / 62

Page 34: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

What about for Binomial RVs? I

First, recall that a Binomial variable is just the sum of nBernoulli variable: Sn =

∑ni=1 Xi

Notation:

Sn ∼ Binomial(n,p)Xi ∼ Bernoulli(p) = Binomial(1, p) for i = 1, . . . , n

34 / 62

Page 35: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

What about for Binomial RVs? II

In this case, we want to estimate p by p̂ where

p̂ =Sn

n=

∑ni=1 Xi

n= X̄

p̂ is just a sample mean!

So we can use the central limit theorem when n is large

35 / 62

Page 36: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Binomial CLT

For a Bernoulli variableµ = mean = pσ2 = variance = p(1-p)

X̄ ≈ N(µ,σ2/n) as before

Equivalently, p̂ ≈ N(p, p(1−p)n )

36 / 62

Page 37: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Notation I

Often we are interested in detecting a difference between twopopulations

Differences in average income by neighborhood

Differences in disease cure rates by age

37 / 62

Page 38: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Notation II

Population 1:

Size = N1

Mean = µ1

Standard deviation = σ1

Population 2:

Size = N2

Mean = µ2

Standard deviation = σ2

Samples of size n1 from Population 1:

Mean = µX̄1= µ1

Standard deviation =σ1/√

n1 = σX1

Samples of size n2 from Population 2:

Mean = µX̄2= µ2

Standard deviation =σ2/√

n2 = σX2

38 / 62

Page 39: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Notation III

Now by CLT, for large n:

X̄1 ∼ N(µ1,σ21/n1)

X̄2 ∼ N(µ2,σ22/n2)

and X̄1 − X̄2 ≈ N(µ1 − µ2,σ2

1n1

+σ2

2n2

)

39 / 62

Page 40: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Difference in proportions?

We’re done if the underlying variable is continuous. What ifthe underlying variable is Binomial?

Then X̄1 − X̄2 ≈ N(µ1 − µ2,σ2

1n1

+σ2

2n2

)is replaced by:

p̂1 − p̂2 ≈ N(p1 − p2,p1(1− p1)

n1+

p2(1− p2)

n2)

40 / 62

Page 41: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Sampling Distributions

Sampling DistributionStatistic Mean Variance

X̄ µ σ2

n

X̄1 − X̄2 µ1 - µ2σ2

1n1

+σ2

2n2

p̂ p pqn

np̂ np npqp̂1 − p̂2 p1 − p2

p1q1n1

+ p2q2n2

41 / 62

Page 42: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Statistical inference

Two methodsEstimationHypothesis testing

Both make use of sampling distributions

Remember to use CLT

42 / 62

Page 43: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Estimation

Point estimation

An estimator of a population parameter: a statistic (e.g. x̄ , p̂)

An estimate of a population parameter: the value of theestimator for a particular sample

Interval estimation

A point estimate plus an interval that expresses theuncertainty or variability associated with the estimate

43 / 62

Page 44: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Hypothesis Testing

Given the observed data, do we reject or accept apre-specified null hypothesis in favor of an alternative?

“Significance testing”

44 / 62

Page 45: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Point Estimation

X̄ is a point estimator of µ

X̄1 − X̄2 is a point estimator of µ1 − µ2

p̂ is a point estimator of p

p̂1 − p̂2 is a point estimator of p1 − p2

We know the sampling distribution of these statistics, e.g.

X̄ ∼ N(µX̄ = µ,σX̄ =σ√n)

If σ is not known, we can use s, the sample standard deviation, asa point estimator of σ

45 / 62

Page 46: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Interval Estimation

100(1− α)% Confidence interval:

estimate ± (tabled value of z or t) · (standard error)

Plugging in the values, we get

X̄ ± zα/2 × σX̄ = [L,U]

46 / 62

Page 47: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Confidence Interval

We are saying that

P(−zα/2 ≤ Z ≤ zα/2) = 1− α

P(−zα/2 ≤ X̄ − µ

σX̄≤ zα/2) = 1− α

P(−zα/2 · σX̄ ≤ X̄ − µ ≤ zα/2 · σX̄ ) = 1− α

After some algebra:

P(X̄ − zα/2 · σX̄ ≤ µ ≤ X̄ + zα/2 · σX̄ ) = 1− α

P(L ≤ µ ≤ U) = 1− α

47 / 62

Page 48: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

CI for mean

A confidence interval for µ is given by the interval estimate

X̄ ± z(α/2) · σX̄

when the population variance σ2 is known

48 / 62

Page 49: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Interpretation

Before the data are observed, the probability is at least(1− α) that [L,U] will contain µ, the population parameter

In repeated sampling from a normally distributed population,100(1− α)% of all intervals of the form above will include thethe population mean µ

After the data are observed, the constructed interval [L,U]either contains the true mean or it does not (no probabilityinvolved anymore)

49 / 62

Page 50: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Known Variance

Sampling from a normally distributed population with knownvariance (σ2 known)

Confidence interval: X̄ ± z(α/2) · σX̄

What if σ2 is unknown?

50 / 62

Page 51: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

The t-distribution

t Den

sity

df=2df=5df=20

t = X̄−µs/√

n51 / 62

Page 52: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Use Sample Variance I

Sampling from a normally distributed population withpopulation variance unknown

We can make use of the sample variance s2

Now we construct the confidence interval as:

X̄ ± z(α/2) · sX̄ when n is “large”

X̄ ± t(α/2,n−1) · sX̄ when n is “small”

52 / 62

Page 53: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Use Sample Variance II

Estimate σ2 with s2

Here, sX̄ = s√n

and tα/2 has n-1 degrees of freedom

The distribution of X̄ is not quite normal, so we need thet-distribution

53 / 62

Page 54: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Properties of the t-distribution

mean = median = mode = 0

Symmetric about the mean

t ranges from −∞ to +∞Family of distributions determined by n − 1, the degrees offreedom

The t distribution approaches the normal distribution as n − 1approaches ∞

54 / 62

Page 55: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Comparing t with normal

Dens

ity

Std. normalt with df=2

55 / 62

Page 56: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Confidence intervals for means

Population Sample Population 95% ConfidenceDistribution Size Variance Interval

NormalAny σ2 known X̄ ± 1.96σ/

√n

Any σ2 unknown, use s2 X̄ ± t0.025,n−1s/√

nNot Normal/ Large σ2 known X̄ ± 1.96σ/

√n

UnknownLarge σ2 unknown, use s2 X̄ ± 1.96s/

√n

Small Any Non-parametric methods

BinomialLarge - p̂ ± 1.96

√p̂(1− p̂)/n

Small - Exact methods

56 / 62

Page 57: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Confidence Intervals for Differences in Means

This is a bit tricky

Recall that formulas for CIs for a single mean depend onwhether or not σ2 is knownthe sample size

For a difference in means, the formula for a CI depends onwhether or not the variances are assumed to be equal whenvariance are unknownsample sizes in each group

57 / 62

Page 58: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Equal Variances I

When variances are assumed to be equal:

The standard error of the difference is estimated by:√s2p

n1+

s2p

n2

Here, s2p is the pooled variance

58 / 62

Page 59: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Equal Variances II

s2p =

(n1 − 1)s21 + (n2 − 1)s2

2

n1 + n2 − 2

where df = n1 + n2 − 2

Recall, n1 is the size of sample 1,and n2 is the size of sample 2

59 / 62

Page 60: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Unequal Variances

When variances are assumed to be unequal:

The standard error of the difference is estimated by:√s21

n1+

s22

n2

Here, df = ν and

ν =

s21

n1+

s22

n2

(s21/n1)2

n1−1 +(s2

2/n2)2

n2−1

60 / 62

Page 61: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Confidence intervals for difference of means

Population Sample Population 95% ConfidenceDistribution Size Variances Interval

Normal

Any known (X̄1 − X̄2) ± 1.96√

σ21

n1+ σ2

2n2

Any unknown, (X̄1 − X̄2) ± t0.025,n1+n2−2

√s2p

n1+

s2p

n2

σ21 = σ2

2

Any unknown, (X̄1 − X̄2) ± t0.025,ν

√s21

n1+ s2

2n2

σ21 )= σ2

2

Large known (X̄1 − X̄2) ± 1.96√

σ21

n1+ σ2

2n2

Not Normal/ Large unknown, (X̄1 − X̄2) ± 1.96√

s2p

n1+

s2p

n2

Unknown σ21 = σ2

2

Large unknown, (X̄1 − X̄2) ± 1.96√

s21

n1+ s2

2n2

σ21 )= σ2

2

Small Any Non-parametric methods61 / 62

Page 62: Lecture 3: The No rm a l Di str ibution a n d Sta tistica l In ferencepeople.virginia.edu/~am3xa/BiostatII/slides/lecture3.pdf · 2007-04-19 · Whe n sa m pl ing is from a p opul

Confidence intervals for difference of proportions

Population Sample 95% ConfidenceDistribution Size Interval

BinomialLarge (p̂1 − p̂2) ± 1.96

√p̂1(1−p̂1)

n1+ p̂2(1−p̂2)

n2

Small Exact methods

62 / 62