Statistics: Unlocking the Power of Data Lock5
STAT 101Dr. Kari Lock Morgan
ANOVA
SECTION 8.1• Testing for a difference in means across
multiple categories
Statistics: Unlocking the Power of Data Lock5
Review: Chi-Square TestsThe χ2 goodness-of-fit tests if one categorical
variable differs from a null distribution
The χ2 test for association tests for an association between two categorical variables
For both, you compute the expected counts in each cell (assuming H0) and the χ2 statistic:
Find the proportion above the χ2 statistic in a randomization or χ2-distribution (if all expected counts > 5)
2
2 observed - expected
expected
Statistics: Unlocking the Power of Data Lock5
Multiple Categories
•So far, we’ve learned how to do inference for a difference in means IF the categorical variable has only two categories
•Today, we’ll learn how to do hypothesis tests for a difference in means across multiple categories
Statistics: Unlocking the Power of Data Lock5
Hypothesis Testing
1.State Hypotheses
2.Calculate a statistic, based on your sample data
3.Create a distribution of this statistic, as it would be observed if the null hypothesis were true
4.Measure how extreme your test statistic from (2) is, as compared to the distribution generated in (3)
test statistic
Statistics: Unlocking the Power of Data Lock5
Cuckoo Birds•Cuckoo birds lay their eggs in the nests of other birds
•When the cuckoo baby hatches, it kicks out all the original eggs/babies
•If the cuckoo is lucky, the mother will raise the cuckoo as if it were her own
http://opinionator.blogs.nytimes.com/2010/06/01/cuckoo-cuckoo/
•Do cuckoo birds found in nests of different species differ in size?
Statistics: Unlocking the Power of Data Lock5
Length of Cuckoo Eggs
Statistics: Unlocking the Power of Data Lock5
Notation
•k = number of groups
•nj = number of units in group j
•n = overall number of units = n1 + n2 + … + nk
Statistics: Unlocking the Power of Data Lock5
Cuckoo Eggs
k = 5n1 = 15, n2 = 60, n3 = 16, n4 = 14, n5 = 15n = 120
Bird Sample Mean
Sample SD
SampleSize
Pied Wagtail 22.90 1.07 15
Pipit 22.50 0.97 60
Robin 22.58 0.68 16
Sparrow 23.12 1.07 14
Wren 21.13 0.74 15
Overall 22.46 1.07 120
Statistics: Unlocking the Power of Data Lock5
Hypotheses
To test for a difference in means across k groups:
Statistics: Unlocking the Power of Data Lock5
Test Statistic
Why can’t use the familiar formula
to get the test statistic?
We need something a bit more complicated…
sample statistic null value
SE
Statistics: Unlocking the Power of Data Lock5
Difference in Means
Whether or not two means are significantly different depends on
• How far apart the means are
• How much variability there is within each group
Statistics: Unlocking the Power of Data Lock5
Difference in Means
group1 group2
02
46
810
group1 group2
02
46
810
1214
group1 group2
4.5
5.0
5.5
6.0
6.5
group1 group2
02
46
810
group1 group2
02
46
810
1214
group1 group2
4.5
5.0
5.5
6.0
6.5
group1 group2
02
46
810
group1 group2
02
46
810
1214
group1 group2
4.5
5.0
5.5
6.0
6.5
1
2
1 2
65
2
X
ssX
1
2
1 2
95
2
X
ssX
1
2
1 2
5
0.6
2s
XXs
Statistics: Unlocking the Power of Data Lock5
Analysis of Variance
•Analysis of Variance (ANOVA) compares the variability between groups to the variability within groups
Total Variability
VariabilityBetween Groups
VariabilityWithin Groups
Statistics: Unlocking the Power of Data Lock5
Analysis of Variance
If the groups are actually different, then
a) the variability between groups should be higher than the variability within groups
b) the variability within groups should be higher than the variability between groups
Statistics: Unlocking the Power of Data Lock5
Discoveries for Today
•How to measure variability between groups?
•How to measure variability within groups?
•How to compare the two measures?
•How to determine significance?
Statistics: Unlocking the Power of Data Lock5
Discoveries for Today
•How to measure variability between groups?
•How to measure variability within groups?
•How to compare the two measures?
•How to determine significance?
Statistics: Unlocking the Power of Data Lock5
Sums of Squares
•We will measure variability as sums of squared deviations (aka sums of squares)
•familiar?
Statistics: Unlocking the Power of Data Lock5
Sums of Squares
2
1
n
ii
X X
Total Variability
VariabilityBetween Groups
VariabilityWithin Groups
2
1
k
j jj
n X X
2
,11
jnk
i j jij
X X
overall mean
data value i
overall mean
mean in group j mean in
group j
ith data value in group j
Sum over all data values Sum over all groups Sum over all data values
Statistics: Unlocking the Power of Data Lock5
Deviations
Group 1
Group 2
X
Total iX X
Overall Mean
1X
Group 1 Mean
,
Within i j jX X
1
BetweenX X
Statistics: Unlocking the Power of Data Lock5
Sums of Squares
2
1
n
ii
X X
Total Variability
VariabilityBetween Groups
VariabilityWithin Groups
2
1
k
j jj
n X X
2
,11
jnk
i j jij
X X
SST (Total sum of squares)
SSG(sum of squares due to groups)
SSE(“Error” sum of squares)
Statistics: Unlocking the Power of Data Lock5
Cuckoo Birds
2
1
137.19n
ii
SST X X
2
1
35.90k
j jj
SSG n X X
2
,11
101.29jnk
i j jij
X XSSE
Statistics: Unlocking the Power of Data Lock5
Source
Groups
Error
Total
df
k-1
n-k
n-1
Sum ofSquares
SSG
SSE
SST
MeanSquareMSG =
SSG/(k-1)MSE =
SSE/(n-k)
ANOVA TableThe “mean square” is the
sum of squares divided by the degrees of freedom
variability
average variability
Statistics: Unlocking the Power of Data Lock5
ANOVA Table•Fill in the beginnings of the ANOVA table based on the Cuckoo birds data.
Source
Groups
Error
Total
df
k-1
n-k
n-1
Sum ofSquares
SSG
SSE
SST
MeanSquare
MSG = SSG/(k-1)
MSE = SSE/(n-k)
Bird Sample Mean
Sample SD
SampleSize
Pied Wagtail 22.90 1.07 15
Pipit 22.50 0.97 60
Robin 22.58 0.68 16
Sparrow 23.12 1.07 14
Wren 21.13 0.74 15
Overall 22.46 1.07 120
SSG = 35.9SSE = 101.20
Statistics: Unlocking the Power of Data Lock5
Source
Groups
Error
Total
df Sum ofSquares
MeanSquare
ANOVA Table•Fill in the beginnings of the ANOVA table based on the Cuckoo birds data.
Statistics: Unlocking the Power of Data Lock5
Discoveries for Today
•How to measure variability between groups?
•How to measure variability within groups?
•How to compare the two measures?
•How to determine significance?
Statistics: Unlocking the Power of Data Lock5
F-Statistic
•The F-statistic is a ratio of the average variability between groups to the average variability within groups
average between group variability
average within group variability
MSGF
MSE
Statistics: Unlocking the Power of Data Lock5
Source
Groups
Error
Total
df
k-1
n-k
n-1
Sum ofSquares
SSG
SSE
SST
MeanSquareMSG =
SSG/(k-1)MSE =
SSE/(n-k)
FStatistic
MSGMSE
ANOVA Table
Statistics: Unlocking the Power of Data Lock5
Cuckoo Eggs
Source
Groups
Error
Total
df
4
115
119
Sum ofSquares35.90
101.29
137.19
MeanSquare
35.9/4 = 8.97
101.29/115 = 0.88
FStatistic
8.97/0.88= 10.19
Statistics: Unlocking the Power of Data Lock5
F-statisticIf there really is a difference between the groups, we would expect the F-statistic to be
a) Higher than we would observe by random chance
b) Lower than we would observe by random chance
Statistics: Unlocking the Power of Data Lock5
Discoveries for Today
•How to measure variability between groups?
•How to measure variability within groups?
•How to compare the two measures?
•How to determine significance?
Statistics: Unlocking the Power of Data Lock5
How to determine significance?
We have a test statistic. What else do we need to perform the hypothesis test?
A distribution of the test statistic assuming H0 is true
How do we get this? Two options:1) Simulation2) Distributional Theory
Statistics: Unlocking the Power of Data Lock5
www.lock5stat.com/statkey Simulation
Because a difference would make the F-statistic higher, calculate proportion in the upper tail
An F-statistic this large would be very unlikely to happen just by random chance if the means were all equal, so we have strong evidence that the mean lengths of cuckoo birds in nests of different species are not all equal.
Statistics: Unlocking the Power of Data Lock5
F-distributionRandomization Distribution
F-statistic
Frequency
0 2 4 6 8 10
0100
200
300
400
500
600
Randomization Distribution
F-statistic
Frequency
0 2 4 6 8 10
0100
200
300
400
500
600
F-distribution
Statistics: Unlocking the Power of Data Lock5
F-DistributionIf the following conditions hold,
1.Sample sizes in each group are large (each nj ≥ 30) OR the data are relatively normally distributed
2.Variability is similar in all groups
3.The null hypothesis is true
then the F-statistic follows an F-distribution
•The F-distribution has two degrees of freedom, one for the numerator of the ratio (k – 1) and one for the denominator (n – k)
Statistics: Unlocking the Power of Data Lock5
Equal Variance•The F-distribution assumes equal within group variability for each group
•As a rough rule of thumb, this assumption is violated if the standard deviation of one group is more than double the standard deviation of another group
Statistics: Unlocking the Power of Data Lock5
F-distributionCan we use the F-distribution to calculate the p-value for the Cuckoo bird eggs?
a) Yesb) Noc) Need more information
Bird Sample Mean
Sample SD
SampleSize
Pied Wagtail 22.90 1.07 15
Pipit 22.50 0.97 60
Robin 22.58 0.68 16
Sparrow 23.12 1.07 14
Wren 21.13 0.74 15
Overall 22.46 1.07 120
Statistics: Unlocking the Power of Data Lock5
Length of Cuckoo Eggs
Statistics: Unlocking the Power of Data Lock5
Source
Groups
Error
Total
df
k-1
n-k
n-1
Sum ofSquares
SSG
SSE
SST
MeanSquareMSG =
SSG/(k-1)MSE =
SSE/(n-k)
FStatistic
MSGMSE
p-value
Use Fk-1,n-k
ANOVA Table
Statistics: Unlocking the Power of Data Lock5
Cuckoo Eggs
Statistics: Unlocking the Power of Data Lock5
Source
Groups
Error
Total
df
4
115
119
Sum ofSquares35.90
101.29
137.19
MeanSquare
8.97
0.88
FStatistic10.19
p-value
4.3 × 10-7
ANOVA Table
We have very strong evidence that average length of cuckoo eggs differs for nests of different species
Equal variability Normal(ish) data
Statistics: Unlocking the Power of Data Lock5
Can we use the F-distribution to calculate the p-value for whether there is a difference in average hours spent studying per week by class year at Duke?
a) Yesb) Noc) Need more information
Study Hours by Class Year
Year Sample Mean
Sample SD
SampleSize
First Year 16.06 10.33 72
Sophomore 17.51 9.29 74
Upperclass 19.31 14.74 52
Statistics: Unlocking the Power of Data Lock5
Study Hours by Class Year
Is there a difference in the average hours spent studying per week by class year at Duke?
(a)Yes(b)No(c)Cannot tell from this data(d)I didn’t finish
318
24984
SSG
SSE
Year Sample Mean
Sample SD
SampleSize
First Year 16.06 10.33 72
Sophomore 17.51 9.29 74
Upperclass 19.31 14.74 52
Statistics: Unlocking the Power of Data Lock5
Source
Groups
ErrorTotal
df Sum ofSquares
MeanSquare
F-Statistic
ANOVA Table
p-value
Statistics: Unlocking the Power of Data Lock5
Summary• Analysis of variance is used to test for a difference in means between groups by comparing the variability between groups to the variability within groups
• Sums of squares are used to measure variability
• The F-statistic is the ratio of average variability between groups to average variability within groups
• The F-statistic follows an F-distribution, if sample sizes are large (or data is normal), variability is equal across groups, and the null hypothesis is true
Statistics: Unlocking the Power of Data Lock5
To DoRead Section 8.1 (we are skipping 8.2)
Do Homework 6 (due Monday, 3/24)