anova

55
Statistics: Unlocking the Power of Data STAT 101 Dr. Kari Lock Morgan 11/1/12 ANOVA SECTION 8.1 Testing for a difference in means across multiple categories

Upload: vadin

Post on 20-Mar-2016

49 views

Category:

Documents


0 download

DESCRIPTION

STAT 101 Dr. Kari Lock Morgan 11/1/12. ANOVA. SECTION 8.1 Testing for a difference in means across multiple categories. What Next?. If you have enjoyed learning how to analyze data, and want to learn more: take STAT 210 (Regression Analysis) Applied, focused on data analysis - PowerPoint PPT Presentation

TRANSCRIPT

Statistics: Unlocking the Power of Data Lock5

STAT 101Dr. Kari Lock Morgan

11/1/12

ANOVA

SECTION 8.1• Testing for a difference in means across

multiple categories

Statistics: Unlocking the Power of Data Lock5

What Next?• If you have enjoyed learning how to analyze data, and want to learn more: • take STAT 210 (Regression Analysis)• Applied, focused on data analysis• Recommended for any major involving data analysis• Only prerequisite is STAT 101

• If you like math and want to learn more of the mathematical theory behind what we’ve learned: • take STAT 230 (Probability) • and then STAT 250 (Mathematical Statistics)• Prerequisite: multivariable calculus

Statistics: Unlocking the Power of Data Lock5

Two Options for p-valuesWe have learned two ways of calculating p-values:

The only difference is how to create a distribution of the statistic, assuming the null is true:

1)Simulation (Randomization Test): • Directly simulate what would happen, just by

random chance, if the null were true

2)Formulas and Theoretical Distributions: • Use a formula to create a test statistic for which

we know the theoretical distribution when the null is true, if sample sizes are large enough

Statistics: Unlocking the Power of Data Lock5

Two Options for IntervalsWe have learned two ways of calculating intervals:

1)Simulation (Bootstrap): • Assess the variability in the statistic by

creating many bootstrap statistics

2)Formulas and Theoretical Distributions: • Use a formula to calculate the standard error

of the statistic, and use the normal or t-distribution to find z* or t*, if sample sizes are large enough

Statistics: Unlocking the Power of Data Lock5

Inference

Which way did you prefer to learn inference?

a) Simulation methodsb) Formulas and theoretical distributions

Statistics: Unlocking the Power of Data Lock5

Inference

Which way gave you a better conceptual understanding of confidence intervals and p-values?

a) Simulation methodsb) Formulas and theoretical distributions

Statistics: Unlocking the Power of Data Lock5

Inference

Which way do you prefer to do inference?

a) Simulation methodsb) Formulas and theoretical distributions

Statistics: Unlocking the Power of Data Lock5

Pros and Cons1)Simulation Methods

PROS:• Methods tied directly to concepts, emphasizing conceptual

understanding• Same procedure for every statistic• No formulas or theoretical distributions to learn and distinguish

between• Works for any sample size• Minimal math needed

CONS:• Need entire dataset (if quantitative variables)• Need a computer• Newer approach, so different from the way most people do

statistics

Statistics: Unlocking the Power of Data Lock5

Pros and Cons2)Formulas and Theoretical Distributions

PROS:• Only need summary statistics• Only need a calculator• The approach most people take

CONS:• Plugging numbers into formulas does little for conceptual

understanding• Many different formulas and distributions to learn and

distinguish between• Harder to see the big picture when the details are different for

each statistic• Doesn’t work for small sample sizes• Requires more math and background knowledge

Statistics: Unlocking the Power of Data Lock5

Two Options

• If the sample size is small, you have to use simulation methods

• If the sample size is large, you can use whichever method you prefer

• It is redundant to use both methods, unless you want to check your answers

Statistics: Unlocking the Power of Data Lock5

Accuracy• The accuracy of simulation methods depends on the number of simulations (more simulations = more accurate)

• The accuracy of formulas and theoretical distributions depends on the sample size (larger sample size = more accurate)

• If the sample size is large and you have generated many simulations, the two methods should give essentially the same answer

Statistics: Unlocking the Power of Data Lock5

Multiple Categories

•So far, we’ve learned how to do inference for a difference in means IF the categorical variable has only two categories

•Today, we’ll learn how to do hypothesis tests for a difference in means across multiple categories

Statistics: Unlocking the Power of Data Lock5

Hypothesis Testing

1.State Hypotheses

2.Calculate a statistic, based on your sample data

3.Create a distribution of this statistic, as it would be observed if the null hypothesis were true

4.Measure how extreme your test statistic from (2) is, as compared to the distribution generated in (3)

test statistic

Statistics: Unlocking the Power of Data Lock5

Cuckoo Birds•Cuckoo birds lay their eggs in the nests of other birds

•When the cuckoo baby hatches, it kicks out all the original eggs/babies

•If the cuckoo is lucky, the mother will raise the cuckoo as if it were her own

http://opinionator.blogs.nytimes.com/2010/06/01/cuckoo-cuckoo/

•Do cuckoo birds found in nests of different species differ in size?

Statistics: Unlocking the Power of Data Lock5

Length of Cuckoo Eggs

Statistics: Unlocking the Power of Data Lock5

Notation

•k = number of groups

•nj = number of units in group j

•n = overall number of units = n1 + n2 + … + nk

Statistics: Unlocking the Power of Data Lock5

Cuckoo Eggs

k = 5n1 = 15, n2 = 60, n3 = 16, n4 = 14, n5 = 15n = 120

Bird Sample Mean

Sample SD

SampleSize

Pied Wagtail 22.90 1.07 15

Pipit 22.50 0.97 60

Robin 22.58 0.68 16

Sparrow 23.12 1.07 14

Wren 21.13 0.74 15

Overall 22.46 1.07 120

Statistics: Unlocking the Power of Data Lock5

Hypotheses

To test for a difference in means across k groups:

0 1 2: ...: At least one

k

a i j

HH

Statistics: Unlocking the Power of Data Lock5

Test Statistic

Why can’t use the familiar formula

to get the test statistic?

•More than one sample statistic•More than one null value

We need something a bit more complicated…

sample statistic null valueSE

Statistics: Unlocking the Power of Data Lock5

Difference in Means

Whether or not two means are significantly different depends on

• How far apart the means are

• How much variability there is within each group

Statistics: Unlocking the Power of Data Lock5

Difference in Means

group1 group2

02

46

810

group1 group2

02

46

810

1214

group1 group2

4.5

5.0

5.5

6.0

6.5

group1 group2

02

46

810

group1 group2

02

46

810

1214

group1 group2

4.5

5.0

5.5

6.0

6.5

group1 group2

02

46

810

group1 group2

02

46

810

1214

group1 group2

4.5

5.0

5.5

6.0

6.5

1

2

1 2

65

2

X

ssX

1

2

1 2

95

2

X

ssX

1

2

1 2

5

0.6

2s

XXs

Statistics: Unlocking the Power of Data Lock5

Analysis of Variance

•Analysis of Variance (ANOVA) compares the variability between groups to the variability within groups

Total Variability

VariabilityBetween Groups

VariabilityWithin Groups

Statistics: Unlocking the Power of Data Lock5

Analysis of Variance

If the groups are actually different, then

a) the variability between groups should be higher than the variability within groups

b) the variability within groups should be higher than the variability between groupsIf the groups are different, there will be high variability between the groups.

Statistics: Unlocking the Power of Data Lock5

Discoveries for Today

•How to measure variability between groups?

•How to measure variability within groups?

•How to compare the two measures?

•How to determine significance?

Statistics: Unlocking the Power of Data Lock5

Discoveries for Today

•How to measure variability between groups?

•How to measure variability within groups?

•How to compare the two measures?

•How to determine significance?

Statistics: Unlocking the Power of Data Lock5

Sums of Squares

•We will measure variability as sums of squared deviations (aka sums of squares)

•familiar?

Statistics: Unlocking the Power of Data Lock5

Sums of Squares

2

1

n

ii

X X

Total Variability

VariabilityBetween Groups

VariabilityWithin Groups

2

1

k

j jj

n X X

2

,11

jnk

i j jij

X X

overall mean

data value i

overall mean

mean in group j mean in

group j

ith data value in group j

Sum over all data values Sum over all groups Sum over all data values

Statistics: Unlocking the Power of Data Lock5

Deviations

Group 1

Group 2

X

Total iX X

Overall Mean

1X

Group 1 Mean

,

Within i j jX X

1

BetweenX X

Statistics: Unlocking the Power of Data Lock5

Sums of Squares

2

1

n

ii

X X

Total Variability

VariabilityBetween Groups

VariabilityWithin Groups

2

1

k

j jj

n X X

2

,11

jnk

i j jij

X X

SST (Total sum of squares)

SSG(sum of squares due to groups)

SSE(“Error” sum of squares)

Statistics: Unlocking the Power of Data Lock5

Cuckoo Birds

2

1

137.19n

ii

SST X X

2

1

35.90k

j jj

SSG n X X

2

,11

101.29jnk

i j jij

X XSSE

Statistics: Unlocking the Power of Data Lock5

Source

Groups

Error

Total

df

k-1

n-k

n-1

Sum ofSquares

SSG

SSE

SST

MeanSquareMSG =

SSG/(k-1)MSE =

SSE/(n-k)

ANOVA TableThe “mean square” is the

sum of squares divided by the degrees of freedom

variability

average variability

Statistics: Unlocking the Power of Data Lock5

ANOVA Table•Fill in the beginnings of the ANOVA table based on the Cuckoo birds data.

Source

Groups

Error

Total

df

k-1

n-k

n-1

Sum ofSquares

SSG

SSE

SST

MeanSquare

MSG = SSG/(k-1)

MSE = SSE/(n-k)

Bird Sample Mean

Sample SD

SampleSize

Pied Wagtail 22.90 1.07 15

Pipit 22.50 0.97 60

Robin 22.58 0.68 16

Sparrow 23.12 1.07 14

Wren 21.13 0.74 15

Overall 22.46 1.07 120

SSG = 35.9SSE = 101.20

Statistics: Unlocking the Power of Data Lock5

Source

Groups

Error

Total

df

4

115

119

Sum ofSquares35.90

101.29

137.19

MeanSquare

35.9/4 = 8.97

101.29/115 = 0.88

ANOVA Table•Fill in the beginnings of the ANOVA table based on the Cuckoo birds data.

Statistics: Unlocking the Power of Data Lock5

Discoveries for Today

•How to measure variability between groups?

•How to measure variability within groups?

•How to compare the two measures?

•How to determine significance?

Statistics: Unlocking the Power of Data Lock5

F-Statistic

•The F-statistic is a ratio of the average variability between groups to the average variability within groups

average between group variabilityaverage within group variability

MSGFMSE

Statistics: Unlocking the Power of Data Lock5

Source

Groups

Error

Total

df

k-1

n-k

n-1

Sum ofSquares

SSG

SSE

SST

MeanSquareMSG =

SSG/(k-1)MSE =

SSE/(n-k)

FStatistic

MSGMSE

ANOVA Table

Statistics: Unlocking the Power of Data Lock5

Cuckoo Eggs

Source

Groups

Error

Total

df

4

115

119

Sum ofSquares35.90

101.29

137.19

MeanSquare

35.9/4 = 8.97

101.29/115 = 0.88

FStatistic

8.97/0.88= 10.19

Statistics: Unlocking the Power of Data Lock5

F-statisticIf there really is a difference between the groups, we would expect the F-statistic to be

a) Higher than we would observe by random chance

b) Lower than we would observe by random chance

If the null hypothesis is true, what kind of F-statistics would we observe just by random chance?

The numerator of the F-statistic measures between group variability, and the denominator measures within group. If there is a difference, we expect the between group variability to be higher.

Statistics: Unlocking the Power of Data Lock5

Discoveries for Today

•How to measure variability between groups?

•How to measure variability within groups?

•How to compare the two measures?

•How to determine significance?

Statistics: Unlocking the Power of Data Lock5

How to determine significance?

We have a test statistic. What else do we need to perform the hypothesis test?

A distribution of the test statistic assuming H0 is true

How do we get this? Two options:1) Simulation2) Distributional Theory

Statistics: Unlocking the Power of Data Lock5

www.lock5stat.com/statkey Simulation

Because a difference would make the F-statistic higher, calculate proportion in the upper tail

An F-statistic this large would be very unlikely to happen just by random chance if the means were all equal, so we have strong evidence that the mean lengths of cuckoo birds in nests of different species are not all equal.

Statistics: Unlocking the Power of Data Lock5

F-distributionRandomization Distribution

F-statistic

Frequency

0 2 4 6 8 10

0100

200

300

400

500

600

Randomization Distribution

F-statistic

Frequency

0 2 4 6 8 10

0100

200

300

400

500

600

F-distribution

Statistics: Unlocking the Power of Data Lock5

F-DistributionIf the following conditions hold,

1.Sample sizes in each group are large (each nj ≥ 30) OR the data are relatively normally distributed

2.Variability is similar in all groups

3.The null hypothesis is true

then the F-statistic follows an F-distribution

•The F-distribution has two degrees of freedom, one for the numerator of the ratio (k – 1) and one for the denominator (n – k)

Statistics: Unlocking the Power of Data Lock5

Equal Variance•The F-distribution assumes equal within group variability for each group

•As a rough rule of thumb, this assumption is violated if the standard deviation of one group is more than double the standard deviation of another group

Statistics: Unlocking the Power of Data Lock5

F-distributionCan we use the F-distribution to calculate the p-value for the Cuckoo bird eggs?

a) Yesb) Noc) Need more information

The equal variability condition is satisfied, but the sample sizes are small so we can only use the F-distribution if the data is normal.

Bird Sample Mean

Sample SD

SampleSize

Pied Wagtail 22.90 1.07 15

Pipit 22.50 0.97 60

Robin 22.58 0.68 16

Sparrow 23.12 1.07 14

Wren 21.13 0.74 15

Overall 22.46 1.07 120

Statistics: Unlocking the Power of Data Lock5

Length of Cuckoo Eggs

Statistics: Unlocking the Power of Data Lock5

F-distribution p-values1.StatKey – simulation or theoretical

2.RStudio:tail.p(“f”,stat,df1,df2,tail=“upper”)

3.TI-83:2nd DISTR 7:Fcdf( lower bound, upper bound, df1, df2

For F-statistics, the p-value (the area as extreme or more extreme) is always the upper tail

Statistics: Unlocking the Power of Data Lock5

Source

Groups

Error

Total

df

k-1

n-k

n-1

Sum ofSquares

SSG

SSE

SST

MeanSquareMSG =

SSG/(k-1)MSE =

SSE/(n-k)

FStatistic

MSGMSE

p-value

Use Fk-1,n-k

ANOVA Table

Statistics: Unlocking the Power of Data Lock5

Cuckoo Eggs

Statistics: Unlocking the Power of Data Lock5

Source

Groups

Error

Total

df

4

115

119

Sum ofSquares35.90

101.29

137.19

MeanSquare

8.97

0.88

FStatistic10.19

p-value

4.3 × 10-7

ANOVA Table

We have very strong evidence that average length of cuckoo eggs differs for nests of different species

Equal variability Normal(ish) data

Statistics: Unlocking the Power of Data Lock5

Can we use the F-distribution to calculate the p-value for whether there is a difference in average hours spent studying per week by class year at Duke?

a) Yesb) Noc) Need more information

Study Hours by Class Year

Year Sample Mean

Sample SD

SampleSize

First Year 16.06 10.33 72

Sophomore 17.51 9.29 74

Upperclass 19.31 14.74 52The equal variability condition is satisfied, and the sample sizes are large enough (>30)

Statistics: Unlocking the Power of Data Lock5

Study Hours by Class Year

Is there a difference in the average hours spent studying per week by class year at Duke?

(a)Yes(b)No(c)Cannot tell from this data(d)I didn’t finish

31824984

SSGSSE

The p-value is 0.29, which means we cannot reject the null, so cannot determine whether there is a difference.

Year Sample Mean

Sample SD

SampleSize

First Year 16.06 10.33 72

Sophomore 17.51 9.29 74

Upperclass 19.31 14.74 52

Statistics: Unlocking the Power of Data Lock5

Source

GroupsErrorTotal

df

2195197

Sum ofSquares

31824984

20013.4

MeanSquare

159128.1

F-Statistic

1.24

ANOVA Table

p-value

0.29

Statistics: Unlocking the Power of Data Lock5

Summary• Analysis of variance is used to test for a difference in means between groups by comparing the variability between groups to the variability within groups

• Sums of squares are used to measure variability

• The F-statistic is the ratio of average variability between groups to average variability within groups

• The F-statistic follows an F-distribution, if sample sizes are large (or data is normal), variability is equal across groups, and the null hypothesis is true

Statistics: Unlocking the Power of Data Lock5

To DoRead Section 8.1

Complete the anonymous midterm evaluation by Monday, 11/5, 5pm

Do Homework 6 (due Tuesday, 11/6)