16 analysis of variance (anova).pdf

8/10/2019 16 Analysis of Variance (ANOVA).pdf

1/14

Section 16.1 16-1

Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.

CHAPTER 16

Analysis of Variance (ANOVA)

GENERALOBJECTIVE

In Chapter 10, we studied inferential methods for comparing the means of two

populations. Now we will study analysis of variance, or ANOVA, which

provides methods for comparing two or more population means. You shouldbe familiar with the chapter that discusses analysis of variance in your

textbook before beginning this chapter.

LESSONOUTLINE

16.1 The F-distribution16.2 One-Way ANOVA: The Logic16.3 One-Way ANOVA: The Procedure

16.4 Multiple Comparisons*16.5 The Kruskal-Wallis Test*16.6 Problems


2/14

16-2 Analysis of Variance (ANOVA)


16.1 The F-distribution

Analysis of variance procedures rely on a distribution called the

F-distribution, named in honor of Sir Ronald Fisher (1800-1962). A variable

is said to have an F-distributionif its distribution has a special type of right-skewed curve, called an F-curve. There are infinitely manyF-distributions,which we identify by stating two associated degrees of freedom a degrees of

freedom for the numerator and a degrees of freedom for the denominator. We

will now study how SPSS can be used to findF-value,F, from thisdistribution.

Finding theF-Value Having a Specified Area to Its Right

Example 16.1 For anF-curve with degrees of freedom, df = (4, 12), findF0.05; that is, find

theF-value having area 0.05to its right for anF-distribution with 4degrees offreedom in the numerator and 12degrees of freedom in the denominator.

Solution The SPSS function, IDF.F(prob, df1, df2)returns the value from theF-distribution, with the specified degrees of freedom, df =(df1, df2), forwhich the area to the left isprob. Similar to computing a t-score, we will use

the Compute Variabledialog box.

TheF-value having area 0.05to its right has area 0.95to its left, since the

total area under the probability curve is one. In the Numeric Expressionbox

typeIDF.F(0.95, 4, 12). SPSS returns theF-value that has area 0.05to its

right as F = 3.26.

16.2 One-Way ANOVA: The Logic

Analysis of Variance (ANOVA) provides methods for comparing severalpopulation means, that is, the means of a single variable from several

populations. In this Chapter, we study one-way analysis of variance. This

type of ANOVA is called one-wayanalysis of variance because it comparesthe means of a variable for populations that result from a classification by one

variable, called the factor. The possible values of the factor are referred to as

the levelsof the factor.

One-way ANOVA is the generalization to more than two populations of the

pooled t-procedure. As in the pooled t-procedure, we make the followingassumptions.


3/14

Section 16.3 16-3


Assumptions (Conditions) for One-Way ANOVA

1. Simple Random Samples: The samples taken from the populations underconsideration are simple random samples.

2. Independent Samples: The samples taken from the populations underconsideration are independent of one another.

3. Normal populations: For each population, the variable underconsideration is normally distributed.

4. Equal standard deviations: The standard deviations of the variableunder consideration are the same for all the populations.

16.3 One-Way ANOVA: The Procedure

The One-Way ANOVA Test

Example 16.3 Energy Consumption: The U.S. Energy Information Administration gathersdata on residential energy consumption and expenditures and publishes its

findings inResidential Energy Consumption Survey: Consumption andExpenditures. Table 16 - 1 shows last years energy consumptions for four

independent random samples of households in the four U.S. regions

Table 16 - 1

Energyconsumptionfor samples

ofhouseholdsin four U.S.

regions

Northeast Midwest South West15 17 11 1010 12 7 12

13 18 9 8

14 13 13 713 15 9

12

At the 5% level of significance, do the data provide sufficient evidence to

conclude that a difference exists in mean annual energy consumption by

households in the four U.S. regions?


4/14



Solution Type the data into two variables named, ENERGYand REGION. ENERGYshould contain all 20data values in the four samples. REGIONshould takeon the four values, 1, 2, 3, and 4, which associate the case with a region. The

values of REGION, 1, 2, 3, and 4, should be associated with the value labels,

Northeast, Midwest, South, and West, respectively.

Step 1:State the null and alternative hypotheses.

Let 1, 2, 3, and 4denote last years mean energy consumptions for

households in the Northeast, Midwest, South, and West, respectively. Thenull and alternative hypotheses are:

0 1 2 3 4: (mean consumptions are all equal)H = = =

: Not all the mean consumptions are all equala

H

Step 2: Decide on the significance level, .

The test is to be performed at the 5% significance level. Thus

= 0.05.

Step 3: Compute the value of the test statistic.

1. Test the hypotheses by choosing Analyze > Compare Means >One-Way ANOVAto open the One-Way ANOVAdialog box

(Figure 16 - 1).

Figure 16 - 1

One-Way

ANOVAdialog box

2. Paste the variable ENERGYinto the Dependent Listbox and the

variableREGION

into the Factorbox.

3. Click the OKbutton to display the results of the one-way ANOVA

in Viewerwindow.

The ANOVAtable (Figure 16 - 2) shows several statistics used in analysis of

variance.


5/14

Section 16.3 16-5


Figure 16 - 2

ANOVA tablefrom One-

Way ANOVA

procedure

The test statistic isF= 6.318. It has anF-distribution with df =(3, 16).

Step 4: Obtain thep-value.

The test statistic has an associatedp-value = 0.005 which is given under the

column titled Sig.

Step 5: If P < , rejectH0; otherwise, do not rejectH0.

Thep-value is less than the specified significance level of 0.05; therefore, we

reject the null hypothesis.

Step 6:Interpret the results of the hypothesis test.

At the 5% significance level, the data provide sufficient evidence to concludethat a difference exists in last years mean energy consumption by households

among the four U.S. regions. That is, at least two of the regions have different

mean energy consumptions.

The ANOVA Table

The layout of the ANOVA table in SPSS is similar to the layout in the chapter

with the following exceptions. SPSS denotes Treatment by Between Groupsand Error by Within Groups. This is because SSTRcan be thought of as the

error betweenthe sample means and SSEcan be thought of as the error within

the samples. The values of SSTR = 97.5, SSE = 82.3, and SST = 179.8can

be read from the second column in the ANOVA table (Figure 16 - 2).

The one-way ANOVA identity,

SST = SSTR + SSE =97.5 + 82.3= 179.8,

shows that the total variation among all the sample data can be partitioned intoa component representing variation among the sample means and a

component representing variation within samples. The associated degrees of

freedom and mean squares are also reported in the ANOVA table.


6/14



16.4 Multiple Comparisons*

When the null hypothesis is rejected in a one-way ANOVA, the conclusion isthat the means are not all equal. Once you make that decision, you may also

want to know which means are different, which is the largest, or, more

generally, the relation among all the between the means. Methods for doingsuch problems are called multiple comparisons.

SPSS provides several multiple comparison methods including the Tukey

multiple comparison method. In multiple comparisons, it is important to

distinguish between the individual confidence leveland thefamily confidence

level. The individual confidence levelis the confidence that any particularconfidence interval contains the true difference between the corresponding

population means; the family confidence levelis the confidence that allthe

confidence intervals simultaneously contain their respective true differences.

The Tukey multiple comparison method is based on the studentized rangedistribution. The Tukey multiple comparison method for obtaining

confidence intervals for the differences between means is similar to the pooledt-interval formula. The essential difference is that, in the Tukey multiple

comparison method the percentile of a studentized range distribution is used

instead of the percentile of a t-distribution. The effect of this is that the(1-)-level confidence intervals constructed by the Tukey multiple

comparisons method have a family confidence level of 1-. Each of the

(1-)-level confidence intervals constructed by the pooled t-interval formula

has an individual confidence level of 1-, the family confidence for this set ofconfidence intervals in smaller than 1-.

The Tukey Multiple-Comparison

Example 16.6 Energy Consumption: Apply the Tukey multiple comparison method to theenergy consumption data in Table 16 - 1. Use a family confidence level of

95%.

Solution To perform Tukey multiple comparisons in SPSS,

1. Click the Post Hoc...button in the One-Way ANOVAdialog box

(Figure 16 - 1) to open the One-Way ANOVA: Post HocMultiple Comparisonsdialog box (Figure 16 - 3).


7/14

Section 16.4 16-7


Figure 16 - 3

One-WayANOVA:Post Hoc

MultipleComparisons

dialog box

2. Choose the checkbox for Tukey.

3. A 95% family confidence interval corresponds to a 5% significance level.

Therefore, enter 0.05into the Significance levelbox.

4. Click the Continuebutton to close the dialog box and then click the OK

button to display the results in the Viewerwindow.

The Multiple Comparisonstable (Figure 16 - 4) shows 95% confidence

intervals for the differences using the Tukey multiple comparisons method.

Figure 16 - 4

MultipleComparisons

table forTukey

multiplecomparisons

method


8/14



For example, the confidence interval for the mean difference between the

Northeast and Midwest regions is5.429to 2.429. Two population means aresignificantly different if their confidence interval does notinclude 0. This is

true for the Midwest and South regions, for example. SPSS provides another

table, the Homogeneous Subsetstable (Figure 16 - 5), to help decipher which

population means are different and which are equal.

Figure 16 - 5

Homogen-eous subsets

table fromTukey

multiplecomparisonprocedure

Means that are lined up together in a column under Subset for alpha = 0.05

are judged equal by the Tukey multiple comparison method. Means that arein separate columns are judged not equal. That is, there is sufficient evidence

the population means for the regions, West, South, and Northeast are equal;

and the population means for the regions, Northeast and Midwest are equal.

Further, since West and Midwest are in different columns there is sufficientevidence that they are not equal. These results have a 95% family confidence

level.

16.5 The Kruskal-Wallis Test*

The Kruskal-Wallis testis a nonparametric alternative to the one-way

ANOVA procedure. The Kruskal-Wallis tests whether several independentsamples are from the same population. The Kruskal-Wallis test applies when

the distributions (one for each population) of the variable under consideration

have the same shape, but does not require that they be normal or have anyother specific shape. Like the Mann-Whitney test, the Kruskal-Wallis test is

based on ranks.


9/14

Section 16.5 16-9


The Kruskal-Wallis Test

Example 16.8 Vehicle Miles: The U.S. Federal Highway Administration conducts annualsurveys on motor vehicle travel by type of vehicle and publishes its findings

inHighway Statistics. Independent simple random samples of cars, buses, and

trucks were chosen and the data on number of miles driven, in thousands, byeach sampled vehicle last year are shown in Table 16 - 2.

Table 16 - 2

Numbermiles driven(1000s) last

year forindependent

samples ofcars, buses,

and trucks

Cars Buses Trucks

19.9 1.8 24.6

15.3 7.2 37.0

2.2 7.2 21.2

6.8 6.5 23.6

34.2 13.3 23.0

8.3 25.4 15.3

12.0 57.17.0 14.5

9.5 26.0

1.1

Preliminary data analysis (not shown) suggest that the distributions of miles

driven have roughly the same shape for cars, buses, and trucks but that thosedistributions are far from normal. Thus the appropriate test is the Kruskal-

Wallis procedure. At the 5% significance level, do the data provide sufficient

evidence to conclude that a difference exists in last years mean number ofmiles driven among cars, buses, and trucks?

Solution The Kruskal-Wallis test is performed by theTests for Several IndependentSamplesdialog box. Type the data into two variables named, MILESand

VEHICLE, in a new data file. MILESshould contain all 25data values in

the three samples. VEHICLEshould take on the values, 1, 2,and3,associated with the value labels, Cars, Buses, and Trucks, respectively.

Step 1:State the null and alternative hypothesesLet 1, 2, and 3denote last years mean number of miles driven for cars,

buses, and trucks, respectively. The null and alternative hypotheses are:

0 1 2 3: (mean miles driven are all equal)H = =

: Not all the means all equala

H

Step 2: Decide on the significance level, .

The test is to be performed at the 5% significance level. Thus = 0.05.


10/14



Step 3: Compute the value of the test statistic

1. Test the hypotheses by choosing Analyze > Nonparametric Tests >

Legacy Dialogs > K Independent Samplesto open the Tests forSeveral Independent Samplesdialog box (Figure 16 - 6).

2. Paste the variable MILESinto the Test Variable Listbox and the

variable VEHICLEinto the Grouping Variablebox.

Figure 16 - 6

Tests forSeveral

IndependentSamples

dialog box

Next, we need to specify the minimum and maximum integer values for thegrouping variable. The minimum value must be less than the maximum value.

Cases associated with values outside the bounds are excluded during the

analysis. This option is supplied so that the Kruskal-Wallisprocedure can beperformed on a subset of the samples.

3. Click the Define Rangebutton to open the Several Independent

Samples: Define Rangedialog box (Figure 16 7).


11/14

Section 16.5 16-11


Figure 16 7Several

IndependentSamples:

Define

Range dialogbox

We require all the cases to be analyzed, consequently enter 1, the minimum

value in VEHICLE, into the Minimumbox and 3, the maximum value in

VEHICLE, into the Maximumbox.

4. Click the Continuebutton to close the dialog box and update the grouping

variable information in the Tests for Several Independent Samplesdialog box.

5. Click the OKbutton to display the results in the Viewerwindow.

The Rankstable (Figure 16 8) displays the mean ranks for each of the three

samples. If the sample means are equal we would expect the mean ranks to be

approximately equal.

Figure 16 8

Ranks tablefrom Kruskal-

Wallisprocedure

The Test Statisticstable (Figure 16 9) gives the chi-square test statistic,

degrees of freedom associated with the test statistic, and thep-value of thehypothesis test.

Figure 16 9

TestStatistics

table fromKruskal-

Wallisprocedure


12/14



The test statistic isH = 9.93 which has a 2-distribution with 2degrees of

freedom.

Step 4: Obtain thep-value.

The test statistic has an associatedp-value = 0.007which is given in the rowtitled Asymp. Sig.

Step 5: If P < , reject H0; otherwise, do not reject H0.Thep-value is less than the specified significance level of 0.05; therefore, we

reject the null hypothesis.

Step 6:Interpret the results of the hypothesis test.

At the 5% significance level, the data provide sufficient evidence to conclude

that at least one of the means is not equal to the others.

16.6 Problems

Problem 16.8 For theF-curve with df =(12, 5), find

a. F0.05

b. F0.01

c. F0.025

Problem 16.10 For theF-curve with df =(6, 10), find

a. F0.05

b. F0.01c. F0.025


13/14

Section 16.6 16-13


Problem 16.48 Movie fans use the annualLeonard Maltin Movie Guidefor facts, cast

members, and reviews of over 21,000 films. The movies are rated form 4stars (4*), indicating a very good movie to 1 star (1*) which Leonard Maltin

refers to as a BOMB. Table 16 3 gives the running times, in minutes, of a

random sample of films listed in one years guide. At the 1% significance

level, do the data provide sufficient evidence to conclude that a differenceexists in mean running times among the four rating groups?

Table 16 3

Running Timesin minutes

1* or 1.5* 2* or 2.5* 3* or 3.5* 4*

75 97 101 10195 70 89 135

84 105 97 93

86 119 103 117

58 87 86 12685 95 100 119

Problem 16.49 Copepods are tiny crustaceans that are an essential link in the estuarine foodweb. Marine scientists G. Weiss, G. McManus, and H. Harvey at the

Chesapeake Biological Laboratory in Maryland designed an experiment to

determine whether dietary lipid (fat) content is important in the populationgrowth of a Chesapeake Bay copepod. Their findings were published as the

paper Development and Lipid Composition of the Harpacticoid Copepod

Nitocra Spinipes Reared on Different Diets (Marine Ecology Progress

Series, vol. 132, pp. 57-61). Independent random samples of copepods wereplaced in containers containing lipid-rich diatoms, bacteria, or leafy

macroalgae. There were 12containers total, four replicates per diet. Five

gravid (egg-bearing) females were placed in each container. Table 16 4shows the number of copepods in each container after 14days.

Table 16 4

Number oCopepods

Diatoms Bacteria Macroalgae

426 303 277

467 301 324

438 293 302497 328 272

a. Obtain the one-way ANOVA table for the data.

b. Verify the one-way ANOVA identity.c. At the 5% significance level, do the data provide sufficient evidence to

conclude that a difference exists in the mean number of copepods among

the three different diets?


14/14



Problem 16.95 Refer to Problem 16.49. Apply the Tukey multiple comparison method to the

data in Table 16 3. Use a family confidence level of 95%

Problem 16.129 Indications are that Americans have become more aware of the dangers of

excessive fat intake in their diets, although some reversal of this awareness

appears to have developed in recent years. The U.S. Department ofAgriculture publishes data on annual consumption of selected beverages in

Food Consumption, Prices, and Expenditures. Independent random samples

of lowfat-milk consumptions, measured in gallons, for 1980, 1995, and 2005are given in Table 16 5.

Table 16 5

Lowfat milkconsumptions,in gallons, fo

1980, 1995,and 2005

1980 1995 2005

11.1 15.5 11.2

10.7 16.0 12.7

8.6 16.1 17.4

9.4 14.7 17.1

9.2 11.5 13.415.1 17.1 11.4

11.6 16.2 13.98.3 14.6

15.2

At the 1% level of significance, do the data provide sufficient evidence to

conclude that there is a difference in mean (per capita) consumption of lowfat

milk for the years 1980, 1995, and 2005? Use the Kruskal-Wallis Test.

16 analysis of variance (anova).pdf

Documents