16 analysis of variance (anova).pdf
TRANSCRIPT
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
1/14
Section 16.1 16-1
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
CHAPTER 16
Analysis of Variance (ANOVA)
GENERALOBJECTIVE
In Chapter 10, we studied inferential methods for comparing the means of two
populations. Now we will study analysis of variance, or ANOVA, which
provides methods for comparing two or more population means. You shouldbe familiar with the chapter that discusses analysis of variance in your
textbook before beginning this chapter.
LESSONOUTLINE
16.1 The F-distribution16.2 One-Way ANOVA: The Logic16.3 One-Way ANOVA: The Procedure
16.4 Multiple Comparisons*16.5 The Kruskal-Wallis Test*16.6 Problems
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
2/14
16-2 Analysis of Variance (ANOVA)
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
16.1 The F-distribution
Analysis of variance procedures rely on a distribution called the
F-distribution, named in honor of Sir Ronald Fisher (1800-1962). A variable
is said to have an F-distributionif its distribution has a special type of right-skewed curve, called an F-curve. There are infinitely manyF-distributions,which we identify by stating two associated degrees of freedom a degrees of
freedom for the numerator and a degrees of freedom for the denominator. We
will now study how SPSS can be used to findF-value,F, from thisdistribution.
Finding theF-Value Having a Specified Area to Its Right
Example 16.1 For anF-curve with degrees of freedom, df = (4, 12), findF0.05; that is, find
theF-value having area 0.05to its right for anF-distribution with 4degrees offreedom in the numerator and 12degrees of freedom in the denominator.
Solution The SPSS function, IDF.F(prob, df1, df2)returns the value from theF-distribution, with the specified degrees of freedom, df =(df1, df2), forwhich the area to the left isprob. Similar to computing a t-score, we will use
the Compute Variabledialog box.
TheF-value having area 0.05to its right has area 0.95to its left, since the
total area under the probability curve is one. In the Numeric Expressionbox
typeIDF.F(0.95, 4, 12). SPSS returns theF-value that has area 0.05to its
right as F = 3.26.
16.2 One-Way ANOVA: The Logic
Analysis of Variance (ANOVA) provides methods for comparing severalpopulation means, that is, the means of a single variable from several
populations. In this Chapter, we study one-way analysis of variance. This
type of ANOVA is called one-wayanalysis of variance because it comparesthe means of a variable for populations that result from a classification by one
variable, called the factor. The possible values of the factor are referred to as
the levelsof the factor.
One-way ANOVA is the generalization to more than two populations of the
pooled t-procedure. As in the pooled t-procedure, we make the followingassumptions.
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
3/14
Section 16.3 16-3
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
Assumptions (Conditions) for One-Way ANOVA
1. Simple Random Samples: The samples taken from the populations underconsideration are simple random samples.
2. Independent Samples: The samples taken from the populations underconsideration are independent of one another.
3. Normal populations: For each population, the variable underconsideration is normally distributed.
4. Equal standard deviations: The standard deviations of the variableunder consideration are the same for all the populations.
16.3 One-Way ANOVA: The Procedure
The One-Way ANOVA Test
Example 16.3 Energy Consumption: The U.S. Energy Information Administration gathersdata on residential energy consumption and expenditures and publishes its
findings inResidential Energy Consumption Survey: Consumption andExpenditures. Table 16 - 1 shows last years energy consumptions for four
independent random samples of households in the four U.S. regions
Table 16 - 1
Energyconsumptionfor samples
ofhouseholdsin four U.S.
regions
Northeast Midwest South West15 17 11 1010 12 7 12
13 18 9 8
14 13 13 713 15 9
12
At the 5% level of significance, do the data provide sufficient evidence to
conclude that a difference exists in mean annual energy consumption by
households in the four U.S. regions?
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
4/14
16-4 Analysis of Variance (ANOVA)
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
Solution Type the data into two variables named, ENERGYand REGION. ENERGYshould contain all 20data values in the four samples. REGIONshould takeon the four values, 1, 2, 3, and 4, which associate the case with a region. The
values of REGION, 1, 2, 3, and 4, should be associated with the value labels,
Northeast, Midwest, South, and West, respectively.
Step 1:State the null and alternative hypotheses.
Let 1, 2, 3, and 4denote last years mean energy consumptions for
households in the Northeast, Midwest, South, and West, respectively. Thenull and alternative hypotheses are:
0 1 2 3 4: (mean consumptions are all equal)H = = =
: Not all the mean consumptions are all equala
H
Step 2: Decide on the significance level, .
The test is to be performed at the 5% significance level. Thus
= 0.05.
Step 3: Compute the value of the test statistic.
1. Test the hypotheses by choosing Analyze > Compare Means >One-Way ANOVAto open the One-Way ANOVAdialog box
(Figure 16 - 1).
Figure 16 - 1
One-Way
ANOVAdialog box
2. Paste the variable ENERGYinto the Dependent Listbox and the
variableREGION
into the Factorbox.
3. Click the OKbutton to display the results of the one-way ANOVA
in Viewerwindow.
The ANOVAtable (Figure 16 - 2) shows several statistics used in analysis of
variance.
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
5/14
Section 16.3 16-5
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
Figure 16 - 2
ANOVA tablefrom One-
Way ANOVA
procedure
The test statistic isF= 6.318. It has anF-distribution with df =(3, 16).
Step 4: Obtain thep-value.
The test statistic has an associatedp-value = 0.005 which is given under the
column titled Sig.
Step 5: If P < , rejectH0; otherwise, do not rejectH0.
Thep-value is less than the specified significance level of 0.05; therefore, we
reject the null hypothesis.
Step 6:Interpret the results of the hypothesis test.
At the 5% significance level, the data provide sufficient evidence to concludethat a difference exists in last years mean energy consumption by households
among the four U.S. regions. That is, at least two of the regions have different
mean energy consumptions.
The ANOVA Table
The layout of the ANOVA table in SPSS is similar to the layout in the chapter
with the following exceptions. SPSS denotes Treatment by Between Groupsand Error by Within Groups. This is because SSTRcan be thought of as the
error betweenthe sample means and SSEcan be thought of as the error within
the samples. The values of SSTR = 97.5, SSE = 82.3, and SST = 179.8can
be read from the second column in the ANOVA table (Figure 16 - 2).
The one-way ANOVA identity,
SST = SSTR + SSE =97.5 + 82.3= 179.8,
shows that the total variation among all the sample data can be partitioned intoa component representing variation among the sample means and a
component representing variation within samples. The associated degrees of
freedom and mean squares are also reported in the ANOVA table.
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
6/14
16-6 Analysis of Variance (ANOVA)
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
16.4 Multiple Comparisons*
When the null hypothesis is rejected in a one-way ANOVA, the conclusion isthat the means are not all equal. Once you make that decision, you may also
want to know which means are different, which is the largest, or, more
generally, the relation among all the between the means. Methods for doingsuch problems are called multiple comparisons.
SPSS provides several multiple comparison methods including the Tukey
multiple comparison method. In multiple comparisons, it is important to
distinguish between the individual confidence leveland thefamily confidence
level. The individual confidence levelis the confidence that any particularconfidence interval contains the true difference between the corresponding
population means; the family confidence levelis the confidence that allthe
confidence intervals simultaneously contain their respective true differences.
The Tukey multiple comparison method is based on the studentized rangedistribution. The Tukey multiple comparison method for obtaining
confidence intervals for the differences between means is similar to the pooledt-interval formula. The essential difference is that, in the Tukey multiple
comparison method the percentile of a studentized range distribution is used
instead of the percentile of a t-distribution. The effect of this is that the(1-)-level confidence intervals constructed by the Tukey multiple
comparisons method have a family confidence level of 1-. Each of the
(1-)-level confidence intervals constructed by the pooled t-interval formula
has an individual confidence level of 1-, the family confidence for this set ofconfidence intervals in smaller than 1-.
The Tukey Multiple-Comparison
Example 16.6 Energy Consumption: Apply the Tukey multiple comparison method to theenergy consumption data in Table 16 - 1. Use a family confidence level of
95%.
Solution To perform Tukey multiple comparisons in SPSS,
1. Click the Post Hoc...button in the One-Way ANOVAdialog box
(Figure 16 - 1) to open the One-Way ANOVA: Post HocMultiple Comparisonsdialog box (Figure 16 - 3).
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
7/14
Section 16.4 16-7
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
Figure 16 - 3
One-WayANOVA:Post Hoc
MultipleComparisons
dialog box
2. Choose the checkbox for Tukey.
3. A 95% family confidence interval corresponds to a 5% significance level.
Therefore, enter 0.05into the Significance levelbox.
4. Click the Continuebutton to close the dialog box and then click the OK
button to display the results in the Viewerwindow.
The Multiple Comparisonstable (Figure 16 - 4) shows 95% confidence
intervals for the differences using the Tukey multiple comparisons method.
Figure 16 - 4
MultipleComparisons
table forTukey
multiplecomparisons
method
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
8/14
16-8 Analysis of Variance (ANOVA)
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
For example, the confidence interval for the mean difference between the
Northeast and Midwest regions is5.429to 2.429. Two population means aresignificantly different if their confidence interval does notinclude 0. This is
true for the Midwest and South regions, for example. SPSS provides another
table, the Homogeneous Subsetstable (Figure 16 - 5), to help decipher which
population means are different and which are equal.
Figure 16 - 5
Homogen-eous subsets
table fromTukey
multiplecomparisonprocedure
Means that are lined up together in a column under Subset for alpha = 0.05
are judged equal by the Tukey multiple comparison method. Means that arein separate columns are judged not equal. That is, there is sufficient evidence
the population means for the regions, West, South, and Northeast are equal;
and the population means for the regions, Northeast and Midwest are equal.
Further, since West and Midwest are in different columns there is sufficientevidence that they are not equal. These results have a 95% family confidence
level.
16.5 The Kruskal-Wallis Test*
The Kruskal-Wallis testis a nonparametric alternative to the one-way
ANOVA procedure. The Kruskal-Wallis tests whether several independentsamples are from the same population. The Kruskal-Wallis test applies when
the distributions (one for each population) of the variable under consideration
have the same shape, but does not require that they be normal or have anyother specific shape. Like the Mann-Whitney test, the Kruskal-Wallis test is
based on ranks.
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
9/14
Section 16.5 16-9
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
The Kruskal-Wallis Test
Example 16.8 Vehicle Miles: The U.S. Federal Highway Administration conducts annualsurveys on motor vehicle travel by type of vehicle and publishes its findings
inHighway Statistics. Independent simple random samples of cars, buses, and
trucks were chosen and the data on number of miles driven, in thousands, byeach sampled vehicle last year are shown in Table 16 - 2.
Table 16 - 2
Numbermiles driven(1000s) last
year forindependent
samples ofcars, buses,
and trucks
Cars Buses Trucks
19.9 1.8 24.6
15.3 7.2 37.0
2.2 7.2 21.2
6.8 6.5 23.6
34.2 13.3 23.0
8.3 25.4 15.3
12.0 57.17.0 14.5
9.5 26.0
1.1
Preliminary data analysis (not shown) suggest that the distributions of miles
driven have roughly the same shape for cars, buses, and trucks but that thosedistributions are far from normal. Thus the appropriate test is the Kruskal-
Wallis procedure. At the 5% significance level, do the data provide sufficient
evidence to conclude that a difference exists in last years mean number ofmiles driven among cars, buses, and trucks?
Solution The Kruskal-Wallis test is performed by theTests for Several IndependentSamplesdialog box. Type the data into two variables named, MILESand
VEHICLE, in a new data file. MILESshould contain all 25data values in
the three samples. VEHICLEshould take on the values, 1, 2,and3,associated with the value labels, Cars, Buses, and Trucks, respectively.
Step 1:State the null and alternative hypothesesLet 1, 2, and 3denote last years mean number of miles driven for cars,
buses, and trucks, respectively. The null and alternative hypotheses are:
0 1 2 3: (mean miles driven are all equal)H = =
: Not all the means all equala
H
Step 2: Decide on the significance level, .
The test is to be performed at the 5% significance level. Thus = 0.05.
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
10/14
16-10 Analysis of Variance (ANOVA)
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
Step 3: Compute the value of the test statistic
1. Test the hypotheses by choosing Analyze > Nonparametric Tests >
Legacy Dialogs > K Independent Samplesto open the Tests forSeveral Independent Samplesdialog box (Figure 16 - 6).
2. Paste the variable MILESinto the Test Variable Listbox and the
variable VEHICLEinto the Grouping Variablebox.
Figure 16 - 6
Tests forSeveral
IndependentSamples
dialog box
Next, we need to specify the minimum and maximum integer values for thegrouping variable. The minimum value must be less than the maximum value.
Cases associated with values outside the bounds are excluded during the
analysis. This option is supplied so that the Kruskal-Wallisprocedure can beperformed on a subset of the samples.
3. Click the Define Rangebutton to open the Several Independent
Samples: Define Rangedialog box (Figure 16 7).
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
11/14
Section 16.5 16-11
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
Figure 16 7Several
IndependentSamples:
Define
Range dialogbox
We require all the cases to be analyzed, consequently enter 1, the minimum
value in VEHICLE, into the Minimumbox and 3, the maximum value in
VEHICLE, into the Maximumbox.
4. Click the Continuebutton to close the dialog box and update the grouping
variable information in the Tests for Several Independent Samplesdialog box.
5. Click the OKbutton to display the results in the Viewerwindow.
The Rankstable (Figure 16 8) displays the mean ranks for each of the three
samples. If the sample means are equal we would expect the mean ranks to be
approximately equal.
Figure 16 8
Ranks tablefrom Kruskal-
Wallisprocedure
The Test Statisticstable (Figure 16 9) gives the chi-square test statistic,
degrees of freedom associated with the test statistic, and thep-value of thehypothesis test.
Figure 16 9
TestStatistics
table fromKruskal-
Wallisprocedure
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
12/14
16-12 Analysis of Variance (ANOVA)
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
The test statistic isH = 9.93 which has a 2-distribution with 2degrees of
freedom.
Step 4: Obtain thep-value.
The test statistic has an associatedp-value = 0.007which is given in the rowtitled Asymp. Sig.
Step 5: If P < , reject H0; otherwise, do not reject H0.Thep-value is less than the specified significance level of 0.05; therefore, we
reject the null hypothesis.
Step 6:Interpret the results of the hypothesis test.
At the 5% significance level, the data provide sufficient evidence to conclude
that at least one of the means is not equal to the others.
16.6 Problems
Problem 16.8 For theF-curve with df =(12, 5), find
a. F0.05
b. F0.01
c. F0.025
Problem 16.10 For theF-curve with df =(6, 10), find
a. F0.05
b. F0.01c. F0.025
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
13/14
Section 16.6 16-13
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
Problem 16.48 Movie fans use the annualLeonard Maltin Movie Guidefor facts, cast
members, and reviews of over 21,000 films. The movies are rated form 4stars (4*), indicating a very good movie to 1 star (1*) which Leonard Maltin
refers to as a BOMB. Table 16 3 gives the running times, in minutes, of a
random sample of films listed in one years guide. At the 1% significance
level, do the data provide sufficient evidence to conclude that a differenceexists in mean running times among the four rating groups?
Table 16 3
Running Timesin minutes
1* or 1.5* 2* or 2.5* 3* or 3.5* 4*
75 97 101 10195 70 89 135
84 105 97 93
86 119 103 117
58 87 86 12685 95 100 119
Problem 16.49 Copepods are tiny crustaceans that are an essential link in the estuarine foodweb. Marine scientists G. Weiss, G. McManus, and H. Harvey at the
Chesapeake Biological Laboratory in Maryland designed an experiment to
determine whether dietary lipid (fat) content is important in the populationgrowth of a Chesapeake Bay copepod. Their findings were published as the
paper Development and Lipid Composition of the Harpacticoid Copepod
Nitocra Spinipes Reared on Different Diets (Marine Ecology Progress
Series, vol. 132, pp. 57-61). Independent random samples of copepods wereplaced in containers containing lipid-rich diatoms, bacteria, or leafy
macroalgae. There were 12containers total, four replicates per diet. Five
gravid (egg-bearing) females were placed in each container. Table 16 4shows the number of copepods in each container after 14days.
Table 16 4
Number oCopepods
Diatoms Bacteria Macroalgae
426 303 277
467 301 324
438 293 302497 328 272
a. Obtain the one-way ANOVA table for the data.
b. Verify the one-way ANOVA identity.c. At the 5% significance level, do the data provide sufficient evidence to
conclude that a difference exists in the mean number of copepods among
the three different diets?
-
8/10/2019 16 Analysis of Variance (ANOVA).pdf
14/14
16-14 Analysis of Variance (ANOVA)
Copyright 2012 Pearson Education, Inc. Publishing as Addison-Wesley.
Problem 16.95 Refer to Problem 16.49. Apply the Tukey multiple comparison method to the
data in Table 16 3. Use a family confidence level of 95%
Problem 16.129 Indications are that Americans have become more aware of the dangers of
excessive fat intake in their diets, although some reversal of this awareness
appears to have developed in recent years. The U.S. Department ofAgriculture publishes data on annual consumption of selected beverages in
Food Consumption, Prices, and Expenditures. Independent random samples
of lowfat-milk consumptions, measured in gallons, for 1980, 1995, and 2005are given in Table 16 5.
Table 16 5
Lowfat milkconsumptions,in gallons, fo
1980, 1995,and 2005
1980 1995 2005
11.1 15.5 11.2
10.7 16.0 12.7
8.6 16.1 17.4
9.4 14.7 17.1
9.2 11.5 13.415.1 17.1 11.4
11.6 16.2 13.98.3 14.6
15.2
At the 1% level of significance, do the data provide sufficient evidence to
conclude that there is a difference in mean (per capita) consumption of lowfat
milk for the years 1980, 1995, and 2005? Use the Kruskal-Wallis Test.