analysis of variance 1 - university of windsorweb4.uwindsor.ca › users › d › djackson ›...

Analysis of Variance 1

Dennis L. Jackson, Ph. D. September, 2006

1. Introduction: One-Way Completely Randomized ANOVA The most frequently used method of analyzing data from experiments involving more than two groups is Analysis of Variance. Basic ANOVA allows the researcher to simultaneously compare means of several groups to determine whether there is significant variation in those means. There are numerous adaptations and ancillary procedures to further explore the nature of any significant differences. If you find yourself in a position where you need to have a deep understanding of ANOVA, there are several excellent text books. A few very good ones are listed in the reference section of this document, and mentioned throughout. The first model to be discussed is the basic One-Way Completely Randomized ANOVA. Completely randomized refers to the assumption that participants are randomly assigned to one and only one condition. One-way means there is only one independent variable (Factor). Such a design yields one overall F-ratio, where F is the ratio of two variance estimates. This ratio, sometimes called the “omnibus” test, indicates whether there is significant variation among the treatment means. If F is significant, generally at least two means will differ significantly from one another, though this need not be the case (Winer, Brown & Michels, 1991 provide an example where the difference has to do with a particular contrast or linear combination). Basic Logic: Definitional Formulae ANOVA proceeds by partitioning the total variance from observations into variance associated with error, within group variance, and variance associated with treatment effects, between group variance. This proceeds by first calculating Sums of Squares (SS). SS can be thought of as a cumulative measure of variance. Starting with total variance, the definitional formula for arriving at this number is…

_2

1

( )n

tot ii

SS X G=

= −∑ 1.1

where n is the total number of observations, Xi is subject i’s score, and_

G is the Grand Mean, or the mean of all observations. In words, one subtracts the Grand Mean from each individual score, squares that value and sums the resulting “squares” for all individuals. This is a measure of total variance for all observations. Variations around the grand mean are due to random error, including individual differences, and treatment effects, due to the experimental design. This is the total variance of all scores from participants in the study and we seek to partition, or divide this variance into variance we can explain, and variance we cannot explain. To get the SS associated with individual differences, one uses a similar approach but instead of subtracting each score from the Grand Mean, one subtracts each score from the mean associated for their particular treatment group.

_2( )W ij j

i

SS X T= −∑ 1.2



Where SSW is the sum of squares within groups, Xij is subject number i in treatment

condition j, and _

jT is the mean for treatment condition j. Hence, each score is “centered”

around its treatment condition mean so that the sum of the deviations within each treatment condition will sum to zero. Summing these squared deviations yields a cumulative measure of variance that is independent of treatment effects (i.e., differences between experimental conditions j). This is the variance that we cannot explain, because these subjects were all treated the same way (i.e., had the same treatment), we can’t readily explain why their scores differ. Obtaining the SS associated with treatment effects can also be accomplished in a similar fashion. The SS, which will be denoted SSM (M for methods) is an estimate of the variation associated with treatment effects.

_ _2( )jM

i

SS n T G= −∑ 1.3

Again, SSM stands for Sums of Squares due to Method (or treatment) and _

G stands for

the Grand Mean, and _

jT is the mean for treatment method j. In words, this formula indicates that the SSM is calculated by taking the difference between each treatment condition mean and the grand mean, squaring that difference, then weighting it by the number of observations within that group. This is variance that we can explain, since it is variance due to our treatment effect. At this point, two things are hopefully clear. One is that these terms can be summed so that SSTOT = SSW + SSM. The second is that the variance of SSTOT has been partitioned into two parts. The first, SSW, is variation due to error. This is variation among observations where everyone was treated the same (i.e., were in the same experimental condition), arrived at by pooling across the various experimental conditions. SSM is a cumulative variance measure that captures the variation due to treatment effects. It ignores variation due to error, i.e., variation within conditions and captures variation among means of treatment conditions, which are our best estimates of any treatment effect.1 Briefly, there is a structural model assumed to underlie data obtained from an experiment. The model is…

i j j ijX µ τ ε= + + 1.4 In English, it is assumed that a person’s score (person i in treatment condition j), is composed of three parts. The first part, µ, is an overall parameter that can be thought of as the population mean of all treatments, or the average treatment condition effect. The

1 Technically, this term does contain error. When the null hypothesis is true (i.e., all treatment means are equal – or no treatment effect exists), there will still be error due to sampling. Hence, this term actually contains treatment effects plus error.



second term, τj, is the effect of the treatment condition that this particular individual was assigned to. The final term, εij, is the experimental error associated with this person. Experimental error is assumed to be independent of the treatment effect (through random assignment), normally distributed, and it is assumed to average out to zero across individuals. It can be due to personal factors, such as individual differences, or to random experimental factors, such as this person happened to participate right after lunch, etc. So, these terms map back to our Sums of Squares discussion. The first term, X ij represents total variation (SSTOT), the second term, µ, is estimated by the Grand Mean. The third term is captured in the SSM and is estimated treatment mean for that particular condition, and the error term, εij , is captured by SSW. Computational Formulae Definitional formulas are utilized above because they afford a better opportunity to comprehend the process of partitioning variance. However, they are cumbersome to work with and more prone to error since actually using them results in many intermediate steps and potential for rounding error. Therefore, computational formulas will be presented below and they should be used in the event of conducting an ANOVA by hand. These formulas will be presented without extensive discussion. At the end of this section, I present a computational example using something known as bracket terms. The procedure is easy to use for hand calculations, however not terribly informative. The approach makes use of the fact that many expressions in the computational formulas are repeated over and over again, so one merely computes each term first, then arrives at sums of squares through simple arithmetic. I highly recommend using the bracket term solution method for hand calculations.

22 ( )

TOT

XSS X

N

Σ= Σ − 1.5

Where N is the total number of observations.

n

TXSS j

jW

22 )(

Σ−ΣΣ= 1.6

Where n is the number of participants within each condition.2 The double summation implies that the squared values of Xj are summed within each group j, then summed across those sums. Likewise, T2, which is the squared sum for each condition is summed.

nk

G

n

TSS j

M

22

−Σ

= 1.7

Where, on the left, the total score for each condition, j, is squared then summed and divided by the number within each condition and, on the right, the grand total, G, which

2 Some of these formulas have to be altered to handle unbalanced designs, unbalanced meaning a design where each condition doesn’t have the same number of observations.



is the sum of all scores, is squared and divided by the number within each condition times the number of conditions (i.e., the total number of observations). Numerical Example The following example, along with the formulas above, are primarily taken from Winer et al. (1991). Assume that one is testing the effectiveness of three teaching methods. The dependent variable (DV) is an exam score from an exam given over material taught using the three methods. Assume also that the 24 subjects were randomly assigned to one of the three treatment conditions and were randomly selected from the population of interest. This model is known as a fixed effects model. The name denotes the idea that all possible effects are being tested. Perhaps more realistically, that one is not attempting to generalize back to all possible treatments (such models will be only briefly addressed in this class). Instead, the intended generalization here is with respect to the three methods and back to the population of interest. In other words, this experiment allows us to make a determination about whether there is likely to be any difference among the effectiveness of these methods in the population of interest.

One-Way Analysis of Variance Example Subject # Method 1 Method 2 Method 3 Total

1 3 4 6 2 5 4 7 n = 8 3 2 3 8 4 4 8 6 k = 3 5 8 7 7 6 4 4 9 7 3 2 10 8 9 5 9

T 38 37 62 137 ΣX2 224 199 496 919 n 8 8 8 24 SS 43.50 27.88 15.50 136.96 Mean 4.75 4.63 7.75 5.71

Source Table SS Df MS F p<

SSwithin 86.88 21 4.14 6.05 0.01 SSmethod 50.08 2 25.04 SStotal 136.96 23 5.95 Example taken from Winer et al. (1991). Page 75

SSW = Σ SSj or, n

Tx j

j

22 )(

Σ−ΣΣ

2 2 238 37 62224 199 496 919 832.125 86.875

8WSS+ += + + − = − =



_ _

2( )jMi

SS n T G= −∑ or nk

G

n

T j32

−Σ

2 2 2 238 37 62 137

50.088 8(3)MSS

+ += − =

While not necessarily a requirement for solving our problem, we can go on to calculate SSTOT as follows (using equation 1.5).

2137224 199 496 919 782.04 136.96

8(3)TOTSS = + + − = − =

As a quick check, SSTOT = SSW + SSM; 136.96 = 50.08 + 86.875, within rounding error. As was introduced above, when one is solving an ANOVA problem, one is partitioning variance. Toward that end, a structural model was presented above, equation 1.4. Now we are in a position to make the conversation more concrete. We can do this by looking at subject11’s score. Using the above logic, their score is: 3 = 5.71 – 0.96 – 1.75 where µ=5.71, τj=-0.96, and εij=-1.75 Graphically… 1………2………3………4………5………6………7………8………9……..10 Xij T1 T Where the distance between T1 and T is the treatment effect for condition T1 and the distance between Xij and T1 is the experimental error for subject Xij. Finally, the distance between Xij and T represents this person’s contribution to total variance (SSTOT). Put yet another way, several factors affect person X ij ’s score.

• There’s an overall parameter for the treatments. • There’s a treatment effect, which is assumed to be equal for everyone within a

given treatment. The extent to which it appears not to be equal is attributed to… • Experimental Error, which is a collection of different types of error including

measurement error, individual differences, etc. This error is assumed to be random.

Degrees of Freedom, Mean Squares and the F Ratio: Completing our example, we can now take the SS terms above and form variance estimates from them. They are cumulative measures of variance at this point, dividing



them by the degrees of freedom (df) will serve to convert them into variances, which are average measures of variance. There are three degrees of freedom terms: How to get df Our example dftot = nk-1 dftot=23 [(8)3-1] dfW =(n-1)k dfW=21 [(8-1)3] dfM =k-1 dfM=2 [3-1] Mean Square terms are easily obtained as follows:

86.884.14

21W

Ww

SSMS

df= = = 1.8

50.0825.04

2M

MM

SSMS

df= = = 1.9

136.96

5.9523

totTOT

tot

SSMS

df= = = 1.10

In practice, the MSTOT isn’t reported. It is calculated here just to show that variance for all observations can be easily calculated. The F-ratio is the actual statistical test associated with ANOVA. It is the ratio of the variance due to methods (plus error) to the variance due to error, or

M

W

MSF

MS= 1.11

which is also why MSTOT is never calculated. It isn’t necessary for the analysis. The rationale for the F ratio goes something like this.

1. One would expect MSM to have some positive value. In other words, even if there is absolutely no differences in treatment effects in the population, by pure random error we would end up with some differences among the treatment means. Those differences just wouldn’t be particularly meaningful. Hence, MSM does have some experimental error in it.

2. MSW is another (independent) estimate of experimental error. 3. If there is no treatment effect, the expected value of F is somewhere around 1.0. 4. If there is a treatment effect, F will be greater than 1.0 and whether it is significant can be

evaluated based on whether it exceeds some critical value given a certain level of probability. I think it is useful to compare the F-ratio to the t-ratio. In the t-ratio for independent samples (below), the straight difference in the mean of the two groups is divided by an estimate of the standard error of the difference between two means drawn from the same population.



1 2

X X

X Xt

s −

−= 1.12

The numerator is a simple variance term, as in F, and the denominator is a measure of error, as in F. Further, there is a direct relationship between t and F when ANOVA is applied to a problem with only two treatment conditions, that is, t2=F. You will notice that we have been working in squared values for the ANOVA problem and the t-statistic is expressed in terms of the original metric, i.e., not squared. Evaluating the F-Ratio Evaluating the significance of F is straightforward. Almost any introductory statistics book will contain a table of critical F values. One needs to know the calculated F value (6.05 in our example), and the df for the numerator (2 in our example) and the denominator (21 in our example). The degrees of freedom were calculated above and df numerator pertains to the variance estimate for the method or intervention (MSM), since that is our numerator, and the denominator df pertains to the variance estimate for experimental error (MSW). The F-Table I am looking at does not have a df denominator of 21, but has 20 and 22. So, the critical F for 2 and 20 degrees of freedom is 3.49 (at p = .05) and 2 and 22 is 3.44. Extrapolating, we can say the critical F value for 2 and 21 is 3.465 (average of the two). Our obtained value of 6.05 exceeds 3.465, so our results are significant. This means that we reject the hypothesis that there is no difference among the treatment means at the .05 level. In other words, we would expect to see results like this less than 5 times in 100 if there was no significant difference among teaching methods in the population. Experiment-wise Error and Mean Comparisons Often times one wishes to compare means among different treatment conditions, either they are planned ahead of time or come about after data are collected and analyzed. The former are called planned comparisons (or a priori) and the latter are called unplanned, post-hoc or a posteriori comparisons. Additionally, a researcher may not be particularly interested in the omnibus F-test to begin with, but may have a particular contrast in mind. For instance, the researcher might have had good reason to believe that treatment condition #3 was going to be superior (for instance, it might include elements of both A1 and A2), so that they planned in advance to test the hypothesis that the mean for A3 is greater than the mean for A1 & A2. Contrasts Either using software, such as SPSS, or using formulas, contrasts representing specific hypotheses to be tested can be set up initially. Contrasts are said to be orthogonal if they convey different information about the differences between treatment means. If there are k levels to an experiment, there are k – 1 possible orthogonal comparisons. Beyond that, the comparisons become linearly dependent upon one another. Let’s say that, for our example, we wanted to specifically test the following contrast.



1 231 2

A AC A

+= =

This contrast could be set up by using coefficients, such as

1 1 2 31 1 2C A A A= + −

This is an equivalent representation and one can use any coefficients provided they sum to zero (e.g., .5, .5, -1; or 20, 20, -40). A comparison which is orthogonal to C1 would be

2 1 2 31 1 0C A A A= − +

or a test to see whether A1 and A2 differ significantly from one another. The zero coefficient implies that the mean for A3 does not take part in the C2 comparison. The check to see if two comparisons are orthogonal is to multiply the coefficients of comparison one by comparison two and sum them. If they sum to zero, they are orthogonal. In our case, (1x1)+(1x-1)+(-2x0)=0. The usefulness of the coefficients comes in constructing Sums of Squares for specific comparisons. The formula for this is as follows.3

2

2

( )

/T

jj

Cj

C TSS

C n

Σ=

Σ 1.13

Where the numerator is the sum of coefficients multiplied by the treatment means, squared. The denominator is the sum of the squared coefficients divided by the number of observations within each cell. The numerator df for a contrast is always one. Applying this to C1, we get the following.

2 2

1 2 2 2

[(1 4.75) (1 4.62) (2 7.75)] 6.1350.103

(1 1 2 ) /8 0.75C

x x xSS

+ − −= = =+ +

And, applying it to C2 we get the following.

2 2

2

[(1 4.75) (1 4.62) (0 7.75)] 0.130.068

2 /8 0.25C

x x xSS

− += = =

Since the comparisons are orthogonal, and complete (i.e., the comparisons are orthogonal and exhaustive), the Sums of Squares should come close to totaling up to the total SSM, they may not exactly because the coefficients are not least-squares estimates. In this instance, they are very close (within rounding error). Furthermore, it can be seen immediately that most of the effect of the three treatments is due to the difference between A3 and the other two conditions.

3 This particular formula works for comparisons involving unequal n’s.



To finish this example off, one can form MS terms and an F-Ratio. The MS terms will be the same as the SS terms above because each contrast has only one degree of freedom. The error term for F should be the error term calculated above, 4.14. Hence, the F-Ratio for the first comparison is 50.103/4.14=12.102 with 1 and 21 degrees of freedom. The probability value is approximately 0.0022. It should be noted that when making multiple comparisons, such as above, one may choose to adjust the alpha level in order to not inflate the Type I error (see section below). This is somewhat debatable, depending on the number of contrasts, whether they are a priori, the consequences of committing a Type I error, who is on your committee, and so on. Howell (2002) presents a useful heuristic for searching out orthogonal contrasts. The F-test is viewed as a contrast among all means and forms the trunk of a tree. The researcher then forms a contrast of interest, with one side forming a branch on the right, and the other side of the contrast forming a branch on the left. From that point on, all contrasts remain on the same side of the tree. For instance, if one has five groups, the tree might look as follows. F = (1, 2, 3, 4, 5)

C1 = (1, 2) vs. (3, 4, 5) C2 = (1) vs. (2) C3 = (3) vs. (4, 5) C4 = (4) vs. (5) Hence, after the first contrast, 2 or 1 cannot be compared in any way to 3, 4 or 5. A set of orthogonal contrasts is orthogonal if all pair-wise comparisons of contrasts are orthogonal (using the aforementioned test of multiplying the coefficients together and summing them). The contrasts we have been considering are user-defined contrasts and can be done within SPSS using the SPECIAL subcommand. Some useful standard contrasts are built directly into SPSS. Below are descriptions of each.

• Helmert Contrasts: these contrasts involve testing each mean against the average of the remaining groups. If in our example, A1 represents the mean for a control group and both A2 and A3 represent the means for different treatments, then Helmert would contrast A1 against A2 & A3, then A2 against A3. This would be akin to answering the questions of 1) is any treatment more effective than no treatment? and 2) is there a significant difference between treatments 2 & 3?

• Difference Contrasts: is also known as the reverse Helmert contrasts and compares each mean to the previous mean(s).

• Repeated Contrasts: these are straight forward contrasts where each cell mean is compared to the previous one. They are particularly useful in repeated measures designs, where one is interested in seeing if measures taken over time differ significantly from the previous measure.

• Polynomial Contrasts: these contrasts involve attempts to fit a curve to data. While they do not necessarily have to be based on repeated measures, it often makes the most sense there. One simply requests of SPSS to fit polynomial



curves and significance indicates that the curve being tested (i.e., powers of 2, 3, etc.) fits significantly better than the previous order curve. The highest order curve that is significant is the one that provides the best trend-line to the data. Care must be exercised as there comes a point at which the curve will fit perfectly (an order of k – 1 where k is the number of groups will fit perfectly). These tests are also additional tests and thus increase chances of type 1 error. An example where these contrasts might be useful is in looking at skill acquisition, where subjects are tested on tasks over time as they learn a new skill. The polynomial equation can be useful here to mathematically characterize skill acquisition.

• Deviation Contrasts: This is a straight-forward type of contrasts where each cell is compared to the grand mean.

• Simple Contrasts: each cell mean is compared to the last (or some user defined cell) and is commonly used in designs that include a control group.

Experiment-wise Error The probability of rejecting the null hypothesis when it is true is controlled directly by the experimenter and is denoted as α (alpha). If α = .05, then we are saying that if we see results that are sufficiently unlikely under the assumption that the null hypothesis is true, where sufficiently unlikely is less than .05, we will reject the null hypothesis. The probability of a Type I error for the omnibus significance test will be .05 if the assumptions of ANOVA are met. Furthermore, if we decide to approach our ANOVA using contrasts, as above, the level of Type I error for C1 will be .05. In addition, the level of Type I error for C2 will be .05. The problem is that, if the goal is to keep our overall probability for a Type I error at .05, we will run into trouble doing all the contrasts. According to probability theory, if two events are independent, the probability that they will both occur is the product of their individual probabilities. So, the probability of flipping a fair coin twice and getting two heads is .5x.5 (or .52)=.25. Applying this to our situation, if we test two contrasts (ignoring the overall omnibus test for now), which are independent4, the probability of making the correct decision for the first one is .95 (1- α) and the probability for making the correct decision on the second is also .95. However, the probability of being correct on both of them when the null hypothesis is true is .952= .9025. This means the actual Type I error rate for both comparisons combined is 1-.9025=0.0975. Our alpha level has almost doubled. This necessitates adjusting our alpha level for multiple comparisons. The formula for accomplishing this is…

1 1mind overallα α= − − 1.14

where αind is the alpha level needed for each individual comparison to maintain the overall alpha level (e.g., .05), here denoted as αoverall. Finally, m is the number of comparisons to be made. The formula assumes all comparisons are independent (orthogonal). In our example above, one needs to use the .0253 level of significance for

4 They are not completely independent since they rely on the same error term, but the fact that they are orthogonal makes them more independent, especially if sample sizes are large.



the two contrasts to stay within an experiment-wise error rate of .05. The necessary alpha cut-off to maintain a given experiment-wise error rate can be approximated by dividing the desired alpha by the number of comparisons. In fact, it can be shown that if one uses this approach, the overall alpha level will never exceed the desired cut-off. Thus, if we wish to make five comparisons and keep our overall α at .05, we would only interpret as significant those tests that yielded a statistic greater than the critical value for α = .05/5 = .01. Mean Comparisons Testing Bonferroni t (Dunn’s Test): this approach utilizes the usual t-test, but uses an adjusted alpha based on the number of comparisons the researcher will conduct. Tables are available to provide appropriate cut-offs of t for given alpha (e.g., Howell, 2002), but this is largely unnecessary since most software packages can print out exact probabilities, and applications such as Excel can compute them. In this approach, if the assumption of homogeneity of variance has not been violated, the usual t-test is used, substituting the pooled error estimate in for the error term. This formula will work with unequal n.

'i j

W W

X Xt

MS MS

n n

−=+

, or simplifying the denominator '2

i j

W

X Xt

MS

n

−=

Dunn—Šidák test: This variation on the above procedure uses a more precise estimate of inflated alpha. The adjusted alpha is based on of the formula α’ = 1 – (1 – α )1/c Where c is the number of comparisons. This test will have slightly more power than the Dunn—Bonferroni test. Some might consider the power difference to be so slight as to be trivial. It should be noted also that these comparisons can be combined with the logic of using contrasts, to form more complex comparisons. In other words, these procedures can be used to compare any two means from an experiment, even those that are means of more than one group. This is really because a t-test is a special instance of a contrast. Additionally, once one knows the average level of alpha one wishes to test at, one could conceivably divide up the protection rate among several comparisons. For instance, if we wish to keep alpha < .05 and we have five comparisons, we could test one at .02, which means that we would test the other four at .03/4 = .0075. One may choose to do this if one wishes to have more power for some comparisons. Holm Test: Here we use a similar approach, except we order the treatment differences by magnitude, from largest to smallest. Then we test the first one at the protected alpha rate for c (using Dunn’s table). If it is significant, we test the next one at c – 1, again using Dunn’s table. Put another way, we evaluate the largest difference at α, the next largest



difference at α/(c – 1), the next largest at α/(c – 2), etc. We repeat this until we come across a nonsignificant comparison, at which point we stop. This approach is more powerful than those above since the cut-off value shrinks with each successive test. The idea is that once one has protected the first comparison for all possible comparisons, and found that comparison to be significant, one only needs to make the remaining comparisons at the cut-off for c – 1 since that is what one is protecting against. If that one is significant, we then want to protect the remaining comparisons at c – 2. This test has been shown to also control for type 1 error well. According to Howell (2002), it is especially useful when one knows a priori that a number of the null hypotheses will be false, and hence there is little reason to protect against type 1 error when the researcher already knows (perhaps based on previous research) that the some of the null hypotheses will be false. The tests discussed thus far are best applied when the researcher has a certain number of comparisons to make. That is because these types of adjustments are more powerful in this instance. This is contrasted with the following approaches which are post hoc approaches, adjusting for all possible comparisons. Hence, with the above approaches, especially the Bonferroni type approach, one can correct alpha for just the number of comparisons to be made, rather than all possible comparisons. Post-Hoc Comparisons Scheffé: Another approach to comparing means is through post-hoc analyses. In post-hoc analyses, the significance tests are adjusted for the number of comparisons. There are numerous post-hoc significance tests. The first to be reported here is the Scheffé test, which has the advantage of not requiring an additional table, as many others do. It also has the advantage, or disadvantage of being very conservative. It protects for an infinite number of comparisons. The Scheffé test uses an F-ratio to test for the significance of two means where the numerator of the F-ratio is the MS for only the effects of interest. The form of the equation is identical to equation 1.7, again with df=k-1.

2 2

'j j

Scheffe

T GSS

n N= Σ − 1.15

The difference is that only those terms one wishes to compare are used to calculate this SS term. For our example, if we were wishing to make a simple comparison between A2 and A3, we would proceed as below.

2 2237 9962171.125 480.5 612.5625 39.0625

8 8 16ScheffeSS = + − = + − =

'

39.062519.53125

2ScheffeMS = =

The error term to be used is the error term calculated from all of the data, i.e. 4.14. Therefore,



2 3

19.531254.7177

4.14A vsAF = =

which is significant, p=.0203 at df(2,21). The adjustment for Scheffé comes from the fact that, even though only one degree of freedom is being used in the test, the MSScheffe’ is divided by two (df numerator rather than df for the comparison), thus reducing F. Additionally, the obtained FScheffé is evaluated at the same level as the overall F statistic for the ANOVA (2 and 21), again making it harder to achieve significance. Scheffé can offer advantages when conducting post-hoc complex comparisons. Generally one would use another method for conducting simple post-hoc comparisons as Scheffé will be too conservative in most instances. Fisher’s Least Significant Difference (LSD) test: This test offers no real protection against type 1 error, but is powerful. It requires the rejection of overall F, at which point the researcher proceeds with individual t-tests (unaltered). This procedure is generally not recommended, but is acceptable with a design containing only three conditions, provided the consequences of a type 1 error are not great. Tukey’s HSD: This is one of the more popular tests, and it is obvious to see why. Using special tables that allow one to look up the studentized range statistic, one can use the results from an analysis to find a value that represents the minimum difference between two means that would be significant. Using our ongoing example from Winer et al. (1991), and the following formula, we will conduct this test.

.05( , ) WW

MSHSD q r df

n=

Where the value for q is accessed through special tables, and r is the number of groups. Hence, our value of q would be 3.58. Recall that MSW is 4.14 and n = 8.

4.143.58 2.58

8HSD= =

Any two means that differ by more than 2.58 would be considered significant. If cell sizes are unequal, HSD is still robust to type 1 error provided that the harmonic mean is used for n. The harmonic mean is defined as

1h

i

kX

X

=Σ

Where k is the number of means. For our example, the harmonic mean will be 8, since there was equal n. However, if we had 6, 7 & 9 for groups 1 – 3 respectively, it would be 3/(1/6+1/7+1/9) = 3/(.167+.143+.111) = 3/.421 = 7.126. This approach may not work well if the homogeneity of variance assumption has been violated and / or cell sizes are



vastly different. Otherwise, there are other procedures that typically involve coming up with different values for each mean comparison, such as the Games and Howell procedure. The Newman – Keuls Test: This is an example of a type of test that takes order into account when performing post-hoc comparisons. We begin by arranging all means in ascending order. T2 T1 T3 4.63 4.75 7.75 Next we will define steps as i – j + 1, where i is the rank of the larger mean being compared and j is the rank of the smaller mean (literally, the rank order from smallest to largest). Thus, the T1-T2 comparison has two steps (2 – 1 + 1) and T3-T2 comparison has three steps (3 – 1 + 1). Next we construct a range of minimum significant differences for 2 and 3 (the only possibilities here)5, using the Tukey approach, except computing once for r = 2 and once for r = 3. We already computed for r = 3, and came up with 2.58. We repeat now for r = 2, and get 2.502. This means that any comparison for r = 3 must exceed 2.58 to be significant and any comparison for r = 2 must be 2.502 to be significant. There is also a procedure which must be followed when using this approach, one that keeps us from finding bewildering results and also controls for familywise error. The procedure cannot be adequately demonstrated with a three group test, so we will switch to another study for the time being. Here, A through G represents groups in a one-way experiment. The numbers in the top row represent the means for each group and the numbers in the matrix represent mean differences. The column r, indicates the number of steps apart for that particular comparison and the final column represents the minimum significant difference for comparisons related to r. Note the groups are ordered (C, A, B, etc.) from the smallest to largest difference. Also note that the diagonals represent the same r, for instance there is only one comparison for which r = 7, but there are two comparisons for which r = 6, E and C, and F and A.

10 12 13 18 22 24 25 C A B D G E F r Intrvl. C 2 3 8 12 14 15 7 10.90 A 3 6 10 12 13 6 10.56 B 5 9 11 12 5 10.18 D 4 6 7 4 9.68 G 2 3 3 8.96 E 1 2 7.82 F

10 12 13 18 22 24 25

5 Generally speaking, this procedure is more useful when several treatment groups are being compared. We use it here to stick with the current example.



C A B D G E F r C ** ** ** 7 A ** ** ** 6 B ** ** ** 5 D 4 G 3 E 2 The testing procedure proceeds as follows (example taken from Winer et al., 1991).

1. The first test is made on the difference in the upper right-hand corner, for the maximum value of r. Since the difference, 15, exceeds the critical value, 10.90, we place two asterisks in the cell for that comparison in the 2nd table. If it were not significant, we would not proceed with any other tests. This constitutes our one and only comparison for r = 7.

2. We now move down and make comparisons for r = 6. We find that both values, 14 and 13, exceed the cutoff of 10.56. Thus we place asterisks in the appropriate cells.

3. We now move down to r = 5 and find three mean differences to compare against our cutoff of 10.18, they are 12, 12, and 12. All exceed our cutoff, thus we add asterisks in for these cells.

4. Moving to r = 4, we have a critical value of 9.68 and four values, 8, 10, 11, and 7. The values of 8 and 7 are not significant. For 8, no further comparisons are made for the triangle which is formed by the value of 8 in the upper right hand corner. This region is represented with strike-throughs above.

5. The same is done for 7, which is also not significant. 6. We now proceed to r = 3 and make any comparisons which do not involve the regions

blocked out with the previous comparisons. There is only one comparison left to make, which is 9, which exceeds the critical value of 8.96.

This procedure is sometimes taken to another level, where one graphically depicts which values are significantly different vs. which aren’t. Thus…

C A D B G E F

Treatments underlined by a common line do not differ from each other, but those underlined by different lines do. Thus, C, A, & D do not differ from each other, but do differ from G, E & F. The idea behind this approach seems to be to divide up the potential for risk, based on how far apart the means are, rather than creating a simultaneous interval for all possible comparisons. One problem is that under some circumstances, namely if the null hypothesis is partially true, this procedure can lead to higher levels of type 1 error. It can also be more powerful than the Tukey approach. The Ryan Procedure (REGWQ): this procedure is similar to the Newman – Keuls, but has the advantage of controlling better for type 1 error. Basically it stays within the



framework of Newman – Keuls by taking order into account, but adjusts alpha for each successive comparison to control type 1 error. This procedure is available in SPSS and SAS and is recommended as a procedure that balances control of Type 1 error with Power very well. Dunnett’s Test for Comparing All Treatments with a Control: When one is interested in this type of comparison, where there are several treatment groups with which one wishes to compare to a control group, this test is the most powerful. Special tables are required and one can derive a minimum significant difference. Back to our ongoing example, let’s assume that treatment 1 is a control group. We need to know dfe and the number of groups (including the control), and from that point we look up the appropriate statistic td, which is 2.38 in our case. Our critical value then becomes…

Wt d

MSD t

n= , which for our case is…

4.142.38 1.71

8tD = =

So, any mean difference from the control, which exceeds 1.71, would be significant. Still, only group three differs significantly. Effect Sizes Finally, it is appropriate when F is significant to report effect sizes. Two common effect sizes are η2 and ω2, pronounced eta-squared and omega-squared. These two indices are analogous to R2 and Adjusted-R2 in regression. In the case of η

2 anyway, it is the same formula as R2. The formulas for determining these two indices are provided below.

2 M

TOT

SS

SSη = 1.16

2 ( 1)M W

TOT W

SS k MS

SS MSω − −=

+ 1.17

These two approaches will yield similar results, and as sample size increases they will converge. The latter approach, ω

2, is designed to correct for the fact that η2 is positively

biased with small samples. In our example,

2 50.080.366

136.96η = = , and

2 50.08 (3 1) 4.14 50.08 2 4.14

0.296136.96 4.14 141.1

ω − − × − ×= = =+



In other words, if we use our ω2 estimate, which is a better estimate of effect size in the population, our teaching method accounts for approximately 30% of the variation in test scores. Assumptions of ANOVA: Also, central to the use of ANOVA are several assumptions. These assumptions are as follows. First, the three main assumptions are that the observations are drawn from a population that is normally distributed on the dependent variable, in each group, that there is homogeneity of variance, and that the observations are independent. The first assumption stems from the fact that the mean may inaccurately reflect central tendency in nonnormally distributed data and variance may inadequately reflect variation under the same circumstance. A logical extension of the normality assumption is that one’s data should be examined for outliers, since a distribution with extreme scores can’t be “normal.” Homogeneity of variance means the variation within conditions is roughly equivalent. The rationale for this assumption follows directly from deriving the error term in ANOVA, namely that we pool within-group variation (variation we cannot explain) across the experimental groups. If one group has drastically different variance, then pooling the variance, which is meant to form a more stable estimate of error, may be a suspect activity. The last main assumption is that observations are independent. This means that within the sample, one person’s score on the DV shouldn’t be related to another person’s score. Additionally, there are other assumptions, or at least consequences of the above assumptions. The following also apply, but are not necessarily frequently mentioned or evaluated.

• It is assumed that the F ratio is the ratio of two independent estimates of the same population variance (σ2

ε), when the null hypothesis is true. • In the fixed effects model, it is assumed that all levels of the factor are represented in the

experiment. • It is assumed that random samples are taken from normally distributed populations.

o Violating random assignment & sampling can invalidate the conclusions that can typically be drawn from experimental designs. Random assignment is the mechanism by which bias is removed from assignment of subjects to conditions and random sampling helps to ensure a normal distribution of error and observed terms – hence preserving independence of F-numerator and denominator.

• Good experimental control is assumed. • Some of these assumptions imply that the treatment has no influence over the shape of

the population distributions, e.g., variances. • Homogeneity of variance is assumed. This means that error variance is homogeneous

within each treatment condition. o It is often said, and sometimes true, that the F-statistic is “robust” to violations of this

homogeneity assumption. In general, if this assumption is violated modestly and sample size is large, there is little difference in the outcome. For instance, when the ratio of variances for three treatment conditions is 1:2:3, the observed probability of F when the null hypothesis is true is .059 instead of .05. Also, according to Stevens, skewness seems to have little impact. Also, when variances are markedly different, e.g., 1:1:9, the observed probabilities can be closer to .15. Flatter distributions (low kurtosis) does seem to affect power. Overall, keeping group sizes the same or very nearly the same is helpful in keeping ANOVA robust to violations of assumptions.



o It should be noted there are at least a couple of different ways to test for homogeneity of variance. The simplest involves forming a ratio of the MSW where the numerator is the treatment condition with the largest MSW and the denominator is the treatment condition with the smallest MSW. This is evaluated as an F ratio where a special table (not the usual F table) has to be consulted. There are more robust methods also, but this gives you an idea of how this assumption is tested. Also, SPSS provides a test, Levene’s Statistic. In this test, failing to reject the null hypothesis results in concluding that this assumption has not been violated.

Robustness Assumption Robustness Normality ANOVA is typically robust to this assumption.

Kurtosis can have an effect on power (with platykurtic distributions resulting in lower power and leptokurtic distributions potentially causing greater incidents of type 1 error), but skewness generally has little effect on the results. One caveat is that outliers, observations with z-scores greater than 2.5 in absolute value should be taken out to see if the results of the analysis change (Kirk, 1995).

Homogeneity of Variance If sample sizes are equal, or more equal (largest to smallest < 1.5) and violation of homogeneity of variance distribution isn’t too far off (i.e., largest:smallest = 4:1), then ANOVA is robust.

Independence of Observations ANOVA is not robust to this assumption. Independence of Observations can have a profound effect on type 1 error, increasing it substantially. It is best controlled through appropriate experimental design, or utilizing a different technique that explicitly models this structure.

Alternatives: There are three classes of alternatives, one is to use a different statistical model, a second is to transform the data to ameliorate the violation of assumptions, and the third is to adjust alpha, either directly or by adjusting the df to require a more stringent cutoff. Below are some options:

• Evaluate F at a more conservative df, such as (1, N – 1). Another is to use a procedure by Welch, which is somewhat cumbersome but can be done by hand. It also adjusts the df so that the critical F value will be higher, thus making α lower.

• Simply test at a more stringent alpha if worried about the violation of assumptions, such as .01, or .001.

• Transforming variables is also a possibility. Here, one needs to find the appropriate transformation to bring the distributions to within reasonable limits of the assumptions of ANOVA.

• Utilize a nonparametric technique, namely the Kruskal-Wallace One-Way ANOVA.



2. Factorial ANOVA A basic understanding of a couple of concepts greatly aids one’s ability to comprehend a host of ANOVA designs. These two concepts are…

1. Partitioning of Variance/Sums of Squares 2. The ratio of effect to error.

Armed with this understanding, more complicated designs can be picked apart by understanding the effects that these other designs yield and finding the correct error term. The first extension of the basic design already described will be a between groups factorial design, again a fixed effects design. The simple case is a two-way ANOVA. In the two-way design, the one-way design is extended by the addition of a second factor, so now there are two between-groups independent variables. This particular design is very common and allows for the test of more than one effect. In the case of a Two-Way ANOVA, where the letter A denotes the first factor (independent variable) and B denotes the second, one can get a Main Effect for A, a Main Effect for B and an Interaction between A and B (A x B).

• A main effect is a significant effect for that variable – put another way, there are significant differences among the marginal means of different levels for that particular factor, ignoring levels of the other factor. Main effects are known as additive effects.

• A significant interaction means that the effect of one variable varies depending upon the level of the second variable. These are known as multiplicative effects.

Given that there are three effects obtainable from a Two-Way ANOVA, one needs to be able to partition the variance into these three effects. The three terms (A, B, & A x B) denote three potential treatment effects. The structural model for this design is as follows.

ijk i j ij ijkX µ α β αβ ε= + + + + 2.0 Where Xijk is person k’s score, on level i of Factor A and level j of Factor B, µ is the grand mean of all treatment effects, αi the effect of Factor A, level i, βj is the effect for level j of Factor B, αβij is the interaction effect, or the joint effect of level i of Factor A and level j of Factor B, and εijk is the error term associated with this observation. As before, the error term represents variation we cannot explain. The calculation of the total sum of squares is no different in this design, that is, it is defined as the sum of the squared deviations of each observation from the grand mean. We begin partitioning this total variation by estimating the overall treatment effect, for all three effects (A, B, & A x B) together. This is the SSAB, sometimes called the Sums of Squares Between Groups.

___2( )ijBGSS n AB G= −∑ 2.1



This is the variation of each cell mean around the grand mean, hence it represents a total treatment effect. A computational formula is given as…

2 2

BG

T GSS

n N= Σ − 2.2

with df=number of cells-1. Where T is the sum of scores within each cell and n is the number of participants within each cell and G is the sum of all scores and N is the total number of participants. Before moving on, it should be noted at this point that the following relationship holds.

BG A B AXBSS SS SS SS= + + 2.3 In words, the total treatment effect can be partitioned into an effect for A, B and A x B. Now, the next step is to further partition SSAB into these component parts. First, the main effect for A is as follows.

_ _2( )iASS nq A G= −∑ 2.4

where n is the number of observations within a cell, q is the number of cells of A within

the given level of B and G is the grand mean. In words, it is the sum of the squared deviations for each treatment level of A (collapsing over levels of B), weighted by the number of observations that go into that mean (i.e., nq). A computational formula for SSA follows.

2 2ROW

AROW

T GSS

n N= Σ − 2.5

Where TROW is the total score for the row and nROW is the total number of participants for the row. The degrees of freedom for SSA is kA-1, where kA is the number of levels of A. Similarly, then, SSB is calculated as

_ _2( )jBSS np B G= −∑ 2.6

where n is the number of observations per cell and p is the number of cells of A within B. Computationally,

2 2COL

BCOL

T GSS

n N= Σ − 2.7

with df=kB-1 and where TCOL is the total score for the column and nCOL is the total number of participants for the column.



The variance associated with the interaction term can be defined by the following formula.

2( )ij jAxB iSS n AB A B G= Σ − − + 2.8 Hence, the variance associated with the interaction term can be thought of as the sum of the squared deviations of each treatment cell, with the appropriate level A and B means subtracted out and the grand mean added back in. This number is weighted by the number of observations within each cell. The degrees of freedom for this term is dfbetween

- dfA - dfB Because of the additive nature of our partitioned Sums of Squares, SSAxB can also be arrived at with the following.

A B Between A BSS SS SS SS× = − − 2.9 Finally, the within condition variance, or error variance, can be arrived at as follows.

2( )ijW ijkSS X AB= Σ − 2.10 This is simply the sum of the squared deviations of each score from its respective treatment mean. It can also be calculated by adding the SS calculated separately from each treatment condition. The degrees of freedom for this term are k(n-1), or the number of people within each minus one times the number of cells.

W WijSS SS= Σ 2.11 A more convenient computational formula for SSW is…

22

W ijk

TSS X

n= Σ − Σ 2.12

Numerical Example The following example is taken from Gravetter & Wallnau (2002). It is a simple between subjects Two-Way ANOVA problem. Factor A is Task Difficulty and Factor B is Anxiety Level. The various Sums of Squares, Mean Squares and F ratios are already provided. Without excessive commentary, the computation of a Two-Way ANOVA will be carried out using the computational formulas.



Factor B: Anxiety Level

Factor A: Task Difficulty

Low Medium High Marginals for Task

Difficulty 3 1 1 6 4

2 5 9 7 7

9 9

13 6 8

Easy

T = 15 3X =

SS = 18

T = 30 6X =

SS = 28

T = 45 9X =

SS = 26

T = 90 6X =

SS = 162 0 2 0 0 3

3 8 3 3 3

0 0 0 5 0

Difficult

T = 5 1X =

SS = 8

T = 20 4X =

SS = 20

T = 5 1X =

SS = 20

T = 30 2X =

SS = 78 Marginals for Anxiety

T = 20 2X =

SS = 36

T = 50 5X =

SS = 58

T = 50 5X =

SS = 206

G = 120

4G = SS = 840

Source SS df MS F p(F)= SSBG 240 5 48 SSA 120 1 120 24 0.0000537 SSB 60 2 30 6 0.0077073 SSAxB 60 2 30 6 0.0077073 SSW 120 24 5 SSTOT 360 29 12.41 Using equation 2.2, SSBG is calculated as follows.

2 2 2 2 2 2 215 30 45 5 20 5 120 3600 14,400720 480 240

5 30 5 30BGSS+ + + + += − = − = − =

Using equation 2.5, SSA is…

2 2 290 30 120 9,000 14,400600 480 120

15 30 15 30ASS+= − = − = − =

Similarly, from equation 2.7, SSB is…

2 2 2 220 50 50 120 5,400 14,400540 480 60

10 30 10 30BSS+ += − = − = − =



SSAxB = 240 – 120 – 60 = 60 (from equation 2.9). Finally, we will use equation 2.11 to form the error term, SSW, which is the sum of the within cell SS terms. SSW = 18 + 28 + 26 + 8 + 20 + 20 = 120 Dividing these SS terms by the appropriate df for each one yields the MS terms. For each effect, the F ratio is the ratio of that particular MS to the MSW term (SSW divided by its df). Comparison of Means and Interpretation of Effects As in the case of a One-Way ANOVA, planned and unplanned comparisons can be carried out on the main effects for A and B. In addition, a significant interaction can be investigated in a couple of different ways. Sometimes researchers will plot the interaction in order to interpret it. An interaction is evident when lines representing different groups (levels of one of the factors) are not parallel. While much can be learned by merely graphing the interaction, researchers will often go beyond plotting and describing the interaction and seek out more quantitative tests. One such set of tests is known as tests of simple main effects and simple effects. A simple main effect is when a particular row or column (level of one of the factors) is examined across levels of the second factor. In order to do this, a special F-Ratio must be constructed, beginning with constructing special Sums of Squares. These simple effects comparison Sums of Squares can be obtained as follows.

2 2_ij j jJ

j

T at r TSS

n n

Σ Σ= − 2.13

Where SSj is the Sums of Squares for a given level of a factor, call it row j (i.e., level 1 of Factor A), Tij represents the cell total for column i in row j, n is the number of observations per cell, Tj is the total for row j, and nj is the total number of observations in row j. The df for SSj is equal to the number of cells in the row/column minus 1. This formula can be easily adapted for testing columns within rows. We will do it both ways for our example. First, let’s examine level 1 of task difficulty (easy) over the three levels of anxiety. Applying equation 25, we have…

2 2 2 2

1

15 30 45 9090

5 15jSS=+ += − =

Next, let’s repeat this for level 2 of task difficulty (hard) over the three levels of anxiety.



2 2 2 2

2

5 20 5 3030

5 15jSS=+ += − =

We can adapt the formula again to look at the columns, collapsing across the rows. In other words, the effect of low anxiety across task difficulty is…

2 2 2

1

15 5 2010

5 10kSS=+= − =

And, medium anxiety across task difficulty…

2 2 2

2

30 20 5010

5 10kSS=+= − =

And, finally, high anxiety across task difficulty.

2 2 2

3

45 50 50160

5 10kSS=+= − =

These Sums of Squares are converted to Mean Squares by dividing by the appropriate degrees of freedom. The F-Ratio is formed by dividing those MS values by our error term (MSW). The table below summarizes the results.

Simple Effects Tests Source SSc MSc F p(F)= Description

Row 1 90 45 9 0.001212 Examines effect of anxiety at level 1 (easy) on performance across levels task difficulty

Row 2 30 15 3 0.068719 Examines effect of anxiety at level 2 (difficult) on performance, across levels task difficulty

Column 1 10 10 2 0.170142 Examines effect of difficulty at level 1 (low) on performance, across levels of anxiety

Column 2 10 10 2 0.170142 Examines effect of difficulty at level 2 (medium) on performance, across levels of anxiety

Column 3 160 160 32 0.000008 Examines effect of difficulty at level 3 (high) on performance, across levels of anxiety

Clearly, these tests are redundant in that they yield the same information. Looking at the graph below, and comparing it to the results of our simple effects test, there are two equivalent interpretations of the interaction.

• The interaction is the result of lack of a significant effect for anxiety on performance in difficult tasks (contrasted with the fact that anxiety serves to increase performance levels with easy tasks), or

• The interaction is the result of a significant effect for task difficulty on performance in the high anxiety condition.



Anx i e t y & Ta sk D i f f i c u l t y

0

1

2

3

4

5

6

7

8

9

10

Low Medium High

Anxi et y Level

Easy

Di f f icul t

If an interaction is significant, there is said to be a moderator effect. That is, the anxiety moderates the effect of task difficulty on performance. Another point should be made with respect to simple effects analysis. First, these can be carried out in the form of contrasts and if the contrasts involve only certain cells, say the comparison of performance in the high anxiety condition, across medium and difficult tasks (ignoring low difficulty tasks), then one is really conducting a simple effects analysis rather than a simple main effects analysis. The computation of these effects are not terribly difficult, but getting SPSS to do it is. Another approach to examining interactions is to conduct simple interaction comparisons. In this approach, we collapse or drop certain rows or columns. As an example, say we anticipate that the interaction will take place as a result of high anxiety and high task difficulty. We could do one interaction test, which ignores this effect, and just looks at low and high task difficulty and low and medium anxiety. Theory would tell us that this interaction should not be significant and, in fact, the graph above tells us the lines are parallel for this analysis. So, we might plan such an analysis assuming that the interaction will not be significant. Secondly, we may choose in advance, based on theory, to do a second simple interaction analysis either 1) comparing high and medium (or low) anxiety conditions using low and high task difficulty, or 2) we may choose to collapse low and medium anxiety to do a 2 x 2 analysis of high vs. low task difficulty and low/medium vs. high anxiety. We would anticipate that this interaction would be significant, hence aiding us in quantifying our interpretation of the location of the interaction. Finally, contrasts such as those discussed with respect to the 1-way ANOVA can also be accomplished on main effects in a factorial design. These contrasts would typically be done on the main effects of the design, using the marginal means, though they could be done within rows or columns. If so, they become variations on the theme of simple main effects analysis. It seems reasonable, however, not to embark on contrasts involving the marginal means (i.e., means for one factor ignoring the other factor) if there is a significant interaction. This would imply the potential to misinterpret the contrasts since



the results of the analysis would be un-interpretable without taking the interaction into account. As the reader may have noticed by this point, the interaction in our example boils down to a difference in performance, when high anxiety is crossed with high task difficulty. If this one effect had followed the general pattern of the other cells, the interaction would not have been significant. Put another way, if participants had performed better under high anxiety and high difficulty, than low and medium anxiety and in the high difficulty condition, we would have had only significant main effects. From my experience, this is not atypical, that the source of an interaction is isolated to one or two cells. Other Post-Hoc Tests A visual inspection of the graph above, plus an examination of the simple main effects might give rise to other comparisons. In this particular example, it is obvious: performance in the high anxiety condition differs significantly depending upon the difficulty of the task. Tukey’s HSD could be used to make this comparison, or a Scheffe’ post-hoc test could be conducted (Keppel, 1982). Following with previous examples, the Scheffe test will be presented below – using formula 1.15, we could obtain the following…

2 22

'

45 505405 5 250 160

5 5 10ScheffeSS = + − = + − =

Since there is only one degree of freedom associated with this main effect, (i.e., two levels of Task Difficulty), the MS is 160. Thus, FScheffe would be 160/5=32. With 1 and 24 degrees of freedom, this F value would clearly be significant. Finally, specific contrasts could also be used in comparing means, keeping in mind that these contrasts (as discussed above) should be presented with a correction for experiment-wise error. One can also use other procedures, such as Tukey’s to compare cell means. If it is deemed to be enlightening to do so, one uses the tabled q statistic substituting the number of cells for r, i.e., treating the factorial design as a large one-way design (Howell, 2002). When using Tukey’s HSD, it is recommended that comparisons be made within a row or column, which is the natural tendency anyway, since crossing rows and columns (e.g., comparing medium anxiety and low task difficulty with high anxiety and high task difficulty) will result in an ambiguous interpretation. Finally, one could use a standard t-test to compare means between cells, making a correction for inflation of type 1 error. In general, SPSS does not allow for simple comparisons automatically, one must use a special contrast statement. A brief summary of contrast statements is presented next. Testing Simple Main Effects and Making Contrasts in SPSS Some types of contrasts are built into SPSS, for instance, SPSS offers simple, deviation and some other types of contrasts. Additionally, the documentation on these contrasts (in the SPSS Syntax Guide) are fairly straight-forward. Specialized, or custom contrasts,



however, are not. They are difficult to understand and the documentation offers little help in my opinion. Some tricks however, first set up your model using the GLM procedure, choosing Univariate. Declare your independent variables as fixed factors and your dependent variable as your dependent variable. Choose post-hoc to specify any post-hoc tests for main effects. Finally, go to Options and request that SPSS print the contrast coefficient matrix, as well as means and any comparisons/alpha corrections that are called for in the analysis. Printing the contrast coefficient matrix allows you to see what contrasts you are actually requesting. When finished, press the paste button and from now on, you will need to work within the syntax window to get simple effects contrasts. In the SPSS syntax, insert an “/lmatrix” subcommand, the “/” indicates it is a subcommand under the UNIANOVA heading. Follow the subcommand with a descriptive label in tick marks, as below. Finally, express your contrast. In this example there are three contrasts, 1)looking at low anxiety within the level of low difficulty, 2)moderate anxiety within the level of low difficulty, and 3)high anxiety within the level of low difficulty. The contrast matrix for the first ‘lmatrix’ command is pasted below. As can be seen, the contrast is isolating low anxiety within easy task difficulty. The purpose of requesting this matrix is to determine that you have specified the contrast you wish to specify. UNIANOVA score BY difficlt anxiety /METHOD = SSTYPE(3) /INTERCEPT = INCLUDE /POSTHOC = anxiety ( TUKEY ) /PLOT = PROFILE( anxiety*difficlt) /EMMEANS = TABLES(difficlt*anxiety) /EMMEANS = TABLES(difficlt) /EMMEANS = TABLES(anxiety) /CRITERIA = ALPHA(.05) /DESIGN = difficlt anxiety difficlt*anxiety /lmatrix 'low anxiety within low difficulty' difficlt*anxiety 1 0 0 -1 0 0 difficlt 1 -1 /lmatrix 'medium anxiety within low difficulty' difficlt*anxiety 0 1 0 0 -1 0 difficlt 1 -1 /lmatrix 'high anxiety within low difficulty' difficlt*anxiety 0 0 1 0 0 -1 difficlt 1 -1. Contrast Coefficients (L' Matrix)(a)

Contrast

Parameter L1 Intercept 0 [DIFFICLT=1] 1 [DIFFICLT=2] -1 [ANXIETY=1] 0 [ANXIETY=2] 0 [ANXIETY=3] 0



[DIFFICLT=1] * [ANXIETY=1] 1



[DIFFICLT=2] * [ANXIETY=1] -1



The default display of this matrix is the transpose of the corresponding L matrix. a low anxiety within difficulty

In the current example, SPSS is interpreting the design as a 2x3, because difficulty was listed first in the UNIANOVA statement (score BY difficlt anxiety). Hence, the first three numbers in the contrast represent the levels of anxiety across the first level of difficulty, the second three represent anxiety across the 2nd level of difficulty, and the two numbers following difficulty correspond to the first and 2nd level. The positive numbers represent the levels being isolated (following difficlt). Had this been a 3x3 design, for instance, there would be three sets of numbers after the interaction. Say for instance, that you wanted to contrast level 2 with level 3 of Factor B, while holding Factor A constant at level 3. The following contrast would allow this. /lmatrix 'B level 2 & 3 with A=3' A*B 0 0 0 0 0 0 0 -1 1 B 0 -1 1

Levels of B1 are assigned zeros because they do not take part in the contrast. Finally, one more example, for instance let’s say that we want a slightly more complicated contrast, where we want to compare B, level 3 against the average of B level 1 & 2, again holding A constant at 3. The following syntax would allow for this. /lmatrix 'B level 1& 2 vs. 3 with A=3' A*B 0 0 0 0 0 0 -1 -1 2 B -1 -1 2 Other Approaches to Carrying out Contrasts At this point, one might reasonably ask why a simple main effects contrast cannot be carried out by specifying a one-way ANOVA for just the level one is interested in. For instance, say we wanted to do two simple main effects analyses on anxiety level, across the two levels of task difficulty (or three simple effects across task difficulty within each level of anxiety). One approach would be to run two separate one-way ANOVAs, one for easy tasks and a second for difficult tasks. This is not exactly a bad approach, with some modification. The main problem has to do with the error term, MSW. When we do this, we get the following in SPSS for Low Task Difficulty: ANOVA(a)



Performance

Sum of

Squares df Mean Square F Sig. Between Groups 90.000 2 45.000 7.500 .008 Within Groups 72.000 12 6.000 Total 162.000 14

a Task Difficulty = Low

And the following for High Task Difficulty: ANOVA(a) Performance

Sum of

Squares df Mean Square F Sig. Between Groups 30.000 2 15.000 3.750 .054 Within Groups 48.000 12 4.000 Total 78.000 14

a Task Difficulty = High

In other words, the MS numerator is correct both times in that it agrees with what we calculated above. However, the denominator MSW is different for both analyses. This means that, if the assumption of homogeneity of variance is tenable, we can do this type of analysis, but we should re-compute the F-value using the overall MSW value calculated above as the error term, MSW = 5.00. If the assumption of heterogeneity is not tenable, then perhaps it makes sense to take these separate one-ways as simple main effects analysis without recalculating F using the pooled error term. Simple interaction comparisons can also be accomplished through contrast coefficients. However, here we will trick SPSS into doing them for us, recalling that we should recalculate F using the correct error term. Syntax pasted below will provide the output just described above as well as the simple interaction comparisons. First, the syntax: * this block of code filters out the low anxiety condition, yielding a 2 x 2 design for medium and high anxiety vs. high and low task difficulty. USE ALL. COMPUTE filter_$=(anxiety=2 or anxiety=3). VARIABLE LABEL filter_$ 'anxiety=2 or anxiety=3 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMAT filter_$ (f1.0). FILTER BY filter_$. EXECUTE . UNIANOVA score BY difficlt anxiety /METHOD = SSTYPE(3) /INTERCEPT = INCLUDE /PLOT = PROFILE( anxiety*difficlt ) /CRITERIA = ALPHA(.05) /DESIGN = difficlt anxiety difficlt*anxiety .



The output from this design demonstrates the expected significance of this two-way interaction. Tests of Between-Subjects Effects Dependent Variable: Performance

Source Type III Sum of Squares df Mean Square F Sig.

Corrected Model 170.000(a) 3 56.667 9.645 .001 Intercept 500.000 1 500.000 85.106 .000 Difficlt 125.000 1 125.000 21.277 .000 Anxiety .000 1 .000 .000 1.000 difficlt * anxiety 45.000 1 45.000 7.660 .014 Error 94.000 16 5.875 Total 764.000 20 Corrected Total 264.000 19

a R Squared = .644 (Adjusted R Squared = .577) * this block of code filters out high anxiety yielding a 2 x 2 design for low and medium anxiety and high vs. low task difficulty. FILTER OFF. USE ALL. EXECUTE . USE ALL. COMPUTE filter_$=(anxiety=1 or anxiety=2). VARIABLE LABEL filter_$ 'anxiety=1 or anxiety=2 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMAT filter_$ (f1.0). FILTER BY filter_$. EXECUTE . UNIANOVA score BY difficlt anxiety /METHOD = SSTYPE(3) /INTERCEPT = INCLUDE /PLOT = PROFILE( anxiety*difficlt ) /CRITERIA = ALPHA(.05) /DESIGN = difficlt anxiety difficlt*anxiety . FILTER OFF. USE ALL. EXECUTE . The output from this design demonstrates the expected non-significance of this two-way interaction. Tests of Between-Subjects Effects Dependent Variable: Performance


Corrected Model 65.000(a) 3 21.667 4.685 .016 Intercept 245.000 1 245.000 52.973 .000 difficlt 20.000 1 20.000 4.324 .054



anxiety 45.000 1 45.000 9.730 .007 difficlt * anxiety .000 1 .000 .000 1.000 Error 74.000 16 4.625 Total 384.000 20 Corrected Total 139.000 19

a R Squared = .468 (Adjusted R Squared = .368) * finally, this block of code combines low & medium anxiety to contrast this condition against high anxiety, across high and low task difficulty. Again, it is a 2 x 2 design, this time unbalanced since the low and medium anxiety conditions are combined. TEMPORARY. RECODE anxiety (LO THRU 2 = 2). UNIANOVA score BY difficlt anxiety /METHOD = SSTYPE(3) /INTERCEPT = INCLUDE /PLOT = PROFILE( anxiety*difficlt ) /CRITERIA = ALPHA(.05) /DESIGN = difficlt anxiety difficlt*anxiety . TITLE 'collapsing low & medium anxiety for simple interaction test'. Finally, as would be expected the output from this simple interaction comparison is significant. Tests of Between-Subjects Effects Dependent Variable: Performance


Corrected Model 195.000(a) 3 65.000 10.242 .000 Intercept 481.667 1 481.667 75.899 .000 difficlt 166.667 1 166.667 26.263 .000 anxiety 15.000 1 15.000 2.364 .136 difficlt * anxiety 60.000 1 60.000 9.455 .005 Error 165.000 26 6.346 Total 840.000 30 Corrected Total 360.000 29

a R Squared = .542 (Adjusted R Squared = .489)

It should be noted, again, that if the assumption of homogeneity is tenable, the actual F for the interaction (and any other effects that are to be interpreted), should be recomputed using the MSW of 5.00 from the entire design. Hence, for the three summary tables here, the F’s for the interactions should be (in order): 45.00 / 5.00 = 9.00; 0.00 / 5.00 = 5.00; and, 60.00 / 5.00 = 12.00. The significance of the F’s should be evaluated at the df associated with the specific analysis, so for the last example, 1 and 26. For this particular example, which is clearly contrived, the choice of denominator didn’t matter. In other instances it may.



Effect Sizes Effect sizes can be computed using equations 1.16 and 1.17. One difference is that there are now up to four different effect size terms that can be calculated. The formulas can be used just as they are and all terms remain the same, with the exception that the numerator changes depending upon the effect one wishes to describe. So, for instance, if one wants to express the total effect, one uses SSBG, if one wants to express the effect of the interaction, one uses SSAxB, etc. Some have argued for the use of a partial effect size calculation here. The rationale is that part of the variance in the denominator is not error, but has actually been assigned to another effect, such as the other factor or the interaction term. Hence, the usual η2 value understates what is known about the effect being described. A partial η2 value is calculated by using the SS for the effect plus SS error in the denominator.

2 E

E W

SSPartial

SS SSη =

+ 2.14

Where SSE is the Sum of Squares for the effect of interest, hence this yields a larger effect size estimate. For our example, if we wish to compute the effect for the interaction using an η2 approach and a partial η2 approach, we would observe the following: η2 = .167 Partial η2 = .333. Similarly, a partial ω2 can be computed with formula 2.15.

2 ( ) /

[ ( ) / ]effect effect W

effect effect W W

df MS MS NPartial

df MS MS N MSω

−=

− + 2.15

Everything should be familiar here, the subscript effect refers to the effect for which a partial ω2 is being computed (i.e., it can be computed for a main effect or interaction). N represents the total number of observations in the study. For example, we could compute a partial ω2 for our interaction term.

2 2(30 5) / 300.250

[2(30 5) / 30] 5Partialω −= =

− +

As expected, this value is lower than the partial η2 computed above. Extensions It should be noted that the Two-Way Between Groups ANOVA we have been discussing can be extended to an N-Way design. For instance, Three-Way ANOVAs are sometimes reported in the literature and one could use a four or five way ANOVA as well. The problem is that the number of subjects required for a three, four, or five way ANVOA can be very large. If one has a 3x3x4x4 design, that is three levels of Factor A, 3 of B, 4 of C



and 4 of D, and one wishes to have 10 observations per cell, one needs first to realize that there are a total of 3x3x4x4=144 cells, so 10 per cell works out to 1,440 participants in the study. Consequently, you are more likely to see these designs in non-experimental contexts where large data sets may be easier to come by. An example might be if one wished to test different effects on questionnaire administration. Say the first Factor (A) had to do with the instructions given, where one group was told how results would be used and the other group was told nothing. The second factor (B) might have to do with wording effects, such that in one condition wording might be slanted so as to try to obtain positive or negative responses, and in the other condition the wording might be more neutral. Finally, the last factor (C) might have to do with the focus of the questionnaire, so that one condition has to reveal information about themselves, another condition people are asked to answer questions about the same content, only referring to their roommate, and finally the third condition the focus would be on the significant other. Such a design will yield several effects in total, first there will be three main effects, A, B, & C. Second, there will be three possible 2-way interactions, A x B, A x C, and B x C. Finally, there will be a three-way interaction effect, A x B x C. Briefly, the following example comes from Tabachnick and Fidell (2001). It is a silly example, but works for our purposes. It is a 3-way between subjects ANOVA, a 2 x 2 x 2 design. The dependent variable is the eagerness of a bull, in the presence of a cow. And, I don’t want to know how it was measured. Factor A is the level of deprivation of the bull, long or short. Factor B is the familiarity of the cow to the bull, familiar or unfamiliar. Finally, Factor C is the decoration of the stall where the meeting is to take place. The stall is either undecorated, or decorated with a floral pattern. The syntax for analyzing these data is presented below. TITLE '3-Way ANOVA Example from T & F'. DATA LIST FREE/DEPRIV FAMIL DECOR EAGER. BEGIN DATA. 1 1 1 4 1 1 1 3 1 1 1 2 1 1 1 3 1 1 1 0 1 1 2 3 1 1 2 0 1 1 2 2 1 1 2 1 1 1 2 4 1 2 1 0 1 2 1 2 1 2 1 2 1 2 1 0 1 2 1 2 1 2 2 6 1 2 2 5 1 2 2 5 1 2 2 5 1 2 2 3 2 1 1 0 2 1 1 5 2 1 1 4 2 1 1 1 2 1 1 3 2 1 2 6 2 1 2 7 2 1 2 6 2 1 2 6 2 1 2 6 2 2 1 9 2 2 1 6 2 2 1 6 2 2 1 6 2 2 1 7 2 2 2 6 2 2 2 10 2 2 2 9



2 2 2 9 2 2 2 8 END DATA. VARIABLE LABELS DEPRIV 'BULL DEPRIVATION' FAMIL 'FAMILIARITY' DECOR 'STALL DECORATION'. VALUE LABELS DEPRIV 1 'SHORT' 2 'LONG' /FAMIL 1 'FAMILIAR' 2 'UNFAMILIAR' /DECOR 1 'UNDECORATED' 2 'FLORAL'. UNIANOVA EAGER BY DEPRIV FAMIL DECOR /METHOD = SSTYPE(3) /INTERCEPT = INCLUDE /PLOT = PROFILE( DEPRIV*FAMIL*DECOR ) /CRITERIA = ALPHA(.05) /DESIGN = DEPRIV FAMIL DECOR DEPRIV*FAMIL DEPRIV*DECOR FAMIL*DECOR DEPRIV*FAMIL*DECOR . Tests of Between-Subjects Effects Dependent Variable: EAGER


Corrected Model 241.600(a) 7 34.514 17.587 .000 Intercept 739.600 1 739.600 376.866 .000 DEPRIV 115.600 1 115.600 58.904 .000 FAMIL 40.000 1 40.000 20.382 .000 DECOR 44.100 1 44.100 22.471 .000 DEPRIV * FAMIL 14.400 1 14.400 7.338 .011 DEPRIV * DECOR 2.500 1 2.500 1.274 .267 FAMIL * DECOR 2.500 1 2.500 1.274 .267 DEPRIV * FAMIL * DECOR 22.500 1 22.500 11.465 .002

Error 62.800 32 1.963 Total 1044.000 40 Corrected Total 304.400 39

a R Squared = .794 (Adjusted R Squared = .749)

All the main effects are significant, as well as one two-way interaction and the three-way interaction. Two plots are required to comprehend the nature of the three-way interaction, and they are presented below.



SHORT LONG

BULL DEPRIVATION

1.00

2.00

3.00

4.00

5.00

6.00

7.00

Est

imat

ed M

argi

nal M

eans

FAMILIARITY

FAMILIAR

UNFAMILIAR

at STALL DECORATION = UNDECORATED

Estimated Marginal Means of EAGER

SHORT LONG

BULL DEPRIVATION

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

Est

imat

ed M

argi

nal M

eans

FAMILIARITY

FAMILIAR

UNFAMILIAR

at STALL DECORATION = FLORAL

Estimated Marginal Means of EAGER

To interpret, the effects of familiarity and deprivation vary depending upon how the stall is decorated. In an undecorated stall, deprivation has no effect on the eagerness of the bull if the cow is familiar, but if the cow is unfamiliar, the bull is more eager if the deprivation time is longer. However, this changes if the stall consists of a floral decoration pattern. Here, deprivation has a similar effect on eagerness for both familiar and unfamiliar cows. Bulls who have been deprived longer are more eager for both familiar and unfamiliar cows when they are in a floral stall. When they are in a plain stall, bulls who have been longer deprived are only more eager in the presence of an unfamiliar cow.



3. Repeated Measures 1-Way ANOVA A repeated measures design can be used in experimental work for a couple of different reasons. For instance, some types of experimentation make it necessary to use such designs, such as when we desire to test people under varying circumstances. Another reason is that one may wish to use a repeated measures design in order to ease the burden of requiring numerous participants required for a between subjects design. Repeated measures designs offer some distinct advantage with regard to reducing error variance in an experimental design. For instance, a two-way ANOVA is sometimes used where a 2nd variable is included as a blocking variable (Stevens, 2002). For instance, a program evaluation of outcomes might include results from three different locations. Consequently, if there are expected differences between the centers, the researcher may choose to enter a Center ID as an IV, not because it is necessarily of interest, but because it serves to improve chances of observing some effect. In a RM ANOVA, this concept is taken to its extreme in that each subject serves in their own between subjects condition, thus the design is blocked for each subject. This allows for between subjects variability to be estimated and (generally) discarded. Partitioning Variance In order to decompose the variance associated with a 1-way within subjects design we need to understand the effects that the design yields and be able to identify the appropriate error term. First, this type of design is used to test for significant differences across the treatment conditions. For instance, let’s say we wish to test the effects of certain drugs (say, antihistamines) on reaction time. Perhaps the rationale is that this is part of an effort to identify whether certain types of antihistamines should be banned from use within certain occupations. In such a design, one is primarily interested in whether reaction times vary consistently between the different antihistamines used. Data are typically laid out so that each row represents a person and each column represents a measurement on that person. First, we are interested in partitioning out the variance associated with treatment conditions. As is similar from previous analyses, this involves finding the variance of the treatment means around the grand mean. The definitional and computational formulas follow, below.

2( )TREAT jSS n T G= Σ − 3.1

Where jT denotes the mean for treatment condition j. In our example this would be the column mean for column j. The degrees of freedom for this term is k-1, where k is the number of treatment levels. The computational formula is as follows.

2 2j

TREAT

T GSS

n kn

Σ= − 3.2



Where n is the number of observations and k is the number of levels of each treatment. Note that instead of using the symbol M to designate treatment or method (e.g., SSM), I am now using the word TREAT to indicate that it is a slightly different term, since it is a within subjects design. Next, there are two variance terms associated with people, the variance between people and the variance within people. The variance between people is the variance of people’s mean scores over treatment conditions, around the grand mean. Converting this to a definitional formula yields…

2. ( )B PEOPLE iSS k P G= Σ − 3.3

Where iP is the mean score for person i, over all treatment conditions. Computationally,

this becomes

2 2

.i

B PEOPLE

P GSS

k kn

Σ= − 3.4

with degrees of freedom equal to n-1. Sum of squares within person is the sum of the squared deviations for each person around

their own mean, iP . The definitional formula for this term is as follows.

2

. ( )W PEOPLE ij ii j

SS X P= Σ Σ − 3.5

The double summation indicates that one needs to sum across all conditions for each person, then across people. The df for this term is n(k-1). The computational formula is as follows.

22

.i

W PEOPLE ij

PSS X

k

Σ= Σ − 3.6

In this equation, 2

ijXΣ is the sum of each squared score in the data matrix. The term 2iPΣ

is the summation of each person’s total score squared. Finally, the error term for this particular model is called the residual. It is the variation left over after the Treatment effects and Individual Differences have been removed. In other words, we know people within an experiment will naturally vary around whatever our dependent variable is. Since each person participates in all conditions, we can remove this source of error (which, as stated before, is a great advantage of repeated measures designs). Whatever is left over is our best estimate of experimental error. A definitional formula for this term is…



2[( ) ( ) ( )]RES ij i ii j

SS X G P G T G= Σ Σ − − − − − 3.7

In words, one takes the distance of a particular score from the grand mean, then subtracts out the distance due to how much that person varies, on average, around the grand mean (individual differences), then subtracts out how much of that distance is attributable to the treatment effect. The remaining deviation is squared and summed for each score across treatment conditions and again over people. It is a measure of the interaction between subjects and conditions. Computationally, this is best accomplished as

.RES W PEOPLE TREATSS SS SS= − 3.8 The degrees of freedom for this particular term are (n-1)(k-1). The Structural Model There are two representations of the structural model for this design. The first assumes no treatment by person interaction and can be characterized as follows. X ij = μ + πi + τj + εij 3.9 The second does assume an interaction between person and treatment. X ij = μ + πi + τj + πτij + εij 3.10 Equation 3.9 assumes that a person’s observed score is a function of a population parameter (mean), µ, their specific ability π, the treatment level to which they were exposed τ, and random error ε. Equation 3.10 reserves the possibility for an interaction between individuals and the treatment condition. In other words, if our example is antihistamines, some people might be more likely to become tired after taking benadryl, while others become very groggy after taking benadryl. It happens that the only practical difference (according to Howell, 2002), has to do with the test for between subjects effects, which we rarely care about. If the test for between subjects differences is significant, then people’s performances were significantly different. If the test was not significant, then it could be that people were significantly different, but the degree to which they were significantly different did not surpass the interaction variation, or it they are not significantly different. Since we rarely care about this test, this difficulty rarely becomes a concern. Numerical Example Sticking with our antihistamine example, here are some data from Winer et al. (1991, page 228).

Person Drug 1 Drug 2 Drug 3 Drug 4 Pi Mean 1 2

30 14

28 18

16 10

34 22

108 64

27.0 16.0



3 4 5

24 38 26

20 34 28

18 20 14

30 44 30

92 136 98

23.0 34.0 24.5

Tj 132 128 78 160 G = 498 Mean 26.4 25.6 15.6 32.0 24.9 GM=24.9

First, let us calculate some intermediate terms that will come in handy later.

2 2 2 2 2 2 230 28 16 34 14 ... 30 13,892ijXΣ = + + + + + + = 2 2498

12,400.204(5)

G

kn= =

2 2 2 2 2132 128 78 160

13,098.405

jT

n

Σ + + += =

2 2 2 2 2108 64 92 98

13,081.004

iP

k

Σ + + += =

Now, we can quickly generate our Sums of Squares terms. From equation 3.4 SSB.PEOPLE = 13,081.00-12,400.20 = 680.80 From equation 3.6 SSW.PEOPLE = 13,892.00-13,081.00 = 811.00 From equation 3.2 SSTREAT = 13,098.40 – 12,400.20 = 698.20 From equation 3.8 SSRES = 811.00 – 698.20 = 112.80 For those keeping track, total sum of squares is… SSTOT = 13,892.00 – 12,400.20 = 1,491.80 We can use the discussion above to get our degrees of freedom for each term and construct the following source table. Source SS df MS F p SSB.PEOPLE 680.80 4 170.20 SSW.PEOPLE 811.00 15 54.07 SSTREAT 698.20 3 232.73 24.759 0.000020 SSRES 112.80 12 9.40



SSTOT 1,491.80 19 78.52 Therefore, there is a significant effect for antihistamines on reaction time. Differences Among Means Contrasts The contrasts discussed under the One-Way Completely Randomized Design ANOVA can also be used in the within subjects design. One needs simply to use SSRES as the error term and also use the df associated with that term. Apparently there is some disagreement about the appropriate error term to use in post-hoc tests with repeated measures. However, it seems common for researchers to use the Scheffe’ post-hoc test, as well as Tukey’s HSD. When using Scheffe’s test, it is necessary to use the SSRES and dfRES when forming the F-Ratio. The Tukey procedure is as follows.

.05; ,( 1)( 1)| | RESi j k n k

MSy y q

n− −− > 3.9

Here q is derived from a table, readily available in most introductory statistics texts. This value, based off of k (number of treatments) and (n-1)(k-1) (error df) is multiplied by the square-root of the ratio of MSRES to n, where n is the number of participants. For our example, n is 5 and MSRES is 9.40. The studentized range statistics (q) for α = .05 is 4.199. Thus,

9.404.199 5.757

5TukeyHSD= =

Any two mean values that exceed 5.757 are deemed to be significantly different. Trend Analysis When we discussed the one-way fixed effects ANOVA model, the notion of polynomial contrasts was presented briefly. Such contrasts are also known as trend analysis in experimental methods. Some more detail is provided here since trend analysis is probably more commonly associated with repeated measures designs – though this is not a requirement. Formally, the requirement is that ordering of measures (or groups) must be quantitative, such that group 2 is associated with a greater or lesser magnitude in some way than group 1. It also is important that one use more than 2 groups, otherwise the linear contrast will fit perfectly since one is trying to fit a function to means, and there are only two. Another situation is that the categories or levels of the factor could be true repeated measures over time. As such, the increase or decrease over time could reflect a change in behavior, such as a worsening of a psychological condition, an increase in agility associated with training, etc. There may be other situations that I am not thinking of, but the basic rule is that if the order of categories is nominal, or doesn’t represent some underlying continuum, then trend analysis wouldn’t make sense. Finally, one may



have to find a way to take into account levels of the independent variable if they are not evenly spaced. For instance, if one is conducting follow-up research and gathers data upon release from treatment, at 30 days, again at 90 days, 1 year, 2 years and 5 years, the data collection points are not evenly spaced. I won’t cover this problem here, but it is covered in texts referenced in this paper. In SPSS, one can choose “contrasts” and request that SPSS try to fit several different lines to the repeated measures means, such as a linear relationship, quadratic, etc. An example where this contrast might be used might be in an experiment on forgetting. Does forgetting follow a linear trend or a quadratic trend? Typically, the researcher is warned not to use such procedures also without some a priori theory, as the nominal alpha will increase substantially from the repeated tests. These tests are carried out as contrasts, where appropriate contrasts coefficients are tabled and can be obtained in more extensive treatments of ANOVA, such as Winer et al. (1991). Trend analysis answers the general question of does some order of trend fit the data? Order here is 1 (linear), 2 (quadratic), 3 (cubic), and so on. One must exercise caution as for k groups, a trend of order k-1 will fit the data perfectly. In trend analysis, the question shifts away from is there significant variance among means to what is the trend among levels of the IV. Some contrived data are presented below. These data represent testing from a memory task over a period of six weeks. Certain paired associations are learned completely, i.e., participants practice until they get a perfect score of 50 out of 50. The subjects are then brought in the next day for the first of six trials. Each trial thereafter is separated by one week and the numbers represent the number of correct responses. There are a total of six participants in this study. Prior to collecting the data, we might assume that the a quadratic trend would likely fit the best, because forgetting tends to follow a concave curve, whereby a great deal of forgetting occurs quickly, with people being able to recall a fraction of what was remembered for longer periods. In other words, the trend is for a quick decay followed by a flattening out of forgetting.

Subject # Time 1 Time 2 Time 3 Time 4 Time 5 Time 6 1 43.00 31.00 10.00 9.00 4.00 4.00 2 40.00 30.00 9.00 6.00 4.00 5.00 3 49.00 33.00 12.00 5.00 6.00 3.00 4 39.00 26.00 11.00 8.00 5.00 5.00 5 41.00 28.00 8.00 6.00 5.00 5.00 6 44.00 34.00 9.00 7.00 7.00 6.00

Mean 42.67 30.33 9.83 6.83 5.17 4.67

Conducting polynomial contrasts boils down to finding the appropriate set of contrast coefficients and constructing the sum of squares associated with the particular contrast. For our design, with k = 6 repeated measures, the linear and quadratic contrast coefficients are: linear -5 -3 -1 1 3 5



quadratic 5 -1 -4 -4 -1 5 The coefficients are orthogonal and sum to zero, just as in the case of orthogonal contrasts in the first section of this paper. Exactly how the contrasts are derived is not evident, which is why it is best to either leave it to your statistical software program6 or look them up in table. These came from Winer et al. (1991), where cubic and quartic contrasts are also provided, but are not of interest to us in this example. Let’s begin by computing Llinear, which is the weighted sum of the means, using the linear contrast coefficients. L linear = -5(42.67)-3(30.33)-1(9.83)+1(6.83)+3(5.17)+5(4.67) = -268.48. The formula for the SS for this contrast is presented below.

2

2linearj

nLSS

C=

Σ 3.10

26( 268.48 ) 432489.0624

6178.4270 70linearSS

−= = =

The denominator is the sum of the squared linear coefficients. All contrasts have one degree of freedom, so SSlinear = MSlinear. Thus, F can be computed as 6178.42 / 4.223 = 1463.04 (I got the error term from the ANOVA table below, which was computed in SPSS). Furthermore, SSlinear can be subtracted from the SS for the effect (see ANOVA table below) to determine whether there is enough residual variance to warrant testing higher-order contrasts. The difference could be divided by the error term from the ANOVA to see if it exceeds the critical value based on the remaining degrees of freedom. Now, to run this in SPSS, we have to enter our data such that all repeated measures are on the same row. We could call them time1, time2, etc. At this point, follow the steps below.

1. From the main menu, pick Analyze, General Linear Model, Repeated Measures… 2. A dialogue window will come up and one field is labeled “Within-Subject Factor

Name”. It should already be called factor1, you can change it to any eight letter name.

3. Enter the number of levels, six in our example. 4. Press the “Add” button. 5. Press the “Define” button. 6. You will see your variables on the left (e.g., time1, time2, etc.) and on the right

will be a window containing fields such as __(?)__(1) through __(?)__(6). 7. Highlight your six measures on the left, and push the arrow to move them over to

the right. 8. The fields such as __(?)__(1) should now be replaced with fields such as

“time1(1)”.

6 Statistical software will not necessarily use the values used here. Generally they are fractional values, but yield the same result.



9. You can now pick various options, such as contrasts, options, etc. and hit okay when you are ready to run your model.

Submitting these data to SPSS yields the following analysis of variance (assumption of sphericity holds Mauchly’s W = .004, approximate χ2(14)=16.948, p > .05). Tests of Within-Subjects Effects Measure: MEASURE_1


factor1 7694.250 5 1538.850 364.369 .000 Error(factor1) 105.583 25 4.223

Requesting polynomial contrasts provides the following table. Tests of Within-Subjects Contrasts Measure: MEASURE_1

Source factor1 Type III Sum of Squares df Mean Square F Sig.

factor1 Linear 6179.336 1 6179.336 532.505 .000 Quadratic 1292.161 1 1292.161 502.740 .000 Cubic .112 1 .112 .125 .738 Order 4 143.006 1 143.006 39.040 .002 Order 5 79.636 1 79.636 33.393 .002 Error(factor1) Linear 58.021 5 11.604 Quadratic 12.851 5 2.570 Cubic 4.471 5 .894 Order 4 18.315 5 3.663 Order 5 11.924 5 2.385

Notice that SPSS provides all possible polynomial contrasts, i.e., powers of 1 (linear) up through k – 1 = 5. Notice also that the sum of the sum of squares for each effect add up to the total sum of squares for the repeated measures effect from the ANOVA Summary Table. That is, the sum of the linear, quadratic, cubic, order 4 and order 5 sum of squares equals 7694.250. Notice also that all the trends are significant except for the cubic trend. As was cautioned above, one should have some a priori notion of what is being tested in order to avoid an inflated type 1 error and overfitting the data. Hence, we might choose to pay the most attention to the first two, since we hypothesized quadratic to begin with, but might not want to immediately rule out the simpler linear trend. In order to gain a visual representation, the two trends are graphed below.



Repeated Measures Trend Analysis

y = -7.6714x + 43.433R2 = 0.8031

-5.00

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

Time 1 Time 2 Time 3 Time 4 Time 5 Time 6

Time 1

Rec

all N

Cor

rect

Series1

Linear (Series1)

Repeated Measures Trend Analysis

y = 2.4018x2 - 24.484x + 65.85R2 = 0.971

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

Time 1 Time 2 Time 3 Time 4 Time 5 Time 6

Time 1

Rec

all N

Cor

rect

Series1

Poly. (Series1)

Clearly, the quadratic trend does follow the curve for forgetting better than the linear trend, though one can see that the linear trend does capture a great deal of the essence of forgetting. In all, it seems worthwhile to use a quadratic equation to represent the trend. Repeated Measures Factorial Design As with between subjects designs, it is possible to have factorial within-subjects designs. They are not terribly common in my experience, but I also think they could be used more often than they are. A brief illustration of such a design is offered below, in order to provide some sense of what can be done using repeated measures. Consider an experiment where a researcher wants to assign people to only read certain types of books during a three month period, then a different genre of books during another three month



period. So, let’s say that for the first three months participants can only read Science Fiction books. During the second six months, participants can only read mystery novels.7 There are two repeated measures factors in this experiment, literature genre, and month (month1 through month3). The dependent variable will be the number of books read during each month. It is therefore possible to get an effect for month, genre or an interaction between the two. For example, interest in reading may trail off or pick up over the three months. Similarly, participants may have more of a fondness for mysteries better than science fiction. Finally, one may find that interest in mystery is fairly sustained over the three months, but there is an interaction which implies that interest in science fiction changes over the three months. The data for such an experiment are presented below.

Number of books read in each of three months Genre = Science Fiction Genre = Mystery Subject # Month1 Month2 Month3 Month1 Month2 Month3

1 1 3 6 3 1 0 2 1 4 8 4 4 2 3 3 3 6 5 3 2 4 5 5 7 4 2 0

Submitting this problem to SPSS is somewhat tricky. Basically, one must use a similar approach as with a one-way repeated measures, but declare two within-measures factors, one with two levels (genre) and one with three levels (month). The output from SPSS for these data is pasted below.

7 Clearly one would want to counterbalance this experiment so half the participants read science fiction books for the first three months and the other half read mystery books for the first three months.



Tests of Within-Subjects Effects Measure: MEASURE_1


Sphericity Assumed 20.167 1 20.167 7.408 .072 Greenhouse-Geisser 20.167 1.000 20.167 7.408 .072 Huynh-Feldt 20.167 1.000 20.167 7.408 .072

genre

Lower-bound 20.167 1.000 20.167 7.408 .072 Sphericity Assumed 8.167 3 2.722 Greenhouse-Geisser 8.167 3.000 2.722 Huynh-Feldt 8.167 3.000 2.722

Error(genre)

Lower-bound 8.167 3.000 2.722 Sphericity Assumed 2.583 2 1.292 1.000 .422 Greenhouse-Geisser 2.583 1.118 2.311 1.000 .396 Huynh-Feldt 2.583 1.313 1.967 1.000 .404

month

Lower-bound 2.583 1.000 2.583 1.000 .391 Sphericity Assumed 7.750 6 1.292 Greenhouse-Geisser 7.750 3.354 2.311 Huynh-Feldt 7.750 3.939 1.967

Error(month)

Lower-bound 7.750 3.000 2.583 Sphericity Assumed 53.583 2 26.792 77.160 .000 Greenhouse-Geisser 53.583 1.443 37.123 77.160 .001 Huynh-Feldt 53.583 2.000 26.792 77.160 .000

genre * month

Lower-bound 53.583 1.000 53.583 77.160 .003 Sphericity Assumed 2.083 6 .347 Greenhouse-Geisser 2.083 4.330 .481 Huynh-Feldt 2.083 6.000 .347

Error(genre*month)

Lower-bound 2.083 3.000 .694

Although it is not shown here, the assumption of sphericity does hold for these data. The table above implies that there is not a significant main effect for genre or month (genre is close, p = .072), but there is a significant interaction. The graph below reveals the nature of the interaction. Basically, over time people’s interest in science fiction reading seems to have increased while their interest in mystery reading decreased.



1 2 3

month

1

2

3

4

5

6

7

Est

imat

ed M

argi

nal M

eans

genre

1

2

Estimated Marginal Means of MEASURE_1

Effect Sizes and Assumptions of Model Effect Sizes As with the other designs, effect sizes can be calculated. The main effect size of interest here would be, going back to our drug example, for Treatment (i.e., Drug). Hence, using the formula for omega-squared…

2 698.20 (4 1)9.40 698.20 28.200.446

1,491.80 9.40 1,501.20ω − − −= = =

+

However, it is often customary to compute a partial effect size for repeated measures designs. The reason for this is that we may be uninterested in any between subjects effects. Put another way, we know that subjects differ, so are we really interested in that effect and in including variation from that effect as variation that cannot be explained? Hence, it might be more appropriate to compute the partial omega-squared effect size.

2 3(232.73 9.40) / 20 33.49950.781

[3(232.73 9.40) / 20] 9.40 42.8995Partialω −= = =

− +

Note that N from equation 2.15 has been replaced by the number of measurements, or the number of subjects times the number of repeated measures (5 participants x 4 drugs). Assumptions Clearly, a first assumption would be that the structural model for this design holds. It should be noted that an additional, and important assumption for this model is the assumption known as sphericity. That assumption is that the variance of the difference-



scores for each pair of conditions is the same in the population. As it turns out, repeated measures ANOVA is not particularly robust to violations of this assumption, yielding higher levels of Type I error than would be expected based on the tabled values of F. Upon encountering a violation of the assumption of sphericity, the researcher has a few possible alternatives. First, the test for sphericity should probably be taken with a grain of salt as it has been shown to be sensitive to multivariate normality, so that it may reject the null due to a lack of MVN, even though the lack of MVN may not be a particularly serious problem. Second, there are two estimates of the degree of violation of sphericity. The parameter that encapsulates this estimate is known as ε (epsilon). Epsilon cannot be calculated directly, only estimated. If the condition of sphericity is met, then ε will equal 1.0 and its minimum value is 1/(k-1). There are two estimates of ε, the Greenhouse – Geisser estimate and the Huynh – Feldt estimate. Values under .70 typically represent a more serious violation. Forunately, there are a couple of corrections for this violation. They are as follows:

• Change df from (k-1), (n-1) to 1 and (n-1). This is an extreme correction, which will decrease power significantly.

• Change df from (k-1), (n-1) to � �( 1), ( 1)k nε ε− − , where �ε is the Greenhouse & Geisser epsilon parameter estimate. This is less conservative, but may be more

conservative when �ε > .70.

• Take the average of the Greenhouse—Geisser �ε and the Huynh—Feldt �ε , the latter tending to overestimate true epsilon and the former tending to underestimate it, especially when true epsilon is > .70. If one chooses this option, a new critical value of F will have to be derived, either by using an Excel Spreadsheet function or by looking up a value in an F table, which requires rounding to the nearest whole number for df. The Excel Spreadsheet function is FDIST. So, entering “=FDIST(5.32,1,7.982)” will provide a probability value associated with an obtained F of 5.32 at 1 and 7.982 degrees of freedom.

Note that all solutions involve adjusting degrees of freedom downward, making the critical F value higher. This keeps α closer to its nominal level. Also, note, that if one has violated the assumption of sphericity, yet all estimates of epsilon yield significant results, then the exact correction is a moot point. It is probably most reasonable to otherwise rely on the third option as a default position.



4. Groups by Trials Repeated Measures Designs As you can guess, the designs discussed thus far can be combined so that one has an experiment where one or more factors are between factors and one or more are within factors. A simple case will be discussed where a researcher has one between factor and one repeated measures factor. Examples of when you would use this type of design might be when you have several trials you wish for subjects to complete, but also have several conditions, such as an experimental vs. control condition. In this design, people would serve in only one between condition, i.e., they’re randomly assigned to experimental or control, but they would participate in all trials. These designs are sometimes referred to as mixed designs, however it is probably better to be specific when naming these designs. Groups by trials is one way of being more specific. These designs are also sometimes called Split-Plot Factorials, Mixed Randomized-Repeated Designs, or Between and Within designs. The problem with calling these mixed designs is that some authors use this term to denote designs that have both fixed and random factors included, which is an entirely different matter. Partitioning Variance The partitioning of Sums of Squares can become quite complex in these higher order designs. We will keep our example rather simple so as not to get lost in the menagerie of symbols. Basically, variance can be partitioned into the following components:

• Between Groups Variance – the variance associated with the between group independent variable. • Subjects within Groups Variance – the variance associated with observations. This is the variance

associated with people’s overall scores within their particular condition – collapsed across trials. This is the variation among scores where the variation for treatment condition has been removed, it serves as the error term for the between groups variance term.

• Repeated Measures Variance – the variance associated with the repeated measure factor (i.e., different trials).

• Group by Trials Variance – This is the variance associated with the interaction between experimental group (treatment condition) and trial.

• Subjects within Groups within Trials Variance (residual) – this is the variation of trial observations with the effect of individual differences, treatment and trials removed. This becomes the error term for both Trials and the interaction between Groups and Trials.

All of these sources of variance add up to the total variance. Beginning with the easiest, the definitional and computational formulas will be presented for each variance partition.

2( )A jSS nk A G= Σ − 4.1 where A is the symbol for Factor A, which is our between subjects (Group) factor, the subscript j denotes the condition, so that Aj is the j th condition for Factor A. The degrees of freedom for this term is j-1, where j is the number of levels of A. Computationally, this becomes:

2 2j

A

A GSS

nk njk

Σ= − 4.2



Where, n is the number of participants within a condition, k is the number of levels of the repeated measures variable and j is the number of levels of A. G, as with previous examples, is the sum of all scores. Subjects within groups variance is defined as…

2_ . ( )subj w groups i jSS k P A= Σ − 4.3

where iP represents person i’s average score over all trials and k represents the number of trials. So, it is their average score over trials minus the group average for that particular level of A, squared then multiplied by the number of trials that went into computing the their mean. The degrees of freedom for this term is j(n-1), which is the number of levels of A times one less than the number of observations per cell. Computationally, this becomes.

22

_ .ji

subj w groups

APSS

k nk

ΣΣ= − 4.4

where 2

iP is person i’s total score over all trials, squared. The variation associated with

the repeated measures factor, trials, is defined as:

2( )jBSS nj B G= Σ − 4.5 In words, the grand mean is subtracted from the mean for each trial (there are j-trials), collapsing across levels of the between factors variable, then squared, then multiplied by

nj, which is the number of observations that went into the mean kB . Computationally, this becomes:

2 2k

B

B GSS

nj njk

Σ= − 4.6

The only new term here is B, which is the sum of scores squared for level j of this factor (B is now used to denote the second factor, which in this case is our repeated measures factor). The degrees of freedom for this particular term is k-1. Group by Trials variance can be defined by the following equation.

2( )jk j kA BSS n AB A B G× = Σ − − + 4.7 The mean for each treatment level within a given experimental level is subtracted from the mean for the experimental level, that difference is squared and weighted by n, then summed across the cells representing trials and levels. If there is no interaction, the effect



of group assignment (A) and trial (B) are additive, and on balance this term will sum to zero. The degrees of freedom for this term is (j-1)(k-1). Computationally, this becomes:

2 2 2 2jk j k

A B

AB A B GSS

n nk nj njk×

Σ Σ Σ= − − + 4.8

At last, the error term associated with the within-subjects factor (B) and the interaction term is defined as follows:

2_ / _ ( )i k jB Subjects W In Groups ijk

j kSS AB P B A× = Σ Σ − − + 4.9

Each person’s score for each trial is first subtracted from their average score across trials, then from the mean for that trial across all people. Finally, the group mean for that particular between subjects condition is added back in. The degrees of freedom for this term is (k-1) x j(n-1).

2 222

_ /jk ji

BxSubjects w inGroups

AB APSS X

n k nk

Σ ΣΣ= Σ − − + 4.10

Numerical Example The following numerical example will take advantage of a strategy I call bracket terms (which I’m sure I stole from Keppel, 1982). This breaks down the calculation of various terms for ease of calculation.



Pre-Test Post-Test Marginals for A

A1: Experimental

13 19 21 15 12 26 25 11 16 21

179

26 27 28 26 17 30 29 15 25 27

250

39 46 49 41 29 56 54 26 41 48

429

A2: Control

15 18 26 23 24 19 14 20 11 20

190

19 20 26 23 26 19 16 20 9

25 203

34 38 52 46 50 38 30 40 20 45

393 Marginals for B

369

453

822

n = 10, number within each cell k = 2, number of trials j = 2, number of groups i = subscript for person X ijk = score for person i, group j, in trial k. Step 1: Arrange into summary table B1: Pre-Test B2: Post-Test Marginals for A A1: Experimental 179 250 429 A2: Control 190 203 393 Marginals for B 369 453 822



Step 2: Form Bracket Terms

[1] = 2 2822

16892.110 2 2

G

njk x x= =

[2] = 2 2 2429 393

16924.510 2

jA

nk

Σ += =×

[3] = 2 2 2369 453

17068.510 2

kB

nj

Σ += =×

[4] =

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2(39 ,46 ,49 ,41 ,29 ,56 ,54 ,26 ,41 ,48 ,34 ,38 ,52 ,46 ,50 ,38 ,30,40 ,20 ,45 )17781

2iP

k

Σ Σ= =

[5] = 2 2 2 2 2179 250 190 203

1718510

jkAB

n

Σ + + += =

[6] = 2 2 2 2 2 2 2 2 2 2 213 19 ...26 27 ...15 18 ...19 20 ...9 25 18106ijkXΣ = + + + + + + + + + =

Step 2: Calculate Sums of Squares SSA = [2] – [1] = 16,924.5 – 16,892.1 = 32.4 SSs(g) (subjects within groups) = [4] – [2] = 17,781 – 16,924.5 = 856.5 SSB (trials) = [3] – [1] = 17,068.5 – 16,892.1 = 176.4 SSAxB = [5] – [2] – [3] + [1] = 17,185 – 16,924.5 – 17,068.5 + 16,892.1= 84.1 SSRES (subjects within groups within trials) = [6] – [5] – [4] + [2] =

18,106 – 17,185 – 17,781 + 16,924.5 = 64.5 SSTOT = [6] – [1] = 18,106 – 16,892.1 = 1,213.9 Step 3: Determine degrees of freedom for each term dfA = j – 1 = 1 dfs(g) = j(n – 1) = 2(9) = 18 dfB = k – 1 = 1 dfAxB = (j – 1)(k – 1) = 1



dfRES = (k – 1)x j(n – 1) = 18 df TOT = jkn – 1 = 39 Step 4: Form Summary Table, compute MS & F. Source SS df MS F p(F)= SSA(Between Groups) 32.4 1 32.40 0.68 0.4201 SSs(g) 856.5 18 47.58 SSB(Between Trials) 176.4 1 176.40 49.23 0.0000 SSAxB(Groups by Trials Interaction) 84.1 1 84.10 23.47 0.0001 SSRES 64.5 18 3.58 SSTOT 1213.9 39 31.13

Note the following:

• SSs(g) is the error term for A (between groups), so 32.40 is divided by 47.58 to get the F for this effect.

• SSRES is the error term for the trials and group by trials interaction, hence it is the denominator for these effects.

Assumptions: This particular design has all the assumptions of a repeated measures design, but since it has a between subjects effect, there is an additional assumption of homogeneity of covariance matrices for each group. Put another way, we must assume that the covariances among the repeated measures are equivalent for each group. As long as group sizes are kept equivalent, violations of this assumption is typically not a concern. Sphericity, however, still remains a concern and any significance tests associated with repeated measures should involve correcting for lack of sphericity. Post Hoc Tests: There are several potential post-hoc approaches, namely Tukey, a Bonferroni correction, and a multivariate testing approach. Tukey’s can be trusted as long as group sizes are equal. Another simple approach is the Bonferroni adjustment, which does a good job under most violations of assumptions. It will have somewhat less power than Tukey though, so Tukey’s is preferred under equal group sizes. The Multivariate approach lacks power in most conditions. If the choice is to use Tukey’s, one must substitute the correct error term in for MSRES. One could also choose to use contrasts or simple main effects analysis in the situation where a significant groups by trials interaction exists. As mentioned previously for other designs, it is probably most efficient to conduct these simple main effects analysis by running different portions of the model separately and substituting in the appropriate (pooled) error term if the assumptions hold.



Power Little has been said so far in this document regarding power. This is not because it isn’t important, but reflects more of the limitation of the document as it is a work in progress. While it may seem like too little too late, I will say it is important to consider power in designing an experiment. Good readable discussions of power are readily available, the Howell text referenced below is especially readable. Howell shows how to estimate a proper sample size for a given level of power once one has some indication of the error term and some notion of the effect size. If one were to look at this discussion for a short time, it would become evident how to estimate power. At that point, it would be useful to take advantage of a free program available from the worldwide web to do the power calculations. Howell recommends G*Power and I have found it to be very useful as well. It is available at http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/. Another point worth making is that observed power can be calculated, and is available as an option in SPSS. Observed power is especially useful for pilot studies, or for studies where one found a result that wasn’t quite strong enough to be significant, but looked like it might be. Observed power is useful to indicate what the power of the experiment was assuming the effect size and sample size that was used. Hence, it shows the probability of rejecting the null, if it is false, under the conditions observed in the experiment. It can also be used to estimate the necessary sample size required for a subsequent study. Good References for ANOVA: Howell, D. C. (2002). Statistical methods for psychology (5th ed.). Pacific Grove, CA:

Duxbury. Keppel, G. (1991). Design and analysis: A researcher’s handbook (3rd ed.). Englewood

Cliffs, N. J.: Prentice-Hall Kirk, R. E. (1995). Experimental design: Procedures for the behavioral sciences (3rd ed.).

Monterey, CA: Brooks/Cole Tabachnick, B. & Fidell, (2001). Computer-assisted research design and analysis.

Pearson/Allyn & Bacon. Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in

experimental design, 3rd ed. New York: McGraw-Hill

analysis of variance 1 - university of windsorweb4.uwindsor.ca › users › d › djackson ›...

Documents