analysis of variance - stony brookzhu/ams394/lab11.pdfanalysis of variance 1 one way analysis of...

Post on 28-Jul-2020

34 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ANOVA

Analysis of Variance

1

One way Analysis of Variance

(ANOVA)

Comparing k Populations

2

The F test – for comparing k means

Situation

• We have k normal populations

• Let mi and s 2 denote the mean and variance

of population i.

• i = 1, 2, 3, … k.

• Note: we assume that the variance for each

population is unknown but the same.

s12 = s2

2 = … = sk2= s 2

3

We want to test

kH mmmm 3210 :

against

jiH jiA ,pair oneleast at for : mm

4

The F statistic

k

i

n

j

iijkN

k

i

iik

j

xx

xxn

F

1 1

21

1

2

11

where xij = the jth observation in the i th sample.

injki ,,2,1 and ,,2,1

kiin

x

x th

i

n

j

ij

i

i

,,2,1 sample for mean 1

size sample Total 1

k

i

inN

mean Overall 1 1

N

x

x

k

i

n

j

ij

i

5

The ANOVA table

k

i

iiB xxnSS1

2

W

B

MS

MSF

k

i

iikB xxnMS1

2

11

k

i

n

j

iijW

j

xxSS1 1

2

k

i

n

j

iijkNW

j

xxMS1 1

21

1k

kN

Source S.S d.f, M.S. F

Between

Within

The ANOVA table is a tool for displaying the

computations for the F test. It is very important when

the Between Sample variability is due to two or more

factors

6

Computing Formulae:

k

i

n

j

ij

i

x1 1

2

Compute

ixTin

j

iji samplefor Total 1

Total Grand 1 11

k

i

n

j

ij

k

i

i

i

xTG

size sample Total1

k

i

inN

k

i i

i

n

T

1

2

1)

2)

3)

4)

5) 7

The data

• Assume we have collected data from each of

k populations

• Let xi1, xi2 , xi3 , … denote the ni observations

from population i.

• i = 1, 2, 3, … k.

8

Then

1)

2)

k

i i

ik

i

n

j

ijWithinn

TxSS

i

1

2

1 1

2

BetweenSS

k

i i

i

N

G

n

T

1

22

3)

kNSS

kSSF

Within

Between

1

9

Source d.f. Sum of

Squares Mean

Square

F-ratio

Between k - 1 SSBetween MSBetween MSB /MSW

Within N - k SSWithin MSWithin

Total N - 1 SSTotal

Anova Table

SSMS

df

10

Example

In the following example we are comparing weight

gains resulting from the following six diets

1. Diet 1 - High Protein , Beef

2. Diet 2 - High Protein , Cereal

3. Diet 3 - High Protein , Pork

4. Diet 4 - Low protein , Beef

5. Diet 5 - Low protein , Cereal

6. Diet 6 - Low protein , Pork

11

Gains in weight (grams) for rats under six diets

differing in level of protein (High or Low)

and source of protein (Beef, Cereal, or Pork)

Diet 1 2 3 4 5 6

73 98 94 90 107 49

102 74 79 76 95 82

118 56 96 90 97 73

104 111 98 64 80 86

81 95 102 86 98 81

107 88 102 51 74 97

100 82 108 72 74 106

87 77 91 90 67 70

117 86 120 95 89 61

111 92 105 78 58 82

Mean 100.0 85.9 99.5 79.2 83.9 78.7

Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55

x 1000 859 995 792 839 787

x2 102062 75819 100075 64462 72613 64401

12

Thus

115864678464794321

2

1 1

2

k

i i

ik

i

n

j

ijWithinn

TxSS

i

BetweenSS 933.461260

5272467846

2

1

22

k

i i

i

N

G

n

T

3.456.214

6.922

54/11586

5/933.46121

kNSS

kSSF

Within

Between

54 and 5 with 386.2 2105.0 F

Thus since F > 2.386 we reject H0 13

Source d.f. Sum of

Squares Mean

Square

F-ratio

Between 5 4612.933 922.587 4.3** (p = 0.0023)

Within 54 11586.000 214.556

Total 59 16198.933

Anova Table

* - Significant at 0.05 (not 0.01)

SSSSSS

** - Significant at 0.01

14

Equivalence of the F-test and the t-test

when k = 2

mns

yxt

Pooled

11

2

11 22

mn

smsns

yx

Pooled

the t-test

15

the F-test

knsn

kxxn

s

sF

k

i

i

k

i

ii

k

i

ii

Pooled

Between

11

2

1

2

2

2

1

1

211 21

2

11

2

11

2

12

2

11

nnsnsn

xxnxxn

2

12

2

11numerator xxnxxn

2r denominato pooleds

16

2

21

221122

2

22

nn

xnxnxnxxn

2

21

221111

2

11

nn

xnxnxnxxn

2

212

21

2

21 xxnn

nn

2

212

21

2

2

1 xxnn

nn

17

2

212

21

2

12

2

212

22

2

11 xxnn

nnnnxxnxxn

2

21

21

21 xxnn

nn

2

21

21

11

1xx

nn

2

2

2

21

21

11

1t

s

xx

nn

FPooled

Hence

18

Gains in weight (grams) for rats under six diets

differing in level of protein (High or Low)

and source of protein (Beef, Cereal, or Pork)

Diet 1 2 3 4 5 6

73 98 94 90 107 49

102 74 79 76 95 82

118 56 96 90 97 73

104 111 98 64 80 86

81 95 102 86 98 81 107 88 102 51 74 97

100 82 108 72 74 106

87 77 91 90 67 70

117 86 120 95 89 61

111 92 105 78 58 82

Mean 100.0 85.9 99.5 79.2 83.9 78.7

Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55

x 1000 859 995 792 839 787

x2 102062 75819 100075 64462 72613 64401 19

SAS Code for one-way ANOVA

20

Data oneway;

Input diet $ weight_gain;

Datalines;

1 73

1 102

1 118

1 104

1 81

1 107

1 100

1 87

1 117

1 111

2 98

2 74

2 56

2 111

2 95

2 88

2 82

2 77

2 86

2 92

3 94

3 79

3 96

3 98

3 102

3 102

3 108

3 91

3 120

3 105

4 90

4 76

4 90

4 64

4 86

4 51

4 72

4 90

4 95

4 78

5 107

5 95

5 97

5 80

5 98

5 74

5 74

5 67

5 89

5 58

6 49

6 82

6 73

6 86

6 81

6 97

6 106

6 70

6 61

6 82

;

Run;

Note: there are

easier ways to

enter the data.

We will come

to that later.

SAS Code for one-way ANOVA

To test our hypothesis,

we use the following

code in SAS:

• “class” tells SAS the classification variable. In general, this is

going to be the effect that you are studying. In this case, the

effect is “diet.”

• “model” tells SAS the dependent variable. The general format

is “model Y = X” where Y is the dependent variable, and X is

the independent variable. In this case, weight_gain is

dependent on diet.

• Often a “quit” statement is necessary, because SAS may

continue to run a procedure until either another one has been

run, or SAS has been told to quit.

PROC ANOVA DATA = oneway;

class diet;

model weight_gain = diet;

RUN;

QUIT;

SAS Output

The ANOVA Procedure

Class Level Information

Class Levels Values

diet 6 1 2 3 4 5 6

Number of Observations Read 60

Number of Observations Used 60

The ANOVA Procedure

Dependent Variable: weight_gain

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 5 4612.93333 922.58667 4.30 0.0023

Error 54 11586.00000 214.55556

Corrected Total 59 16198.93333

R-Square Coeff Var Root MSE weight_gain Mean

0.284768 16.67039 14.64772 87.86667

Source DF Anova SS Mean Square F Value Pr > F

diet 5 4612.933333 922.586667 4.30 0.0023

Factorial Experiments

Analysis of Variance

24

• Dependent variable Y

• k Categorical independent variables A, B,

C, … (the Factors)

• Let

– a = the number of categories of A

– b = the number of categories of B

– c = the number of categories of C

– etc.

25

The Completely Randomized Design

• We form the set of all treatment combinations

– the set of all combinations of the k factors

• Total number of treatment combinations

– t = abc….

• In the completely randomized design n

experimental units (test animals , test plots, etc.

are randomly assigned to each treatment

combination.

– Total number of experimental units N = nt=nabc..

26

The treatment combinations can thought to be

arranged in a k-dimensional rectangular block

A

1

2

a

B 1 2 b

27

A

B

C

28

• The Completely Randomized Design is called balanced

• If the number of observations per treatment combination is unequal the design is called unbalanced. (resulting mathematically more complex analysis and computations)

• If for some of the treatment combinations there are no observations the design is called incomplete. (In this case it may happen that some of the parameters - main effects and interactions - cannot be estimated.)

29

Example: Two-way ANOVA

(two-factor experiment)

In this example we are examining the effect of

We have n = 10 test animals randomly

assigned to k = 6 diets

The level of protein A (High or Low) and

the source of protein B (Beef, Cereal, or

Pork) on weight gains (grams) in rats.

30

The k = 6 diets are the 6 = 3×2 Level-

Source combinations

1. High - Beef

2. High - Cereal

3. High - Pork

4. Low - Beef

5. Low - Cereal

6. Low - Pork

31

Source of Protein

Level

of

Protein

Beef Cereal Pork

High

Low

Treatment combinations

Diet 1 Diet 2 Diet 3

Diet 4 Diet 5 Diet 6

32

Level of Protein Beef Cereal Pork Overall

Low 79.20 83.90 78.70 80.60

Source of Protein

High 100.00 85.90 99.50 95.13

Overall 89.60 84.90 89.10 87.87

Summary Table of Means

33

Table Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and s

ource of protein (Beef, Cereal, or Pork)

Level of Protein High Protein Low protein

Source of Protein Beef Cereal Pork Beef Cereal Pork

Diet 1 2 3 4 5 6

73 98 94 90 107 49 102 74 79 76 95 82 118 56 96 90 97 73 104 111 98 64 80 86 81 95 102 86 98 81 107 88 102 51 74 97 100 82 108 72 74 106 87 77 91 90 67 70 117 86 120 95 89 61

111 92 105 78 58 82

Mean 100.0 85.9 99.5 79.2 83.9 78.7

Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55 34

35

Data twoway;

Input Protein $ Source $ weight_gain;

Datalines;

High Beef 73

High Beef 102

High Beef 118

High Beef 104

High Beef 81

High Beef 107

High Beef 100

High Beef 87

High Beef 117

High Beef 111

High Cereal 98

High Cereal 74

High Cereal 56

High Cereal 111

High Cereal 95

High Cereal 88

High Cereal 82

High Cereal 77

High Cereal 86

High Cereal 92

High Pork 94

High Pork 79

High Pork 96

High Pork 98

High Pork 102

High Pork 102

High Pork 108

High Pork 91

High Pork 120

High Pork 105

Low Beef 90

Low Beef 76

Low Beef 90

Low Beef 64

Low Beef 86

Low Beef 51

Low Beef 72

Low Beef 90

Low Beef 95

Low Beef 78

Low Cereal 107

Low Cereal 95

Low Cereal 97

Low Cereal 80

Low Cereal 98

Low Cereal 74

Low Cereal 74

Low Cereal 67

Low Cereal 89

Low Cereal 58

Low Pork 49

Low Pork 82

Low Pork 73

Low Pork 86

Low Pork 81

Low Pork 97

Low Pork 106

Low Pork 70

Low Pork 61

Low Pork 82

;

Run;

SAS Code for two-way ANOVA

To test our hypotheses,

we use the following

code in SAS:

• “class” tells SAS the two classification variables, which are

generally going to be the effects that you are studying. In this

case, the effects are “Protein” and “Source”

• “model” tells SAS the dependent variable. The general format

is “model Y = X1 X2 X1*X2” where Y is the dependent

variable, X1 and X2 are independent variables. X1*X2 means

the interaction of X1 and X2.

• Often a “quit” statement is necessary, because SAS may

continue to run a procedure until either another one has been

run, or SAS has been told to quit.

PROC ANOVA DATA = twoway;

class Protein Source;

model weight_gain = Protein Source

Protein*Source;

RUN;

QUIT;

SAS Output

The ANOVA Procedure

Class Level Information

Class Levels Values

Protein 2 High Low

Source 3 Beef Cereal Pork

Number of Observations Read 60

Number of Observations Used 60

The ANOVA Procedure

Dependent Variable: weight_gain

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 5 4612.93333 922.58667 4.30 0.0023

Error 54 11586.00000 214.55556

Corrected Total 59 16198.93333

R-Square Coeff Var Root MSE weight_gain Mean

0.284768 16.67039 14.64772 87.86667

Source DF Anova SS Mean Square F Value Pr > F

Protein 1 3168.266667 3168.266667 14.77 0.0003

Source 2 266.533333 133.266667 0.62 0.5411

Protein*Source 2 1178.133333 589.066667 2.75 0.0732

Profiles of the response relative

to a factor

A graphical representation of the

effect of a factor on a reponse

variable (dependent variable)

39

Profile Y for A

Y

Levels of A

a 1 2 3 …

This could be for an

individual case or

averaged over a group

of cases

This could be for

specific level of

another factor or

averaged levels of

another factor

40

70

80

90

100

110

Beef Cereal Pork

Weig

ht

Ga

in

High Protein

Low Protein

Overall

Profiles of Weight Gain for

Source and Level of Protein

41

70

80

90

100

110

High Protein Low Protein

Weig

ht

Ga

in

Beef

Cereal

Pork

Overall

Profiles of Weight Gain for

Source and Level of Protein

42

Example – Four factor experiment

Four factors are studied for their effect on Y (luster of paint film). The four factors are:

Two observations of film luster (Y) are taken

for each treatment combination

1) Film Thickness - (1 or 2 mils)

2) Drying conditions (Regular or Special)

3) Length of wash (10,30,40 or 60 Minutes), and

4) Temperature of wash (92 ˚C or 100 ˚C)

43

The data is tabulated below: Regular Dry Special Dry Minutes 92 C 100 C 92C 100 C 1-mil Thickness 20 3.4 3.4 19.6 14.5 2.1 3.8 17.2 13.4 30 4.1 4.1 17.5 17.0 4.0 4.6 13.5 14.3 40 4.9 4.2 17.6 15.2 5.1 3.3 16.0 17.8 60 5.0 4.9 20.9 17.1 8.3 4.3 17.5 13.9 2-mil Thickness 20 5.5 3.7 26.6 29.5 4.5 4.5 25.6 22.5 30 5.7 6.1 31.6 30.2 5.9 5.9 29.2 29.8 40 5.5 5.6 30.5 30.2 5.5 5.8 32.6 27.4 60 7.2 6.0 31.4 29.6 8.0 9.9 33.5 29.5

44

Definition:

A factor is said to not affect the response if

the profile of the factor is horizontal for all

combinations of levels of the other factors:

No change in the response when you change

the levels of the factor (true for all

combinations of levels of the other factors)

Otherwise the factor is said to affect the

response:

45

Profile Y for A – A affects the response

Y

Levels of A

a 1 2 3 …

Levels of B

46

Profile Y for A – no affect on the response

Y

Levels of A

a 1 2 3 …

Levels of B

47

Definition:

• Two (or more) factors are said to interact if changes in the response when you change the level of one factor depend on the level(s) of the other factor(s).

• Profiles of the factor for different levels of the other factor(s) are not parallel

• Otherwise the factors are said to be additive .

• Profiles of the factor for different levels of the other factor(s) are parallel.

48

Interacting factors A and B Y

Levels of A

a 1 2 3 …

Levels of B

49

Additive factors A and B Y

Levels of A

a 1 2 3 …

Levels of B

50

• If two (or more) factors interact each factor

effects the response.

• If two (or more) factors are additive it still

remains to be determined if the factors

affect the response

• In factorial experiments we are interested in

determining

– which factors effect the response and

– which groups of factors interact .

51

Order of testing in factorial experiments

1. Test first the higher order interactions.

2. If an interaction is present there is no need to test lower order interactions or main effects involving those factors. All factors in the interaction affect the response and they interact

3. The testing continues for lower order interactions and main effects for factors which have not yet been determined to affect the response.

52

More SAS Program: Proc GLM

The ANOVA procedure is one of several procedures available in SAS/STAT software for analysis of variance. The ANOVA procedure is designed to handle balanced data (that is, data with equal numbers of observations for every combination of the classification factors), whereas the GLM procedure can analyze both balanced and unbalanced data. Because PROC ANOVA takes into account the special structure of a balanced design, it is faster and uses less storage than PROC GLM for balanced data.

Proc GLM

PROC GLM DATA = twoway;

class Protein Source;

model weight_gain = Protein Source Protein*Source;

lsmeans Protein Source Protein*Source /out=outmns;

*gives least square means and outputs them into another data set called 'outmns';

means Protein Source /cldiff bon;

*ask SAS for the confidence limits for the difference of means and the type of comparison;

output out=resout p=preds rstudent=exstdres;

*outputs the residuals and predicted value to a data set called 'resout';

RUN;

QUIT;

Proc GLM, continued title 'Profile/Interaction Plots';

symbol i=j;

*tells SAS to draw lines between joint means;

proc gplot data=outmns;

where poison ne . and treatment ne .;

*remove the marginal means from the data set since we only wish to plot joint means;

plot lsmean*Protein=Source;

plot lsmean*Source=Protein;

run; quit;

goptions reset=all; *resets PROC GPLOT options;

title 'Residual Plot';

proc gplot data=resout;

plot exstdres*preds;

run; quit;

Mean versus LS Mean (LSM)

56

Mean versus LS Mean (LSM)

57

Note, for balanced designs,

as true for our examples,

the mean and LSM are the same.

Bonferroni Pairwise Mean Comparisons The GLM Procedure

Bonferroni (Dunn) t Tests for weight_gain

NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type

II error rate than Tukey's for all pairwise comparisons.

Alpha 0.05

Error Degrees of Freedom 54

Error Mean Square 214.5556

Critical Value of t 2.00488

Minimum Significant Difference 7.5825

Comparisons significant at the 0.05 level are indicated by ***.

Difference

Protein Between Simultaneous 95%

Comparison Means Confidence Limits

High - Low 14.533 6.951 22.116 ***

Low - High -14.533 -22.116 -6.951 ***

The GLM Procedure

Bonferroni (Dunn) t Tests for weight_gain

NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type

II error rate than Tukey's for all pairwise comparisons.

Alpha 0.05

Error Degrees of Freedom 54

Error Mean Square 214.5556

Critical Value of t 2.47085

Minimum Significant Difference 11.445

Comparisons significant at the 0.05 level are indicated by ***.

Difference Simultaneous

Source Between 95% Confidence

Comparison Means Limits

Beef - Pork 0.500 -10.945 11.945

Beef - Cereal 4.700 -6.745 16.145

Pork - Beef -0.500 -11.945 10.945

Pork - Cereal 4.200 -7.245 15.645

Cereal - Beef -4.700 -16.145 6.745

Cereal - Pork -4.200 -15.645 7.245

Tukey pairwise mean comparisons

PROC GLM DATA = twoway;

class Protein Source;

model weight_gain = Protein Source Protein*Source;

means Protein Source /tukey;

RUN;

QUIT;

The GLM Procedure

Tukey's Studentized Range (HSD) Test for weight_gain

NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type

II error rate than REGWQ.

Alpha 0.05

Error Degrees of Freedom 54

Error Mean Square 214.5556

Critical Value of Studentized Range 2.83533

Minimum Significant Difference 7.5825

Means with the same letter are not significantly different.

Tukey Grouping Mean N Protein

A 95.133 30 High

B 80.600 30 Low

The GLM Procedure

Tukey's Studentized Range (HSD) Test for weight_gain

NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type

II error rate than REGWQ.

Alpha 0.05

Error Degrees of Freedom 54

Error Mean Square 214.5556

Critical Value of Studentized Range 3.40823

Minimum Significant Difference 11.163

Means with the same letter are not significantly different.

Tukey Grouping Mean N Source

A 89.600 20 Beef

A

A 89.100 20 Pork

A

A 84.900 20 Cereal

Models for factorial

Experiments

66

Part I. Factor Effects Model

67

The Single Factor Experiment (One-way ANOVA)

Situation

• We have t = a treatment combinations

• Let mi and s 2 denote the mean and variance

of treatment (population) i.

• i = 1, 2, 3, … a.

• Note: we assume that the variance for each

population is unknown but the same.

s12 = s2

2 = … = sa2= s 2

68

The data

• Assume we have collected data for each of

the a treatments

• Let yi1, yi2 , yi3 , … , yin denote the n

observations for treatment i.

• i = 1, 2, 3, … a.

69

The model

Note:

ij i ij i i ijy ym m m

i ij i ijm m m m

where ij ij iy m

1

1 k

i

ikm m

i i m m

has N(0,s 2) distribution

(overall mean effect)

(Effect of Factor A)

Note: 1

0a

i

i

by their definition. 70

Model 1:

ij i ijy m

yij (i = 1, … , a; j = 1, …, n) are independent

Normal with mean mi and variance s 2.

Model 2:

where ij (i = 1, … , a; j = 1, …, n) are independent

Normal with mean 0 and variance s 2.

ij i ijy m Model 3:

where ij (i = 1, … , a; j = 1, …, n) are independent

Normal with mean 0 and variance s 2 and

1

0a

i

i

71

The Two Factor Experiment

Situation

• We have t = ab treatment combinations

• Let mij and s 2 denote the mean and variance

of observations from the treatment

combination when A = i and B = j.

• i = 1, 2, 3, … a, j = 1, 2, 3, … b.

72

The data

• Assume we have collected data (n observations)

for each of the t = ab treatment combinations.

• Let yij1, yij2 , yij3 , … , yijn denote the n observations

for treatment combination - A = i, B = j.

• i = 1, 2, 3, … a, j = 1, 2, 3, … b.

73

The model Note:

ijk ij ijk ij ij ijky ym m m

i j ij i j ijm m m m m m m m m

where ijk ijk ijy m

1 1 1 1

1 1 1, and

a b b a

ij i ij j ij

i j j iab b am m m m m m

, ,i i j j m m m m

follows N(0,s 2) distribution

and

i j ijkijm

ij i jij m m m m

74

The model Note:

ijk ij ijk ij ij ijky ym m m

i j ij i j ijm m m m m m m m m

where ijk ijk ijy m

1 1 1 1

1 1 1, and

a b b a

ij i ij j ij

i j j iab b am m m m m m

, ,i i j j m m m m

follows N(0,s 2) distribution

Note: 1

0a

i

i

by their definition.

i j ijkijm

75

ijk i j ijkijy m

Model :

where ijk (i = 1, … , a; j = 1, …, b ; k = 1, …, n) are

independent Normal with mean 0 and variance s 2 and

1

0a

i

i

1

0b

j

j

1 1

and 0a b

ij iji j

Main effects Interaction

Effect Mean Error

76

ijk i j ijkijy m

Maximum Likelihood Estimates

where ijk (i = 1, … , a; j = 1, …, b ; k = 1, …, n) are

independent Normal with mean 0 and variance s 2 and

1 1 1

ˆa b n

ijk

i j k

y y abnm

1 1

ˆb n

i i ijk

j k

y y y bn y

1 1

ˆa n

j j ijk

i k

y y y an y

77

^

ij i jijy y y y

1

n

ijk i j

k

y n y y y

2

2

1 1 1

a b n

ijk ij

i j k

y ynab

s

2

1 1 1

^1 ˆˆˆa b n

ijk i j iji j k

ynab

m

This is not an unbiased estimator of s 2 (usually the

case when estimating variance.)

The unbiased estimator results when we divide by

ab(n -1) instead of abn 78

22

1 1 1

1

1

a b n

ijk ij

i j k

s y yab n

2

1 1 1

^1 ˆˆˆ1

a b n

ijk i j iji j k

yab n

m

The unbiased estimator of s 2 is

1

1Error ErrorSS MS

ab n

2

1 1 1

a b n

Error ijk ij

i j k

SS y y

where

79

22

1 1 1 1

^a b a b

AB ij i jiji j i j

SS y y y y

Testing for Interaction:

1

1 1AB

AB

Error Error

SSa bMS

FMS MS

where

We want to test:

H0: ()ij = 0 for all i and j, against

HA: ()ij ≠ 0 for at least one i and j.

The test statistic

80

( 1)( 1), ( 1)AB

Error

MSF F a b ab n

MS

We reject

H0: ()ij = 0 for all i and j,

If

81

22

1 1

ˆa a

A i i

i i

SS y y

Testing for the Main Effect of A:

1

1A

A

Error Error

SSaMS

FMS MS

where

We want to test:

H0: i = 0 for all i, against

HA: i ≠ 0 for at least one i.

The test statistic

82

( 1), ( 1)A

Error

MSF F a ab n

MS

We reject

H0: i = 0 for all i,

If

83

2

2

1 1

ˆb b

B j j

j j

SS y y

Testing for the Main Effect of B:

1

1B

B

Error Error

SSbMS

FMS MS

where

We want to test:

H0: j = 0 for all j, against

HA: j ≠ 0 for at least one j.

The test statistic

84

( 1), ( 1)B

Error

MSF F b ab n

MS

We reject

H0: j = 0 for all j,

If

85

The ANOVA Table

Source S.S. d.f. MS =SS/df F

A SSA a - 1 MSA MSA / MSError

B SSB b - 1 MSB MSB / MSError

AB SSAB (a - 1)(b - 1) MSAB MSAB/ MSError

Error SSError ab(n - 1) MSError

Total SSTotal abn - 1

86

The ANOVA Procedure

Dependent Variable: weight_gain

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 5 4612.93333 922.58667 4.30 0.0023

Error 54 11586.00000 214.55556

Corrected Total 59 16198.93333

R-Square Coeff Var Root MSE weight_gain Mean

0.284768 16.67039 14.64772 87.86667

Source DF Anova SS Mean Square F Value Pr > F

Protein 1 3168.266667 3168.266667 14.77 0.0003

Source 2 266.533333 133.266667 0.62 0.5411

Protein*Source 2 1178.133333 589.066667 2.75 0.0732

Part II. General Linear Model

88

One-way ANOVA

The ANOVA is indeed a special case of the

general linear model (GLM) when all the

predictors are categorical variables.

For one-way ANOVA, we have only one

categorical predictor. As shown in the

following slides, we can easily translate the

ANOVA into a GLM using dummy variables.

89

Dummy Variables • Dummy coding

• 0s and 1s

– For a categorical predictor with k categories, k-1 dummy variables will go into the regression equation leaving out one

reference category (e.g. control)

• Coefficients are interpreted

as change with respect to the

reference variable (the one

with all zeros)

– In this case group 3

Group D1 D2

1 1 0

2 0 1

3 0 0

GLM representation and

interpretations • GLM model:

• Relation to category/group means:

• Therefore the ANOVA hypothesis:

• Can be expressed as:

mmmm

mmmm

mmmm

003

10

01

2122113

22122112

12122111

DD:Group

DD:2Group

DD:1Group

m 2211 DDY

3210 : mmm H

0: 210 H

Two-way ANOVA

We will revisit the two-way ANOVA example

on the impact of weight_gain from two

factors:

(1)Protein level (denoted as Protein) – it has

two levels: High/Low

(2)Protein source (denoted as Source) – it has

three levels: Beef/Cereal/Pork

92

Dummy Variables

Source D1 D2

Beef 1 0

Cereal 0 1

Pork 0 0

Protein D

High 1

Low 0

GLM representation and

interpretations • GLM model:

• Relation to category/group means:

mmm

mmm

mmm

mmm

mmm

mmm

0*00*0000

1*00*0100

0*01*0010

0*10*1001

1*10*1101

0*11*1011

543216

3543215

2543214

1543213

531543212

421543211

:Low/Pork

:Low/Cereal

:Low/Beef

:High/Pork

:lHigh/Cerea

:High/Beef

m 251423121 ** DDDDDDDY

GLM representation and

interpretations • Test for Interaction:

• Test for Protein (level) main effect:

• Test for (protein) Source main effect:

0: 540 H

0: 10 H

0: 320 H

Acknowledgement:

• We thank colleagues who posted their lecture notes on the internet@!

• Please note that in SAS, we have several procedures that will enable you to perform ANOVA. These include Proc ANOVA and Proc GLM, plus several other procedures such as Proc Mixed, etc. The ANOVA procedures we have learned so far are just the basic fixed effect ANOVAs. In the future we will also learn those with random effect, and mixed effects. See the following websites for a review and preview:

• http://www.ats.ucla.edu/stat/sas/library/SASAnova_mf.htm

• http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#mixed_toc.htm

• http://www.hawaii.edu/hisug/pdf/AnnMariaprocmixed.pdf

96

top related