review of anova and linear regression. review of simple anova

54
Review of ANOVA and linear regression

Post on 19-Dec-2015

232 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Review of ANOVA and linear regression. Review of simple ANOVA

Review of ANOVA and linear regression

Page 2: Review of ANOVA and linear regression. Review of simple ANOVA

Review of simple ANOVA

Page 3: Review of ANOVA and linear regression. Review of simple ANOVA

ANOVAfor comparing means between more than 2 groups

Page 4: Review of ANOVA and linear regression. Review of simple ANOVA

Hypotheses of One-Way ANOVA

All population means are equal i.e., no treatment effect (no variation in means

among groups)

At least one population mean is different i.e., there is a treatment effect Does not mean that all population means are

different (some pairs may be the same)

c3210 μμμμ:H

same the are means population the of all Not:H1

Page 5: Review of ANOVA and linear regression. Review of simple ANOVA

The F-distribution A ratio of variances follows an F-

distribution:

22

220

:

:

withinbetweena

withinbetween

H

H

The F-test tests the hypothesis that two variances are equal. F will be close to 1 if sample variances are equal.

mnwithin

between F ,2

2

~

Page 6: Review of ANOVA and linear regression. Review of simple ANOVA

How to calculate ANOVA’s by hand…  Treatment 1 Treatment 2 Treatment 3 Treatment 4

y11 y21 y31 y41

y12 y22 y32 y42

y13 y23 y33 y43

y14 y24 y34 y44

y15 y25 y35 y45

y16 y26 y36 y46

y17 y27 y37 y47

y18 y28 y38 y48

y19 y29 y39 y49

y110 y210 y310 y410

n=10 obs./group

k=4 groups

The group means

10

10

11

1

jjy

y10

10

12

2

jjy

y10

10

13

3

jjy

y 10

10

14

4

jjy

y

The (within) group variances

110

)(10

1

211

j

j yy

110

)(10

1

222

j

j yy

110

)(10

1

233

j

j yy

110

)(10

1

244

j

j yy

Page 7: Review of ANOVA and linear regression. Review of simple ANOVA

Sum of Squares Within (SSW), or Sum of Squares Error (SSE)

The (within) group variances110

)(10

1

211

j

j yy

110

)(10

1

222

j

j yy

110

)(10

1

233

j

j yy

110

)(10

1

244

j

j yy

4

1

10

1

2)(i j

iij yy

+

10

1

211 )(

jj yy

10

1

222 )(

jj yy

10

3

233 )(

jj yy

10

1

244 )(

jj yy++

Sum of Squares Within (SSW) (or SSE, for chance error)

Page 8: Review of ANOVA and linear regression. Review of simple ANOVA

Sum of Squares Between (SSB), or Sum of Squares Regression (SSR)

Sum of Squares Between (SSB). Variability of the group means compared to the grand mean (the variability due to the treatment).

Overall mean of all 40 observations (“grand mean”)

40

4

1

10

1

i jijy

y

24

1

)(10

i

i yyx

Page 9: Review of ANOVA and linear regression. Review of simple ANOVA

Total Sum of Squares (SST)

Total sum of squares(TSS).Squared difference of every observation from the overall mean. (numerator of variance of Y!)

4

1

10

1

2)(i j

ij yy

Page 10: Review of ANOVA and linear regression. Review of simple ANOVA

Partitioning of Variance

4

1

10

1

2)(i j

iij yy

4

1

2)(i

i yy

4

1

10

1

2)(i j

ij yy=+

SSW + SSB = TSS

10x

Page 11: Review of ANOVA and linear regression. Review of simple ANOVA

ANOVA Table

Between (k groups)

k-1 SSB(sum of squared deviations of group means from grand mean)

SSB/k-1 Go to

Fk-1,nk-k

chart

Total variation

nk-1 TSS(sum of squared deviations of observations from grand mean)  

 

Source of variation

 

d.f.

 

Sum of squares

Mean Sum of Squares

F-statistic p-value

Within(n individuals per

group)

nk-k SSW (sum of squared deviations of observations from their group mean)

s2=SSW/nk-k

knkSSW

kSSB

1

TSS=SSB + SSW

Page 12: Review of ANOVA and linear regression. Review of simple ANOVA

Example

Treatment 1 Treatment 2 Treatment 3 Treatment 4

60 inches 50 48 47

67 52 49 67

42 43 50 54

67 67 55 67

56 67 56 68

62 59 61 65

64 67 61 65

59 64 60 56

72 63 59 60

71 65 64 65

Page 13: Review of ANOVA and linear regression. Review of simple ANOVA

Example

Treatment 1 Treatment 2 Treatment 3 Treatment 4

60 inches 50 48 47

67 52 49 67

42 43 50 54

67 67 55 67

56 67 56 68

62 59 61 65

64 67 61 65

59 64 60 56

72 63 59 60

71 65 64 65

Step 1) calculate the sum of squares between groups:

 

Mean for group 1 = 62.0

Mean for group 2 = 59.7

Mean for group 3 = 56.3

Mean for group 4 = 61.4

 

Grand mean= 59.85 SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per group= 19.65x10 = 196.5

Page 14: Review of ANOVA and linear regression. Review of simple ANOVA

Example

Treatment 1 Treatment 2 Treatment 3 Treatment 4

60 inches 50 48 47

67 52 49 67

42 43 50 54

67 67 55 67

56 67 56 68

62 59 61 65

64 67 61 65

59 64 60 56

72 63 59 60

71 65 64 65

Step 2) calculate the sum of squares within groups:

 

(60-62) 2+(67-62) 2+ (42-62) 2+ (67-62) 2+ (56-62)

2+ (62-62) 2+ (64-62) 2+ (59-62) 2+ (72-62) 2+ (71-62) 2+ (50-59.7) 2+ (52-59.7) 2+ (43-59.7) 2+67-59.7) 2+ (67-59.7) 2+ (69-59.7) 2…+….(sum of 40 squared deviations) = 2060.6

Page 15: Review of ANOVA and linear regression. Review of simple ANOVA

Step 3) Fill in the ANOVA table

3 196.5 65.5 1.14 .344

36 2060.6 57.2

 

Source of variation

 

d.f.

 

Sum of squares

 

Mean Sum of Squares

 

F-statistic

 

p-value

Between

Within

Total 39 2257.1

   

      

Page 16: Review of ANOVA and linear regression. Review of simple ANOVA

Step 3) Fill in the ANOVA table

3 196.5 65.5 1.14 .344

36 2060.6 57.2

 

Source of variation

 

d.f.

 

Sum of squares

 

Mean Sum of Squares

 

F-statistic

 

p-value

Between

Within

Total 39 2257.1

   

      

INTERPRETATION of ANOVA:

How much of the variance in height is explained by treatment group?

R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%

Page 17: Review of ANOVA and linear regression. Review of simple ANOVA

Coefficient of Determination

SST

SSB

SSESSB

SSBR

2

The amount of variation in the outcome variable (dependent variable) that is explained by the predictor (independent variable).

Page 18: Review of ANOVA and linear regression. Review of simple ANOVA

ANOVA example

S1a, n=25 S2b, n=25 S3c, n=25 P-valued

Calcium (mg) Mean 117.8 158.7 206.5 0.000SDe 62.4 70.5 86.2

Iron (mg) Mean 2.0 2.0 2.0 0.854

SD 0.6 0.6 0.6

Folate (μg) Mean 26.6 38.7 42.6 0.000

SD 13.1 14.5 15.1

Zinc (mg)Mean 1.9 1.5 1.3 0.055

SD 1.0 1.2 0.4a School 1 (most deprived; 40% subsidized lunches).b School 2 (medium deprived; <10% subsidized).c School 3 (least deprived; no subsidization, private school).d ANOVA; significant differences are highlighted in bold (P<0.05).

Table 6. Mean micronutrient intake from the school lunch by school

FROM: Gould R, Russell J, Barker ME. School lunch menus and 11 to 12 year old children's food choice in three secondary schools in England-are the nutritional standards being met? Appetite. 2006 Jan;46(1):86-92.

Page 19: Review of ANOVA and linear regression. Review of simple ANOVA

Answer

Step 1) calculate the sum of squares between groups:

Mean for School 1 = 117.8

Mean for School 2 = 158.7

Mean for School 3 = 206.5

Grand mean: 161

SSB = [(117.8-161)2 + (158.7-161)2 + (206.5-161)2] x25 per group= 98,113

Page 20: Review of ANOVA and linear regression. Review of simple ANOVA

Answer

Step 2) calculate the sum of squares within groups:

 

S.D. for S1 = 62.4

S.D. for S2 = 70.5

S.D. for S3 = 86.2

Therefore, sum of squares within is:

(24)[ 62.42 + 70.5 2+ 86.22]=391,066

Page 21: Review of ANOVA and linear regression. Review of simple ANOVA

Answer

Step 3) Fill in your ANOVA table  

Source of variation

 

d.f.

 

Sum of squares

 

Mean Sum of Squares

 

F-statistic

 

p-value

Between 2 98,113 49056 9 <.05

Within 72 391,066 5431    

Total 74 489,179      

**R2=98113/489179=20%

School explains 20% of the variance in lunchtime calcium intake in these kids.

Page 22: Review of ANOVA and linear regression. Review of simple ANOVA

Beyond one-way ANOVA

Often, you may want to test more than 1 treatment. ANOVA can accommodate more than 1 treatment or factor, so long as they are independent. Again, the variation partitions beautifully!

 TSS = SSB1 + SSB2 + SSW  

Page 23: Review of ANOVA and linear regression. Review of simple ANOVA

Linear regression review

Page 24: Review of ANOVA and linear regression. Review of simple ANOVA

What is “Linear”?

Remember this: Y=mX+B?

B

m

Page 25: Review of ANOVA and linear regression. Review of simple ANOVA

What’s Slope?

A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

Page 26: Review of ANOVA and linear regression. Review of simple ANOVA

Regression equation…

iii xxyE )/(Expected value of y at a given level of x=

Page 27: Review of ANOVA and linear regression. Review of simple ANOVA

Predicted value for an individual…

yi= + *xi + random errori

Follows a normal distribution

Fixed – exactly on the line

Page 28: Review of ANOVA and linear regression. Review of simple ANOVA

Assumptions (or the fine print)

Linear regression assumes that… 1. The relationship between X and Y is linear 2. Y is distributed normally at each value of X 3. The variance of Y at every value of X is the

same (homogeneity of variances) 4. The observations are independent**

**When we talk about repeated measures starting next week, we will violate this assumption and hence need more sophisticated regression models!

Page 29: Review of ANOVA and linear regression. Review of simple ANOVA

The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X.

Sy/x

Sy/x

Sy/x

Sy/x

Sy/x

Sy/x

Page 30: Review of ANOVA and linear regression. Review of simple ANOVA

C A

B

A

yi

 

x

y

yi

 

C

B

*Least squares estimation gave us the line (β) that minimized C2

 

ii xy

y

A2 B2 C2

SStotal

Total squared distance of observations from naïve mean of y Total variation

SSreg Distance from regression line to naïve mean of y

 Variability due to x (regression)   

SSresidual

Variance around the regression line

 Additional variability not explained by x—what least squares method aims to minimize

n

iii

n

i

n

iii yyyyyy

1

2

1 1

22 )ˆ()ˆ()(

Regression Picture

R2=SSreg/SStotal

Page 31: Review of ANOVA and linear regression. Review of simple ANOVA

Recall example: cognitive function and vitamin D

Hypothetical data loosely based on [1]; cross-sectional study of 100 middle-aged and older European men. Cognitive function is measured by the

Digit Symbol Substitution Test (DSST).

1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.

Page 32: Review of ANOVA and linear regression. Review of simple ANOVA

Distribution of vitamin D

Mean= 63 nmol/L

Standard deviation = 33 nmol/L

Page 33: Review of ANOVA and linear regression. Review of simple ANOVA

Distribution of DSST

Normally distributed

Mean = 28 points

Standard deviation = 10 points

Page 34: Review of ANOVA and linear regression. Review of simple ANOVA

Four hypothetical datasets

I generated four hypothetical datasets, with increasing TRUE slopes (between vit D and DSST): 0 0.5 points per 10 nmol/L 1.0 points per 10 nmol/L 1.5 points per 10 nmol/L

Page 35: Review of ANOVA and linear regression. Review of simple ANOVA

Dataset 1: no relationship

Page 36: Review of ANOVA and linear regression. Review of simple ANOVA

Dataset 2: weak relationship

Page 37: Review of ANOVA and linear regression. Review of simple ANOVA

Dataset 3: weak to moderate relationship

Page 38: Review of ANOVA and linear regression. Review of simple ANOVA

Dataset 4: moderate relationship

Page 39: Review of ANOVA and linear regression. Review of simple ANOVA

The “Best fit” line

Regression equation:

E(Yi) = 28 + 0*vit Di (in 10 nmol/L)

Page 40: Review of ANOVA and linear regression. Review of simple ANOVA

The “Best fit” line

Note how the line is a little deceptive; it draws your eye, making the relationship appear stronger than it really is!

Regression equation:

E(Yi) = 26 + 0.5*vit Di (in 10 nmol/L)

Page 41: Review of ANOVA and linear regression. Review of simple ANOVA

The “Best fit” line

Regression equation:

E(Yi) = 22 + 1.0*vit Di (in 10 nmol/L)

Page 42: Review of ANOVA and linear regression. Review of simple ANOVA

The “Best fit” line

Regression equation:

E(Yi) = 20 + 1.5*vit Di (in 10 nmol/L)

Note: all the lines go through the point (63, 28)!

Page 43: Review of ANOVA and linear regression. Review of simple ANOVA

Significance testing…Slope

Distribution of slope ~ Tn-2(β,s.e.( ))

 

 

H0: β1 = 0 (no linear relationship)

H1: β1 0 (linear relationship does exist)

)ˆ.(.

es

Tn-2=

Page 44: Review of ANOVA and linear regression. Review of simple ANOVA

Example: dataset 4

Standard error (beta) = 0.03 T98 = 0.15/0.03 = 5, p<.0001

95% Confidence interval = 0.09 to 0.21

Page 45: Review of ANOVA and linear regression. Review of simple ANOVA

Multiple linear regression…

What if age is a confounder here? Older men have lower vitamin D Older men have poorer cognition

“Adjust” for age by putting age in the model: DSST score = intercept +

slope1xvitamin D + slope2 xage

Page 46: Review of ANOVA and linear regression. Review of simple ANOVA

2 predictors: age and vit D…

Page 47: Review of ANOVA and linear regression. Review of simple ANOVA

Different 3D view…

Page 48: Review of ANOVA and linear regression. Review of simple ANOVA

Fit a plane rather than a line…

On the plane, the slope for vitamin D is the same at every age; thus, the slope for vitamin D represents the effect of vitamin D when age is held constant.

Page 49: Review of ANOVA and linear regression. Review of simple ANOVA

Equation of the “Best fit” plane… DSST score = 53 + 0.0039xvitamin D

(in 10 nmol/L) - 0.46 xage (in years)

P-value for vitamin D >>.05 P-value for age <.0001

Thus, relationship with vitamin D was due to confounding by age!

Page 50: Review of ANOVA and linear regression. Review of simple ANOVA

Multiple Linear Regression More than one predictor…

E(y)= + 1*X + 2 *W + 3 *Z…

Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant.

 

Page 51: Review of ANOVA and linear regression. Review of simple ANOVA

Functions of multivariate analysis:

Control for confounders Test for interactions between predictors

(effect modification) Improve predictions

Page 52: Review of ANOVA and linear regression. Review of simple ANOVA

ANOVA is linear regression!

Divide vitamin D into three groups: Deficient (<25 nmol/L) Insufficient (>=25 and <50 nmol/L) Sufficient (>=50 nmol/L), reference group

DSST= (=value for sufficient) + insufficient*(1 if insufficient) + 2 *(1 if deficient)

This is called “dummy coding”—where multiple binary variables are created to represent being in each category (or not) of a categorical variable

Page 53: Review of ANOVA and linear regression. Review of simple ANOVA

The picture…

Sufficient vs. Insufficient

Sufficient vs. Deficient

Page 54: Review of ANOVA and linear regression. Review of simple ANOVA

Results… Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t|

Intercept 1 40.07407 1.47817 27.11 <.0001 deficient 1 -9.87407 3.73950 -2.64 0.0096 insufficient 1 -6.87963 2.33719 -2.94 0.0041

Interpretation: The deficient group has a mean DSST 9.87

points lower than the reference (sufficient) group.

The insufficient group has a mean DSST 6.87 points lower than the reference (sufficient) group.