statistics 262: intermediate biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf ·...

40
Statistics 262: Intermediate Biostatistics Regression, ANOVA, Random Effects Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics – p.1/??

Upload: others

Post on 06-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Statistics 262: IntermediateBiostatistics

Regression, ANOVA, Random Effects

Jonathan Taylor & Kristin Cobb

Statistics 262: Intermediate Biostatistics – p.1/??

Page 2: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Overview of today’s class

Multiple regression models

Analysis of Variance: Fixed effects

Analysis of Variance: Random effects

Statistics 262: Intermediate Biostatistics – p.2/??

Page 3: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Example: mortality rates

Pollution study: For n = 59 metropolitanareas, we record: median education, X1; %nonwhite X2; median income, X3 & pollution,X4.

Aim is to predict mortality rates Y based onX1, . . . , X4.

“Simplest model”

Y∣∣X1, . . . , X4 = β0 +

4∑

i=1

βi · Xi + ε.

Statistics 262: Intermediate Biostatistics – p.3/??

Page 4: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Example: mortality rates

As in simple regression model, ε’s areassumed independent (conditional on all ofthe observed X ’s) and homoskedastic (equalvariance).

Model is fit using least squares – as usual.

Questions of possible interest: is pollutioncorrelated with mortality?

Or, H0 : β4 = 0?

Statistics 262: Intermediate Biostatistics – p.4/??

Page 5: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Testing hypotheses:

Assuming ε ∼ N(0, σ2), the least squaresestimates

β̂ =(XTX

)−1XTY ∼ N

(β, σ2(XTX)−1

)

where X is the design matrix

1 X1,1 . . . X1,4... ... ... ...1 X59,1 . . . X59,4

Statistics 262: Intermediate Biostatistics – p.5/??

Page 6: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Distributions

For any linear combination of β’s

〈a, β̂〉 ∼ N(〈a, β〉, σ2a(XTX)−1aT

).

We also “know” that

σ̂2 =1

54

59∑

i=1

(Yi − (β0 +

4∑

j=1

β̂jXj)

)2

∼ σ2·χ2

54

54.

Why “54” – because we estimated 5parameters, leaving 59 − 5 = 54 degrees offreedom.

Statistics 262: Intermediate Biostatistics – p.6/??

Page 7: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

t-tests

These facts tell us that

〈a, β̂〉√σ̂2a(XTX)−1aT

∼ t54.

This gives a way to test whether β4 = 0 (andget a CI for β4) : compute

T =β̂4√

σ̂2(XTX)−15,5

If |T | > t54,0.975 = 2.00 then we reject H0 atlevel 0.05. Statistics 262: Intermediate Biostatistics – p.7/??

Page 8: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

F -tests

A t-test is a “partial” regression test becauseit tests for the effect of one variable “allowingfor” the effects of the remaining variables.

Sometimes, it may be of interest to seewhether we can “safely” drop more than onevariable from the model.

For instance, suppose we want to see ifeducation and % nonwhite can be droppedfrom the model.

Test is based on the difference in SSEbetween the two models and the SSE of the“full” model. Statistics 262: Intermediate Biostatistics – p.8/??

Page 9: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Error sums of squares

Consider the two models

Yi = β0 + β1 · X1 + β4 · X4

Yi = β0 +4∑

i=1

βi · Xi

Each model, has an SSE, let’s say SSE(R)(reduced) and SSE(F ) (full).

Statistics 262: Intermediate Biostatistics – p.9/??

Page 10: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

F -test procedure

If the ε’s are normally distributed andhomoskedastic, then

F =

SSE(R)−SSE(F )dfR−dfF

SSE(F )dfF

∼ FdfR−dfF ,dfF.

Reject H0 : β2 = β3 = 0 if F > FdfR−dfF ,dfF ,0.95.

That is, reject if the full model explainssignificantly more variability than the reducedmodel.

Statistics 262: Intermediate Biostatistics – p.10/??

Page 11: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

SAS code

PROC REG DATA=DATADIR.pollution;

MODEL MORTALITY = INCOME POLLUTION \

NONWHITE EDUCATION / PARTIAL ;

POLLUTION: TEST POLLUTION=0;

FTEST: TEST INCOME=0, EDUCATION=0;

RUN;

Statistics 262: Intermediate Biostatistics – p.11/??

Page 12: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Diagnostics

QQ-plot as in simple regression.

Partial residual plots: for each variable

Measures of influence: Cook’s distance.

Variance inflation factors.

Statistics 262: Intermediate Biostatistics – p.12/??

Page 13: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Cook’s distance, VIF

Cook’s distance is a measure of how much aparticular observation influences theregression model.

Measures the difference in the predictedmeans when the i-th observation is deletedfrom the dataset.

The variance inflation factor of a predictor is ameasure of how accurately you can estimatea coefficient.

Statistics 262: Intermediate Biostatistics – p.13/??

Page 14: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

SAS code: VIF, Cook’s distance

PROC REG DATA=DATADIR.pollution;

MODEL MORTALITY = INCOME POLLUTION NONWHITE \

EDUCATION / VIF INFLUENCE;

OUTPUT OUT=DATADIR.pdiag R=RESID \

COOKD=COOK P=YHAT;

RUN;

PROC PLOT DATA=DATADIR.pdiag;

PLOT COOK*YHAT;

RUN;

Statistics 262: Intermediate Biostatistics – p.14/??

Page 15: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Using PROC GLM

PROC GLM DATA=DATADIR.pollution;

MODEL MORTALITY = INCOME POLLUTION \

NONWHITE EDUCATION;

CONTRAST ’Pollution’ POLLUTION 1;

CONTRAST ’Income & Education’ INCOME 1, \

EDUCATION 1;

RUN;

Statistics 262: Intermediate Biostatistics – p.15/??

Page 16: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Analysis of Variance

All variables in the pollution dataset arecontinuous.

In clinical settings, often there are categoricalvariables.

Simplest example: comparing twopopulations – two sample t-test.

Statistics 262: Intermediate Biostatistics – p.16/??

Page 17: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

One-way ANOVA

First generalization: more than one level.

One-way ANOVA model: observations:(Yij), 1 ≤ i ≤ r, 1 ≤ j ≤ ni: r groups and ni

samples in i-th group.

Yij = µ + αi + εij, εij ∼ N(0, σ2).

Constraint:∑r

i=1 αi = 0.

Simplest question: is there any group effect?

H0 : α1 = · · · = αr = 0.

Statistics 262: Intermediate Biostatistics – p.17/??

Page 18: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

ANOVA tables: One-way

Source SS df E(MS)

Treatments SSTR =Pr

i=1 ni

`

Y i· − Y··

´2r − 1 σ2 +

Pri=1

niα2

i

r−1

Error SSE =Pr

i=1

Pnij=1(Yij − Y i·)

2Pr

i=1 ni − r σ2

Notation: Y i· is i-th group mean, Y ·· is overallmean.

By looking at the ANOVA table, we canconstruct tests very easily.

For instance, we see that underH0 : α1 = · · · = αr = 0, the expected value ofSSTR and SSE is σ2.

Statistics 262: Intermediate Biostatistics – p.18/??

Page 19: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Example: rehab surgery

How does prior fitness affect recovery fromsurgery? Observations: 24 subjects’ recoverytime.

Three fitness levels: below average, average,above average.

If you are in better shape before surgery,does it take less time to recover?

Statistics 262: Intermediate Biostatistics – p.19/??

Page 20: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Group effect

Full model:

Yij ∼ µ + αi + εij

Reduced model:

Yij ∼ µ + εij.

F -statistic

F =

∑r

i=1

∑nij=1(Yij−Y

··)2

−∑r

i=1

∑nij=1

(Yij−Y i·)2

2∑r

i=1

∑nij=1

(Yij−Y i·)2

21

=

SSTRdfTR

SSEdfE

Statistics 262: Intermediate Biostatistics – p.20/??

Page 21: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Test for trend

Does increased fitness correspond to adecrease in recovery time?

One way to test this: test

H0 :3∑

j=1

(j − 2) · (µj − µ) = 0.

Rationale: if the means µ are of definite orderthen they will be correlated with the vector(1, 2, 3). If means are “symmetric’ around µ2,then test “should not” reject H0.

Statistics 262: Intermediate Biostatistics – p.21/??

Page 22: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

SAS code

PROC GLM DATA=DATADIR.rehab;

CLASS FITNESS;

MODEL TIME = FITNESS;

ESTIMATE ’TREND’ FITNESS 1 0 -1 ;

RUN;

Statistics 262: Intermediate Biostatistics – p.22/??

Page 23: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Two-way ANOVA

Second generalization: more than onegrouping variable.

Two-way ANOVA model (equal sample sizes):observations:(Yijk), 1 ≤ i ≤ r, 1 ≤ j ≤ m, 1 ≤ k ≤ n: rgroups in first grouping variable, m groups inssecond and n samples per “cell.”

Yijk = µ+αi+βj+(αβ)ij+εijk, εijk ∼ N(0, σ2).

Again: just a regression model.

Statistics 262: Intermediate Biostatistics – p.23/??

Page 24: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Constraints on the parameters

∑ri=1 αi = 0

∑mj=1 βj = 0

∑mj=1(αβ)ij = 0, 1 ≤ i ≤ r

∑ri=1(αβ)ij = 0, 1 ≤ j ≤ m.

Statistics 262: Intermediate Biostatistics – p.24/??

Page 25: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Questions of interest

Are there main effects for the groupingvariables?

H0 : α1 = · · · = αr = 0, H0 : β1 = · · · = βm = 0.

Are there interaction effects:

H0 : (αβ)ij = 0, 1 ≤ i ≤ r, 1 ≤ j ≤ m.

Statistics 262: Intermediate Biostatistics – p.25/??

Page 26: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

ANOVA tables: Two-way (fixed)

SS df E(MS)

SSA = nmPr

i=1

`

Y i·· − Y···

´2r − 1 σ2 + nm

Pri=1

α2

i

r−1

SSB = nrPm

j=1

`

Y·j· − Y

···

´2m − 1 σ2 + nr

Pmj=1

β2

j

r−1

SSAB = nPr

i=1

Pmj=1

`

Y ij· − Y i·· − Y·j· + Y

···

´2(m − 1)(r − 1) σ2 + n

Pri=1

Pmj=1

(αβ)2ij

(r−1)(m−1)

SSE =Pr

i=1

Pmj=1

Pnk=1(Yijk − Y ij·)

2 (n − 1)mr σ2

For instance, we see that underH0 : (αβ)ij = 0,∀i, j the expected value ofSSAB and SSE is σ2 – use these for anF -test.

Statistics 262: Intermediate Biostatistics – p.26/??

Page 27: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Example: kidney failure

Time of stay in hospital depends on weightgain between treatments and duration oftreatment.

Two levels of duration, three levels of weightgain.

Is there an interaction? Main effects?

Statistics 262: Intermediate Biostatistics – p.27/??

Page 28: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

SAS code

PROC GLM DATA=DATADIR.kidney;

CLASS DURATION WEIGHT;

MODEL DAYS = DURATION WEIGHT DURATION*WEIGHT;

MEANS DURATION WEIGHT / LSD CLDIFF;

RUN;

Statistics 262: Intermediate Biostatistics – p.28/??

Page 29: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Random vs. fixed effects

In two ANOVA examples, the categoricalvariables are well-defined categories: belowaverage fitness, long duration, etc.

In some designs, there is sometimes acategorical variable for each subject.

Simplest example: repeated measures,where more than one (identical)measurement is taken on the same individual.

In this case, the “group” effect αi is bestthought of as random.

Statistics 262: Intermediate Biostatistics – p.29/??

Page 30: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

When to use random effects?

Example (two-way): suppose we werestudying the variability of an assay todetermine Viral Load in HIV+ patients. Weare interested also in the variability acrosssubtype.

We might collect data from many differentcenters on a few of the most prevalentsubtypes.

Ignoring possible confounding, the “center”effect can be thought of as a random effect,and subtype as a fixed effect.

Statistics 262: Intermediate Biostatistics – p.30/??

Page 31: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Example: sodium content in beer

How much sodium is there in North Americanbeer? How much does this vary by brand?

Observations: for 6 brands of beer, werecorded the sodium content of 8 12 ouncebottles.

Questions of interest: what is the “grandmean” sodium content? How much variabilityis there from brand to brand?

“Individuals” in this case are brands, repeatedmeasures are the 8 bottles.

Statistics 262: Intermediate Biostatistics – p.31/??

Page 32: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

One-way random effects model

Yij ∼ µ· + αi + εij, 1 ≤ i ≤ r, 1 ≤ j ≤ n

εij ∼ N(0, σ2), 1 ≤ i ≤ r, 1 ≤ j ≤ n

αi ∼ N(0, σ2µ), 1 ≤ i ≤ r.

Statistics 262: Intermediate Biostatistics – p.32/??

Page 33: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

One-way random (equal sample sizes)

Source SS df E(SS)

Treatments SSTR =Pr

i=1 n`

Y i· − Y··

´2r − 1 σ2 + nσ2

µ

Error SSE =Pr

i=1

Pnj=1(Yij − Y i·)

2 (n − 1)r σ2

Only change here is the expectation of SSTRwhich reflects randomness of αi’s.

ANOVA table is still useful to setup tests: thesame F statistics for fixed or random will workhere.

Statistics 262: Intermediate Biostatistics – p.33/??

Page 34: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Inference for µ·

We know that E(Y ··) = µ·, and can show that

Var(Y ··) =nσ2

µ + σ2

rn.

Therefore,Y ·· − µ·

SSTR(r−1)rn

∼ tr−1

Statistics 262: Intermediate Biostatistics – p.34/??

Page 35: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Inference for µ·

Why r − 1 degrees of freedom? Imagine wecould record an infinite number ofobservations for each individual, so thatY i· → µi.

To learn anything about µ· we still only have robservations (µ1, . . . , µr).

Sampling more within an individual cannotnarrow the CI for µ·.

Statistics 262: Intermediate Biostatistics – p.35/??

Page 36: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Inference for σ2µ

σ2µ+σ2

This quantity describes the relativecontribution of the random effects variance tothe total variance of one observation.

We use the fact that

F =σ2 + nσ2

µ

σ2×

SSTRr−1SSE

(n−1)r

∼ Fr−1,(n−1)r.

Manipulate the inequalities:

P (Fr−1,(n−1)r,α/2 ≤ F ≤ Fr−1,(n−1)r,1−α/2) = 1−α.

Statistics 262: Intermediate Biostatistics – p.36/??

Page 37: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Estimating σ2µ

From the ANOVA table

σ2µ =

E(SSTR/(r − 1)) − E(SSE/((n − 1)r))

n.

Natural estimate:

S2µ =

SSTR/(r − 1) − SSE/((n − 1)r)

n

Problem: this estimate can be negative! Oneof the difficulties in random effects model.

Statistics 262: Intermediate Biostatistics – p.37/??

Page 38: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

SAS code

PROC GLM DATA=DATADIR.beer;

CLASS BRAND BOTTLE;

MODEL SODIUM = BRAND;

MEANS BRAND / LSD CLDIFF;

RUN;

Statistics 262: Intermediate Biostatistics – p.38/??

Page 39: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

Two-way random effects model

Yijk ∼ µ·· + αi + βj + (αβ)ij + εij, 1 ≤ i ≤ r, 1 ≤j ≤ m, 1 ≤ k ≤ n

εijk ∼ N(0, σ2), 1 ≤ i ≤ r, 1 ≤ j ≤ m, 1 ≤ k ≤n

αi ∼ N(0, σ2α), 1 ≤ i ≤ r.

βj ∼ N(0, σ2β), 1 ≤ j ≤ m.

(αβ)ij ∼ N(0, σ2αβ), 1 ≤ j ≤ m, 1 ≤ i ≤ r.

Statistics 262: Intermediate Biostatistics – p.39/??

Page 40: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/.../spring.2004/notes/week2.pdf · 2005-11-23 · variable from the model. For instance, suppose we want to see if education

ANOVA tables: Two-way (random)

SS df E(SS)

SSA = nmPr

i=1

`

Y i·· − Y···

´2r − 1 σ2 + nmσ2

α + nσ2αβ

SSB = nrPm

j=1

`

Y·j· − Y

···

´2m − 1 σ2 + nrσ2

β+ nσ2

αβ

SSAB = nPr

i=1

Pmj=1

`

Y ij· − Y i·· − Y·j· + Y

···

´2(m − 1)(r − 1) σ2 + nσ2

αβ

SSE =Pr

i=1

Pmj=1

Pnk=1(Yijk − Y ij·)

2 (n − 1)ab σ2

To test H0 : σ2α = 0 use SSA and SSAB.

To test H0 : σ2αβ = 0 use SSAB and SSE.

Statistics 262: Intermediate Biostatistics – p.40/??