pubh8401 linear models

PubH8401 Linear Models

Revised from Yuan Zhang’s version

Fall 2019

Contents

1 Generalized, Linear, and Mixed Models 2

1.1 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Review: Likelihood Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.6 Goodness of Fit (GOF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.7 Over-dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.1.8 Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.1.9 Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1.10 Quasi-likelihood Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2 Correlated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.2.2 Linear Mixed Models (LMM) . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.2.3 Generalized Linear Mixed Models (GLMM) . . . . . . . . . . . . . . . . . 34

1.2.4 Generalized Estimating Equations (GEE) . . . . . . . . . . . . . . . . . . . 37

1.2.5 Population-average (PA) Model vs Subject-specific (SS) Model . . . . . . 41

1.2.6 Comparison between GEE and GLMM . . . . . . . . . . . . . . . . . . . . 42

1.2.7 Missing Data Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

1

Chapter 1

Generalized, Linear, and Mixed Models

Reference: McCulloch C., Searle S., Neuhaus J (2008). Generalized, Linear, and Mixed Models,

Second Edition. John Wiley & Sons, Inc., Hoboken, New Jersey.

1.1 Generalized Linear Models

1.1.1 Review: Likelihood Theory

1. X random: Assume that X and Θ are independent, the likelihood function is

L(Θ∣Y,X) = f(Y,X ∣Θ) =n

∏i=1

f(Yi∣XTi ,Θ)f(XT

i ),

and the log-likelihood function is

l(Θ∣Y,X) = logL(Θ∣Y,X).

2. X deterministic: The log-likelihood function is

L(Θ∣Y,X) = f(Y ∣Θ,X) =n

∏i=1

f(Yi∣xTi ,Θ),

and the log-likelihood function is

l(Θ∣Y,X) = logL(Θ∣Y,X).

Let θ be the maximum likelihood estimator of θ. So L(θ) = maxθ∈Ω

L(θ).

The score function is

U(θ) =∂l(θ)

∂θ,

and the score equation is

U(θ) = 0.

The information is

In(θ) = E[In(θ)] = E[U(θ)U(θ)T ] = E [−∂2l(θ)

∂θ∂θT] .

2

1.1.2 Exponential Family

The exponential family has the form

f(y∣θ, φ) = expyθ − b(θ)

a(φ)+ c(y, φ) ,

where

θ: the parameter of interest, called the canonical parameter

φ is the nuisance parameter, called the scale parameter.

Note 1: The formulation here might be different from what you learned from your Stat Inference

class, which allows for multiple θs. Note 2: For the models we will discuss in this class, we

assume the parameter φ is known (or can be estimated separately).

Example 1.1.1.

1. Normal distribution: Y ∼ N(µ,σ2).

f(y∣θ, φ) = (2πσ2)−1/2 exp−(y − µ)2

2σ2 = exp

yµ − µ2/2

σ2−y2

2σ2−

1

2log(2πσ2)

Ô⇒ θ = µ,φ = σ2, a(φ) = φ, b(θ) =θ2

2, c(y, φ) = −

y2

2φ−

1

2log(2πσ2)

2. Binomial distribution: Y ∼ Binomial(m,p).

f(y∣θ, φ) = (m

y)py(1 − p)m−y = expy log p + (m − y) log(1 − p) + log (

m

y)

= expy logp

1 − p+m log(1 − p) + log (

m

y)

Ô⇒ θ = logp

1 − p, φ = 1, a(φ) = 1, b(θ) = −m log(1 − p) =m log(1 + eθ), c(y, φ) = log (

m

y)

3. Poisson distribution: Y ∼ Poisson(λ).

f(y∣θ, φ) =λye−λ

y!= expy logλ − λ − log y!

Ô⇒ θ = logλ, b(θ) = λ = eθ, c(y, φ) = − log y!

Properties of Exponential Family: If Y ∼ f(y∣θ, φ) with φ known, then:

(i) µ ≡ E(Y ) = b′(θ)

(ii) Var(Y ) = b′′(θ)a(φ)

Proof.

l(θ) = log f(Y ∣θ, φ) =Y θ − b(θ)

a(φ)+ c(Y,φ)

U(θ) =∂l

∂θ=

1

a(φ)[Y − b′(θ)]

U ′(θ) =∂2l

∂θ2= −

1

a(φ)b′′(θ)

3

Since E[U(θ)] = 0, E[Y − b′(θ)] = 0Ô⇒ E(Y ) = b′(θ).

Under certain regularity conditions,

I(θ) = Var[U(θ)] = E[U2(θ)] = −E(∂2l

∂θ2) .

Since

E[U2(θ)] =1

a2(φ)E[Y − b′(θ)]2 =

1

a2(φ)E[(Y −EY )2] =

1

a2(φ)Var(Y )

and

−E(∂2l

∂θ2) =

1

a(φ)b′′(θ),

we obtain

Var(Y ) = a(φ)b′′(θ).

Note: b′′(θ) is called the variance function, denoted by V (µ). The variance function indicates

how the variance of Y depends on the mean of Y .

Example 1.1.2.

1. Normal distribution: θ = µ,φ = σ2, b(θ) = θ2

2 , a(φ) = φ.

E(Y ) = b′(θ) = θ = µ Var(Y ) = a(φ)b′′(θ) = a(φ) = φ = σ2

2. Binomial distribution: θ = log p1−p , b(θ) =m log(1 + eθ).

b′(θ) =meθ

1 + eθ=mp b′′(θ) =

meθ

(1 + eθ)2=mp(1 − p)

3. Poisson distribution: θ = logλ, b(θ) = eθ.

b′(θ) = b′′(θ) = eθ = λÔ⇒ E(Y ) = Var(Y ) = λ

1.1.3 Generalized Linear Models

Generalized Linear Models (GLM)

i) Replace the linear model µ = E(Y ) =Xβ with a linear model for g(µ).

ii) Replace the constant variance assumption, i.e., Var(εi) = σ2 with a mean-variance rela-

tion.

Var(Yi) = b′′(θi)a(φ) E(Yi) = b

′(θi)

iii) Replace the normal distribution assumption with the exponential family, ** but still

assume independence.

Random component:

Yi ∼ f(yi; θi, φ) = expyiθi − b(θi)

a(φ)+ c(yi, φ) .

4

Note: Randomness comes from the distribution, e.g., yi ∈ 0,1, modeling the randomness not

by whether the patient has a negative or positive outcome, but by modeling the underlying

parameter pi. Systematic component: ηi = xTi β, where xTi denotes the ith row of the matrix of

predictors X. Link function ηi = g(µi), where µi = b′(θi). The default link function is typically

the canonical link, ηi = θi.

Example 1.1.3.

1. Linear model (a special case of GLM): Yi =XTi β + εi.

The random component is Yi ∼ N(XTi β,σ

2) and the systematic component is µi ≡ E(Yi) =

XTi β = ηi. ηi = µi is the identity link.

2. Logistic regression: Yi ∼ Bernoulli(pi),

f(y∣θ, φ) = expy logp

1 − p+ log(1 − p)

θ = logp

1 − p, b(θ) = log(1 + eθ)Ô⇒ µ = E(Y ) = b′(θ) =

eθ

1 + eθ= p, ηi =X

Ti β

The logistic regression formula is

log (pi

1 − pi) =XT

i β.

The link function is the logit function η = g(µ) = log ( µ1−µ). This is a canonical link

function because η = θ = g(µ).

3. Poisson regression: Yi = number of events ∼ Poisson(λi)

f(y∣θ, φ) = expy logλ − λ − log y!

θ = logλ, b(θ) = λ = eθ, µ = b′(θ) = eθ = λ

The Poisson regression formula is

log(µi) =XTi β.

The canonical link is η = θ = g(µ) = log(µ).

parameter θ µb′(θ)

ηg(µ)

if η = θ, then canonical link g∗()

Xβ=

1.1.4 Maximum Likelihood Estimation

Iterated Weighted Least Squares

The log-likelihood function of a distribution from the exponential family is

l =n

∑i=1

li =n

∑i=1

Yiθi − b(θi)

ai(φ)+ c(Yi, φ) .

5

The score function can be written as:

Uj(β) =∂l

∂βj=

n

∑i=1

∂li∂θi

∂θi∂µi

∂µi∂ηi

∂ηi∂βj

where µi = b′(θi) and ∂ηi∂βj

=Xij. Since

∂li∂θi

=Yi − b′(θi)

ai(φ),∂θi∂µi

=1

∂µi/∂θi=

1

b′′(θi),∂µi∂ηi

=1

g′(µi)

Uj(β) =n

∑i=1

Yi − b′(θi)

ai(φ)

1

V (µi)

∂µi∂ηi

Xij

=n

∑i=1

Xij1

( ∂ηi∂µi)

2ai(φ)V (µi)

(Yi − µi)∂ηi∂µi

≡n

∑i=1

XijWi(Zi − ηi)

where W −1i = (

∂ηi∂µi

)2

ai(φ)V (µi) = [g′(µi)]2ai(φ)V (µi),

and Zi = ηi + (Yi − µi)∂ηi∂µi

= g(µi) + (Yi − µi)g′(µi).

Zi is called the adjusted dependent variability.

Rationale:

∑iXijWiZi = (XTWZ)j is the weighted linear regression. Note that W is a diagonal matrix

and Z is a vector. Then we can write

U(β) =XTW (Z − η) where W = diag(W1, . . . ,Wn) and η =Xβ.

Assume X has full rank, U(β) = 0⇐⇒XTWZ =XTWXβ ⇐⇒ β = (XTWX)−1XTWZ.

Iterated Weighted Least Squares (IWLS)

1. Initialize: η = g(Y ), use Y to initialize µ.

2. Calculate:

µ = g−1(η)

Z = η + (Y − µ)g′(µ)

W = diag (g′(µ)2a(φ)V (µ)−1)

β = (XT WX)−1XT W Z

η =Xβ

3. Repeat: repeat the second step until convergence, e.g., ∣β(k+1) − β(k)∣ < ε.

Intuition: g(µi) = xTi β = ηi

, by Taylor expansion,

g(Yi) ≈ g(µi) + (Yi − µi)g′(µi) = ηi + (Yi − µi)

∂ηi∂µi

= Zi

6

If we use g(Yi) directly as the outcome in a linear regression, we may violate the constant

variance assumption.

Var(Zi) = Var(Yi) (∂ηi∂µi

)2

= ai(φ)V (µi) (∂ηi∂µi

)2

=W −1i

Given the independence of samples, W −1i is diagonal and we can apply a weighted least square

(WLS) approach to fit the model. Note: As in a WLS model, we can estimate the variance

Var(β) = (XT WX)−1.

Example 1.1.4. (Poisson regression) η = log(µ), V (µ) = µ, a(φ) = 1.

1. Initialize: η = log(Y ); η = log(Y + 0.5) in practice, since Y can be 0.

2. Repeat:

µ = exp(η)

Z = η + (Y − µ)1

µ)

W = diag (µ)


η =Xβ

Numerical Methods to Solving U(β) = 0

1. Fisher scoring: β(k+1) = β(k) + (I)−1U where I = −E(∂2l

∂β∂βT)

** equivalent to IWLS when a(φ) = φ

2. Newton-Raphson: β(k+1) = β(k) + (i)−1U where i = −∂2l

∂β∂βT∣β=β(k)

** equivalent to Fisher scoring for canonical link

3. simulation, MCMC approach

The score function is

Uj(β) =∂l

∂βj=

n

∑i=1

XijWi∂ηi∂µi

(Yi − µi).

The information is

In(β)jk = −∂2l

∂βj∂βk

= −n

∑i=1

Xij∂

∂βk(Wi

∂ηi∂µi

) (Yi − µi) +n

∑i=1

XijWi∂ηi∂µi

(∂µi∂ηi

∂ηi∂βk

)

= −n

∑i=1

Xij∂

∂βk(Wi

∂ηi∂µi

) (Yi − µi) +n

∑i=1

XijWiXik.

Since E(Yi − µi) = 0,

E(−∂2l

∂βj∂βk) =

n

∑i=1

XijWiXik.

7

i.e., the expected information E (In(β)) =XTWX. So Cov(β) ≈ (XT WX)−1.

For canonical link η = θ,∂η

∂µ=∂θ

∂µ=

1

b′′(θ)=

1

V (µ). So, W

∂η

∂µ=

1∂η∂µa(φ)V (µ)

=1

a(φ).

Hence,

∂

∂βk(W

∂η

∂µ) =

∂

∂βk(

1

a(φ)) = 0Ô⇒ −

∂2l

∂βj∂βk= E(−

∂2l

∂βj∂βk) =∑

i

XijWiXik.

Therefore, Fisher scoring is equivalent to Newton-Raphson.

1.1.5 Inference

Let β be the maximum likelihood estimator of β so that U(β) = 0.

The asymptotic distribution of β is β ∼ MVN(β, I−1n (β)) where In = E(−

∂2l

∂β∂βT).

Hypothesis Test

β =⎛

⎝

β1

β2

⎞

⎠, H0 ∶ β2 = β2.

1. Likelihood ratio test

2. Score test: U(β1, β2)T In(β1, β2)

−1U(β1, β2)

dÐ→ χ2

p−q under H0

3. Wald test: (β2 − β2)T Var(β2)−1(β2 − β2)

dÐ→ χ2

p−q under H0

1.1.6 Goodness of Fit (GOF)

1. Deviance:

ai(φ) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

φ Normal

1 Poisson1mi

Binomial

⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭

=φ

mi

Rewrite:

l(β) =n

∑i=1

log fi(Yi∣θi, φ) =n

∑i=1

⎧⎪⎪⎨⎪⎪⎩

mi [Yiθi − b(θi)]

φ+ ci(Yi, φ)

⎫⎪⎪⎬⎪⎪⎭

.

Define deviance as:

D(Y, µ) = 2φl(µ, φ∣Y ) − l(µ, φ∣Y ) where µ is the MLE of the saturated model

= 2n

∑i=1

mi Yi (θi − θi) − [b(θi) − b(θi)] .

Hence, deviance depends on the specified distribution (likelihood), not on the nuisance

parameter φ.

Note:In the saturated model, µi = Yi, i.e., using Yi as estimate for µ. So, θ is derived from

µi = Yi and θ is derived from ηi and µi.

We compare the two nested models:

8

Model 1: β =⎛

⎝

β1 q×1

β2 (p−q)×1

⎞

⎠µ ≡ µ(1)

Model 2: β =⎛

⎝

β1 q×1

β2 (p−q)×1

⎞

⎠µ ≡ µ(2)

For linear models, the test statistic isRSS(2) −RSS(1)

σ2∼ χ2

p−q.

For generalized linear models, the test statistic is LR =D(Y, µ(2)) −D(Y, µ(1))

φ∼ χ2

p−q.

Example 1.1.5.

(a) Normal: log f(y∣θ, φ) = −(y − µ)2

2φ+ c(y, φ), θ = µ,φ = σ2,mi = 1.

D(Y, µ) = 2n

∑i=1

Yi (θi − θi) − [b(θi) − b(θi)]

= 2n

∑i=1

Yi (Yi − µi) − [Y 2i

2−µ2i

2]

= 2n

∑i=1

(Y 2i

2− Yiµi +

µ2i

2)

=n

∑i=1

(Yi − µi)2

= RSS

(b) Binomial: log f(y∣θ, φ) =my logµ

1 − µ+ log(1 − µ) + c,

θ = logµ

1 − µ⇐⇒ µ =

eθ

1 + eθ, b(θ) =m log(1 + eθ), yi =

#events

mi

.

D(Y, µ) = 2n

∑i=1

mi yi (logyi

1 − yi− log

µi1 − µi

) − [log (1 +yi

1 − yi) − log (1 +

µi1 − µi

)]

= 2n

∑i=1

mi yi logyiµi+ (1 − yi) log

1 − yi1 − µi

(c) Poisson: log f(y∣θ, φ) = y logµ − µ + c, θ = logµ, b(θ) = eθ,mi = 1.

D(Y, µ) = 2n

∑i=1

Yi(logYi − log µi) − (Yi − µi)

= 2n

∑i=1

Yi logYiµi− (Yi − µi)

2. Pearson χ2: does not depend on distribution assumption.

χ2 =n

∑i=1

(Yi − µi)2

V (µi)/mi

Criterion:χ2

n − p=

n

∑i=1

(Yi − µi)2

(n − p)V (µi)≈ φ = a(φ)mi

Example 1.1.6.

9

(a) Normal: b(θ) =θ2

2, V (µ) = b′′(θ) = 1.

χ2 =n

∑i=1

(Yi − µi)2 = RSS

n

∑i=1

(Yi − µi)2

n − p= σ2

OLS ≈ σ2

(b) Binomial: b(θ) =m log(1 + eθ), µ =eθ

1 + eθ, b′′(θ) =m

eθ

(1 + eθ)2=mµ(1 − µ).

χ2 =n

∑i=1

mi(Yi − µi)2

µi(1 − µi)

n

∑i=1

mi(Yi − µi)2

(n − p)µi(1 − µi)∼χ2n−p

n − p→ 1

(c) Poisson: b(θ) = eθ, µ = eθ, b′′(θ) = eθ = µ.

χ2 =n

∑i=1

(Yi − µi)2

µi

n

∑i=1

(Yi − µi)2

(n − p)µi≈ 1 when µ→∞ since

Y − µ√µ→ N(0,1)

Types of Residuals:

1. Pearson residual: rPi =Yi − µi√V (µi)

, χ2 =n

∑i=1

(rPi )2. R: resid(fit,”pearson”)

2. Deviance residual: rDi =√di ∗ sign(Yi − µi) where di = 2mi Yi (θi − θi) − [b(θi) − b(θi)] .

3. Working residual: zi = ηi + (Yi − µi)∂ηi∂µi

, rWi = zi − ηi = (Yi − µi)∂ηi∂µi

. R: fit$resid

1.1.7 Over-dispersion

Over-dispersion

Var(Y ) > a(φ)V (µ)Ð→ se(β) underestimatedÐ→ z-statistic inflatedÐ→ inflated Type I error

Under-dispersion

Var(Y ) < a(φ)V (µ)Ð→ conservative test resultsÐ→ power decreases

Example 1.1.7. Seed data. Crowder MJ (1978). Appl. Statist. 27, 34-37.

Study of germination of 2 types of seeds with 2 root extracts.

Variables:

seed type = 1,2

extract = 1,2

r = # germinated seeds

m = # seeds on plate

Let Y = r/m be the data. Let µ = E(Y ) be the probability of germination.

10

seed 1 2

extract 1 2 1 2

r m r/m r m r/m r m r/m r m r/m

10 39 10/39 5 6 5/6 8 16 8/16 3 12 3/12

23 62 23/62 53 74 53/74 10 30 10/30 22 41 22/41

23 81 23/81 55 72 55/72 8 28 8/28 15 30 15/30

26 51 26/51 32 51 32/51 23 45 23/45 32 51 32/51

17 39 17/39 46 79 46/79 0 4 0 3 7 3/7

10 13 10/13

The model is logit(µ) = logµ

1 − µ= β0 + β1 ∗ seed + β2 ∗ extract + β3 ∗ seed ∗ extract.

Deviance: D = 33.28 with df = n − p = 21 − 4 = 17,D

n − p≈ 2 ≫ 1.

Pearson χ2: χ2 = 31.65 with df = n − p = 21 − 4 = 17.

Heterogeneous population (over-dispersion):

Let π be the germination probability within a plate. Assume that m is a constant.

Since µ = E(Y ) = E[E(Y ∣π)] = E(π),

Var(Y ) = E[Var(Y ∣π)] +Var[E(Y ∣π)] = E [1

mπ(1 − π)] +Var(π)

=1

mE(π)(1 −E(π)) +Var(π) (1 −

1

m) =

1

mµ(1 − µ) +Var(π) (1 −

1

m) >

µ(1 − µ)

m

If Var(π) = τ 2µ(1 − µ), then Var(Y ) =µ(1 − µ)

m[1 + (m − 1)τ 2] = σ2µ(1 − µ)

m= σ2V (µ).

Model-basedl Approach to Over-dispersed GLM

Assume that Var(Y ) = σ2V (µ)a(φ), where V (µ)a(φ) is the naive variance from the distribu-

tional assumption (exponential family). We have:

β⋅∼ N (β,σ2(XTWX)−1) where ⋅ indicate asymptotically.

σ2 =χ2

n − por

D

n − p, se→ se

√χ2

n − p.

In the case, Var(Y ) = σ2µ(1 − µ)

m,bias(σ2) ∼ O (

1

m)∝

1

m.

1.1.8 Binary Data

For binary outcome, some commonly-used link functions:

1. Logistic model: logpi

1 − pi= xTi β, η = g(p) = logit(p)Ô⇒ p =

eη

1 + eη.

Note: This is a symmetric link function, which means that switching case and control

would lead to the same result except a sign switch on β.

This corresponds to the cdf of a logistic distribution: F (x) =e(x−µ)/r

1 + e(x−µ)/rand E(X) = µ,

Var(X) =π2r2

3.

11

2. Probit model: p = Φ(η), η = Φ−1(p)Ð→ probit link.

Note: This is also a symmetric link function since Φ−1(p) = −Φ−1(1−p). The probit model

is also refered as the liability model.

Comparison between logistic and probit models:

For logistic models, it is easier to interpret β. For probit models, it is easier to fit, but

the change of p related to β is not proportional.

3. Complementary log-log link: extreme value distribution (Gumbel distribution)

p = F (η) = 1 − exp[− exp(η)], η = g(p) = log[− log(1 − p)] represents the distribution of

maxima, with underlying normal/exponential distribution (e.g., competing risks).

4. Log-link: η = log(p), p = eη = eXβ. This link function can be used to estimate relative risk

(risk ratio) instead of odds ratio.

Case-control vs cohort study

1. Population-based model: P (Y = 1∣X)

Since OR(X;β) =P (Y = 1∣X = x)P (Y = 0∣X = 0)

P (Y = 1∣X = 0)P (Y = 0∣X = x),

P (Y = 1∣X) =exp[α + logOR(X;β)]

1 + exp[α + logOR(X;β)].

2. Case-control sampling: S =

⎧⎪⎪⎨⎪⎪⎩

1 sampled

0 not sampled.

P (X,Y ∣S = 1) = P (X ∣Y,S = 1)P (Y ∣S = 1) = P (X ∣Y )nYn

where nY = n0 (control) or n1 (case) and n = n0 + n1

Ô⇒ P (X ∣Y ) =n

nYP (X,Y ∣S = 1) =

n

nYP (Y ∣X,S = 1)P (X ∣S = 1)

=n

nY

exp[Y (α∗ +Xβ∗)]

1 + exp(α∗ +Xβ∗)q(X) since

P (Y = 1∣X,S = 1) =P (Y = 1,X,S = 1)

P (X,S = 1)=P (S = 1∣Y = 1,X)P (Y = 1∣X)P (X)

∑1Y =0P (S = 1∣Y,X)P (Y ∣X)P (X)

=π1P (Y = 1∣X)

π0P (Y = 0∣X) + π1P (Y = 1∣X)=

exp(logπ1 + α +Xβ)/[1 + exp(α +Xβ)]

[exp(logπ0) + exp(logπ1 + α +Xβ)]/[1 + exp(α +Xβ)]

=

π1π0

exp(α +Xβ)

1 + π1π0

exp(α +Xβ)=

exp(α∗ +Xβ)

1 + exp(α∗ +Xβ)where α∗ = α + log (

π1

π0

)

Note: α is the population prevalence. In practice, π1 ≫ π0.

3. Maximum likelihood estimation:

L∝ L1 ×L2 =n

∏i=1

P (Yi∣Xi, S = 1)n

∏i=1

P (Xi∣S = 1) =n

∏i=1

exp[Yi(α∗ +XTi β)]

1 + exp(α∗ +XTi β)

n

∏i=1

q(Xi)

maximizing L subject to constraints

n0

n= P (Y = 0∣S = 1) = ∫ P (Y = 0∣X = x,S = 1)q(X = x)dx

n1

n= P (Y = 1∣S = 1) = ∫ P (Y = 1∣X = x,S = 1)q(X = x)dx

12

This is a semi-parametric model, with the parametric part P (Y ∣X,S = 1) and the non-

parametric model P (X ∣S = 1) = q(X).

Proposition 1: The unconstrained maximizer (α∗, β) through q(x) = 1n#i ∶Xi = x,

which also maximize L, satisfies constraints and therefore also solve the constrained op-

timization.

Proposition 2: Estimating equations from L1 are unbiased w.r.t. q(⋅) distribution.

Reference: Prentice & Pyke, 1979.

1.1.9 Count Data

Poisson regression model:

1. logµi =XTi β,µi = E(Yi) model counts, but as population size increases, counts increase.

2. logλi =XTi β = log µi

nimodel rate, so logµi = logni +XT

i β. logni is called the offset term.

Alternative regression model: variance-stabilizing transformation for count data.

1.1.10 Quasi-likelihood Theory

For generalized linear models, µ = E(Y ) = b′(θ), Var(Y ) = a(φ)V (µ). In the case of dispersion,

Var(Y ) ≠ a(φ)V (µ). In the model-based correction, we assume Var(Y ) = σ2a(φ)V (µ). Here,

we further relax the assumption of Var(Yi) and allow any form.

Note: The assumption of correct variance function impact less on β, but more on se(β).

Recall in GLM

Yi ∼ f(Yi∣θi, φ) = expYiθi − b(θi)

a(φ)− c(Yi, φ)

ηi =XTi β = g(µi) l(β) =

n

∑i=1

li(β) =n

∑i=1

log f(Yi∣β)

The score equation is

n

∑i=1

∂

∂βjli(β) = 0Ô⇒

n

∑i=1

∂µi/∂βjVar(Yi)

(Yi − µi) = 0, j = 1, . . . , p,

since∂

∂µili(β) =

∂

∂θili(β)

1

∂µi/∂θi=Yi − µia(φ)

1

b′′(θi).

Quasi-likelihood

Suppose instead of knowing the likelihood (i.e., the distribution), we only know the first two

moments:

E(Yi) = µi = µi(β)

Var(Yi) =

⎧⎪⎪⎨⎪⎪⎩

a(φ)V (µi) exponential family

V (µi) user specified (more flexibility, V (µi) ≠ b′′(θ) here)

13

We mimic the score equations for GLM, estimate β by solving the estimating equations

n

∑i=1

∂µi/∂βjV (µi)

(Yi − µi) = 0.

Note:

This is no longer the derivative of a true likelihood but a quasi-likelihood.

Since the contribution to the log-likelihood from Yi is the integral w.r.t. µi oflog f(Yi∣β)

∂µi,

we define the log quasi-likelihood via the contribution of Yi:

Qi(Yi∣µi) = ∫µi

Yi

Yi − t

V (t)dtÔ⇒

∂Qi

∂βj=∂µi∂βj

∂Qi

∂µi=∂µi∂βj

Yi − µiV (µi)

Q =n

∑i=1

Qi behaves like a log-likelihood function.

β is estimated by IWLS.

Properties of Quasi-likelihood

1. E(∂Qi

∂µi) = 0

Proof.

E(∂Qi

∂µi) = E(

Yi − µiV (µi)

) =E(Yi) − µiV (µi)

= 0

2. E(∂Qi

∂βj) = 0

Proof.

E(∂Qi

∂βj) = E(

∂Qi

∂µi)∂µi∂βj

= 0

3. E [(∂Qi

∂µi)

2

] = −E(∂2Qi

∂µ2i

)

Proof.

E [(∂Qi

∂µi)

2

] = E⎡⎢⎢⎢⎢⎣

(Yi − µiV (µi)

)

2⎤⎥⎥⎥⎥⎦

=1

V (µi)2[E(Y 2

i ) − 2µiE(Yi) + µ2i ]

=1

V (µi)2[E(Y 2

i ) − µ2i ] =

1

V (µi)since Var(Yi) = V (µi)

−∂2Qi

∂µ2i

=1

V (µi)−Yi − µiV 2(µi)

∂V (µi)

∂µiÔ⇒ −E(

∂2Qi

∂µ2i

) =1

V (µi)

4. E(∂Qi

∂βj

∂Qi

∂βk) = −E(

∂2Qi

∂βj∂βk)

14

Proof.

∂Qi

∂βj

∂Qi

∂βk= (

∂Qi

∂µi)

2 ∂µi∂βj

∂µi∂βk

∂2Qi

∂βj∂βk=

∂

∂βk(∂Qi

∂βj) =

∂

∂βk(∂Qi

∂βj

∂µi∂µi

) =∂

∂βk(∂Qi

∂µi

∂µi∂βj

)

=∂2Qi

∂µi∂βk

∂µi∂βj

+∂Qi

∂µi

∂2µi∂βj∂βk

=∂2Qi

∂µ2i

∂µi∂βk

∂µi∂βj

+∂Qi

∂µi

∂2µi∂βj∂βk

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶E(⋅)=0

The properties 1-4 are analogous to those for likelihood.

Maximum Quasi-likelihood Estimators (MQLE)

Under usual regularity conditions,

the maximum likelihood estimators (MLE), β, satisfy√n(β − β)

dÐ→ N(0, I−1(β));

the maximum quasi-likelihood estimators (MQLE), β, satisfy√n(β − β)

dÐ→ N(0, I−1(β)),

where

I(β) = E(1

n

n

∑i=1

∂Qi

∂β

∂Qi

∂βT) = −E(

1

n

n

∑i=1

∂2Qi

∂β∂βT) =

1

n

n

∑i=1

(∂µi∂β

V (µi)−1 ∂µi∂βT

) =1

n(∂µ

∂βV (µ)−1 ∂µ

∂βT) .

Example 1.1.8. Over-dispersion: Var(Yi) = σ2V (µi)a(φ).

The estimating equation isn

∑i=1

∂µi/∂βjσ2a(φ)V (µi)

(Yi − µi) = 0.

1. σ2 constant Ð→ β unaffected.

2. Var(β) =1

nI(β)−1 = σ2 (

∂µ

∂βV (µ)−1 ∂µ

∂βT)−1

where V (µ) is the variance function from

GLM and σ2 is a scale on V (µ).

To estimate σ2,

σ2 =1

n − p(Y − µ)TV (µ)−1(Y − µ) =

1

n − p

n

∑i=1

⎛

⎝

Yi − µi√V (µi)

⎞

⎠

2

=χ2

n − p.

This formula can be applied to both likelihoods.

Estimate MQLE β by IWLS

1. Initialize: η = g(Y ).

2. Repeat:

µ = g−1(η)

Z = η + (Y − µ)g′(µ)

W = diag (g′(µ)2V ar(Y )−1)


η =Xβ

15

R: family=quasi(link=..., variance=...), e.g., quasi(link=logit, variance=”mu(1-mu)”)

Summary

1. Quasi-likelihood (QL) is used when the assumption of standard exponential family is

invalid.

2. β’s are not affected much.

3. MQLE has the same asymptotic properties as MLE.

4. MQLE is obtained by IWLS.

Note: Little guidance on which V (µ) to use. Robust variance estimates (see below) always

valid for large samples.

Robust Variance Estimator

The quasi-score function: Ui =∂Qi

∂β=∂µi∂β

Yi − µiV (µi)

.

By Taylor expansion,

0 = U(β) ≈ U(β) +∂U

∂β(β − β)

Ô⇒√n(β − β)

⋅=√n(−

∂U

∂β)−1

U(β) = (−1

n

n

∑i=1

∂Ui∂β

)

−11

√n

n

∑i=1

Ui

dÐ→ N(0,Σ) where Σ = A−1BA−1 and

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

A = − limn→∞

1

n

n

∑i=1

E(∂Ui∂β

) = − limn→∞

1

nAn

B = limn→∞

1

n

n

∑i=1

Var(Ui) = limn→∞

1

nBn

Since Var(∂Qi

∂β) = E(

∂Qi

∂β

∂Qi

∂βT) = −E(

∂2Qi

∂β∂βT) when V ar(Yi) is correctly specified,

A = − limn→∞

1

n

n

∑i=1

E(∂Ui∂β

)

B = limn→∞

1

n

n

∑i=1

Var(Ui) = limn→∞

1

n

n

∑i=1

E(−∂2Qi

∂β∂βT) = A

,

So, Σ = B−1 = A−1.

Now, Var(β) = [E(−∂2Q

∂β∂βT)]

−1

.

If Var(Yi) ≠ V (µi), i.e., assumed variance is incorrect, since E(U) = 0, β is still consistent. We

still have

√n(β − β)

⋅= (−

1

n

n

∑i=1

∂Ui∂β

)

−11

√n

n

∑i=1

UidÐ→ N(0,Σ) where Ui =

∂µi∂β

Yi − µiV (µi)

,

∂Ui∂β

=∂

∂βT(∂µi∂β

1

V (µi)) (Yi − µi) −

∂µi∂β

∂µi∂βT

1

V (µi).

16

Hence,

−An⋅=

n

∑i=1

E(−∂Ui∂β

) =n

∑i=1

∂µi∂β

∂µi∂βT

1

V (µi)=XTWX where W = diag (g′(µi)

2V (µi)−1)

Bn =n

∑i=1

Var(Ui) =n

∑i=1

E(UiUTi )

Robust variance estimator:

Var(β) =1

nA−1BA−1 = An

−1BnAn

−1where An =X

T WX and Bn =n

∑i=1

Ui(β)Ui(β)T

Var(β) = (XT WX)−1 n

∑i=1

Ui(β)Ui(β)T(XT WX)−1

Note: needs large n to estimate Bn well.

Different variance estimator:

1. Naive: Var(β) = (XT WX)−1, under user-specified V (µi). In GLM, V (µi) is determined

by the distributional assumption.

2. Model-based: assume Var(Yi) = σ2V (µi), Var(β) = σ2(XT WX)−1.

3. Robust: only assume that the mean is correctly specified, so

Var(β) = (XT WX)−1 n

∑i=1

Ui(β)Ui(β)T(XT WX)−1.

Mathematical Rationale for the Estimating Equations

An estimating function, g(Y ; θ), is a function of the data Y and parameter θ having zero mean

for all parameter values, i.e., E[g(Y ; θ)] = 0. To obtain parameter estimates, solve g(θ;Y ) = 0.

E.g., under identify link, Y − µ(β) = 0Ô⇒ Yi − µi(β) = 0.

Challenge: reduce n-vector to a p-vector with minimal savrifice of information. n: # samples;

p: # parameters. For linear models,

E[gi(Y ; θ)∣Ai] = 0

where Ai ≡ Ai(Y ; θ).

1. For regressions, Ai =Xi (the set of covariates);

2. For time-series, Ai = Yi−1.

Let Dij = −E∂gi(Yi; θ)

∂θj∣Ai.

E.g., g = Y − µ(β), D =∂µ

∂β.

The estimating functions are U(β, y) =∂µ

∂β

Y − µ(β)

V (µ)=DTV −1g where

17

V = diagVar(g1∣A1), . . . ,Var(gn∣An).

Following quasi-likelihood results, the asymptotic variance: Cov(θ) = E [∂U(θ)

∂θ]

−1

= (DTV D)−1 =

(∂µT

∂θV∂µ

∂θ)

−1

.

Optimality: Consider linear estimating equations h(y;β) =HT (y −µ(β)) where H ∼ n× p may

be a function of β, but not of y. Estimate β through h(y; β) = 0.

Under the usual asymptotic regularity conditions,

0 = h(β) ≈ h(β) +∂h

∂β(β − β)

Ô⇒ β − β ≈ −h(β) (∂h

∂β)−1

= −h(β) [∂H

∂β(y − µ) −HT ∂µ

∂β]−1

≈ (HTD)−1h(β)

since E [∂H∂β (y − µ)] = 0. Thus,

Cov(β) ≈ (HTD)−1Cov(h)(DTH)−1

= (HTD)−1HTCov(y)H(DTH)−1

= (HTD)−1HTV H(DTH)−1

≥ Cov(β) = (DTV −1D)−1 where β is the MQLE

since Cov(β)−1−Cov(β)−1 ≈DT (V −1−H(HTV H)−1HT )D ≥ 0, where V −1−H(HTV H)−1HT

is symmetric and idempotent.

Fact: For positive definite matrices A, B of the same order, A−1−B−1 ≥ 0, i.e., nonnegative definiteÔ⇒

B −A ≥ 0 nonnegative definite.

Optimality: For any vector a, Cov(aT β) ≥ Cov(aT β), similar to the proof of Gauss-Markov

Thm.

Note: Cov(β)−1 − Cov(β)−1 is the residual variance of DTV −1Y ∼HT .

Quasi-likelihood (QL) Extensions

1. Parameters in the variance function: E(Yi) = µi = g−1(XTi β), Var(Yi) = V (µi, φ), e.g.,

σ2V (µi).

Reference: Breslow. JASA 85. 565-571 (1990).

2. Correlated data:

Suppose Yi is a vector of correlated observations.

E(Yi) = µi = g−1(XTi β),Var(Yi) = V (µi, φ) = V 1/2ρV 1/2 where ρ is the correlation matrix

and V is the variance matrix.

18

This is the GEE model.

Reference: Diggle, Liang, Zeger. 1994. Section 4.6 & 7.1.

19

1.2 Correlated Data

1.2.1 Introduction

Correlated data has various formats:

Longitudinal data: measurements over time (t as a covariate)

Repeated measures

Clustered data

Panel data (econometrics)

Example 1.2.1. Subject i = 1, . . . ,m; Time tij, Data (Yij,Xij), j = 1, . . . , n.

E(Yij) = µij Ô⇒ E(Yi) = µi, Var(Yij) = Vij Ô⇒ Var(Yi) = Vi.

Models

1. Yij = β0 + βXij + εij Ô⇒ Yij = β0 + βXi1´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

baseline profile

+β(Xij −Xi1)´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

change profile

+εij

2. Yij = β0i + β(Xij −Xi1) + εij

3. Yij = β0 +βCXi1 +βL(Xij −Xi1)+ εij where βC is the cross-sectional component (baseline)

and βL is the longitudinal component.

βOLS = (XTX)−1XTY

βGLS = (XTV −1X)−1XTV −1Y

Yij =XTijβ + εij, εi =

⎛⎜⎜⎝

εi1

⋮

εin

⎞⎟⎟⎠

∼MVN(0, σ2V0).

1. For independent data, V0 = I.

2. For correlated data, V0 is not diagonal, but εi’s are still independent.

Y ∼ N(Xβ,σ2V ) where V =

⎛⎜⎜⎜⎜⎜⎝

V1 0

V2

⋱

0 Vm

⎞⎟⎟⎟⎟⎟⎠

.

Main reason to consider correlation: efficiency, i.e., β has smaller variance than that assuming

uncorrelated.

Coefficient Estimation

1. (a) WLS: min(Y −Xβ)TW (Y −Xβ)Ô⇒ βW = (XTWX)−1XTWY

(b) OLS: W = I Ô⇒ βI = (XTX)−1XTY

(c) BLUE: W = V −1 Ô⇒ β = (XTV −1X)−1XTV −1Y

All estimators are consistent:

E(βW ) = (XTWX)−1XTWE(Y ) = (XTWX)−1XTWXβ = β.

20

Var(βW ) = σ2(XTWX)−1XTWV WX(XTWX)−1

Var(βI) = σ2(XTX)−1XTV X(XTX)−1

Var(β) = σ2(XTV −1X)−1

If V is unknown, then we can model and estimate V 1O parametrically; 2O non-parametrically

(m≫ n).

If V is misspecified, W is not optimal. Efficiency is defined by

efficiency =Var(β)

Var(βW ).

2. MLE:

l(β,σ2, V0) = −1

2

1

σ2(Y −Xβ)TV −1

0 (Y −Xβ) + nm logσ2 +m log ∣V0∣

estimate β,σ, V0 by MLE (estimator of σ is biased).

3. REML (Restricted Maximum Likelihood):

The maximum likelihood estimator of σ2 is biased. But sometimes, we want an unbiased

estimate, e.g., variance component of mixed effect model.

E.g., V = I, σ2 =RSS

nm, unbiased estimator σ2 =

RSS

nm − p.

Key idea of REML: Let A be a linear transformation on Y , Y ∗ = AY , s.t. distribution of

Y ∗ does not depend on β.

E.g., A = I −X(XTX)−1XT = I − P Ô⇒ Y ∗ = AY ∼ N(0, σ2AV AT ).

Ô⇒Apply MLE on Y ∗ leading to the REML estimate for σ2 = RSSnm−p .

Class of Correlation Models (Extension of GLMs)

1. Marginal model (extension of quasi-likelihood):

1O E(Yi) = µi, g(µi) =Xiβ

2O Var(Yij) = V (µij)φ

3O Corr(Yij, Yik) = ρ(µij, µik∣α)

Ô⇒ GEE (generalized estimating equations), specify g(), V (), ρ().

Advantage: only need first and second moments; only take correlation into account in the

model.

Disadvantage: may mis-specify the correlation function.

2. Random effects model/mixed effects model/conditional model (extension of GLM, MLE):

Given bi, (Yi1, . . . , Yini) are mutually independent and follow a GLM with density

f(yij ∣bi) = expyijθij − ψ(θij)

φ+ c(yij, φ) .

The conditional mean and variance are:

µij = E(Yij ∣bi) = ψ′(θij)

Vij = Var(Yij ∣bi) = ψ′′(θij)φ,

21

since conditional on bi (conditional factor), the data should be independent.

The model satisfies

g(µij) =XTijβ +Z

Tijbi,

where Zij is a subset of Xij and ZTijbi models the correlation (shared property in a family).

Ô⇒ GLMM (generalized linear mixed-effects model).

Assume bi’s are mutually independent, with a common multivariate distribution F (e.g.,

F ∼ a normal distribution). The likelihood is

L(β,α∣Y ) =m

∏i=1∫

ni

∏j=1

f(Yij, bi)dbi =m

∏i=1∫

ni

∏j=1

f(Yij ∣bi)f(bi∣α)dbi,

where α is related to F , f(Yij ∣bi) from GLM and f(bi∣α) from F (e.g., MVN).

For generalized linear mixed models, it is difficult to do integration.

For Gaussian data (Y ∣bi), use MLE or REML.

For non-Gaussian data (Y ∣bi), use numerical integration.

Advantage: can find out the correlation factors and model them.

Disadvantage: need to make sure the correlation factors fully explain (otherwise not

independent) the correlation. Let m be the number of clusters. m needs to be large for

the distribution to approximate well.

3. Transition (Markov) model:

The link function is

g E[yij ∣yi1, . . . , yi(j−1)] =XTijβ +

s

∑r=1

fr(yi1, . . . , yi(j−1);α).

Var[yij ∣yi1, . . . , yi(j−1)] = φAE[yij ∣yi1, . . . , yi(j−1)].

We need to know the order of observations.

If g is the linear/identity link, then all three models converge to the same model.

1.2.2 Linear Mixed Models (LMM)

The linear mixed model is represented by

yi =Xiβ +Zibi + εi

where Xiβ is the fixed component and Zibi is the random component.

Theoretically, Zi ⊈Xi; in practice, Zi ⊆Xi.

Example 1.2.2. measurement over time (t), treatment groups (T ); tij is time-variant and Ti

is time-invariant.

1O random intercept, fixed slope: Zi =

⎛⎜⎜⎝

1

⋮

1

⎞⎟⎟⎠

yij = β0 + β1tij + β2Ti + b0i + εij

* different baseline values

22

2O random intercept, random slope: Zi =

⎛⎜⎜⎝

1 ti1

⋮ ⋮

1 tini

⎞⎟⎟⎠

,Xi =

⎛⎜⎜⎝

1 ti1 Ti

⋮ ⋮ ⋮

1 tiniTi

⎞⎟⎟⎠

yij = β0 + β1tij + β2Ti + b0i + b1itij + εij

* different baseline (starting point)

* different response to treatment

Assumptions:

1. bi ∼ N(0,D).

In Example 1.2.2,

1O bi = (b0i) ∼ N(0, σ2b),

2O bi = (b0i, b1i) ∼ N⎛

⎝

⎛

⎝

0

0

⎞

⎠,⎛

⎝

σ20 σ01

σ10 σ21

⎞

⎠

⎞

⎠.

σ01 = Cov(b0i, b1i) accounts for the shared properties between effects, but we do not

model them. When the number of correlation factors goes to infinity, σ01 → 0.

2. Var(εi) = Ri, e.g., σ2I. Ri could be non-diagonal, representing residual correlation within

cluster.

3. biεi.

Two-component representation of LMM:

Define:

Level 1 (measurement level): finest unit - e.g., tij, (lipid)ij - are variant across time.

Level 2 (personal level, upper level): patient/subject i - e.g., Ti, (gender)i - invariant across

time.

Example 1.2.3.

Level 1 model: Yij = b0i + β1tij + εij, εij ∼ N(0, σ2) (1)

Level 2 model: b0i = β00 + β01Ti +U0i, U0i ∼ N(0, σ2b) where U0i’s represent the residuals (2)

Substituting (2) into (1), we get model 1O.

For model 2O,

Level 1 model: Yij = b0i + b1itij + εij

Level 2 model: b0i = β00 + β01Ti +U0i, b1i = β10 + β11Ti +U1i

Common Structures for D or Ri

The dimension of D depends on the number of random effect risk factors. D characterizes

correlation between random effects.

The dimension of Ri depends on the number of observations within cluster i. Ri characterizes

(residual) correlation between samples.

23

E.g., D = Cov

⎛⎜⎜⎜⎜⎜⎝

⎛⎜⎜⎜⎜⎜⎝

b01

b02

⋮

b0m

⎞⎟⎟⎟⎟⎟⎠

,

⎛⎜⎜⎜⎜⎜⎝

b11

b12

⋮

b1m

⎞⎟⎟⎟⎟⎟⎠

⎞⎟⎟⎟⎟⎟⎠

is a 2 × 2 matrix;

Ri = Cov

⎛⎜⎜⎜⎜⎜⎝

⎛⎜⎜⎜⎜⎜⎝

y11

y21

⋮

yn11

⎞⎟⎟⎟⎟⎟⎠

,

⎛⎜⎜⎜⎜⎜⎝

y12

y22

⋮

yn22

⎞⎟⎟⎟⎟⎟⎠

,

⎛⎜⎜⎜⎜⎜⎝

y13

y23

⋮

yn33

⎞⎟⎟⎟⎟⎟⎠

⎞⎟⎟⎟⎟⎟⎠

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶observations at time i=1,2,3

is a 3 × 3 matrix.

1. exchangeable/compound-symmetry (C-S): σ2

⎛⎜⎜⎜⎜⎜⎝

1 ρ ⋯ ρ

ρ 1 ⋯ ρ

⋮ ⋮ ⋱ ⋮

ρ ρ ⋯ 1

⎞⎟⎟⎟⎟⎟⎠

2. variance-component (VC), independent (usually applied to D, not Ri): D =⎛

⎝

σ21 0

0 σ22

⎞

⎠

3. one-dependent: σ2

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 ρ ⋯ 0 0

ρ 1 ⋯ 0 0

⋮ ⋮ ⋱ ⋮ ⋮

0 0 ⋯ 1 ρ

0 0 ⋯ ρ 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠

4. k-dependent: σ2

⎛⎜⎜⎝

1 ρ1 ⋯ ρk 0 0 ⋯ 0

ρ1 1 ρ1 ⋯ ρk 0 ⋯ 0

⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

⎞⎟⎟⎠

(k + 1) parameters: σ2, ρ1, . . . , ρk

5. AR-1: Ri = σ2

⎛⎜⎜⎜⎜⎜⎝

1 ρ ρ2 ⋯ ρn−2 ρn−1

ρ 1 ρ ⋯ ρn−3 ρn−2

⋮ ⋮ ⋮ ⋮ ⋮ ⋮

ρn−1 ρn−2 ρn−3 ⋯ ρ 1

⎞⎟⎟⎟⎟⎟⎠

6. Markov chain type (AR with non-equal intervals): Corr(Yij, Yik) = ργ∣tj−tk ∣ or ργ⋅f(tj−tk)

7. unstructured: no assumption on D or RiÔ⇒n(n + 1)

2parameters (inefficient if we have

information on D or Ri)

Note: Except the exchangeable structure, the order of data matters.

24

Estimation of β

Var(Yi) = ZiDZTi +Ri = Vi

Var(Y ) = ZDZT +R = V =

⎛⎜⎜⎜⎜⎜⎝

V1 0 ⋯ 0

0 V1 ⋯ 0

⋮ ⋮ ⋱ ⋮

0 0 ⋯ Vm

⎞⎟⎟⎟⎟⎟⎠

1. If V is known, then Yi ∼MVN(µi, Vi).

f(Yi) = (2π)−ni/2∣Vi∣−1/2 exp−

1

2(Yi −Xiβ)

TV −1i (Yi −Xiβ) f(Y ) =

m

∏i=1

f(Yi)

Ô⇒ MLE β = (m

∑i=1

XTi V

−1i Xi)

−1

(m

∑i=1

XTi V

−1i Yi)

X =

⎛⎜⎜⎝

X1

⋮

Xm

⎞⎟⎟⎠

V =

⎛⎜⎜⎝

V1 ⋯ 0

⋮ ⋱ ⋮

0 ⋯ Vm

⎞⎟⎟⎠

Ô⇒ β = (XTV −1X)−1XTV −1Y (GLS)

Var(β) = (m

∑i=1

XTi V

−1i Xi)

−1

= (XTV −1X)−1

2. If V is unknown, then there is no guarantee that β is MVN .

Let ω be the parameters of the covariance matrix.

E.g., D =⎛

⎝

σ21 σ12

σ12 σ22

⎞

⎠,Ri = σ2

0I Ô⇒ ω = (σ21, σ

22, σ12, σ2

0).

(a) MLE of V (ω):

∂l(β,ω)

∂ω= 0Ô⇒ VMLE = V Ô⇒ β = (XT V −1X)−1(XT V −1Y )

Problems:

1O ωMLE is biased.

2O ωMLE may have negative solutions (boundary on parameter space).

(b) REML of V (ω):

Let A be a transformation matrix such that U = AY contains no β, i.e., AX = 0.

A = I −X(XTX)−1XT and Y ∼ N(Xβ,V )Ô⇒ U ∼ N(0,AV AT ).

Reference: Harville (1974). Biometrika.

MLE estimate β, σ2 (or V ) simultaneously. σ2MLE biased downwards due to the

degree of freedom loss in estimating β.

Intuition behind REML: maximize a modified likelihood free of β.

For design matrix XN×p full rank, there exists at most N − p vectors with aTX = 0.

⇐Ô In a N-dimension space, we can find at most N − p vectors (passing the origin)

25

orthogonal to a linear subspace spanned by p linearly independent vectors.

Define A = [a1 a2 ⋯ aN−p]:

Ô⇒ ATX = 0

Ô⇒ y∗ = ATy = AT (Xβ + ε) = AT ε ∼ N (0,ATV A) free of β.

Derivation of REML: Partition the likelihood into a likelihood for the mean param-

eter µ (and σ2 and a residual likelihood for σ2 only.

Example 1.2.4. n independent variables with normal distribution:

logL = −n

2log(2π) −

n

2log(σ2) −

n

∑i=1

(y − µ)2

2σ2

= −n

2log(2π) −

n

2log(σ2) −

n

∑i=1

(y − y)2

2σ2−n(y − µ)2

2σ2

= [−1

2log(2π) −

1

2log(σ2) −

n(y − µ)2

2σ2]

+ [−n − 1

2log(2π) −

n − 1

2log(σ2) −

n

∑i=1

(y − y)2

2σ2]

where the first component is the logL for y (differ by a constant), and the second

component the logL for n − 1 independent random variables.

Matrix transformation: For MVN vector y, find orthogonal matrix P : u = Py s.t.

u = u1, u2,⋯, un uncorrelated, normally distributed (orthogonal transformation).

Ô⇒ Let first row of P be 1,1,⋯,1/√n

Ô⇒ u1 =√ny, u2,⋯, un

i.i.d∼ N(0, σ2)

Example 1.2.5. Linear model: y =Xβ + ε

OLS estimate: b = (XTX)−1XTy

logL(y) = −n

2log(2π) −

n

2log(σ2) −

(y −Xβ)T (y −Xβ)

2σ2

Rewrite:

y −Xβ = (y −Xb) + (Xb −Xβ)

= (y −Xb) +X(b − β)

Ô⇒ (y −Xβ)T (y −Xβ)

= (y −Xb)T (y −Xb) + 2(y −Xb)TX(b − β) + (b − β)TXTX(b − β)

= (y −Xb)T (y −Xb) + (b − β)TXTX(b − β)

since y −Xb = (I − P )Y is an orthogonal projection onto the linear space spanned

by X. And,

(y −Xb)T (y −Xb) = (y − Py)T (y − Py)

= yT (I − P )T (I − P )y

= yT (I − P )y

26

since I − P symmetric and idempotent. So,

LogL(y) =−p

2log(2π) −

p

2log(σ2) −

(b − β)TXTX(b − β)

2σ2

+ −n − p

2log(2π) −

n − p

2log(σ2) −

yT (I − P )y

2σ2

where the second component is the residual logL, free of β.

Example 1.2.6. Linear mixed model: y = Xβ + Zb + ε, where b ∼ N(0,D), ε ∼

N(0,R).

H = Var(y) = ZDZT +R

Find a transformation LTy =⎛

⎝

y∗1y∗2

⎞

⎠, L = (L1, L2), s.t.,

LT1X = Ip

LT2X = 0

Here L1 is a n × p matrix, and L2 is a n × (n − p). Therefore,

y∗1 ∼ N(β,LT1HL1)

y∗2 ∼ N(0, LT2HL2)

Cov(y∗1 , y∗2) = L

T1HL2

Using property of MVN, we can get:

y∗1 ∣y∗2 ∼ N (β −LT1HL2(L

T2HL2)

−1y∗2 , LT1HL1 −L

T1HL2(L

T2HL2)

−1LT2HL1)

And,

(H−1X L2)−1H−1 (H−1X L2)

−1= ((H−1X L2)

TH (H−1X L2))

−1

=⎛

⎝

⎛

⎝

XTH−1

LT2

⎞

⎠H (H−1X L2)

⎞

⎠

−1

=⎛

⎝

(XTH−1X) XTL2

LT2X (LT2HL2)

⎞

⎠

−1

=⎛

⎝

(XTH−1X)−1

0

0 (LT2HL2)−1

⎞

⎠

Ô⇒H−1 = (H−1X L2)⎛

⎝

(XTH−1X)−1

0

0 (LT2HL2)−1

⎞

⎠(H−1X L2)

= (H−1X (XTH−1X)−1

L2 (LT2HL2)−1)⎛

⎝

XTH−1

LT2

⎞

⎠

=H−1X (XTH−1X)−1XTH−1 +L2 (L

T2HL2)

−1LT2

27

On both sides, pre-multiply by LT1H and post-multiply by HL1:

Ô⇒ LT1HL1 = LT1X (XTH−1X)

−1XTL1 + (LT1HL2) (L

T2HL2)

−1(LT2HL1)

= (XTH−1X)−1+ (LT1HL2) (L

T2HL2)

−1(LT2HL1)

since LT1X = Ip. Therefore,

y∗1 ∣y∗2 ∼ N (β −LT1HL2(L

T2HL2)

−1y∗2 , (XTH−1X)

−1)

Because y∗2 ∼ N (0, LT2HL2). the residual likelihood:

lR = C −1

2(log∣LT2HL2∣ + y

∗T2 (LT2HL2)

−1y∗T2 )

= C −1

2(log∣LT2HL2∣ + y

TL2(LT2HL2)

−1LT2 y)

Let P =H−1 −H−1X (XTH−1X)−1XTH−1 = L2(LT2HL2)−1LT2 . And we have:

∣LTHL∣ = ∣LT2HL2∣∣LT1HL1 − ∣ (LT1HL2) (L

T2HL2)

−1(LT2HL1) ∣

= ∣LT2HL2∣∣ (XTH−1X)

−1∣

log ∣LTHL∣ = log ∣LLTH ∣ = log ∣LLT ∣ + log∣H ∣

Ô⇒ log ∣LT2HL2∣ = log ∣LTHL∣ + log ∣XTH−1X ∣ = log ∣LLT ∣ + log∣H ∣ + log ∣XTH−1X ∣

Therefore,

lR = C∗ −1

2(log∣H ∣ + log ∣XTH−1X ∣ + yTPy)

which is the REML for parameters in H, and the loglikelihood for the fixed effects

depends on y1∣y2. (Note: log ∣LLT ∣ is merged into the constant term.)

Comparison between MLE and REML:

1O REML removes bias from ωMLE.

2O REML does not estimate β directly, so invariant to the values of β.

3O REML is less sensitive to outliers than MLE.

4O difference between REML and ML → 0 as m→∞ where m is the number of clusters

(e.g., independent families), not overall sample size.

5O From Bayestian perspective, using u = ATY is equivalent to ignoring any prior in-

formation on β. In absence of info on β, no info for ω is lost when using u = ATY

instead of Y .

6O Var(ωREML) ≥ Var(ωMLE)Ô⇒ REML is less efficient than MLE.

REML: unbiased, less efficient;

MLE: biased, more efficient.

MSE = Var + bias2. Let p be the number of β’s.

If p ≤ 4, then MSE(ωMLE) < MSE(ωREML); if p > 4, MSE(ωMLE) > MSE(ωREML).

28

Inference for β

Simple test: H0 ∶ βj = 0 vs HA ∶ βj ≠ 0. V (β) = (XT V −1X)−1.

Wald test:βjσj

⋅∼ N(0,1)→ t-test

β2j

σ2j

⋅∼ χ2(1)→ F-test

Debate on degree of freedom: For linear models, df = n−p; for longitudinal data, n =m

∑i=1

ni−p(?).

Use approximation in practice, e.g., Satterthwaite.

Contrast test: H0 ∶ Lβ = 0. Var(Lβ) = LVar(β)LT .

Wald test: TL = (Lβ)T LV (β)LT−1

(Lβ)⋅∼ χ2(r) under H0, r = rank(L).

Model Comparison:

1O nested models: LRT ∼ χ2k

2O non-nested models:

AIC = −2(l−p) where l is the log-likelihood and p = #(β,ω) (penalized for #parameters)

BIC = −2 [l − p log (N

2)]

Robust Variance Estimator

If V is known, the generalized least squares (GLS) estimator of β is β = (XTV −1X)−1XTV −1Y .

Var(β) = (XTV −1X)−1XTV −1Var(Y )V −1X(XTV −1X)−1

≈ (XTV −1X)−1XTV −1 (Y − Y )(Y − Y )T

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=Var(Y ), the empirical variance of Y

V −1X(XTV −1X)−1

Example 1.2.7. Let Yij be the pain score for woman i at time tj. Ti =

⎧⎪⎪⎨⎪⎪⎩

0 control

1 treatment.

1. Model (fixed slope): Yij = β0 + β1tj + β2Ti + b2Ti + eij, b2 ∼ N(0, σ2b), eij ∼ N(0, σ2), b2eij.

E(Yij) = β0 + β1tj + β2Ti, Var(Yij) = T 2i Var(b) +Var(eij) =

⎧⎪⎪⎨⎪⎪⎩

σ2b + σ

2 if Ti = 1

σ2 if Ti = 0.

Hence, the variation of Yij in the treatment group > the control group.

There is no interaction term: no matter which treatment is given, the pain score for

women will have the same progression rate along time Ô⇒ the same slope.

2. Model: Yij = β0 + β1tj + β2Ti + b2Si + eij where Si = 1 − Ti.

Hence, the variation of Yij in the treatment group < the control group.

3. Model (random slope): Yij = β0 + β1tj + β2Ti + β3Titj + b2Ti + b3Titj + eij.

Assume independence, Var(Yij) = t2jVar(b1) + T 2i Var(b2) + T 2

i t2jVar(b3) +Var(eij).

Hence, the variation of Yij in the treatment group > the control group.

Two-stage model:

Stage I: Yij = β0i + β1itj + eij to model individual measurements (e.g., repeated measures from

patients)

29

Stage II:⎛

⎝

β0i

β1i

⎞

⎠=⎛

⎝

β0

β1

⎞

⎠+⎛

⎝

b0i

b1i

⎞

⎠to model the difference between patients (normal assumption)

Ô⇒ Yij = (β0 + b0i) + (β1 + b1i)tj + eij = β0 + β1tj + b0i + b1itj + eij.

Assume E(bi) = 0, Var(bi) =D =⎛

⎝

d11 d12

d12 d22

⎞

⎠.

Yi = Ziβ +Zibi + eiÔ⇒ Var(Yi) = ZiDZTi +Ri

Var(b0i) = d11 Ô⇒ 95% plausible range for intercept is b0 ± 1.96√d11

Var(b1i) = d22 Ô⇒ 95% plausible range for intercept is b1 ± 1.96√d22

1O H0 ∶ Var(bi) =⎛

⎝

d11 0

0 d22

⎞

⎠Ô⇒ d12 = 0 b0ib1i

2O H0 ∶ Var(bi) =⎛

⎝

d11 0

0 0

⎞

⎠Ô⇒ d12 = 0&d22 = 0 Problem: test on the boundary of

parameter space; the distribution χ22 does not hold any more. LRT ∼ χ2

k under H0 usually,

but in this case, LRT ∼ a 50:50 mixture of χ21 and χ2

2. (Ref: Verbeke & Molenberghs, Ch

6.3)

Best Linear Unbiased Predictor (BLUP)

Subjects with ”large” b0i are very ”different”. It would be nice to have values of b0i.

Difficulty: b0i is random, so can be one of an infinite number of values. For prediction, we often

use expected values. E.g., Y1, . . . , YnÔ⇒ Yn+1 = Yn ≈ E(Y ).

b01, . . . , b0mÔ⇒ problem: we cannot use expected values as E(b0i) = 0.

Solution: instead of using the marginal distribution, we use expectation from the conditional

distribution.

Let b0i = E(b0i∣Yi) which is the function that minimizes E[bi − c(Yi)]2, i.e., MSE for any linear

function c(Yi).

Simple case: Yij = µ + bi + eij, bi ∼ N(0,D) (random-effect one-way ANOVA)

Here D is a scalar, eij ∼ N(0, σ2), eijbi.

Yij ∣bi ∼ N(µ + bi, σ2)

f(bi∣Yi) =f(Yi∣bi)f(bi)

f(Yi)∝ f(Yi∣bi)f(bi) ∼ normal distribution

.

f(bi∣Yi)∝ (σ2)−ni/2 exp−1

2σ2

ni

∑j=1

(Yij − µ − bi)2D−1/2 exp−

1

2Db2i

∝ exp−1

2σ2

ni

∑j=1

[(Yij − µ)2 − 2bi(Yij − µ) + b

2i ] −

1

2Db2i

= exp[−ni

2σ2b2i −

1

2Db2i ] −

1

2σ2

ni

∑j=1

[(Yij − µ)2 − 2bi(Yij − µ)]

Ô⇒ bi∣Yi ∼ N(µb∣Y , σ2b∣Y )

30

f(bi∣Yi)∝ exp

⎧⎪⎪⎨⎪⎪⎩

−1

2σ2b∣Y

b2i + 2bi

µb∣Y

σ2b∣Y

−µ2b∣Y

2σ2b∣Y

⎫⎪⎪⎬⎪⎪⎭

Hence,1

2σ2b∣Y

=ni

2σ2+

1

2DÔ⇒ σ2

b∣Y = (niD + σ2

σ2D)−1

=σ2D

niD + σ2

µb∣Y

σ2b∣Y

=1

σ2

ni

∑j=1

(Yij − µ)Ô⇒ µb∣Y =σ2b∣Y

σ2

ni

∑j=1

(Yij − µ) =niD

niD + σ2(Yi − µ) < Yi − µ

Yi = β0 + b0i + ei, b0i = Yi − µ where Yi is the cluster mean and µ is the population average.

bi = E(bi∣Yi) =niD

niD + σ2(Yi − µ) is the shrinkage estimate (shrink to the population mean).

bi is called the shrinkage estimator, or weighted deviation of Yi and µ.

bi is unbiased: E(bi) =niD

niD + σ2E(Yi − µ) = 0.

In practice, bi =niD

niD + σ2(Yi − µ)Ð→ Empirical BLUP (EBLUP).

Define: µi = E(Yij ∣bi) = µ + bi.

Predict: µi = µ + bi =niD

niD + σ2Yi +

σ2

niD + σ2µ vs µi = Yi

1O If D ≫ σ2, then µi → Yi.

2O If σ2 ≫D, then µi → µ.

Summary: Yi =Xiβ +Zibi + εi, bi ∼ N(0,D), ei ∼ N(0,Ri), biei.

Hence, Yi ∼ N(Xiβ,Σi) where Σi = ZiDZTi +Ri.

bi = E(bi∣Yi) = DZTi Σ−1

i (Yi −Xiβ)

µi = E(Yi) =Xiβ

µi = E(Yi∣bi) =Xiβ +Zibi = ZiDZTi Σ−1

i Yi + (Ini−ZiDZ

Ti Σ−1

i )Xiβ = Wi1Yi + Wi2µi

where Yi is the data and µi is the marginal mean

Bayesian interpretation:

E(bi) = 0 is the prior mean for bi before data collection.

E(bi∣Yi) = bi is the posterior mean after data collection.

EBLUP → Empirical Bayes estimator

Variance Components

Yi =Xiβ +Zibi + ei =Xiβ + εiÔ⇒ β = (XTΣ−1X)−1XΣ−1Y where Σ = ZDZT +R.

Σ: covariance matrix of Y

D: random effects

R: residual covariance

If Zi = Z and Ri = σ2Ini, then β is invariant to the choice of Z: β = (XTX)−1XTY .

Example 1.2.8. Yi =Xiβ +Zibi + ei

1. Random intercept: bi =

⎛⎜⎜⎝

1

1

1

⎞⎟⎟⎠

Var(bi) = d11 =D Var(ei) = σ2I

31

Assume ni = 3,

Var(Yi) = ZiDZTi +Ri =

⎛⎜⎜⎝

1

1

1

⎞⎟⎟⎠

d11

⎛⎜⎜⎝

1

1

1

⎞⎟⎟⎠

T

+ σ2

⎛⎜⎜⎝

1 0 0

0 1 0

0 0 1

⎞⎟⎟⎠

=

⎛⎜⎜⎝

d11 + σ2 d11 d11

d11 d11 + σ2 d11

d11 d11 d11 + σ2

⎞⎟⎟⎠

= σ21

⎛⎜⎜⎝

1 ρ ρ

ρ 1 ρ

ρ ρ 1

⎞⎟⎟⎠

where σ21 = d11 + σ

2 and ρ =d11

d11 + σ2.

2. Random intercept and random slope:

bi = (b0i b1i) D =⎛

⎝

d11 0

0 d22

⎞

⎠Ri = σ2Ini

Z =

⎛⎜⎜⎝

1 t1

1 t2

1 t3

⎞⎟⎟⎠

Σi = ZiDZTi +Ri =

⎛⎜⎜⎝

d11 + t21d22 + σ2 d11 + t1t2d22 d11 + t1t3d22

d11 + t1t2d22 d11 + t22d22 + σ2 d11 + t2t3d22

d11 + t1t3d22 d11 + t2t3d22 d11 + t23d22 + σ2

⎞⎟⎟⎠

Corr(Yi1, Yi2) =d11 + t1t2d22

√d11 + t21d22 + σ2

√d11 + t22d22 + σ2

Hence, t ↑ ⇒ Var(Yit) ↓ and ∣t1 − t2∣ ↑ ⇒ Corr ↓.

Let Ri = σ2

⎛⎜⎜⎝

1 ρ ρ2

ρ 1 ρ

ρ2 ρ 1

⎞⎟⎟⎠

. Corr(Yi1, Yi2) =d11 + t1t2d22 + ρσ2

√d11 + t21d22 + σ2

√d11 + t22d22 + σ2

.

3. (a) Zi =

⎛⎜⎜⎝

Ti

Ti

Ti

⎞⎟⎟⎠

Ti =

⎧⎪⎪⎨⎪⎪⎩

0 control

1 treatment

Var(Yi) = d11Ti

⎛⎜⎜⎝

1 1 1

1 1 1

1 1 1

⎞⎟⎟⎠

+ σ2I3 =

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

σ2I if Ti = 0

⎛⎜⎜⎝

d11 + σ2 d11 d11

d11 d11 + σ2 d11

d11 d11 d11 + σ2

⎞⎟⎟⎠

if Ti = 1

If we want correlation in both groups, then:

(b) Let Ri = C-S.

Var(Yi) = d11TiJ3×3 + σ2

⎛⎜⎜⎝

1 ρ ρ

ρ 1 ρ

ρ ρ 1

⎞⎟⎟⎠

=

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

σ2

⎛⎜⎜⎝

1 ρ ρ

ρ 1 ρ

ρ ρ 1

⎞⎟⎟⎠

if Ti = 0

⎛⎜⎜⎝

d11 + σ2 d11 + ρσ2 d11 + ρσ2

d11 + ρσ2 d11 + σ2 d11 + ρσ2

d11 + ρσ2 d11 + ρσ2 d11 + σ2

⎞⎟⎟⎠

if Ti = 1

(c) Zi =

⎛⎜⎜⎝

1 Ti

1 Ti

1 Ti

⎞⎟⎟⎠

. Ri = σ2I.

32

Resampling Methods

Model: Yi =Xiβ +Zibi + ei, bi ∼ N(0,D), ei ∼ N(0,Ri), biei.

Model-based estimate: β = (XT Σ−1X)−1XT Σ−1Y where Σ = ZDZT +R, Var(β) = (XT Σ−1X)−1.

If the model is misspecified, then we can use empirical/robust Var(β).

1. Jackknife

We have observations X1, . . . ,Xni.i.d.∼ F (x; θ). Using data, we can compute an estimate

of θ, e.g., θ = X, θ =1

n − 1

n

∑i=1

(Xi − X)2.

Let θ(i) be the estimate of θ leaving out Xi. So, there are n possible estimates.

θ(⋅) =1

n

n

∑i=1

θ(i), ˆbias(θ) = (n − 1)[θ(⋅) − θ]

Example 1.2.9. θ = E(X), θ = X

θ(i) =nX −Xi

n − 1

Ô⇒ θ(⋅) =∑ni=1(nX −Xi)

n(n − 1)=

n

n − 1X −

1

n(n − 1)

n

∑i=1

Xi = X

Ô⇒ ˆbias(θ) = 0

Define pseudo value as θi = θ + (n − 1)[θ − θ(i)].

The Jackknife estimator is θ =1

n

n

∑i=1

θi.

So, bias(θ) = O (1

n) and bias(θ) = O (

1

n2)Ô⇒ bias(θ) will drop faster than bias(θ).

Var(θ) =n − 1

n

n

∑i=1

[θ(i) − θ(⋅)]2 =

1

n(n − 1)

n

∑i=1

(θi − θ)2. Reference: Tukey (1958).

Application of Jackknife on repeated measures: estimate θ after removing each indepen-

dent observation.

β(i) = (m

∑j=1,j≠i

XTj Σ−1

j Xj)

−1

(m

∑j=1,j≠i

XTj Σ−1

j Yj) β(⋅) =1

m

m

∑i=1

β(i)

Var(β) =m − 1

m

m

∑i=1

[β(i) − β(⋅)]2 =

(m − 1)2

m

1

m − 1

m

∑i=1

[β(i) − β(⋅)]2

2. Bootstrap

In real world, observed data X = (X1, . . . ,Xn)Ô⇒ θ = S(X). There is no selection bias,

resampling for many times. However in practice, experiments are expensive.

In Bootstrap world, the sample X is viewed as population. We repeat the sampling from

X: X∗ = (X∗1 , . . . ,X

∗n)Ô⇒ θ∗ = S(X∗).

There are (2n−1n

) unique Bootstrap samples Ô⇒ θ∗(1),⋯, θ∗((2n−1n)).

Ô⇒ Bootstrap distribution of θ ≈ true distribution of θ.

θ∗(⋅) =1

B

B

∑b=1

θ∗(b) B = (2n − 1

n) Var(θ) =

1

B − 1

B

∑b=1

[θ∗(b) − θ∗(⋅)]2

Application of Bootstrap on repeated measures:

(1) Resample unique subject/cluster (IDs): sample unique patient ID at one time, then

33

the whole cluster goes into the new sample. Otherwise, e.g., drawing 5 from cluster 1

and 7 from cluster 2 will create bias. Here, the assumption is cluster sizes are equal.

If we have a problem of variations in cluster sizes, then one possible solution would

be to split the clusters to equal-size groups.

(2) Compute β and denote it as β∗(b). We do not consider the standard error estimate

at this step; we use the Bootstrap estimate of variance instead.

(3) Repeat steps (1) and (2) for B times, B = (2m − 1

m), the total number of unique

draws we can get from the sample, or a pre-specified number, e.g., 500.

Ô⇒ Var(β) =1

B − 1

B

∑b=1

[β∗(b) − β∗(⋅)] [β∗(b) − β∗(⋅)]T

where β∗(⋅) =1

B

B

∑b=1

β∗(b)

Estimate the confidence intervals:

i. β∗(⋅)

j ± 1.96 se(β∗(⋅)

j )

ii. (2.5th - 97.5th) percentile of the Bootstrap distribution

1.2.3 Generalized Linear Mixed Models (GLMM)

GLMM is an extension for LMM to logistic link. It models correlation through random effects.

E(Yij) = µij =Xijβ +Zijbi Ð→ logit(µij) =Xijβ +Zijbi, bi ∼ N(0,D)

Objectives:

1. interested in the relationship between X and Y : Yij ∼Xij

2. interested in the correlations as well

3. β’s have subject specific interpretation

Recall GLM: g(µi) =Xiβ. GLMM: E(Yij ∣bi) = µij, g(µij) =XTijβ +Z

Tijbi, bi ∼ N(0,D(θ)).

L(Yij ∣β, θ) = ∫

ni

∏j=1

L(Yij ∣bi)L(bi)dbi

Remarks:

1. L(⋅) does not have a closed form except when Yij ∼ normal.

2. We need to evaluate the integral numerically - difficult for high-dimension. We use Rie-

mann integral, but for high–dimensional data, the calculation will be 100n which is very

slow.

3. Methods: (1) approximation, (2) Gauss-Hermite, (3) Gibbs sampling - BUGS.

Reference: Breslow&Clayton (1993). JASA DLZ (1994).

Quasi-likelihood:

Let y1, . . . , yn∣b be independent and follow a GLM with µbi = E(Yi∣bi), Var(Yi∣bi) = φV (µbi),

g(µbi) =XTi β +Zibi.

34

The quasi-likelihood of (β, θ) is

L(β, θ)∝ ∣D∣1/2∫ expn

∑i=1

li(yi∣β) −1

2bTD−1bdb where li(yi∣β)∝ ∫

µbi

yi

yi − u

φV (µ)du.

Example 1.2.10.

1. Clustered binomial data:

logit(µbij) =XTijβ +Z

Tijbi, E(Yij ∣bi) = µbij, Var(Yij ∣bi) = φm−1

ij µbij(1 − µ

bij)

2. Clustered Poisson data:

log(µbij) =XTijβ +Z

Tijbi, E(Yij ∣bi) = µbij, Var(Yij ∣bi) = φµbij

Inference in GLMM:

1. Conditional inference:

idea: calculate the sufficient statistics for b and make inference using the conditional

likelihood on the sufficient statistic.

Advantages:

(a) robustness: no distributional assumption on b

(b) likelihood has closed form

Disadvantages:

(a) Only works in specical cases, e.g., logistic GLMM; Poisson log-linear GLMM with

random intercept

(b) Some covariate effects cannot be estimated, e.g., cluster-level covariate effect in

conditional logistic regression

Details: Treat bi fixed, θij = θij(β, bi)

m

∏i=1

ni

∏j=1

f (Yij ∣β, bi)∝m

∏i=1

ni

∏j=1

exp (θijYij −Ψ(θij))

use exponential family and canonical link function, θij =XTijβ +Z

Tijbi, sufficient statistics:

ai =∑i,j

xijyij for β

bi =∑j

Zijyij for bi

See Diggle, Heagerty, Liang, and Zeger, ”Analysis of Longitudinal data”, 2002.

m

∏i=1

f (Yi∣∑j

Zijyij = bi, β) = ⋯ =m

∏i=1

∑Ri1exp(βTai)

∑Ri2exp(βT ∑

nij=1 xijyij

2. Approximate Inference:

idea: approximate l(β, θ) using various approximations.

35

(a) Laplace approximation: expand the intefrand of l(β, θ) about the mode, b = b in a

Taylor series before integration.

Ref: Tierney and Kadane, 1986, JASA; Lin and Pierce, 1993, Biometrika; Breslow

and Lin, 1995, Biometrika.

(b) Solomon-Cox approximation: expand the intefrand of l(β, θ) about the mean of ran-

dom effects.

Ref: Barndorff-Nielson and Cox, 1989, Chapter 3.3; Solomon and Cox, 1992, Biometrika;

Breslow and Lin, 1995, Biometrika.

(c) Penalized Quasi-Likelihood (PQL): modified Laplace, iteratively fit a linear mixed

model using GLM working weight and working vector, i.e., repeatedly call Proc

Mixed in SAS (-GLIMMIX-).

Perform poorly (biased β) when: 1) Poisson: small mean value; Binomial: small p

or n. 2) random effect variances are large or highly correlated.

Ref: Schall, 1991, Biometrika; Breslow and Clayton, 1993, JASA.

(d) Corrected PQL (cPQL): PQL does not work well for sparse data (the Laplace ap-

proximation does not perform well). Use simple correction terms for βPQL and θPQL

to reduce bias.

Ref: Breslow and Lin, 1995, Biometrika; Lin and Breslow, 1996, JASA.

Note: All these approximation procedures generally do not give consistent estimates of β

and θ except for normal case.

3. E-M algorithm: convert the problem to a missing data scenario, i.e., missing random

effects.

Complete data: Y , b

Incomplete data: Y

E-step: Use expectations to ”impute” missing values. Specifically, calculate the expected

value of the sufficient statistic, condition on the observed data.

t = E [bbT ∣Y, θ]

which involves the same integration we would like to avoid in our likelihood inference.

(a) Gaussian approximation. Ref: Stiratelli, Laird, Ware, 1982, Biometrics.

(b) Second order Laplace approximation. Ref: Steele, 1996, Biometrics.

(c) Monte-Carlo simulation (using Metroplis method). Ref: McCulloch, 1994, 1997,

JASA; Wallner, et al., 1997, JASA.

M-step: estimate θ using the imputed data (sufficient statistic).

4. Gibbs sampling.

Prior for β: flat prior;

Prior for D(θ): Gamma (Jeffery prior does not work since the posterior is not proper).

Objective: generate the joint distribution of [β, θ, b∣y]

36

Idea: generate a series of conditional distribution: [β∣θ, b, y], [b∣β, θ, y], [θ∣β, b, y]

Ref: Zeger and Karim, 1991, JASA; McCulloch, 1994, JASA; Gelfand, Sahu, Carlin,

1995, Bayesian Stat.

1.2.4 Generalized Estimating Equations (GEE)

Generalized estimating equations is a marginal model for nonlinear/non-normal case.

Reference: Liang&Zeger, 1986. Biometrika. p13-22.

Longitudinal/clustered data:

* K independent subjects/clusters

* ni observations over time/within a cluster

Yi = (Yi1, . . . , Yini)T ,Xi = (Xi1, . . . ,Xini

)T

Question: How to extend GLM to correlated data?

Usually, Yi∣bi ∼ F , L(Yi) = ∫ L(Yi∣bi)f(bi)dbi is difficult to specify or compute.

Objective: only interested in Yi ∼Xi, while treating the correlations as nuisance parameters.

– make as fewer assumptions as possible.

– construct consistent and asymptotically normal regression coefficient estimates.

Distribution assumption: only specify the marginal distribution of Yij using quasi-likelihood.

E(Yij) = µij,Var(Yij) = φV (µij)Ô⇒ QL: l(Yij) = ∫µij

yij

yij − u

φV (u)du.

Mean model: g(µij) =XTijβ.

Independent case: g(µi) =XTi β.

Quasi-score:n

∑i=1

DTi V

−1i (yi − µi) = 0 where Di =

∂µi∂βT

is a 1 × p row and Vi = Var(yi) is a scalar.

Generalized estimating equations:

k

∑i=1

DTi

°ni×p

V −1i

°ni×ni

(yi − µi)´¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¶

ni×1

= 0 where Vi = Cov(yi) and Di =∂µi∂βT

=

⎛⎜⎜⎜⎜⎜⎝

∂µi1∂βT

⋮∂µini

∂βT

⎞⎟⎟⎟⎟⎟⎠

= ∆−1i

°diagg′(µi)

Xi

Vi = V1/2mi Ri(α)V

1/2mi where Vmi

= diagφV (µi) is the marginal variance of Yi, Ri(α) is the

working correlation matrix (need to specify, could be wrong) and α is the nuisance correlation

parameter(s).

Estimate β

1. Estimate β by IWLS: g(µ) = η =Xβ

(a) Initialize: η = g(Y ), V = V (φ) = φV (α)

(b) Update:

(i) µ = g−1(η), Z = η + (Y − µ)g′(µ), W = diag (g′(µ)2a(φ)V (µ)−1) ,

β = (XTWX)−1XTWZ, η =Xβ

37

(ii) Estimate V = V (φ)

Example 1.2.11. Pearson residuals + Methods of Moment (MoM).

Recall that Pearson residual is rk =Yk − µk√Vk(µk)

.

The scalar parameter is φ =χ2

n − p=

N

∑k=1

r2k

n − p, where N is the total sample size.

The exchangeable correlation structure is Corr(Yij, Yik) = α, j ≠ k.

So, E(rijrik

φ) = αÔ⇒ α =

∑j∑j<k rijrik

φ∑ini(ni−1)

2

.

If φ > 1, then we have over-dispersion; if φ < 1, then we have under-dispersion.

Alternative methods to estimate α:

* For binary data, use pairwise ORs instead of Pearson correlation.

Reference: Lipsitz (1991). Biometrika; Liang (1992). JRSS(B).

* Use quasi-least squares (QLS) instead of MoM.

Reference: Chaganty (1997). Journal of Statistical Planning & Inference.

* Anscombe residuals (1953): A(Y ) − A(µ). A(⋅) is a transformation which

makes the distribution of A(Y ) more normal.

A = ∫du

V 1/3(u), based on the delta method, Var[A(Y )] = [A′(Y )]2Var(Y ).

E.g., Poisson rAij =32(Y

2/3ij − µ2/3)

µ1/6.

This is an old method, not necessarily having more advantage than Pearson

residual.

2. Estimate β by Fisher-scoring:

(a) Quasi-likelihood approach: change the variance function in denominator to the one

with correlation.

(b) Fisher-scoring: U(β) =m

∑i=1

DTi V

−1i (yi − µi) = 0.

38

k-th update:

0 = U(βk+1)⋅= U(βk) +

∂U

∂βT(βk+1 − βk)

Ô⇒ βk+1 = βk + (−∂U

∂βT)−1

U(βk)

Ô⇒ E(−∂U

∂β)βk+1 = E(−

∂U

∂βTβk +U(βk))

Ô⇒ (k

∑i=1

DTi V

−1i Di)β

k+1 = (k

∑i=1

DTi V

−1i Di)β

k +k

∑i=1

DTi V

−1i (yi − µi)

Since Di =∂µi∂βT

= ∆−1i Xi and ∆i = diagg′(µij), let Wi = ∆−1

i V−1i ∆−1

i ,

X =

⎛⎜⎜⎝

X1

⋮

Xm

⎞⎟⎟⎠

and Xi ∼ ni × p W =

⎛⎜⎜⎝

W1 ⋯ 0

⋮ ⋱ ⋮

0 ⋯ Wm

⎞⎟⎟⎠

Y =

⎛⎜⎜⎝

y1

⋮

ym

⎞⎟⎟⎠

Ô⇒ (XTWX)βk+1 =XTWY where working vector Y =Xβk +∆(Y − µ),∆ =

⎛⎜⎜⎝

∆1 ⋯ 0

⋮ ⋱ ⋮

0 ⋯ ∆m

⎞⎟⎟⎠

.

For large samples, β⋅∼ N (β, φ(∆TΩ−1∆)−1). The estimated variance-covariance matrix is

Vβ = φ(∆TΩ−1∆)−1 = φ(

m

∑j=1

∆Tj Ω−1

j ∆j)

−1

where ∆j =∂µj∂βT

and Ωj = Var(Yj).

Robust standard error: V Rβ = Vβ (

m

∑j=1

∆jΩ−1j SjΩ

−1j ∆j) Vβ/φ

2 where Sj = (Yi−µj)(Yj−µj)T and Vβ

is the model-based/naive s.e., since β is consistent even when V (φ) is misspecified.

Link to previous knowledge:

– cluster of correlated normal observations → multivariate normal

– cluster of correlated Poisson observations ↛ multivariate Poisson

For non-normal data, estimating equations ≠∂

∂βl(β)Ð→ QL.

Estimating equations are unbiased Ð→ β are consistent.

Formal Asymptotic Results

Conditions: 1) φ, α are√k-consistent (moment estimator).

2)∂µ

∂βTpÐ→ A uniformly in an open neighborhood of β.

Results: 1) β is consistent.

2)√k(β − β) ∼ N(0,Σ) where

Σ = limk→∞

(1

k

k

∑i=1

DTi V

−1i Di)

−1

[1

k

k

∑i=1

DiV−1i (yi − µi)(yi − µi)

TV −1i Di]

−1

(1

k

k

∑i=1

DTi V

−1i Di)

−1

= limk→∞

A−1GA−1

Corollary. If Vi = V1/2mi Ri(α)V

1/2mi is correctly specified (so Vmi

and Ri are both correctly speci-

fied) and E(G) = A, then Σ = limk→∞

A−1 = limk→∞

(1

k

k

∑i=1

DTi V

−1i Di)

−1

, β is efficient within the linear

39

estimating function family.

GEE 1 vs GEE 2

* GEE 1: specify the first two moments

– assume β and α orthogonal (α,β contain completely different information)

– so β is consistent even when V (µ) is misspecified

* GEE 2: estimate β and α simultaneously

– require modeling the 3rd and 4th moments of Yij

– give consistent β and α when E(Yij) and V (Yi) are correct

References: Zhao&Prentice (1990). Biometrika; Zhao&Prentice (1991). Biometrika;

Liang (1992). JRSS(B).

* Extended GEE (EGEE):

– estimate β and α simultaneously, but only make assumption on 1st and 2nd moments

– estimate α efficiently when correlation structure correctly specified

– for consistency of β, does not require correct V (µ)

Reference: Hall&Severini (1998). JASA.

General Procedures of Generalized Estimating Equations

1. Define marginal distribution of Yij:

E(Yij) = µij with link function g(µij) =Xijβ and variance function Var(Yij) = φV (µij).

2. Pick “working” correlation structure: Var(Yi) = φVi = φV1/2mi Ri(α)V

1/2mi .

Find Ri(α): 1) Run GLM

2) Get residuals: Yij − µij = eij

3) Look at correlation in eij:

subj t1 t2 ⋯ tni

1 e11 e12 ⋯ e1ni

⋮ ⋮ ⋮ ⋮ ⋮

m em1 em2 ⋯ emni

* works only for repeated measures and small number of time intervals.

* data should be ordered and line up (repeated at the same time point).

Guidelines: 1) Simpler structures are easier to fit.

2) The loss of efficiency by using a simpler structure depends on

i) covariate pattern

ii) cluster size: larger size Ð→ greater loss of efficiency

iii) within-cluster correlation: larger correlation Ð→ greater loss of efficiencyIf m≫ n, then “unstructured” may be asymptotically efficient.

Reference: Prentice (1988). Biometrics.

3. Solve β using IWLS:

40

marginal distribution = normal Ô⇒ MLE β

marginal distribution ne normal Ô⇒ not MLE

Var(β) = φ(m

∑i=1

∆Ti Ω−1

i ∆i)

−1

is the naive/model-based variance.

Alternatively, use the empirical/robust/sandwich/White variance

V Rβ = Vβ (

m

∑i=1

∆iΩ−1i SiΩ

−1i ∆i) Vβ/φ

2 where Si = (Yi − µi)(Yi − µi)T

4. Test: H0 ∶ β =⎛

⎝

β1

β2

⎞

⎠vs H1 ∶ β =

⎛

⎝

β1

β2

⎞

⎠

i) Wald test: W = (β2 − β2)T [Cov(β2)]

−1(β2 − β2)

⋅∼ χ2

q under H0.

ii) Score test: S = UTβ2(β1, β2) Cov [Uβ2(β1, β2)]

−1Uβ2(β1, β2)

where Uβ =⎛

⎝

Uβ1Uβ2

⎞

⎠=

⎛⎜⎜⎜⎝

∂µT

∂β1

V −1(Y − µ)

∂µT

∂β2

V −1(Y − µ)

⎞⎟⎟⎟⎠

.

1O model-based covariance: Cov [Uβ2(β1, β2)] = I22 is the top left 2 × 2 square of

the information matrix.

2O robust covariance: CovR [Uβ2(β1, β2)]−1= (I22)

−1CovR(β2) (I22)

−1

Reference: Rotnitzky&Jewell, 1990. Biometrika.

R: library(gee)

gee(Y ∼ X, id = id var, corstr = ..., family = ..., data = ...)

SAS: proc genmod data = ...;

class id var;

model ...;

repeated subject = id var / type = ..., dist = ..., link = ..., modelse;

run;

1.2.5 Population-average (PA) Model vs Subject-specific (SS) Model

Population-average (PA) model: E(Yij) = µij population means.

Subject-specific (SS) model: E(Yij ∣bij) = µij, e.g., β0i = β0 + b0i, β1i = β1 + b1i.

For normal data with identity link,

µi = E(µi) = E(Xiβ +Zibi) =Xiβ.

For non-normal data, E(µi) ≠ µi.

Example 1.2.12. Poisson population-average model vs subject-specific model.

For the population-average model, log(µi) =Xiβ.

For the subject-specific model, log(µi) =Xiβ +Zibi, bi ∼ N(0,D), so µi = exp(Xiβ +Zibi).

E(Yi) = E[E(Yi∣bi)] = E(µi) = E[exp(Xiβ +Zibi)]´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

subject-specific model

≠ exp(Xiβ) = E(Yi) = µi´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

population-average model

.

41

Interpretation of β

PA model and SS model have different interpretations of β’s. Generally βPA < βSS.

Interpretation of β in PA model: change in population average.

Interpretation of β in SS model: conditional on the same cluster, comparing within the cluster.

Variance

In PA model, we model Σi = Cov(Yi) directly, capturing variances and correlations for all

sources with a single model.

In SS model, Σi = ZiDiZTi +Ri, as in the linear mixed models.

Example 1.2.13. LMM. Var(Yi) = ZiDZTi +Ri.

Assume D =⎛

⎝

d11 d12

d12 d22

⎞

⎠, Zi =

⎛⎜⎜⎜⎜⎜⎝

1 1

1 2

⋮ ⋮

1 t

⎞⎟⎟⎟⎟⎟⎠

, Ri = σ2Ini.

Var(Yij) = d11 + 2jd12 + j2d22 + σ2, j = 1, . . . , t, Cov(Yij, Yik) = d11 + (j + k)d12 + jkd22.∂[Var(Yij)]

dj= 2d12 + 2jd22, so if j > −d12d22

, then Var(Yij) will monotonically increase over time;

if j < −d12d22, then Var(Yij) will monotonically decrease over time.

In generalized estimating equations, bias and efficiency for finite samples depend on:

– number of clusters/units;

– distribution of cluster sizes;

– magnitude of correlations;

– number and type pf covariates.

Reference: Davis (2002). Statistical models for the analysis of repeated measures.

Time-dependent Variables

– We have to assume (Yij ∣Xij) = E(Yij ∣Xi1, . . . ,Xini) or E(Yij ∣Xi1, . . . ,Xi(j−1),Xij).

Reference: Pepe&Anderson (1994).

– Alternatively, we use a diagonal working covariance matrix - independent but unequal

variances.

1.2.6 Comparison between GEE and GLMM

GEE and GLMM have different interpretations of β and hence different estimations of β.

Generally, βGEE < βGLMM.

Since g−1(XTijβ

∗) = µij = E(Yij)´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

GEE

= E[E(Yij ∣bi)] = E[g−1(XTijβ +Z

Tijbi)]

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶GLMM

,

XTijβ

∗ = g E[g−1(XTijβ +Z

Tijbi)] where β∗ = βGEE and β = βGLMM.

Example 1.2.14.

42

1. Binary data with probit link: g(⋅) = Φ−1(⋅)

Φ−1(µbij) =XTijβ +Z

TijbiÔ⇒ µbij = Φ(XT

ijβ +ZTijbi)

µij = E(µbij) = E [Φ(XTijβ +Z

Tijbi)] = Φ

1

(1 +ZTijDZij)

1/2XTijβ

XTijβ

∗ = Φ−1(µij) =XTijβ

(1 +ZTijDZij)

1/2

So, β∗ = β unless D = 0. Since 1 +ZTijDZij ≥ 1, ∣β∣ > ∣β∗∣.

E.g., Φ−1(µbij) =XTijβ + bi, bi ∼ N(0, θ). So, 1 +ZT

ijDZij = (1 + niθ)1/2, β∗ =1

(1 + niθ)1/2β.

2. Binary data with logit link: g(⋅) = logit(⋅)

µbij =exp(XT

ijβ +ZTijbi)

1 + exp(XTijβ +Z

Tijbi)

= F (XTijβ +Z

Tijbi) where F is the cdf of logistic distribution

µij = E(Yij) = E(µbij) ≈ F Xijβ

(1 + c2ZTijDZij)

1/2 where c =

16√

3

15π≈ 0.59

3. Count data with log link: g(⋅) = log(⋅)

µbij = exp(XTijβ +Z

Tijbi)

µij = E(µbij) = exp(XTijβ +

1

2ZTijDZij)

log(µij) =Xijβ∗ =Xijβ +

1

2ZTijDZij

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶offset

1.2.7 Missing Data Pattern

Both GEE and LMM allow missing data, but with different assumptions.

* GEE assumes missing completely at random (MCAR): P (missing∣Yobs, Ymis) = α.

We can also extend GEE models to allow MAR, e.g., Robins (1995), Paik (1997).

* LMM/GLMM (based on log-likelihood) assumes missing at random (MAR):

P (missing∣Yobs, Ymis) = P (missing∣Yobs)Ô⇒ missing data can still be inferred by observed

data.

Since P (Yobs, Ymis) = P (Ymis∣Yobs)P (Yobs), we can use imputation methods.

If we assume MCAR, then P (missing∣Yobs) = P (missing∣Yobs, Ymis) = α. So P (Yobs, Ymis)∝

P (Yobs). If we only infer from P (Yobs), then the estimator is still consistent, but we may

lose some efficiency. For this complete case, just drop NA observations.

43

pubh8401 linear models

Documents