pubh8401 linear models
TRANSCRIPT
PubH8401 Linear Models
Revised from Yuan Zhang’s version
Fall 2019
Contents
1 Generalized, Linear, and Mixed Models 2
1.1 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Review: Likelihood Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.6 Goodness of Fit (GOF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.7 Over-dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.8 Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.9 Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.10 Quasi-likelihood Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Correlated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.2 Linear Mixed Models (LMM) . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.3 Generalized Linear Mixed Models (GLMM) . . . . . . . . . . . . . . . . . 34
1.2.4 Generalized Estimating Equations (GEE) . . . . . . . . . . . . . . . . . . . 37
1.2.5 Population-average (PA) Model vs Subject-specific (SS) Model . . . . . . 41
1.2.6 Comparison between GEE and GLMM . . . . . . . . . . . . . . . . . . . . 42
1.2.7 Missing Data Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1
Chapter 1
Generalized, Linear, and Mixed Models
Reference: McCulloch C., Searle S., Neuhaus J (2008). Generalized, Linear, and Mixed Models,
Second Edition. John Wiley & Sons, Inc., Hoboken, New Jersey.
1.1 Generalized Linear Models
1.1.1 Review: Likelihood Theory
1. X random: Assume that X and Θ are independent, the likelihood function is
L(Θ∣Y,X) = f(Y,X ∣Θ) =n
∏i=1
f(Yi∣XTi ,Θ)f(XT
i ),
and the log-likelihood function is
l(Θ∣Y,X) = logL(Θ∣Y,X).
2. X deterministic: The log-likelihood function is
L(Θ∣Y,X) = f(Y ∣Θ,X) =n
∏i=1
f(Yi∣xTi ,Θ),
and the log-likelihood function is
l(Θ∣Y,X) = logL(Θ∣Y,X).
Let θ be the maximum likelihood estimator of θ. So L(θ) = maxθ∈Ω
L(θ).
The score function is
U(θ) =∂l(θ)
∂θ,
and the score equation is
U(θ) = 0.
The information is
In(θ) = E[In(θ)] = E[U(θ)U(θ)T ] = E [−∂2l(θ)
∂θ∂θT] .
2
1.1.2 Exponential Family
The exponential family has the form
f(y∣θ, φ) = expyθ − b(θ)
a(φ)+ c(y, φ) ,
where
θ: the parameter of interest, called the canonical parameter
φ is the nuisance parameter, called the scale parameter.
Note 1: The formulation here might be different from what you learned from your Stat Inference
class, which allows for multiple θs. Note 2: For the models we will discuss in this class, we
assume the parameter φ is known (or can be estimated separately).
Example 1.1.1.
1. Normal distribution: Y ∼ N(µ,σ2).
f(y∣θ, φ) = (2πσ2)−1/2 exp−(y − µ)2
2σ2 = exp
yµ − µ2/2
σ2−y2
2σ2−
1
2log(2πσ2)
Ô⇒ θ = µ,φ = σ2, a(φ) = φ, b(θ) =θ2
2, c(y, φ) = −
y2
2φ−
1
2log(2πσ2)
2. Binomial distribution: Y ∼ Binomial(m,p).
f(y∣θ, φ) = (m
y)py(1 − p)m−y = expy log p + (m − y) log(1 − p) + log (
m
y)
= expy logp
1 − p+m log(1 − p) + log (
m
y)
Ô⇒ θ = logp
1 − p, φ = 1, a(φ) = 1, b(θ) = −m log(1 − p) =m log(1 + eθ), c(y, φ) = log (
m
y)
3. Poisson distribution: Y ∼ Poisson(λ).
f(y∣θ, φ) =λye−λ
y!= expy logλ − λ − log y!
Ô⇒ θ = logλ, b(θ) = λ = eθ, c(y, φ) = − log y!
Properties of Exponential Family: If Y ∼ f(y∣θ, φ) with φ known, then:
(i) µ ≡ E(Y ) = b′(θ)
(ii) Var(Y ) = b′′(θ)a(φ)
Proof.
l(θ) = log f(Y ∣θ, φ) =Y θ − b(θ)
a(φ)+ c(Y,φ)
U(θ) =∂l
∂θ=
1
a(φ)[Y − b′(θ)]
U ′(θ) =∂2l
∂θ2= −
1
a(φ)b′′(θ)
3
Since E[U(θ)] = 0, E[Y − b′(θ)] = 0Ô⇒ E(Y ) = b′(θ).
Under certain regularity conditions,
I(θ) = Var[U(θ)] = E[U2(θ)] = −E(∂2l
∂θ2) .
Since
E[U2(θ)] =1
a2(φ)E[Y − b′(θ)]2 =
1
a2(φ)E[(Y −EY )2] =
1
a2(φ)Var(Y )
and
−E(∂2l
∂θ2) =
1
a(φ)b′′(θ),
we obtain
Var(Y ) = a(φ)b′′(θ).
Note: b′′(θ) is called the variance function, denoted by V (µ). The variance function indicates
how the variance of Y depends on the mean of Y .
Example 1.1.2.
1. Normal distribution: θ = µ,φ = σ2, b(θ) = θ2
2 , a(φ) = φ.
E(Y ) = b′(θ) = θ = µ Var(Y ) = a(φ)b′′(θ) = a(φ) = φ = σ2
2. Binomial distribution: θ = log p1−p , b(θ) =m log(1 + eθ).
b′(θ) =meθ
1 + eθ=mp b′′(θ) =
meθ
(1 + eθ)2=mp(1 − p)
3. Poisson distribution: θ = logλ, b(θ) = eθ.
b′(θ) = b′′(θ) = eθ = λÔ⇒ E(Y ) = Var(Y ) = λ
1.1.3 Generalized Linear Models
Generalized Linear Models (GLM)
i) Replace the linear model µ = E(Y ) =Xβ with a linear model for g(µ).
ii) Replace the constant variance assumption, i.e., Var(εi) = σ2 with a mean-variance rela-
tion.
Var(Yi) = b′′(θi)a(φ) E(Yi) = b
′(θi)
iii) Replace the normal distribution assumption with the exponential family, ** but still
assume independence.
Random component:
Yi ∼ f(yi; θi, φ) = expyiθi − b(θi)
a(φ)+ c(yi, φ) .
4
Note: Randomness comes from the distribution, e.g., yi ∈ 0,1, modeling the randomness not
by whether the patient has a negative or positive outcome, but by modeling the underlying
parameter pi. Systematic component: ηi = xTi β, where xTi denotes the ith row of the matrix of
predictors X. Link function ηi = g(µi), where µi = b′(θi). The default link function is typically
the canonical link, ηi = θi.
Example 1.1.3.
1. Linear model (a special case of GLM): Yi =XTi β + εi.
The random component is Yi ∼ N(XTi β,σ
2) and the systematic component is µi ≡ E(Yi) =
XTi β = ηi. ηi = µi is the identity link.
2. Logistic regression: Yi ∼ Bernoulli(pi),
f(y∣θ, φ) = expy logp
1 − p+ log(1 − p)
θ = logp
1 − p, b(θ) = log(1 + eθ)Ô⇒ µ = E(Y ) = b′(θ) =
eθ
1 + eθ= p, ηi =X
Ti β
The logistic regression formula is
log (pi
1 − pi) =XT
i β.
The link function is the logit function η = g(µ) = log ( µ1−µ). This is a canonical link
function because η = θ = g(µ).
3. Poisson regression: Yi = number of events ∼ Poisson(λi)
f(y∣θ, φ) = expy logλ − λ − log y!
θ = logλ, b(θ) = λ = eθ, µ = b′(θ) = eθ = λ
The Poisson regression formula is
log(µi) =XTi β.
The canonical link is η = θ = g(µ) = log(µ).
parameter θ µb′(θ)
ηg(µ)
if η = θ, then canonical link g∗()
Xβ=
1.1.4 Maximum Likelihood Estimation
Iterated Weighted Least Squares
The log-likelihood function of a distribution from the exponential family is
l =n
∑i=1
li =n
∑i=1
Yiθi − b(θi)
ai(φ)+ c(Yi, φ) .
5
The score function can be written as:
Uj(β) =∂l
∂βj=
n
∑i=1
∂li∂θi
∂θi∂µi
∂µi∂ηi
∂ηi∂βj
where µi = b′(θi) and ∂ηi∂βj
=Xij. Since
∂li∂θi
=Yi − b′(θi)
ai(φ),∂θi∂µi
=1
∂µi/∂θi=
1
b′′(θi),∂µi∂ηi
=1
g′(µi)
Uj(β) =n
∑i=1
Yi − b′(θi)
ai(φ)
1
V (µi)
∂µi∂ηi
Xij
=n
∑i=1
Xij1
( ∂ηi∂µi)
2ai(φ)V (µi)
(Yi − µi)∂ηi∂µi
≡n
∑i=1
XijWi(Zi − ηi)
where W −1i = (
∂ηi∂µi
)2
ai(φ)V (µi) = [g′(µi)]2ai(φ)V (µi),
and Zi = ηi + (Yi − µi)∂ηi∂µi
= g(µi) + (Yi − µi)g′(µi).
Zi is called the adjusted dependent variability.
Rationale:
∑iXijWiZi = (XTWZ)j is the weighted linear regression. Note that W is a diagonal matrix
and Z is a vector. Then we can write
U(β) =XTW (Z − η) where W = diag(W1, . . . ,Wn) and η =Xβ.
Assume X has full rank, U(β) = 0⇐⇒XTWZ =XTWXβ ⇐⇒ β = (XTWX)−1XTWZ.
Iterated Weighted Least Squares (IWLS)
1. Initialize: η = g(Y ), use Y to initialize µ.
2. Calculate:
µ = g−1(η)
Z = η + (Y − µ)g′(µ)
W = diag (g′(µ)2a(φ)V (µ)−1)
β = (XT WX)−1XT W Z
η =Xβ
3. Repeat: repeat the second step until convergence, e.g., ∣β(k+1) − β(k)∣ < ε.
Intuition: g(µi) = xTi β = ηi
, by Taylor expansion,
g(Yi) ≈ g(µi) + (Yi − µi)g′(µi) = ηi + (Yi − µi)
∂ηi∂µi
= Zi
6
If we use g(Yi) directly as the outcome in a linear regression, we may violate the constant
variance assumption.
Var(Zi) = Var(Yi) (∂ηi∂µi
)2
= ai(φ)V (µi) (∂ηi∂µi
)2
=W −1i
Given the independence of samples, W −1i is diagonal and we can apply a weighted least square
(WLS) approach to fit the model. Note: As in a WLS model, we can estimate the variance
Var(β) = (XT WX)−1.
Example 1.1.4. (Poisson regression) η = log(µ), V (µ) = µ, a(φ) = 1.
1. Initialize: η = log(Y ); η = log(Y + 0.5) in practice, since Y can be 0.
2. Repeat:
µ = exp(η)
Z = η + (Y − µ)1
µ)
W = diag (µ)
β = (XT WX)−1XT W Z
η =Xβ
Numerical Methods to Solving U(β) = 0
1. Fisher scoring: β(k+1) = β(k) + (I)−1U where I = −E(∂2l
∂β∂βT)
** equivalent to IWLS when a(φ) = φ
2. Newton-Raphson: β(k+1) = β(k) + (i)−1U where i = −∂2l
∂β∂βT∣β=β(k)
** equivalent to Fisher scoring for canonical link
3. simulation, MCMC approach
The score function is
Uj(β) =∂l
∂βj=
n
∑i=1
XijWi∂ηi∂µi
(Yi − µi).
The information is
In(β)jk = −∂2l
∂βj∂βk
= −n
∑i=1
Xij∂
∂βk(Wi
∂ηi∂µi
) (Yi − µi) +n
∑i=1
XijWi∂ηi∂µi
(∂µi∂ηi
∂ηi∂βk
)
= −n
∑i=1
Xij∂
∂βk(Wi
∂ηi∂µi
) (Yi − µi) +n
∑i=1
XijWiXik.
Since E(Yi − µi) = 0,
E(−∂2l
∂βj∂βk) =
n
∑i=1
XijWiXik.
7
i.e., the expected information E (In(β)) =XTWX. So Cov(β) ≈ (XT WX)−1.
For canonical link η = θ,∂η
∂µ=∂θ
∂µ=
1
b′′(θ)=
1
V (µ). So, W
∂η
∂µ=
1∂η∂µa(φ)V (µ)
=1
a(φ).
Hence,
∂
∂βk(W
∂η
∂µ) =
∂
∂βk(
1
a(φ)) = 0Ô⇒ −
∂2l
∂βj∂βk= E(−
∂2l
∂βj∂βk) =∑
i
XijWiXik.
Therefore, Fisher scoring is equivalent to Newton-Raphson.
1.1.5 Inference
Let β be the maximum likelihood estimator of β so that U(β) = 0.
The asymptotic distribution of β is β ∼ MVN(β, I−1n (β)) where In = E(−
∂2l
∂β∂βT).
Hypothesis Test
β =⎛
⎝
β1
β2
⎞
⎠, H0 ∶ β2 = β2.
1. Likelihood ratio test
2. Score test: U(β1, β2)T In(β1, β2)
−1U(β1, β2)
dÐ→ χ2
p−q under H0
3. Wald test: (β2 − β2)T Var(β2)−1(β2 − β2)
dÐ→ χ2
p−q under H0
1.1.6 Goodness of Fit (GOF)
1. Deviance:
ai(φ) =
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
φ Normal
1 Poisson1mi
Binomial
⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭
=φ
mi
Rewrite:
l(β) =n
∑i=1
log fi(Yi∣θi, φ) =n
∑i=1
⎧⎪⎪⎨⎪⎪⎩
mi [Yiθi − b(θi)]
φ+ ci(Yi, φ)
⎫⎪⎪⎬⎪⎪⎭
.
Define deviance as:
D(Y, µ) = 2φl(µ, φ∣Y ) − l(µ, φ∣Y ) where µ is the MLE of the saturated model
= 2n
∑i=1
mi Yi (θi − θi) − [b(θi) − b(θi)] .
Hence, deviance depends on the specified distribution (likelihood), not on the nuisance
parameter φ.
Note:In the saturated model, µi = Yi, i.e., using Yi as estimate for µ. So, θ is derived from
µi = Yi and θ is derived from ηi and µi.
We compare the two nested models:
8
Model 1: β =⎛
⎝
β1 q×1
β2 (p−q)×1
⎞
⎠µ ≡ µ(1)
Model 2: β =⎛
⎝
β1 q×1
β2 (p−q)×1
⎞
⎠µ ≡ µ(2)
For linear models, the test statistic isRSS(2) −RSS(1)
σ2∼ χ2
p−q.
For generalized linear models, the test statistic is LR =D(Y, µ(2)) −D(Y, µ(1))
φ∼ χ2
p−q.
Example 1.1.5.
(a) Normal: log f(y∣θ, φ) = −(y − µ)2
2φ+ c(y, φ), θ = µ,φ = σ2,mi = 1.
D(Y, µ) = 2n
∑i=1
Yi (θi − θi) − [b(θi) − b(θi)]
= 2n
∑i=1
Yi (Yi − µi) − [Y 2i
2−µ2i
2]
= 2n
∑i=1
(Y 2i
2− Yiµi +
µ2i
2)
=n
∑i=1
(Yi − µi)2
= RSS
(b) Binomial: log f(y∣θ, φ) =my logµ
1 − µ+ log(1 − µ) + c,
θ = logµ
1 − µ⇐⇒ µ =
eθ
1 + eθ, b(θ) =m log(1 + eθ), yi =
#events
mi
.
D(Y, µ) = 2n
∑i=1
mi yi (logyi
1 − yi− log
µi1 − µi
) − [log (1 +yi
1 − yi) − log (1 +
µi1 − µi
)]
= 2n
∑i=1
mi yi logyiµi+ (1 − yi) log
1 − yi1 − µi
(c) Poisson: log f(y∣θ, φ) = y logµ − µ + c, θ = logµ, b(θ) = eθ,mi = 1.
D(Y, µ) = 2n
∑i=1
Yi(logYi − log µi) − (Yi − µi)
= 2n
∑i=1
Yi logYiµi− (Yi − µi)
2. Pearson χ2: does not depend on distribution assumption.
χ2 =n
∑i=1
(Yi − µi)2
V (µi)/mi
Criterion:χ2
n − p=
n
∑i=1
(Yi − µi)2
(n − p)V (µi)≈ φ = a(φ)mi
Example 1.1.6.
9
(a) Normal: b(θ) =θ2
2, V (µ) = b′′(θ) = 1.
χ2 =n
∑i=1
(Yi − µi)2 = RSS
n
∑i=1
(Yi − µi)2
n − p= σ2
OLS ≈ σ2
(b) Binomial: b(θ) =m log(1 + eθ), µ =eθ
1 + eθ, b′′(θ) =m
eθ
(1 + eθ)2=mµ(1 − µ).
χ2 =n
∑i=1
mi(Yi − µi)2
µi(1 − µi)
n
∑i=1
mi(Yi − µi)2
(n − p)µi(1 − µi)∼χ2n−p
n − p→ 1
(c) Poisson: b(θ) = eθ, µ = eθ, b′′(θ) = eθ = µ.
χ2 =n
∑i=1
(Yi − µi)2
µi
n
∑i=1
(Yi − µi)2
(n − p)µi≈ 1 when µ→∞ since
Y − µ√µ→ N(0,1)
Types of Residuals:
1. Pearson residual: rPi =Yi − µi√V (µi)
, χ2 =n
∑i=1
(rPi )2. R: resid(fit,”pearson”)
2. Deviance residual: rDi =√di ∗ sign(Yi − µi) where di = 2mi Yi (θi − θi) − [b(θi) − b(θi)] .
3. Working residual: zi = ηi + (Yi − µi)∂ηi∂µi
, rWi = zi − ηi = (Yi − µi)∂ηi∂µi
. R: fit$resid
1.1.7 Over-dispersion
Over-dispersion
Var(Y ) > a(φ)V (µ)Ð→ se(β) underestimatedÐ→ z-statistic inflatedÐ→ inflated Type I error
Under-dispersion
Var(Y ) < a(φ)V (µ)Ð→ conservative test resultsÐ→ power decreases
Example 1.1.7. Seed data. Crowder MJ (1978). Appl. Statist. 27, 34-37.
Study of germination of 2 types of seeds with 2 root extracts.
Variables:
seed type = 1,2
extract = 1,2
r = # germinated seeds
m = # seeds on plate
Let Y = r/m be the data. Let µ = E(Y ) be the probability of germination.
10
seed 1 2
extract 1 2 1 2
r m r/m r m r/m r m r/m r m r/m
10 39 10/39 5 6 5/6 8 16 8/16 3 12 3/12
23 62 23/62 53 74 53/74 10 30 10/30 22 41 22/41
23 81 23/81 55 72 55/72 8 28 8/28 15 30 15/30
26 51 26/51 32 51 32/51 23 45 23/45 32 51 32/51
17 39 17/39 46 79 46/79 0 4 0 3 7 3/7
10 13 10/13
The model is logit(µ) = logµ
1 − µ= β0 + β1 ∗ seed + β2 ∗ extract + β3 ∗ seed ∗ extract.
Deviance: D = 33.28 with df = n − p = 21 − 4 = 17,D
n − p≈ 2 ≫ 1.
Pearson χ2: χ2 = 31.65 with df = n − p = 21 − 4 = 17.
Heterogeneous population (over-dispersion):
Let π be the germination probability within a plate. Assume that m is a constant.
Since µ = E(Y ) = E[E(Y ∣π)] = E(π),
Var(Y ) = E[Var(Y ∣π)] +Var[E(Y ∣π)] = E [1
mπ(1 − π)] +Var(π)
=1
mE(π)(1 −E(π)) +Var(π) (1 −
1
m) =
1
mµ(1 − µ) +Var(π) (1 −
1
m) >
µ(1 − µ)
m
If Var(π) = τ 2µ(1 − µ), then Var(Y ) =µ(1 − µ)
m[1 + (m − 1)τ 2] = σ2µ(1 − µ)
m= σ2V (µ).
Model-basedl Approach to Over-dispersed GLM
Assume that Var(Y ) = σ2V (µ)a(φ), where V (µ)a(φ) is the naive variance from the distribu-
tional assumption (exponential family). We have:
β⋅∼ N (β,σ2(XTWX)−1) where ⋅ indicate asymptotically.
σ2 =χ2
n − por
D
n − p, se→ se
√χ2
n − p.
In the case, Var(Y ) = σ2µ(1 − µ)
m,bias(σ2) ∼ O (
1
m)∝
1
m.
1.1.8 Binary Data
For binary outcome, some commonly-used link functions:
1. Logistic model: logpi
1 − pi= xTi β, η = g(p) = logit(p)Ô⇒ p =
eη
1 + eη.
Note: This is a symmetric link function, which means that switching case and control
would lead to the same result except a sign switch on β.
This corresponds to the cdf of a logistic distribution: F (x) =e(x−µ)/r
1 + e(x−µ)/rand E(X) = µ,
Var(X) =π2r2
3.
11
2. Probit model: p = Φ(η), η = Φ−1(p)Ð→ probit link.
Note: This is also a symmetric link function since Φ−1(p) = −Φ−1(1−p). The probit model
is also refered as the liability model.
Comparison between logistic and probit models:
For logistic models, it is easier to interpret β. For probit models, it is easier to fit, but
the change of p related to β is not proportional.
3. Complementary log-log link: extreme value distribution (Gumbel distribution)
p = F (η) = 1 − exp[− exp(η)], η = g(p) = log[− log(1 − p)] represents the distribution of
maxima, with underlying normal/exponential distribution (e.g., competing risks).
4. Log-link: η = log(p), p = eη = eXβ. This link function can be used to estimate relative risk
(risk ratio) instead of odds ratio.
Case-control vs cohort study
1. Population-based model: P (Y = 1∣X)
Since OR(X;β) =P (Y = 1∣X = x)P (Y = 0∣X = 0)
P (Y = 1∣X = 0)P (Y = 0∣X = x),
P (Y = 1∣X) =exp[α + logOR(X;β)]
1 + exp[α + logOR(X;β)].
2. Case-control sampling: S =
⎧⎪⎪⎨⎪⎪⎩
1 sampled
0 not sampled.
P (X,Y ∣S = 1) = P (X ∣Y,S = 1)P (Y ∣S = 1) = P (X ∣Y )nYn
where nY = n0 (control) or n1 (case) and n = n0 + n1
Ô⇒ P (X ∣Y ) =n
nYP (X,Y ∣S = 1) =
n
nYP (Y ∣X,S = 1)P (X ∣S = 1)
=n
nY
exp[Y (α∗ +Xβ∗)]
1 + exp(α∗ +Xβ∗)q(X) since
P (Y = 1∣X,S = 1) =P (Y = 1,X,S = 1)
P (X,S = 1)=P (S = 1∣Y = 1,X)P (Y = 1∣X)P (X)
∑1Y =0P (S = 1∣Y,X)P (Y ∣X)P (X)
=π1P (Y = 1∣X)
π0P (Y = 0∣X) + π1P (Y = 1∣X)=
exp(logπ1 + α +Xβ)/[1 + exp(α +Xβ)]
[exp(logπ0) + exp(logπ1 + α +Xβ)]/[1 + exp(α +Xβ)]
=
π1π0
exp(α +Xβ)
1 + π1π0
exp(α +Xβ)=
exp(α∗ +Xβ)
1 + exp(α∗ +Xβ)where α∗ = α + log (
π1
π0
)
Note: α is the population prevalence. In practice, π1 ≫ π0.
3. Maximum likelihood estimation:
L∝ L1 ×L2 =n
∏i=1
P (Yi∣Xi, S = 1)n
∏i=1
P (Xi∣S = 1) =n
∏i=1
exp[Yi(α∗ +XTi β)]
1 + exp(α∗ +XTi β)
n
∏i=1
q(Xi)
maximizing L subject to constraints
n0
n= P (Y = 0∣S = 1) = ∫ P (Y = 0∣X = x,S = 1)q(X = x)dx
n1
n= P (Y = 1∣S = 1) = ∫ P (Y = 1∣X = x,S = 1)q(X = x)dx
12
This is a semi-parametric model, with the parametric part P (Y ∣X,S = 1) and the non-
parametric model P (X ∣S = 1) = q(X).
Proposition 1: The unconstrained maximizer (α∗, β) through q(x) = 1n#i ∶Xi = x,
which also maximize L, satisfies constraints and therefore also solve the constrained op-
timization.
Proposition 2: Estimating equations from L1 are unbiased w.r.t. q(⋅) distribution.
Reference: Prentice & Pyke, 1979.
1.1.9 Count Data
Poisson regression model:
1. logµi =XTi β,µi = E(Yi) model counts, but as population size increases, counts increase.
2. logλi =XTi β = log µi
nimodel rate, so logµi = logni +XT
i β. logni is called the offset term.
Alternative regression model: variance-stabilizing transformation for count data.
1.1.10 Quasi-likelihood Theory
For generalized linear models, µ = E(Y ) = b′(θ), Var(Y ) = a(φ)V (µ). In the case of dispersion,
Var(Y ) ≠ a(φ)V (µ). In the model-based correction, we assume Var(Y ) = σ2a(φ)V (µ). Here,
we further relax the assumption of Var(Yi) and allow any form.
Note: The assumption of correct variance function impact less on β, but more on se(β).
Recall in GLM
Yi ∼ f(Yi∣θi, φ) = expYiθi − b(θi)
a(φ)− c(Yi, φ)
ηi =XTi β = g(µi) l(β) =
n
∑i=1
li(β) =n
∑i=1
log f(Yi∣β)
The score equation is
n
∑i=1
∂
∂βjli(β) = 0Ô⇒
n
∑i=1
∂µi/∂βjVar(Yi)
(Yi − µi) = 0, j = 1, . . . , p,
since∂
∂µili(β) =
∂
∂θili(β)
1
∂µi/∂θi=Yi − µia(φ)
1
b′′(θi).
Quasi-likelihood
Suppose instead of knowing the likelihood (i.e., the distribution), we only know the first two
moments:
E(Yi) = µi = µi(β)
Var(Yi) =
⎧⎪⎪⎨⎪⎪⎩
a(φ)V (µi) exponential family
V (µi) user specified (more flexibility, V (µi) ≠ b′′(θ) here)
13
We mimic the score equations for GLM, estimate β by solving the estimating equations
n
∑i=1
∂µi/∂βjV (µi)
(Yi − µi) = 0.
Note:
This is no longer the derivative of a true likelihood but a quasi-likelihood.
Since the contribution to the log-likelihood from Yi is the integral w.r.t. µi oflog f(Yi∣β)
∂µi,
we define the log quasi-likelihood via the contribution of Yi:
Qi(Yi∣µi) = ∫µi
Yi
Yi − t
V (t)dtÔ⇒
∂Qi
∂βj=∂µi∂βj
∂Qi
∂µi=∂µi∂βj
Yi − µiV (µi)
Q =n
∑i=1
Qi behaves like a log-likelihood function.
β is estimated by IWLS.
Properties of Quasi-likelihood
1. E(∂Qi
∂µi) = 0
Proof.
E(∂Qi
∂µi) = E(
Yi − µiV (µi)
) =E(Yi) − µiV (µi)
= 0
2. E(∂Qi
∂βj) = 0
Proof.
E(∂Qi
∂βj) = E(
∂Qi
∂µi)∂µi∂βj
= 0
3. E [(∂Qi
∂µi)
2
] = −E(∂2Qi
∂µ2i
)
Proof.
E [(∂Qi
∂µi)
2
] = E⎡⎢⎢⎢⎢⎣
(Yi − µiV (µi)
)
2⎤⎥⎥⎥⎥⎦
=1
V (µi)2[E(Y 2
i ) − 2µiE(Yi) + µ2i ]
=1
V (µi)2[E(Y 2
i ) − µ2i ] =
1
V (µi)since Var(Yi) = V (µi)
−∂2Qi
∂µ2i
=1
V (µi)−Yi − µiV 2(µi)
∂V (µi)
∂µiÔ⇒ −E(
∂2Qi
∂µ2i
) =1
V (µi)
4. E(∂Qi
∂βj
∂Qi
∂βk) = −E(
∂2Qi
∂βj∂βk)
14
Proof.
∂Qi
∂βj
∂Qi
∂βk= (
∂Qi
∂µi)
2 ∂µi∂βj
∂µi∂βk
∂2Qi
∂βj∂βk=
∂
∂βk(∂Qi
∂βj) =
∂
∂βk(∂Qi
∂βj
∂µi∂µi
) =∂
∂βk(∂Qi
∂µi
∂µi∂βj
)
=∂2Qi
∂µi∂βk
∂µi∂βj
+∂Qi
∂µi
∂2µi∂βj∂βk
=∂2Qi
∂µ2i
∂µi∂βk
∂µi∂βj
+∂Qi
∂µi
∂2µi∂βj∂βk
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶E(⋅)=0
The properties 1-4 are analogous to those for likelihood.
Maximum Quasi-likelihood Estimators (MQLE)
Under usual regularity conditions,
the maximum likelihood estimators (MLE), β, satisfy√n(β − β)
dÐ→ N(0, I−1(β));
the maximum quasi-likelihood estimators (MQLE), β, satisfy√n(β − β)
dÐ→ N(0, I−1(β)),
where
I(β) = E(1
n
n
∑i=1
∂Qi
∂β
∂Qi
∂βT) = −E(
1
n
n
∑i=1
∂2Qi
∂β∂βT) =
1
n
n
∑i=1
(∂µi∂β
V (µi)−1 ∂µi∂βT
) =1
n(∂µ
∂βV (µ)−1 ∂µ
∂βT) .
Example 1.1.8. Over-dispersion: Var(Yi) = σ2V (µi)a(φ).
The estimating equation isn
∑i=1
∂µi/∂βjσ2a(φ)V (µi)
(Yi − µi) = 0.
1. σ2 constant Ð→ β unaffected.
2. Var(β) =1
nI(β)−1 = σ2 (
∂µ
∂βV (µ)−1 ∂µ
∂βT)−1
where V (µ) is the variance function from
GLM and σ2 is a scale on V (µ).
To estimate σ2,
σ2 =1
n − p(Y − µ)TV (µ)−1(Y − µ) =
1
n − p
n
∑i=1
⎛
⎝
Yi − µi√V (µi)
⎞
⎠
2
=χ2
n − p.
This formula can be applied to both likelihoods.
Estimate MQLE β by IWLS
1. Initialize: η = g(Y ).
2. Repeat:
µ = g−1(η)
Z = η + (Y − µ)g′(µ)
W = diag (g′(µ)2V ar(Y )−1)
β = (XT WX)−1XT W Z
η =Xβ
15
R: family=quasi(link=..., variance=...), e.g., quasi(link=logit, variance=”mu(1-mu)”)
Summary
1. Quasi-likelihood (QL) is used when the assumption of standard exponential family is
invalid.
2. β’s are not affected much.
3. MQLE has the same asymptotic properties as MLE.
4. MQLE is obtained by IWLS.
Note: Little guidance on which V (µ) to use. Robust variance estimates (see below) always
valid for large samples.
Robust Variance Estimator
The quasi-score function: Ui =∂Qi
∂β=∂µi∂β
Yi − µiV (µi)
.
By Taylor expansion,
0 = U(β) ≈ U(β) +∂U
∂β(β − β)
Ô⇒√n(β − β)
⋅=√n(−
∂U
∂β)−1
U(β) = (−1
n
n
∑i=1
∂Ui∂β
)
−11
√n
n
∑i=1
Ui
dÐ→ N(0,Σ) where Σ = A−1BA−1 and
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
A = − limn→∞
1
n
n
∑i=1
E(∂Ui∂β
) = − limn→∞
1
nAn
B = limn→∞
1
n
n
∑i=1
Var(Ui) = limn→∞
1
nBn
Since Var(∂Qi
∂β) = E(
∂Qi
∂β
∂Qi
∂βT) = −E(
∂2Qi
∂β∂βT) when V ar(Yi) is correctly specified,
A = − limn→∞
1
n
n
∑i=1
E(∂Ui∂β
)
B = limn→∞
1
n
n
∑i=1
Var(Ui) = limn→∞
1
n
n
∑i=1
E(−∂2Qi
∂β∂βT) = A
,
So, Σ = B−1 = A−1.
Now, Var(β) = [E(−∂2Q
∂β∂βT)]
−1
.
If Var(Yi) ≠ V (µi), i.e., assumed variance is incorrect, since E(U) = 0, β is still consistent. We
still have
√n(β − β)
⋅= (−
1
n
n
∑i=1
∂Ui∂β
)
−11
√n
n
∑i=1
UidÐ→ N(0,Σ) where Ui =
∂µi∂β
Yi − µiV (µi)
,
∂Ui∂β
=∂
∂βT(∂µi∂β
1
V (µi)) (Yi − µi) −
∂µi∂β
∂µi∂βT
1
V (µi).
16
Hence,
−An⋅=
n
∑i=1
E(−∂Ui∂β
) =n
∑i=1
∂µi∂β
∂µi∂βT
1
V (µi)=XTWX where W = diag (g′(µi)
2V (µi)−1)
Bn =n
∑i=1
Var(Ui) =n
∑i=1
E(UiUTi )
Robust variance estimator:
Var(β) =1
nA−1BA−1 = An
−1BnAn
−1where An =X
T WX and Bn =n
∑i=1
Ui(β)Ui(β)T
Var(β) = (XT WX)−1 n
∑i=1
Ui(β)Ui(β)T(XT WX)−1
Note: needs large n to estimate Bn well.
Different variance estimator:
1. Naive: Var(β) = (XT WX)−1, under user-specified V (µi). In GLM, V (µi) is determined
by the distributional assumption.
2. Model-based: assume Var(Yi) = σ2V (µi), Var(β) = σ2(XT WX)−1.
3. Robust: only assume that the mean is correctly specified, so
Var(β) = (XT WX)−1 n
∑i=1
Ui(β)Ui(β)T(XT WX)−1.
Mathematical Rationale for the Estimating Equations
An estimating function, g(Y ; θ), is a function of the data Y and parameter θ having zero mean
for all parameter values, i.e., E[g(Y ; θ)] = 0. To obtain parameter estimates, solve g(θ;Y ) = 0.
E.g., under identify link, Y − µ(β) = 0Ô⇒ Yi − µi(β) = 0.
Challenge: reduce n-vector to a p-vector with minimal savrifice of information. n: # samples;
p: # parameters. For linear models,
E[gi(Y ; θ)∣Ai] = 0
where Ai ≡ Ai(Y ; θ).
1. For regressions, Ai =Xi (the set of covariates);
2. For time-series, Ai = Yi−1.
Let Dij = −E∂gi(Yi; θ)
∂θj∣Ai.
E.g., g = Y − µ(β), D =∂µ
∂β.
The estimating functions are U(β, y) =∂µ
∂β
Y − µ(β)
V (µ)=DTV −1g where
17
V = diagVar(g1∣A1), . . . ,Var(gn∣An).
Following quasi-likelihood results, the asymptotic variance: Cov(θ) = E [∂U(θ)
∂θ]
−1
= (DTV D)−1 =
(∂µT
∂θV∂µ
∂θ)
−1
.
Optimality: Consider linear estimating equations h(y;β) =HT (y −µ(β)) where H ∼ n× p may
be a function of β, but not of y. Estimate β through h(y; β) = 0.
Under the usual asymptotic regularity conditions,
0 = h(β) ≈ h(β) +∂h
∂β(β − β)
Ô⇒ β − β ≈ −h(β) (∂h
∂β)−1
= −h(β) [∂H
∂β(y − µ) −HT ∂µ
∂β]−1
≈ (HTD)−1h(β)
since E [∂H∂β (y − µ)] = 0. Thus,
Cov(β) ≈ (HTD)−1Cov(h)(DTH)−1
= (HTD)−1HTCov(y)H(DTH)−1
= (HTD)−1HTV H(DTH)−1
≥ Cov(β) = (DTV −1D)−1 where β is the MQLE
since Cov(β)−1−Cov(β)−1 ≈DT (V −1−H(HTV H)−1HT )D ≥ 0, where V −1−H(HTV H)−1HT
is symmetric and idempotent.
Fact: For positive definite matrices A, B of the same order, A−1−B−1 ≥ 0, i.e., nonnegative definiteÔ⇒
B −A ≥ 0 nonnegative definite.
Optimality: For any vector a, Cov(aT β) ≥ Cov(aT β), similar to the proof of Gauss-Markov
Thm.
Note: Cov(β)−1 − Cov(β)−1 is the residual variance of DTV −1Y ∼HT .
Quasi-likelihood (QL) Extensions
1. Parameters in the variance function: E(Yi) = µi = g−1(XTi β), Var(Yi) = V (µi, φ), e.g.,
σ2V (µi).
Reference: Breslow. JASA 85. 565-571 (1990).
2. Correlated data:
Suppose Yi is a vector of correlated observations.
E(Yi) = µi = g−1(XTi β),Var(Yi) = V (µi, φ) = V 1/2ρV 1/2 where ρ is the correlation matrix
and V is the variance matrix.
18
This is the GEE model.
Reference: Diggle, Liang, Zeger. 1994. Section 4.6 & 7.1.
19
1.2 Correlated Data
1.2.1 Introduction
Correlated data has various formats:
Longitudinal data: measurements over time (t as a covariate)
Repeated measures
Clustered data
Panel data (econometrics)
Example 1.2.1. Subject i = 1, . . . ,m; Time tij, Data (Yij,Xij), j = 1, . . . , n.
E(Yij) = µij Ô⇒ E(Yi) = µi, Var(Yij) = Vij Ô⇒ Var(Yi) = Vi.
Models
1. Yij = β0 + βXij + εij Ô⇒ Yij = β0 + βXi1´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶
baseline profile
+β(Xij −Xi1)´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶
change profile
+εij
2. Yij = β0i + β(Xij −Xi1) + εij
3. Yij = β0 +βCXi1 +βL(Xij −Xi1)+ εij where βC is the cross-sectional component (baseline)
and βL is the longitudinal component.
βOLS = (XTX)−1XTY
βGLS = (XTV −1X)−1XTV −1Y
Yij =XTijβ + εij, εi =
⎛⎜⎜⎝
εi1
⋮
εin
⎞⎟⎟⎠
∼MVN(0, σ2V0).
1. For independent data, V0 = I.
2. For correlated data, V0 is not diagonal, but εi’s are still independent.
Y ∼ N(Xβ,σ2V ) where V =
⎛⎜⎜⎜⎜⎜⎝
V1 0
V2
⋱
0 Vm
⎞⎟⎟⎟⎟⎟⎠
.
Main reason to consider correlation: efficiency, i.e., β has smaller variance than that assuming
uncorrelated.
Coefficient Estimation
1. (a) WLS: min(Y −Xβ)TW (Y −Xβ)Ô⇒ βW = (XTWX)−1XTWY
(b) OLS: W = I Ô⇒ βI = (XTX)−1XTY
(c) BLUE: W = V −1 Ô⇒ β = (XTV −1X)−1XTV −1Y
All estimators are consistent:
E(βW ) = (XTWX)−1XTWE(Y ) = (XTWX)−1XTWXβ = β.
20
Var(βW ) = σ2(XTWX)−1XTWV WX(XTWX)−1
Var(βI) = σ2(XTX)−1XTV X(XTX)−1
Var(β) = σ2(XTV −1X)−1
If V is unknown, then we can model and estimate V 1O parametrically; 2O non-parametrically
(m≫ n).
If V is misspecified, W is not optimal. Efficiency is defined by
efficiency =Var(β)
Var(βW ).
2. MLE:
l(β,σ2, V0) = −1
2
1
σ2(Y −Xβ)TV −1
0 (Y −Xβ) + nm logσ2 +m log ∣V0∣
estimate β,σ, V0 by MLE (estimator of σ is biased).
3. REML (Restricted Maximum Likelihood):
The maximum likelihood estimator of σ2 is biased. But sometimes, we want an unbiased
estimate, e.g., variance component of mixed effect model.
E.g., V = I, σ2 =RSS
nm, unbiased estimator σ2 =
RSS
nm − p.
Key idea of REML: Let A be a linear transformation on Y , Y ∗ = AY , s.t. distribution of
Y ∗ does not depend on β.
E.g., A = I −X(XTX)−1XT = I − P Ô⇒ Y ∗ = AY ∼ N(0, σ2AV AT ).
Ô⇒Apply MLE on Y ∗ leading to the REML estimate for σ2 = RSSnm−p .
Class of Correlation Models (Extension of GLMs)
1. Marginal model (extension of quasi-likelihood):
1O E(Yi) = µi, g(µi) =Xiβ
2O Var(Yij) = V (µij)φ
3O Corr(Yij, Yik) = ρ(µij, µik∣α)
Ô⇒ GEE (generalized estimating equations), specify g(), V (), ρ().
Advantage: only need first and second moments; only take correlation into account in the
model.
Disadvantage: may mis-specify the correlation function.
2. Random effects model/mixed effects model/conditional model (extension of GLM, MLE):
Given bi, (Yi1, . . . , Yini) are mutually independent and follow a GLM with density
f(yij ∣bi) = expyijθij − ψ(θij)
φ+ c(yij, φ) .
The conditional mean and variance are:
µij = E(Yij ∣bi) = ψ′(θij)
Vij = Var(Yij ∣bi) = ψ′′(θij)φ,
21
since conditional on bi (conditional factor), the data should be independent.
The model satisfies
g(µij) =XTijβ +Z
Tijbi,
where Zij is a subset of Xij and ZTijbi models the correlation (shared property in a family).
Ô⇒ GLMM (generalized linear mixed-effects model).
Assume bi’s are mutually independent, with a common multivariate distribution F (e.g.,
F ∼ a normal distribution). The likelihood is
L(β,α∣Y ) =m
∏i=1∫
ni
∏j=1
f(Yij, bi)dbi =m
∏i=1∫
ni
∏j=1
f(Yij ∣bi)f(bi∣α)dbi,
where α is related to F , f(Yij ∣bi) from GLM and f(bi∣α) from F (e.g., MVN).
For generalized linear mixed models, it is difficult to do integration.
For Gaussian data (Y ∣bi), use MLE or REML.
For non-Gaussian data (Y ∣bi), use numerical integration.
Advantage: can find out the correlation factors and model them.
Disadvantage: need to make sure the correlation factors fully explain (otherwise not
independent) the correlation. Let m be the number of clusters. m needs to be large for
the distribution to approximate well.
3. Transition (Markov) model:
The link function is
g E[yij ∣yi1, . . . , yi(j−1)] =XTijβ +
s
∑r=1
fr(yi1, . . . , yi(j−1);α).
Var[yij ∣yi1, . . . , yi(j−1)] = φAE[yij ∣yi1, . . . , yi(j−1)].
We need to know the order of observations.
If g is the linear/identity link, then all three models converge to the same model.
1.2.2 Linear Mixed Models (LMM)
The linear mixed model is represented by
yi =Xiβ +Zibi + εi
where Xiβ is the fixed component and Zibi is the random component.
Theoretically, Zi ⊈Xi; in practice, Zi ⊆Xi.
Example 1.2.2. measurement over time (t), treatment groups (T ); tij is time-variant and Ti
is time-invariant.
1O random intercept, fixed slope: Zi =
⎛⎜⎜⎝
1
⋮
1
⎞⎟⎟⎠
yij = β0 + β1tij + β2Ti + b0i + εij
* different baseline values
22
2O random intercept, random slope: Zi =
⎛⎜⎜⎝
1 ti1
⋮ ⋮
1 tini
⎞⎟⎟⎠
,Xi =
⎛⎜⎜⎝
1 ti1 Ti
⋮ ⋮ ⋮
1 tiniTi
⎞⎟⎟⎠
yij = β0 + β1tij + β2Ti + b0i + b1itij + εij
* different baseline (starting point)
* different response to treatment
Assumptions:
1. bi ∼ N(0,D).
In Example 1.2.2,
1O bi = (b0i) ∼ N(0, σ2b),
2O bi = (b0i, b1i) ∼ N⎛
⎝
⎛
⎝
0
0
⎞
⎠,⎛
⎝
σ20 σ01
σ10 σ21
⎞
⎠
⎞
⎠.
σ01 = Cov(b0i, b1i) accounts for the shared properties between effects, but we do not
model them. When the number of correlation factors goes to infinity, σ01 → 0.
2. Var(εi) = Ri, e.g., σ2I. Ri could be non-diagonal, representing residual correlation within
cluster.
3. biεi.
Two-component representation of LMM:
Define:
Level 1 (measurement level): finest unit - e.g., tij, (lipid)ij - are variant across time.
Level 2 (personal level, upper level): patient/subject i - e.g., Ti, (gender)i - invariant across
time.
Example 1.2.3.
Level 1 model: Yij = b0i + β1tij + εij, εij ∼ N(0, σ2) (1)
Level 2 model: b0i = β00 + β01Ti +U0i, U0i ∼ N(0, σ2b) where U0i’s represent the residuals (2)
Substituting (2) into (1), we get model 1O.
For model 2O,
Level 1 model: Yij = b0i + b1itij + εij
Level 2 model: b0i = β00 + β01Ti +U0i, b1i = β10 + β11Ti +U1i
Common Structures for D or Ri
The dimension of D depends on the number of random effect risk factors. D characterizes
correlation between random effects.
The dimension of Ri depends on the number of observations within cluster i. Ri characterizes
(residual) correlation between samples.
23
E.g., D = Cov
⎛⎜⎜⎜⎜⎜⎝
⎛⎜⎜⎜⎜⎜⎝
b01
b02
⋮
b0m
⎞⎟⎟⎟⎟⎟⎠
,
⎛⎜⎜⎜⎜⎜⎝
b11
b12
⋮
b1m
⎞⎟⎟⎟⎟⎟⎠
⎞⎟⎟⎟⎟⎟⎠
is a 2 × 2 matrix;
Ri = Cov
⎛⎜⎜⎜⎜⎜⎝
⎛⎜⎜⎜⎜⎜⎝
y11
y21
⋮
yn11
⎞⎟⎟⎟⎟⎟⎠
,
⎛⎜⎜⎜⎜⎜⎝
y12
y22
⋮
yn22
⎞⎟⎟⎟⎟⎟⎠
,
⎛⎜⎜⎜⎜⎜⎝
y13
y23
⋮
yn33
⎞⎟⎟⎟⎟⎟⎠
⎞⎟⎟⎟⎟⎟⎠
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶observations at time i=1,2,3
is a 3 × 3 matrix.
1. exchangeable/compound-symmetry (C-S): σ2
⎛⎜⎜⎜⎜⎜⎝
1 ρ ⋯ ρ
ρ 1 ⋯ ρ
⋮ ⋮ ⋱ ⋮
ρ ρ ⋯ 1
⎞⎟⎟⎟⎟⎟⎠
2. variance-component (VC), independent (usually applied to D, not Ri): D =⎛
⎝
σ21 0
0 σ22
⎞
⎠
3. one-dependent: σ2
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝
1 ρ ⋯ 0 0
ρ 1 ⋯ 0 0
⋮ ⋮ ⋱ ⋮ ⋮
0 0 ⋯ 1 ρ
0 0 ⋯ ρ 1
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠
4. k-dependent: σ2
⎛⎜⎜⎝
1 ρ1 ⋯ ρk 0 0 ⋯ 0
ρ1 1 ρ1 ⋯ ρk 0 ⋯ 0
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
⎞⎟⎟⎠
(k + 1) parameters: σ2, ρ1, . . . , ρk
5. AR-1: Ri = σ2
⎛⎜⎜⎜⎜⎜⎝
1 ρ ρ2 ⋯ ρn−2 ρn−1
ρ 1 ρ ⋯ ρn−3 ρn−2
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
ρn−1 ρn−2 ρn−3 ⋯ ρ 1
⎞⎟⎟⎟⎟⎟⎠
6. Markov chain type (AR with non-equal intervals): Corr(Yij, Yik) = ργ∣tj−tk ∣ or ργ⋅f(tj−tk)
7. unstructured: no assumption on D or RiÔ⇒n(n + 1)
2parameters (inefficient if we have
information on D or Ri)
Note: Except the exchangeable structure, the order of data matters.
24
Estimation of β
Var(Yi) = ZiDZTi +Ri = Vi
Var(Y ) = ZDZT +R = V =
⎛⎜⎜⎜⎜⎜⎝
V1 0 ⋯ 0
0 V1 ⋯ 0
⋮ ⋮ ⋱ ⋮
0 0 ⋯ Vm
⎞⎟⎟⎟⎟⎟⎠
1. If V is known, then Yi ∼MVN(µi, Vi).
f(Yi) = (2π)−ni/2∣Vi∣−1/2 exp−
1
2(Yi −Xiβ)
TV −1i (Yi −Xiβ) f(Y ) =
m
∏i=1
f(Yi)
Ô⇒ MLE β = (m
∑i=1
XTi V
−1i Xi)
−1
(m
∑i=1
XTi V
−1i Yi)
X =
⎛⎜⎜⎝
X1
⋮
Xm
⎞⎟⎟⎠
V =
⎛⎜⎜⎝
V1 ⋯ 0
⋮ ⋱ ⋮
0 ⋯ Vm
⎞⎟⎟⎠
Ô⇒ β = (XTV −1X)−1XTV −1Y (GLS)
Var(β) = (m
∑i=1
XTi V
−1i Xi)
−1
= (XTV −1X)−1
2. If V is unknown, then there is no guarantee that β is MVN .
Let ω be the parameters of the covariance matrix.
E.g., D =⎛
⎝
σ21 σ12
σ12 σ22
⎞
⎠,Ri = σ2
0I Ô⇒ ω = (σ21, σ
22, σ12, σ2
0).
(a) MLE of V (ω):
∂l(β,ω)
∂ω= 0Ô⇒ VMLE = V Ô⇒ β = (XT V −1X)−1(XT V −1Y )
Problems:
1O ωMLE is biased.
2O ωMLE may have negative solutions (boundary on parameter space).
(b) REML of V (ω):
Let A be a transformation matrix such that U = AY contains no β, i.e., AX = 0.
A = I −X(XTX)−1XT and Y ∼ N(Xβ,V )Ô⇒ U ∼ N(0,AV AT ).
Reference: Harville (1974). Biometrika.
MLE estimate β, σ2 (or V ) simultaneously. σ2MLE biased downwards due to the
degree of freedom loss in estimating β.
Intuition behind REML: maximize a modified likelihood free of β.
For design matrix XN×p full rank, there exists at most N − p vectors with aTX = 0.
⇐Ô In a N-dimension space, we can find at most N − p vectors (passing the origin)
25
orthogonal to a linear subspace spanned by p linearly independent vectors.
Define A = [a1 a2 ⋯ aN−p]:
Ô⇒ ATX = 0
Ô⇒ y∗ = ATy = AT (Xβ + ε) = AT ε ∼ N (0,ATV A) free of β.
Derivation of REML: Partition the likelihood into a likelihood for the mean param-
eter µ (and σ2 and a residual likelihood for σ2 only.
Example 1.2.4. n independent variables with normal distribution:
logL = −n
2log(2π) −
n
2log(σ2) −
n
∑i=1
(y − µ)2
2σ2
= −n
2log(2π) −
n
2log(σ2) −
n
∑i=1
(y − y)2
2σ2−n(y − µ)2
2σ2
= [−1
2log(2π) −
1
2log(σ2) −
n(y − µ)2
2σ2]
+ [−n − 1
2log(2π) −
n − 1
2log(σ2) −
n
∑i=1
(y − y)2
2σ2]
where the first component is the logL for y (differ by a constant), and the second
component the logL for n − 1 independent random variables.
Matrix transformation: For MVN vector y, find orthogonal matrix P : u = Py s.t.
u = u1, u2,⋯, un uncorrelated, normally distributed (orthogonal transformation).
Ô⇒ Let first row of P be 1,1,⋯,1/√n
Ô⇒ u1 =√ny, u2,⋯, un
i.i.d∼ N(0, σ2)
Example 1.2.5. Linear model: y =Xβ + ε
OLS estimate: b = (XTX)−1XTy
logL(y) = −n
2log(2π) −
n
2log(σ2) −
(y −Xβ)T (y −Xβ)
2σ2
Rewrite:
y −Xβ = (y −Xb) + (Xb −Xβ)
= (y −Xb) +X(b − β)
Ô⇒ (y −Xβ)T (y −Xβ)
= (y −Xb)T (y −Xb) + 2(y −Xb)TX(b − β) + (b − β)TXTX(b − β)
= (y −Xb)T (y −Xb) + (b − β)TXTX(b − β)
since y −Xb = (I − P )Y is an orthogonal projection onto the linear space spanned
by X. And,
(y −Xb)T (y −Xb) = (y − Py)T (y − Py)
= yT (I − P )T (I − P )y
= yT (I − P )y
26
since I − P symmetric and idempotent. So,
LogL(y) =−p
2log(2π) −
p
2log(σ2) −
(b − β)TXTX(b − β)
2σ2
+ −n − p
2log(2π) −
n − p
2log(σ2) −
yT (I − P )y
2σ2
where the second component is the residual logL, free of β.
Example 1.2.6. Linear mixed model: y = Xβ + Zb + ε, where b ∼ N(0,D), ε ∼
N(0,R).
H = Var(y) = ZDZT +R
Find a transformation LTy =⎛
⎝
y∗1y∗2
⎞
⎠, L = (L1, L2), s.t.,
LT1X = Ip
LT2X = 0
Here L1 is a n × p matrix, and L2 is a n × (n − p). Therefore,
y∗1 ∼ N(β,LT1HL1)
y∗2 ∼ N(0, LT2HL2)
Cov(y∗1 , y∗2) = L
T1HL2
Using property of MVN, we can get:
y∗1 ∣y∗2 ∼ N (β −LT1HL2(L
T2HL2)
−1y∗2 , LT1HL1 −L
T1HL2(L
T2HL2)
−1LT2HL1)
And,
(H−1X L2)−1H−1 (H−1X L2)
−1= ((H−1X L2)
TH (H−1X L2))
−1
=⎛
⎝
⎛
⎝
XTH−1
LT2
⎞
⎠H (H−1X L2)
⎞
⎠
−1
=⎛
⎝
(XTH−1X) XTL2
LT2X (LT2HL2)
⎞
⎠
−1
=⎛
⎝
(XTH−1X)−1
0
0 (LT2HL2)−1
⎞
⎠
Ô⇒H−1 = (H−1X L2)⎛
⎝
(XTH−1X)−1
0
0 (LT2HL2)−1
⎞
⎠(H−1X L2)
= (H−1X (XTH−1X)−1
L2 (LT2HL2)−1)⎛
⎝
XTH−1
LT2
⎞
⎠
=H−1X (XTH−1X)−1XTH−1 +L2 (L
T2HL2)
−1LT2
27
On both sides, pre-multiply by LT1H and post-multiply by HL1:
Ô⇒ LT1HL1 = LT1X (XTH−1X)
−1XTL1 + (LT1HL2) (L
T2HL2)
−1(LT2HL1)
= (XTH−1X)−1+ (LT1HL2) (L
T2HL2)
−1(LT2HL1)
since LT1X = Ip. Therefore,
y∗1 ∣y∗2 ∼ N (β −LT1HL2(L
T2HL2)
−1y∗2 , (XTH−1X)
−1)
Because y∗2 ∼ N (0, LT2HL2). the residual likelihood:
lR = C −1
2(log∣LT2HL2∣ + y
∗T2 (LT2HL2)
−1y∗T2 )
= C −1
2(log∣LT2HL2∣ + y
TL2(LT2HL2)
−1LT2 y)
Let P =H−1 −H−1X (XTH−1X)−1XTH−1 = L2(LT2HL2)−1LT2 . And we have:
∣LTHL∣ = ∣LT2HL2∣∣LT1HL1 − ∣ (LT1HL2) (L
T2HL2)
−1(LT2HL1) ∣
= ∣LT2HL2∣∣ (XTH−1X)
−1∣
log ∣LTHL∣ = log ∣LLTH ∣ = log ∣LLT ∣ + log∣H ∣
Ô⇒ log ∣LT2HL2∣ = log ∣LTHL∣ + log ∣XTH−1X ∣ = log ∣LLT ∣ + log∣H ∣ + log ∣XTH−1X ∣
Therefore,
lR = C∗ −1
2(log∣H ∣ + log ∣XTH−1X ∣ + yTPy)
which is the REML for parameters in H, and the loglikelihood for the fixed effects
depends on y1∣y2. (Note: log ∣LLT ∣ is merged into the constant term.)
Comparison between MLE and REML:
1O REML removes bias from ωMLE.
2O REML does not estimate β directly, so invariant to the values of β.
3O REML is less sensitive to outliers than MLE.
4O difference between REML and ML → 0 as m→∞ where m is the number of clusters
(e.g., independent families), not overall sample size.
5O From Bayestian perspective, using u = ATY is equivalent to ignoring any prior in-
formation on β. In absence of info on β, no info for ω is lost when using u = ATY
instead of Y .
6O Var(ωREML) ≥ Var(ωMLE)Ô⇒ REML is less efficient than MLE.
REML: unbiased, less efficient;
MLE: biased, more efficient.
MSE = Var + bias2. Let p be the number of β’s.
If p ≤ 4, then MSE(ωMLE) < MSE(ωREML); if p > 4, MSE(ωMLE) > MSE(ωREML).
28
Inference for β
Simple test: H0 ∶ βj = 0 vs HA ∶ βj ≠ 0. V (β) = (XT V −1X)−1.
Wald test:βjσj
⋅∼ N(0,1)→ t-test
β2j
σ2j
⋅∼ χ2(1)→ F-test
Debate on degree of freedom: For linear models, df = n−p; for longitudinal data, n =m
∑i=1
ni−p(?).
Use approximation in practice, e.g., Satterthwaite.
Contrast test: H0 ∶ Lβ = 0. Var(Lβ) = LVar(β)LT .
Wald test: TL = (Lβ)T LV (β)LT−1
(Lβ)⋅∼ χ2(r) under H0, r = rank(L).
Model Comparison:
1O nested models: LRT ∼ χ2k
2O non-nested models:
AIC = −2(l−p) where l is the log-likelihood and p = #(β,ω) (penalized for #parameters)
BIC = −2 [l − p log (N
2)]
Robust Variance Estimator
If V is known, the generalized least squares (GLS) estimator of β is β = (XTV −1X)−1XTV −1Y .
Var(β) = (XTV −1X)−1XTV −1Var(Y )V −1X(XTV −1X)−1
≈ (XTV −1X)−1XTV −1 (Y − Y )(Y − Y )T
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶=Var(Y ), the empirical variance of Y
V −1X(XTV −1X)−1
Example 1.2.7. Let Yij be the pain score for woman i at time tj. Ti =
⎧⎪⎪⎨⎪⎪⎩
0 control
1 treatment.
1. Model (fixed slope): Yij = β0 + β1tj + β2Ti + b2Ti + eij, b2 ∼ N(0, σ2b), eij ∼ N(0, σ2), b2eij.
E(Yij) = β0 + β1tj + β2Ti, Var(Yij) = T 2i Var(b) +Var(eij) =
⎧⎪⎪⎨⎪⎪⎩
σ2b + σ
2 if Ti = 1
σ2 if Ti = 0.
Hence, the variation of Yij in the treatment group > the control group.
There is no interaction term: no matter which treatment is given, the pain score for
women will have the same progression rate along time Ô⇒ the same slope.
2. Model: Yij = β0 + β1tj + β2Ti + b2Si + eij where Si = 1 − Ti.
Hence, the variation of Yij in the treatment group < the control group.
3. Model (random slope): Yij = β0 + β1tj + β2Ti + β3Titj + b2Ti + b3Titj + eij.
Assume independence, Var(Yij) = t2jVar(b1) + T 2i Var(b2) + T 2
i t2jVar(b3) +Var(eij).
Hence, the variation of Yij in the treatment group > the control group.
Two-stage model:
Stage I: Yij = β0i + β1itj + eij to model individual measurements (e.g., repeated measures from
patients)
29
Stage II:⎛
⎝
β0i
β1i
⎞
⎠=⎛
⎝
β0
β1
⎞
⎠+⎛
⎝
b0i
b1i
⎞
⎠to model the difference between patients (normal assumption)
Ô⇒ Yij = (β0 + b0i) + (β1 + b1i)tj + eij = β0 + β1tj + b0i + b1itj + eij.
Assume E(bi) = 0, Var(bi) =D =⎛
⎝
d11 d12
d12 d22
⎞
⎠.
Yi = Ziβ +Zibi + eiÔ⇒ Var(Yi) = ZiDZTi +Ri
Var(b0i) = d11 Ô⇒ 95% plausible range for intercept is b0 ± 1.96√d11
Var(b1i) = d22 Ô⇒ 95% plausible range for intercept is b1 ± 1.96√d22
1O H0 ∶ Var(bi) =⎛
⎝
d11 0
0 d22
⎞
⎠Ô⇒ d12 = 0 b0ib1i
2O H0 ∶ Var(bi) =⎛
⎝
d11 0
0 0
⎞
⎠Ô⇒ d12 = 0&d22 = 0 Problem: test on the boundary of
parameter space; the distribution χ22 does not hold any more. LRT ∼ χ2
k under H0 usually,
but in this case, LRT ∼ a 50:50 mixture of χ21 and χ2
2. (Ref: Verbeke & Molenberghs, Ch
6.3)
Best Linear Unbiased Predictor (BLUP)
Subjects with ”large” b0i are very ”different”. It would be nice to have values of b0i.
Difficulty: b0i is random, so can be one of an infinite number of values. For prediction, we often
use expected values. E.g., Y1, . . . , YnÔ⇒ Yn+1 = Yn ≈ E(Y ).
b01, . . . , b0mÔ⇒ problem: we cannot use expected values as E(b0i) = 0.
Solution: instead of using the marginal distribution, we use expectation from the conditional
distribution.
Let b0i = E(b0i∣Yi) which is the function that minimizes E[bi − c(Yi)]2, i.e., MSE for any linear
function c(Yi).
Simple case: Yij = µ + bi + eij, bi ∼ N(0,D) (random-effect one-way ANOVA)
Here D is a scalar, eij ∼ N(0, σ2), eijbi.
Yij ∣bi ∼ N(µ + bi, σ2)
f(bi∣Yi) =f(Yi∣bi)f(bi)
f(Yi)∝ f(Yi∣bi)f(bi) ∼ normal distribution
.
f(bi∣Yi)∝ (σ2)−ni/2 exp−1
2σ2
ni
∑j=1
(Yij − µ − bi)2D−1/2 exp−
1
2Db2i
∝ exp−1
2σ2
ni
∑j=1
[(Yij − µ)2 − 2bi(Yij − µ) + b
2i ] −
1
2Db2i
= exp[−ni
2σ2b2i −
1
2Db2i ] −
1
2σ2
ni
∑j=1
[(Yij − µ)2 − 2bi(Yij − µ)]
Ô⇒ bi∣Yi ∼ N(µb∣Y , σ2b∣Y )
30
f(bi∣Yi)∝ exp
⎧⎪⎪⎨⎪⎪⎩
−1
2σ2b∣Y
b2i + 2bi
µb∣Y
σ2b∣Y
−µ2b∣Y
2σ2b∣Y
⎫⎪⎪⎬⎪⎪⎭
Hence,1
2σ2b∣Y
=ni
2σ2+
1
2DÔ⇒ σ2
b∣Y = (niD + σ2
σ2D)−1
=σ2D
niD + σ2
µb∣Y
σ2b∣Y
=1
σ2
ni
∑j=1
(Yij − µ)Ô⇒ µb∣Y =σ2b∣Y
σ2
ni
∑j=1
(Yij − µ) =niD
niD + σ2(Yi − µ) < Yi − µ
Yi = β0 + b0i + ei, b0i = Yi − µ where Yi is the cluster mean and µ is the population average.
bi = E(bi∣Yi) =niD
niD + σ2(Yi − µ) is the shrinkage estimate (shrink to the population mean).
bi is called the shrinkage estimator, or weighted deviation of Yi and µ.
bi is unbiased: E(bi) =niD
niD + σ2E(Yi − µ) = 0.
In practice, bi =niD
niD + σ2(Yi − µ)Ð→ Empirical BLUP (EBLUP).
Define: µi = E(Yij ∣bi) = µ + bi.
Predict: µi = µ + bi =niD
niD + σ2Yi +
σ2
niD + σ2µ vs µi = Yi
1O If D ≫ σ2, then µi → Yi.
2O If σ2 ≫D, then µi → µ.
Summary: Yi =Xiβ +Zibi + εi, bi ∼ N(0,D), ei ∼ N(0,Ri), biei.
Hence, Yi ∼ N(Xiβ,Σi) where Σi = ZiDZTi +Ri.
bi = E(bi∣Yi) = DZTi Σ−1
i (Yi −Xiβ)
µi = E(Yi) =Xiβ
µi = E(Yi∣bi) =Xiβ +Zibi = ZiDZTi Σ−1
i Yi + (Ini−ZiDZ
Ti Σ−1
i )Xiβ = Wi1Yi + Wi2µi
where Yi is the data and µi is the marginal mean
Bayesian interpretation:
E(bi) = 0 is the prior mean for bi before data collection.
E(bi∣Yi) = bi is the posterior mean after data collection.
EBLUP → Empirical Bayes estimator
Variance Components
Yi =Xiβ +Zibi + ei =Xiβ + εiÔ⇒ β = (XTΣ−1X)−1XΣ−1Y where Σ = ZDZT +R.
Σ: covariance matrix of Y
D: random effects
R: residual covariance
If Zi = Z and Ri = σ2Ini, then β is invariant to the choice of Z: β = (XTX)−1XTY .
Example 1.2.8. Yi =Xiβ +Zibi + ei
1. Random intercept: bi =
⎛⎜⎜⎝
1
1
1
⎞⎟⎟⎠
Var(bi) = d11 =D Var(ei) = σ2I
31
Assume ni = 3,
Var(Yi) = ZiDZTi +Ri =
⎛⎜⎜⎝
1
1
1
⎞⎟⎟⎠
d11
⎛⎜⎜⎝
1
1
1
⎞⎟⎟⎠
T
+ σ2
⎛⎜⎜⎝
1 0 0
0 1 0
0 0 1
⎞⎟⎟⎠
=
⎛⎜⎜⎝
d11 + σ2 d11 d11
d11 d11 + σ2 d11
d11 d11 d11 + σ2
⎞⎟⎟⎠
= σ21
⎛⎜⎜⎝
1 ρ ρ
ρ 1 ρ
ρ ρ 1
⎞⎟⎟⎠
where σ21 = d11 + σ
2 and ρ =d11
d11 + σ2.
2. Random intercept and random slope:
bi = (b0i b1i) D =⎛
⎝
d11 0
0 d22
⎞
⎠Ri = σ2Ini
Z =
⎛⎜⎜⎝
1 t1
1 t2
1 t3
⎞⎟⎟⎠
Σi = ZiDZTi +Ri =
⎛⎜⎜⎝
d11 + t21d22 + σ2 d11 + t1t2d22 d11 + t1t3d22
d11 + t1t2d22 d11 + t22d22 + σ2 d11 + t2t3d22
d11 + t1t3d22 d11 + t2t3d22 d11 + t23d22 + σ2
⎞⎟⎟⎠
Corr(Yi1, Yi2) =d11 + t1t2d22
√d11 + t21d22 + σ2
√d11 + t22d22 + σ2
Hence, t ↑ ⇒ Var(Yit) ↓ and ∣t1 − t2∣ ↑ ⇒ Corr ↓.
Let Ri = σ2
⎛⎜⎜⎝
1 ρ ρ2
ρ 1 ρ
ρ2 ρ 1
⎞⎟⎟⎠
. Corr(Yi1, Yi2) =d11 + t1t2d22 + ρσ2
√d11 + t21d22 + σ2
√d11 + t22d22 + σ2
.
3. (a) Zi =
⎛⎜⎜⎝
Ti
Ti
Ti
⎞⎟⎟⎠
Ti =
⎧⎪⎪⎨⎪⎪⎩
0 control
1 treatment
Var(Yi) = d11Ti
⎛⎜⎜⎝
1 1 1
1 1 1
1 1 1
⎞⎟⎟⎠
+ σ2I3 =
⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩
σ2I if Ti = 0
⎛⎜⎜⎝
d11 + σ2 d11 d11
d11 d11 + σ2 d11
d11 d11 d11 + σ2
⎞⎟⎟⎠
if Ti = 1
If we want correlation in both groups, then:
(b) Let Ri = C-S.
Var(Yi) = d11TiJ3×3 + σ2
⎛⎜⎜⎝
1 ρ ρ
ρ 1 ρ
ρ ρ 1
⎞⎟⎟⎠
=
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
σ2
⎛⎜⎜⎝
1 ρ ρ
ρ 1 ρ
ρ ρ 1
⎞⎟⎟⎠
if Ti = 0
⎛⎜⎜⎝
d11 + σ2 d11 + ρσ2 d11 + ρσ2
d11 + ρσ2 d11 + σ2 d11 + ρσ2
d11 + ρσ2 d11 + ρσ2 d11 + σ2
⎞⎟⎟⎠
if Ti = 1
(c) Zi =
⎛⎜⎜⎝
1 Ti
1 Ti
1 Ti
⎞⎟⎟⎠
. Ri = σ2I.
32
Resampling Methods
Model: Yi =Xiβ +Zibi + ei, bi ∼ N(0,D), ei ∼ N(0,Ri), biei.
Model-based estimate: β = (XT Σ−1X)−1XT Σ−1Y where Σ = ZDZT +R, Var(β) = (XT Σ−1X)−1.
If the model is misspecified, then we can use empirical/robust Var(β).
1. Jackknife
We have observations X1, . . . ,Xni.i.d.∼ F (x; θ). Using data, we can compute an estimate
of θ, e.g., θ = X, θ =1
n − 1
n
∑i=1
(Xi − X)2.
Let θ(i) be the estimate of θ leaving out Xi. So, there are n possible estimates.
θ(⋅) =1
n
n
∑i=1
θ(i), ˆbias(θ) = (n − 1)[θ(⋅) − θ]
Example 1.2.9. θ = E(X), θ = X
θ(i) =nX −Xi
n − 1
Ô⇒ θ(⋅) =∑ni=1(nX −Xi)
n(n − 1)=
n
n − 1X −
1
n(n − 1)
n
∑i=1
Xi = X
Ô⇒ ˆbias(θ) = 0
Define pseudo value as θi = θ + (n − 1)[θ − θ(i)].
The Jackknife estimator is θ =1
n
n
∑i=1
θi.
So, bias(θ) = O (1
n) and bias(θ) = O (
1
n2)Ô⇒ bias(θ) will drop faster than bias(θ).
Var(θ) =n − 1
n
n
∑i=1
[θ(i) − θ(⋅)]2 =
1
n(n − 1)
n
∑i=1
(θi − θ)2. Reference: Tukey (1958).
Application of Jackknife on repeated measures: estimate θ after removing each indepen-
dent observation.
β(i) = (m
∑j=1,j≠i
XTj Σ−1
j Xj)
−1
(m
∑j=1,j≠i
XTj Σ−1
j Yj) β(⋅) =1
m
m
∑i=1
β(i)
Var(β) =m − 1
m
m
∑i=1
[β(i) − β(⋅)]2 =
(m − 1)2
m
1
m − 1
m
∑i=1
[β(i) − β(⋅)]2
2. Bootstrap
In real world, observed data X = (X1, . . . ,Xn)Ô⇒ θ = S(X). There is no selection bias,
resampling for many times. However in practice, experiments are expensive.
In Bootstrap world, the sample X is viewed as population. We repeat the sampling from
X: X∗ = (X∗1 , . . . ,X
∗n)Ô⇒ θ∗ = S(X∗).
There are (2n−1n
) unique Bootstrap samples Ô⇒ θ∗(1),⋯, θ∗((2n−1n)).
Ô⇒ Bootstrap distribution of θ ≈ true distribution of θ.
θ∗(⋅) =1
B
B
∑b=1
θ∗(b) B = (2n − 1
n) Var(θ) =
1
B − 1
B
∑b=1
[θ∗(b) − θ∗(⋅)]2
Application of Bootstrap on repeated measures:
(1) Resample unique subject/cluster (IDs): sample unique patient ID at one time, then
33
the whole cluster goes into the new sample. Otherwise, e.g., drawing 5 from cluster 1
and 7 from cluster 2 will create bias. Here, the assumption is cluster sizes are equal.
If we have a problem of variations in cluster sizes, then one possible solution would
be to split the clusters to equal-size groups.
(2) Compute β and denote it as β∗(b). We do not consider the standard error estimate
at this step; we use the Bootstrap estimate of variance instead.
(3) Repeat steps (1) and (2) for B times, B = (2m − 1
m), the total number of unique
draws we can get from the sample, or a pre-specified number, e.g., 500.
Ô⇒ Var(β) =1
B − 1
B
∑b=1
[β∗(b) − β∗(⋅)] [β∗(b) − β∗(⋅)]T
where β∗(⋅) =1
B
B
∑b=1
β∗(b)
Estimate the confidence intervals:
i. β∗(⋅)
j ± 1.96 se(β∗(⋅)
j )
ii. (2.5th - 97.5th) percentile of the Bootstrap distribution
1.2.3 Generalized Linear Mixed Models (GLMM)
GLMM is an extension for LMM to logistic link. It models correlation through random effects.
E(Yij) = µij =Xijβ +Zijbi Ð→ logit(µij) =Xijβ +Zijbi, bi ∼ N(0,D)
Objectives:
1. interested in the relationship between X and Y : Yij ∼Xij
2. interested in the correlations as well
3. β’s have subject specific interpretation
Recall GLM: g(µi) =Xiβ. GLMM: E(Yij ∣bi) = µij, g(µij) =XTijβ +Z
Tijbi, bi ∼ N(0,D(θ)).
L(Yij ∣β, θ) = ∫
ni
∏j=1
L(Yij ∣bi)L(bi)dbi
Remarks:
1. L(⋅) does not have a closed form except when Yij ∼ normal.
2. We need to evaluate the integral numerically - difficult for high-dimension. We use Rie-
mann integral, but for high–dimensional data, the calculation will be 100n which is very
slow.
3. Methods: (1) approximation, (2) Gauss-Hermite, (3) Gibbs sampling - BUGS.
Reference: Breslow&Clayton (1993). JASA DLZ (1994).
Quasi-likelihood:
Let y1, . . . , yn∣b be independent and follow a GLM with µbi = E(Yi∣bi), Var(Yi∣bi) = φV (µbi),
g(µbi) =XTi β +Zibi.
34
The quasi-likelihood of (β, θ) is
L(β, θ)∝ ∣D∣1/2∫ expn
∑i=1
li(yi∣β) −1
2bTD−1bdb where li(yi∣β)∝ ∫
µbi
yi
yi − u
φV (µ)du.
Example 1.2.10.
1. Clustered binomial data:
logit(µbij) =XTijβ +Z
Tijbi, E(Yij ∣bi) = µbij, Var(Yij ∣bi) = φm−1
ij µbij(1 − µ
bij)
2. Clustered Poisson data:
log(µbij) =XTijβ +Z
Tijbi, E(Yij ∣bi) = µbij, Var(Yij ∣bi) = φµbij
Inference in GLMM:
1. Conditional inference:
idea: calculate the sufficient statistics for b and make inference using the conditional
likelihood on the sufficient statistic.
Advantages:
(a) robustness: no distributional assumption on b
(b) likelihood has closed form
Disadvantages:
(a) Only works in specical cases, e.g., logistic GLMM; Poisson log-linear GLMM with
random intercept
(b) Some covariate effects cannot be estimated, e.g., cluster-level covariate effect in
conditional logistic regression
Details: Treat bi fixed, θij = θij(β, bi)
m
∏i=1
ni
∏j=1
f (Yij ∣β, bi)∝m
∏i=1
ni
∏j=1
exp (θijYij −Ψ(θij))
use exponential family and canonical link function, θij =XTijβ +Z
Tijbi, sufficient statistics:
ai =∑i,j
xijyij for β
bi =∑j
Zijyij for bi
See Diggle, Heagerty, Liang, and Zeger, ”Analysis of Longitudinal data”, 2002.
m
∏i=1
f (Yi∣∑j
Zijyij = bi, β) = ⋯ =m
∏i=1
∑Ri1exp(βTai)
∑Ri2exp(βT ∑
nij=1 xijyij
2. Approximate Inference:
idea: approximate l(β, θ) using various approximations.
35
(a) Laplace approximation: expand the intefrand of l(β, θ) about the mode, b = b in a
Taylor series before integration.
Ref: Tierney and Kadane, 1986, JASA; Lin and Pierce, 1993, Biometrika; Breslow
and Lin, 1995, Biometrika.
(b) Solomon-Cox approximation: expand the intefrand of l(β, θ) about the mean of ran-
dom effects.
Ref: Barndorff-Nielson and Cox, 1989, Chapter 3.3; Solomon and Cox, 1992, Biometrika;
Breslow and Lin, 1995, Biometrika.
(c) Penalized Quasi-Likelihood (PQL): modified Laplace, iteratively fit a linear mixed
model using GLM working weight and working vector, i.e., repeatedly call Proc
Mixed in SAS (-GLIMMIX-).
Perform poorly (biased β) when: 1) Poisson: small mean value; Binomial: small p
or n. 2) random effect variances are large or highly correlated.
Ref: Schall, 1991, Biometrika; Breslow and Clayton, 1993, JASA.
(d) Corrected PQL (cPQL): PQL does not work well for sparse data (the Laplace ap-
proximation does not perform well). Use simple correction terms for βPQL and θPQL
to reduce bias.
Ref: Breslow and Lin, 1995, Biometrika; Lin and Breslow, 1996, JASA.
Note: All these approximation procedures generally do not give consistent estimates of β
and θ except for normal case.
3. E-M algorithm: convert the problem to a missing data scenario, i.e., missing random
effects.
Complete data: Y , b
Incomplete data: Y
E-step: Use expectations to ”impute” missing values. Specifically, calculate the expected
value of the sufficient statistic, condition on the observed data.
t = E [bbT ∣Y, θ]
which involves the same integration we would like to avoid in our likelihood inference.
(a) Gaussian approximation. Ref: Stiratelli, Laird, Ware, 1982, Biometrics.
(b) Second order Laplace approximation. Ref: Steele, 1996, Biometrics.
(c) Monte-Carlo simulation (using Metroplis method). Ref: McCulloch, 1994, 1997,
JASA; Wallner, et al., 1997, JASA.
M-step: estimate θ using the imputed data (sufficient statistic).
4. Gibbs sampling.
Prior for β: flat prior;
Prior for D(θ): Gamma (Jeffery prior does not work since the posterior is not proper).
Objective: generate the joint distribution of [β, θ, b∣y]
36
Idea: generate a series of conditional distribution: [β∣θ, b, y], [b∣β, θ, y], [θ∣β, b, y]
Ref: Zeger and Karim, 1991, JASA; McCulloch, 1994, JASA; Gelfand, Sahu, Carlin,
1995, Bayesian Stat.
1.2.4 Generalized Estimating Equations (GEE)
Generalized estimating equations is a marginal model for nonlinear/non-normal case.
Reference: Liang&Zeger, 1986. Biometrika. p13-22.
Longitudinal/clustered data:
* K independent subjects/clusters
* ni observations over time/within a cluster
Yi = (Yi1, . . . , Yini)T ,Xi = (Xi1, . . . ,Xini
)T
Question: How to extend GLM to correlated data?
Usually, Yi∣bi ∼ F , L(Yi) = ∫ L(Yi∣bi)f(bi)dbi is difficult to specify or compute.
Objective: only interested in Yi ∼Xi, while treating the correlations as nuisance parameters.
– make as fewer assumptions as possible.
– construct consistent and asymptotically normal regression coefficient estimates.
Distribution assumption: only specify the marginal distribution of Yij using quasi-likelihood.
E(Yij) = µij,Var(Yij) = φV (µij)Ô⇒ QL: l(Yij) = ∫µij
yij
yij − u
φV (u)du.
Mean model: g(µij) =XTijβ.
Independent case: g(µi) =XTi β.
Quasi-score:n
∑i=1
DTi V
−1i (yi − µi) = 0 where Di =
∂µi∂βT
is a 1 × p row and Vi = Var(yi) is a scalar.
Generalized estimating equations:
k
∑i=1
DTi
°ni×p
V −1i
°ni×ni
(yi − µi)´¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¶
ni×1
= 0 where Vi = Cov(yi) and Di =∂µi∂βT
=
⎛⎜⎜⎜⎜⎜⎝
∂µi1∂βT
⋮∂µini
∂βT
⎞⎟⎟⎟⎟⎟⎠
= ∆−1i
°diagg′(µi)
Xi
Vi = V1/2mi Ri(α)V
1/2mi where Vmi
= diagφV (µi) is the marginal variance of Yi, Ri(α) is the
working correlation matrix (need to specify, could be wrong) and α is the nuisance correlation
parameter(s).
Estimate β
1. Estimate β by IWLS: g(µ) = η =Xβ
(a) Initialize: η = g(Y ), V = V (φ) = φV (α)
(b) Update:
(i) µ = g−1(η), Z = η + (Y − µ)g′(µ), W = diag (g′(µ)2a(φ)V (µ)−1) ,
β = (XTWX)−1XTWZ, η =Xβ
37
(ii) Estimate V = V (φ)
Example 1.2.11. Pearson residuals + Methods of Moment (MoM).
Recall that Pearson residual is rk =Yk − µk√Vk(µk)
.
The scalar parameter is φ =χ2
n − p=
N
∑k=1
r2k
n − p, where N is the total sample size.
The exchangeable correlation structure is Corr(Yij, Yik) = α, j ≠ k.
So, E(rijrik
φ) = αÔ⇒ α =
∑j∑j<k rijrik
φ∑ini(ni−1)
2
.
If φ > 1, then we have over-dispersion; if φ < 1, then we have under-dispersion.
Alternative methods to estimate α:
* For binary data, use pairwise ORs instead of Pearson correlation.
Reference: Lipsitz (1991). Biometrika; Liang (1992). JRSS(B).
* Use quasi-least squares (QLS) instead of MoM.
Reference: Chaganty (1997). Journal of Statistical Planning & Inference.
* Anscombe residuals (1953): A(Y ) − A(µ). A(⋅) is a transformation which
makes the distribution of A(Y ) more normal.
A = ∫du
V 1/3(u), based on the delta method, Var[A(Y )] = [A′(Y )]2Var(Y ).
E.g., Poisson rAij =32(Y
2/3ij − µ2/3)
µ1/6.
This is an old method, not necessarily having more advantage than Pearson
residual.
2. Estimate β by Fisher-scoring:
(a) Quasi-likelihood approach: change the variance function in denominator to the one
with correlation.
(b) Fisher-scoring: U(β) =m
∑i=1
DTi V
−1i (yi − µi) = 0.
38
k-th update:
0 = U(βk+1)⋅= U(βk) +
∂U
∂βT(βk+1 − βk)
Ô⇒ βk+1 = βk + (−∂U
∂βT)−1
U(βk)
Ô⇒ E(−∂U
∂β)βk+1 = E(−
∂U
∂βTβk +U(βk))
Ô⇒ (k
∑i=1
DTi V
−1i Di)β
k+1 = (k
∑i=1
DTi V
−1i Di)β
k +k
∑i=1
DTi V
−1i (yi − µi)
Since Di =∂µi∂βT
= ∆−1i Xi and ∆i = diagg′(µij), let Wi = ∆−1
i V−1i ∆−1
i ,
X =
⎛⎜⎜⎝
X1
⋮
Xm
⎞⎟⎟⎠
and Xi ∼ ni × p W =
⎛⎜⎜⎝
W1 ⋯ 0
⋮ ⋱ ⋮
0 ⋯ Wm
⎞⎟⎟⎠
Y =
⎛⎜⎜⎝
y1
⋮
ym
⎞⎟⎟⎠
Ô⇒ (XTWX)βk+1 =XTWY where working vector Y =Xβk +∆(Y − µ),∆ =
⎛⎜⎜⎝
∆1 ⋯ 0
⋮ ⋱ ⋮
0 ⋯ ∆m
⎞⎟⎟⎠
.
For large samples, β⋅∼ N (β, φ(∆TΩ−1∆)−1). The estimated variance-covariance matrix is
Vβ = φ(∆TΩ−1∆)−1 = φ(
m
∑j=1
∆Tj Ω−1
j ∆j)
−1
where ∆j =∂µj∂βT
and Ωj = Var(Yj).
Robust standard error: V Rβ = Vβ (
m
∑j=1
∆jΩ−1j SjΩ
−1j ∆j) Vβ/φ
2 where Sj = (Yi−µj)(Yj−µj)T and Vβ
is the model-based/naive s.e., since β is consistent even when V (φ) is misspecified.
Link to previous knowledge:
– cluster of correlated normal observations → multivariate normal
– cluster of correlated Poisson observations ↛ multivariate Poisson
For non-normal data, estimating equations ≠∂
∂βl(β)Ð→ QL.
Estimating equations are unbiased Ð→ β are consistent.
Formal Asymptotic Results
Conditions: 1) φ, α are√k-consistent (moment estimator).
2)∂µ
∂βTpÐ→ A uniformly in an open neighborhood of β.
Results: 1) β is consistent.
2)√k(β − β) ∼ N(0,Σ) where
Σ = limk→∞
(1
k
k
∑i=1
DTi V
−1i Di)
−1
[1
k
k
∑i=1
DiV−1i (yi − µi)(yi − µi)
TV −1i Di]
−1
(1
k
k
∑i=1
DTi V
−1i Di)
−1
= limk→∞
A−1GA−1
Corollary. If Vi = V1/2mi Ri(α)V
1/2mi is correctly specified (so Vmi
and Ri are both correctly speci-
fied) and E(G) = A, then Σ = limk→∞
A−1 = limk→∞
(1
k
k
∑i=1
DTi V
−1i Di)
−1
, β is efficient within the linear
39
estimating function family.
GEE 1 vs GEE 2
* GEE 1: specify the first two moments
– assume β and α orthogonal (α,β contain completely different information)
– so β is consistent even when V (µ) is misspecified
* GEE 2: estimate β and α simultaneously
– require modeling the 3rd and 4th moments of Yij
– give consistent β and α when E(Yij) and V (Yi) are correct
References: Zhao&Prentice (1990). Biometrika; Zhao&Prentice (1991). Biometrika;
Liang (1992). JRSS(B).
* Extended GEE (EGEE):
– estimate β and α simultaneously, but only make assumption on 1st and 2nd moments
– estimate α efficiently when correlation structure correctly specified
– for consistency of β, does not require correct V (µ)
Reference: Hall&Severini (1998). JASA.
General Procedures of Generalized Estimating Equations
1. Define marginal distribution of Yij:
E(Yij) = µij with link function g(µij) =Xijβ and variance function Var(Yij) = φV (µij).
2. Pick “working” correlation structure: Var(Yi) = φVi = φV1/2mi Ri(α)V
1/2mi .
Find Ri(α): 1) Run GLM
2) Get residuals: Yij − µij = eij
3) Look at correlation in eij:
subj t1 t2 ⋯ tni
1 e11 e12 ⋯ e1ni
⋮ ⋮ ⋮ ⋮ ⋮
m em1 em2 ⋯ emni
* works only for repeated measures and small number of time intervals.
* data should be ordered and line up (repeated at the same time point).
Guidelines: 1) Simpler structures are easier to fit.
2) The loss of efficiency by using a simpler structure depends on
i) covariate pattern
ii) cluster size: larger size Ð→ greater loss of efficiency
iii) within-cluster correlation: larger correlation Ð→ greater loss of efficiencyIf m≫ n, then “unstructured” may be asymptotically efficient.
Reference: Prentice (1988). Biometrics.
3. Solve β using IWLS:
40
marginal distribution = normal Ô⇒ MLE β
marginal distribution ne normal Ô⇒ not MLE
Var(β) = φ(m
∑i=1
∆Ti Ω−1
i ∆i)
−1
is the naive/model-based variance.
Alternatively, use the empirical/robust/sandwich/White variance
V Rβ = Vβ (
m
∑i=1
∆iΩ−1i SiΩ
−1i ∆i) Vβ/φ
2 where Si = (Yi − µi)(Yi − µi)T
4. Test: H0 ∶ β =⎛
⎝
β1
β2
⎞
⎠vs H1 ∶ β =
⎛
⎝
β1
β2
⎞
⎠
i) Wald test: W = (β2 − β2)T [Cov(β2)]
−1(β2 − β2)
⋅∼ χ2
q under H0.
ii) Score test: S = UTβ2(β1, β2) Cov [Uβ2(β1, β2)]
−1Uβ2(β1, β2)
where Uβ =⎛
⎝
Uβ1Uβ2
⎞
⎠=
⎛⎜⎜⎜⎝
∂µT
∂β1
V −1(Y − µ)
∂µT
∂β2
V −1(Y − µ)
⎞⎟⎟⎟⎠
.
1O model-based covariance: Cov [Uβ2(β1, β2)] = I22 is the top left 2 × 2 square of
the information matrix.
2O robust covariance: CovR [Uβ2(β1, β2)]−1= (I22)
−1CovR(β2) (I22)
−1
Reference: Rotnitzky&Jewell, 1990. Biometrika.
R: library(gee)
gee(Y ∼ X, id = id var, corstr = ..., family = ..., data = ...)
SAS: proc genmod data = ...;
class id var;
model ...;
repeated subject = id var / type = ..., dist = ..., link = ..., modelse;
run;
1.2.5 Population-average (PA) Model vs Subject-specific (SS) Model
Population-average (PA) model: E(Yij) = µij population means.
Subject-specific (SS) model: E(Yij ∣bij) = µij, e.g., β0i = β0 + b0i, β1i = β1 + b1i.
For normal data with identity link,
µi = E(µi) = E(Xiβ +Zibi) =Xiβ.
For non-normal data, E(µi) ≠ µi.
Example 1.2.12. Poisson population-average model vs subject-specific model.
For the population-average model, log(µi) =Xiβ.
For the subject-specific model, log(µi) =Xiβ +Zibi, bi ∼ N(0,D), so µi = exp(Xiβ +Zibi).
E(Yi) = E[E(Yi∣bi)] = E(µi) = E[exp(Xiβ +Zibi)]´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶
subject-specific model
≠ exp(Xiβ) = E(Yi) = µi´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶
population-average model
.
41
Interpretation of β
PA model and SS model have different interpretations of β’s. Generally βPA < βSS.
Interpretation of β in PA model: change in population average.
Interpretation of β in SS model: conditional on the same cluster, comparing within the cluster.
Variance
In PA model, we model Σi = Cov(Yi) directly, capturing variances and correlations for all
sources with a single model.
In SS model, Σi = ZiDiZTi +Ri, as in the linear mixed models.
Example 1.2.13. LMM. Var(Yi) = ZiDZTi +Ri.
Assume D =⎛
⎝
d11 d12
d12 d22
⎞
⎠, Zi =
⎛⎜⎜⎜⎜⎜⎝
1 1
1 2
⋮ ⋮
1 t
⎞⎟⎟⎟⎟⎟⎠
, Ri = σ2Ini.
Var(Yij) = d11 + 2jd12 + j2d22 + σ2, j = 1, . . . , t, Cov(Yij, Yik) = d11 + (j + k)d12 + jkd22.∂[Var(Yij)]
dj= 2d12 + 2jd22, so if j > −d12d22
, then Var(Yij) will monotonically increase over time;
if j < −d12d22, then Var(Yij) will monotonically decrease over time.
In generalized estimating equations, bias and efficiency for finite samples depend on:
– number of clusters/units;
– distribution of cluster sizes;
– magnitude of correlations;
– number and type pf covariates.
Reference: Davis (2002). Statistical models for the analysis of repeated measures.
Time-dependent Variables
– We have to assume (Yij ∣Xij) = E(Yij ∣Xi1, . . . ,Xini) or E(Yij ∣Xi1, . . . ,Xi(j−1),Xij).
Reference: Pepe&Anderson (1994).
– Alternatively, we use a diagonal working covariance matrix - independent but unequal
variances.
1.2.6 Comparison between GEE and GLMM
GEE and GLMM have different interpretations of β and hence different estimations of β.
Generally, βGEE < βGLMM.
Since g−1(XTijβ
∗) = µij = E(Yij)´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶
GEE
= E[E(Yij ∣bi)] = E[g−1(XTijβ +Z
Tijbi)]
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶GLMM
,
XTijβ
∗ = g E[g−1(XTijβ +Z
Tijbi)] where β∗ = βGEE and β = βGLMM.
Example 1.2.14.
42
1. Binary data with probit link: g(⋅) = Φ−1(⋅)
Φ−1(µbij) =XTijβ +Z
TijbiÔ⇒ µbij = Φ(XT
ijβ +ZTijbi)
µij = E(µbij) = E [Φ(XTijβ +Z
Tijbi)] = Φ
1
(1 +ZTijDZij)
1/2XTijβ
XTijβ
∗ = Φ−1(µij) =XTijβ
(1 +ZTijDZij)
1/2
So, β∗ = β unless D = 0. Since 1 +ZTijDZij ≥ 1, ∣β∣ > ∣β∗∣.
E.g., Φ−1(µbij) =XTijβ + bi, bi ∼ N(0, θ). So, 1 +ZT
ijDZij = (1 + niθ)1/2, β∗ =1
(1 + niθ)1/2β.
2. Binary data with logit link: g(⋅) = logit(⋅)
µbij =exp(XT
ijβ +ZTijbi)
1 + exp(XTijβ +Z
Tijbi)
= F (XTijβ +Z
Tijbi) where F is the cdf of logistic distribution
µij = E(Yij) = E(µbij) ≈ F Xijβ
(1 + c2ZTijDZij)
1/2 where c =
16√
3
15π≈ 0.59
3. Count data with log link: g(⋅) = log(⋅)
µbij = exp(XTijβ +Z
Tijbi)
µij = E(µbij) = exp(XTijβ +
1
2ZTijDZij)
log(µij) =Xijβ∗ =Xijβ +
1
2ZTijDZij
´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶offset
1.2.7 Missing Data Pattern
Both GEE and LMM allow missing data, but with different assumptions.
* GEE assumes missing completely at random (MCAR): P (missing∣Yobs, Ymis) = α.
We can also extend GEE models to allow MAR, e.g., Robins (1995), Paik (1997).
* LMM/GLMM (based on log-likelihood) assumes missing at random (MAR):
P (missing∣Yobs, Ymis) = P (missing∣Yobs)Ô⇒ missing data can still be inferred by observed
data.
Since P (Yobs, Ymis) = P (Ymis∣Yobs)P (Yobs), we can use imputation methods.
If we assume MCAR, then P (missing∣Yobs) = P (missing∣Yobs, Ymis) = α. So P (Yobs, Ymis)∝
P (Yobs). If we only infer from P (Yobs), then the estimator is still consistent, but we may
lose some efficiency. For this complete case, just drop NA observations.
43