analysis of repeated measurements (spring 2017) of repeated measurements (spring 2017) ... p1 ˙ p2...

Analysis of Repeated Measurements (Spring 2017)

J.P.Kim

Dept. of Statistics

Finally modified at June 18, 2017

Preface & Disclaimer

This note is a summary of the lecture Analysis of Repeated Measurements (326.748A) held at

Seoul National University, Spring 2017. Lecturer was MC Paik, and the note was summarized

by J.P.Kim, who is a Ph.D student. There are few textbooks and references in this course,

which are following.

• Analysis of Longitudinal Data, Diggle, Heagerty, Liang & Zeger, Oxford University Press,

2002.

• Linear Mixed Models in Practice: An SAS-oriented Approach, Verbeke & Molenberghs,

Springer, 1997.

• Applied longitudinal analysis, Fitzmaurice, Laird & Ware, Wiley, 2004.

• Statistical Methods for the Analysis of Repeated Measurements, Charles S. Davis, Springer-

Verlag, 2002.

Also I referred to following books when I write this note. The list would be updated contin-

uously.

• Applied Multivariate Statistical Analysis, Johnson & Wichern, Pearson, 2002.

Most contents of this summary note is from these textbooks and references. If you want to

correct typo or mistakes, please contact to: [email protected]

1

Chapter 1

Introduction

1.1 Introduction

Consider the data those are repeatedly measured over time. Such data, which are so-called longi-

tudinal data, are compared with those from cross-sectional studies, because longitudinal studies offer

unique information that other study design cannot. While it is often possible to adress the same sci-

entific questions with a longitudinal or cross-sectional study, the major advantage of the former is its

capacity to separate what in the context of population studies are called cohort or age effects.

Example 1.1. Consider data illustrated in figure 1.1. In the left one, score is plotted against age

for a hypothetical cross-sectional study of children; data are collected from 8 individuals. With this

data, little else can be said. However, in the right one, we suppose the same data were obtained in a

longitudinal study in which each individual was measured twice (over 2 years). Now it is clear that

while younger children began at a higher score, everyone improved with time. Longitudinal study gives

a unique information that cross-sectional study cannot imply.

Figure 1.1: Cross-sectional study vs Longitudinal one

Notations. In this item, we define some notations that we will use among the entire course. Let Yij

be the outcome (can be a vector) for the ith individual at the jth time point, where i = 1, 2, · · · ,K

(e.g. # of each individual) and j = 1, 2, · · · , ni (e.g. # of time points or subunits). Also, let Xij =

(xij1, xij2, · · · , xijp)> be p-variate vector of covariates for the ith individual at the jth time point.

2

Analysis of Repeated Measurements J.P.Kim

Some elements do not change over time and are called time-constant covariates. For example, age at

entry or gender is time-constant. Then the second subscript (of Xij) is not necessary. Some elements

change over time and are called time-varying covariates.

Example 1.2. If covariate xij = xi is time-constant, e.g.

E(Yij |xi) = β0 + β1xi + β2j,

then β1 is difference in Y per unit change in x. If covariate is time-varying, we can write the model as

E(Yij |xi) = β0 + β1xi1 + β2(xij − xi1) + β3j.

Note that β2 can be estimated only if time-varying covariates are available. For example, if x denotes

the exercise level at each month, if all of individuals in the study do never exercise, just as the

summarizer, we cannot estimate β2.

Remark 1.3. Note that repeated measurements within a unit are correlated. The most important

part in analysis of longitudinal data is how to deal with such correlation issue. Of course, we can avoid

the correlation issue by using summary measures (e.g. average through the time) for independent

units (Two-stage model), or conditional analysis eliminating nuisance parameters associated with

correlation. However, in this course, we will see some methods handling such correlation issue. Of

course, depending on the goal of the study, ifferent analytical methods should be called for. For

example, if the goal of the study is how the outcome varies for one group (e.g. smokers) vs. the other

(e.g. nonsmokers), a marginal approach such as Generalized Estimating Equation fits well. In other

words, if each individual belongs to one of groups and such group does not change, the goal of the

study might be to see difference between groups, and hence we may see marginal (integrated) properties

of each group. On the other hand, if the goal of the study is how the outcome varies within a subject

when a condition (e.g. smoking vs. non-smoking) is changed, a conditional approach such as mixed

effects model fits well. (e.g. smoking state of each individuals change over time)

Remark 1.4. In linear models, the two approaches (marginal vs. conditional) are congruent; in

nonlinear models, incongruent. For example, consider linear models. If one uses marginal approach,

then our model might be

E(Y |X) = β0 + βXX,

where X = 0 or X = 1. Then βX is related to the difference between two groups,

βX = E(Y |X = 1)− E(Y |X = 0).

3


However, if we consider a conditional approach, then outcome of each individual matters, and hence

the model becomes

E(Yi|Xi, bi) = β0 + bi + βXXi,

where bi denotes the “random effect” which distinguishes from each individual (i.e., it is a specific

quantity determines individual). Then if an individual is determined (as b),

βX = E(Y |X = 1, b)− E(Y |X = 0, b)

denotes the difference between two groups, and hence the two models do not ‘contradict’ each other,

even though interpretation of βX is different. However, in nonlinear case, two models ‘contradict’ each

other. For example, consider following two logistic models

logitP (Y = 1|X) = β0 + βXX

and

logitP (Y = 1|X, b) = β0 + b+ βXX.

The former one (a marginal approach) yields

βX = logP (Y = 1|X = 1)P (Y = 0|X = 0)

P (Y = 1|X = 0)P (Y = 0|X = 1), (“log odds ratio”)

or equivalently

P (Y = 1|X) =eβ0+βXX

1 + eβ0+βXX,

while the latter one (a conditional approach) yields

βX = logP (Y = 1|X = 1, b)P (Y = 0|X = 0, b)

P (Y = 1|X = 0, b)P (Y = 0|X = 1, b),

or

P (Y = 1|X, b) =eβ0+b+βXX

1 + eβ0+b+βXX.

Denoting the distribution (pdf) of b as g(b), we obtain that

P (Y = 1|X) =

∫P (Y = 1|X, b)g(b)db

=

∫eβ0+b+βXX

1 + eβ0+b+βXXg(b)db

6= eβ∗0+β∗XX

1 + eβ∗0+β∗XX

,

4


and hence two approaches contradict each other.

Previous remarks emphasize that we have to make the goal of study clear for analysis. Explatory

data analysis or descriptive statistics can be helpful for modeling before the data analysis.

Example 1.5. High correlation between data (from each individual) means that each individual is

distinct from others; we can distinguishes each other. Low correlation gives tangled plot over time;

each individual is indistinguishable.

Figure 1.2: Example 1.5.

1.2 Review for multivariate analysis

In this section, we recall some multivariate data analysis methods.

Definition 1.6. Let Y = (Y1, Y2, · · · , Yp)> be a random vector. Then

E(Y ) =

E(Y1)

E(Y2)...

E(Yp)

=

µ1

µ2

...

µp

= µ

and

V ar(Y ) = E(Y − µ)(Y − µ)> = Σ =

σ11 σ12 · · · σ1p

σ21 σ22 · · · σ2p

......

. . ....

σp1 σp2 · · · σpp

= E(Y Y >)− µµ>.

5


Also, correlation matrix is defined as R = V −1/2ΣV −1/2, where

V =

σ11 0 · · · 0

0 σ22 · · · 0...

.... . .

...

0 0 · · · σpp

.

Example 1.7. Estimation of mean and variance. We estimate µ as

µ = y =1

n

n∑i=1

Yi.

If µ is known, we can estimate Σ as

Σ =1

n

n∑i=1

(Yi − µ)(Yi − µ)>,

and if µ is unknown, then

Σ =1

n− 1

n∑i=1

(Yi − µ)(Yi − µ)>.

Now clearly

E(AY ) = AE(Y ) = Aµ

and

V ar(Aµ) =1

nAΣA>.

It can be easily obtained that

E(µ) = µ and V ar(µ) =1

nΣ.

Also, note that

E(Y−µ)(Y−µ)> = E(Y−µ)(Y−µ)>−E(Y−µ)(µ−µ)>−E(µ−µ)(Y−µ)>+E(µ−µ)(µ−µ)> =

(1− 1

n

)Σ

and therefore

E(Σ) =n− 1

nΣ, E(Σ) = Σ.

Example 1.8. Multivariate normal density uses multivariate version of a standardized distance, or

Mahalanobis distance,

f(Y ) ∝ exp

(−1

2(Y − µ)>Σ−1(Y − µ)

).

6


Note that

Y ∼ Np(µ,Σ) ⇔ Z = Σ−1/2(Y − µ) ∼ Np(0, I) ⇔ z1, z2, · · · , zpi.i.d.∼ N(0, 1).

Example 1.9. Consider a hypothesis testing problem

H0 : µ = µ0 vs H1 : µ 6= µ0

based on Y1, Y2, · · · , Yn ∼ Np(µ,Σ). When Σ is known, we can use that

n(µ− µ0)>Σ−1(µ− µ0) ∼ χ2(p).

If Σ is unknown, we use following Hotelling’s T 2 distribution.

• We denote the distribution of

Z1Z>1 + · · ·+ ZkZ

>k

where Z1, · · · , Zk ∼ Np(0,Σ) is called a Wishart distribution and denoted as Wp(Σ, k).

• If α can be written as md>M−1d where d ⊥M , d ∼ Np(0, I), M ∼Wp(I,m), then we say that

α has the Hotelling T 2 distributino with parameter p and df m. We write α ∼ T 2(p,m).

• Note that

T 2 = (µ−µ0)>(S

n

)−1

(µ−µ0) =[√

nΣ−1/2(µ− µ)]>

Σ1/2S−1Σ1/2[√

nΣ−1/2(µ− µ)]∼ T 2(p, n−1).

• Also note that

T 2(p, n− 1)d≡ (n− 1)p

n− pF (p, n− p).

These come from following two facts: let Z1, · · · , Zn ∼ Np(0, I) and SZ be its sample variance

matrix. Then

(n− 1)SZ ∼Wp(I, n− 1)

and

nZ>S−1Z Z ∼ (n− 1)p

n− pF (p, n− p).

For the first one, let Z := (Z1, Z2, · · · , Zn)> be n× p matrix. Then

(Z1 − Z, · · · , Zn − Z)> = (I −Π1)Z

7


where Π1 = 1(1>1)−11>. Hence we get

SZ =n∑i=1

(Zi − Z)(Zi − Z)> = Z>(I −Π1)Z,

and then spectral decomposition of idempotent matrix I −Π1 ends the proof.

For the second one, consider following two properties of Wishart distribution.

• Let S ∼Wp(Σ, k) (k ≥ p) and

Σ =

Σ11 σ12

σ21 σ22

, Σ−1 =

Σ11 σ12

σ21 σ22

, S−1 =

S11 s12

s21 s22

.

Thenσ22

s22∼ χ2(k − p+ 1).

Proof is given as following. Note that

1/s22 = s22 − s21S−111 s12.

Also note that,

Sd≡ Z1Z

>1 + · · ·+ ZkZ

>k

where Z1, · · · , Zki.i.d.∼ Np(0,Σ). Letting Zi =

Xi

Yi

, we get

S =

∑XiX>i

∑XiYi∑

YiX>i

∑Y 2i

and hence

1/s22 = s22·1 = Y >(I −ΠX)Y.

Now V ar(Y |X) = σ22·1 gives

Y >(I −ΠX)Y |X ∼ σ22·1χ2(k − (p− 1)),

which is equivalent toσ22

s22∼ χ2(k − p+ 1).

8


• If S ∼Wp(Σ, k), then for any vector d,

d>Σ−1d

d>S−1d∼ χ2(k − p+ 1).

Proof is given as following. WLOG ‖d‖ = 1. Let

P =

P1

d>

be an orthogonal matrix. Then

PSP> ∼Wp(PΣP>, k)

holds. Now note that(PΣ−1P>)22

(PS−1P>)22=d>Σ−1d

d>S−1d∼ χ2(k − p+ 1)

by previous proposition.

• Now come back to the origin problem. Our goal is to show that

nZ>S−1Z Z ∼ (n− 1)p

n− pF (p, n− p).

Note that

(n− 1)Z>Z

Z>S−1Z Z

∣∣∣∣∣Z ∼ χ2(n− p)

and hence

(n− 1)Z>Z

Z>S−1Z Z

∼ χ2(n− p)

and it is independent of Z. Therefore,

nZ>Z/p

(n− 1)Z>Z

Z>S−1Z Z

/(n− p)∼ F (p, n− p),

i.e.,

nZ>S−1Z Z ∼ (n− 1)p

n− pF (p, n− p).

Now we can obtain our conclusion:

T 2 = (µ− µ0)>(S/n)−1(µ− µ0)

= (√nΣ−1/2(µ− µ0))>(Σ−1/2SΣ−1/2)−1√nΣ−1/2(µ− µ0)

9


∼ T 2(p, n− 1)d≡ (n− 1)p

n− pF (p, n− p)

under H0 : µ = µ0. In here, the additional notation S = Σ is used.

Remark 1.10. Note that Hotelling’s T 2 statistic is invariant under nonsingular linear transformation:

For Y = CX + d where C is p× p nonsingular matrix, we get

n(Y − µY )>S−1Y (Y − µY ) = n(CX − CµX)>(C>)−1S−1

X C−1(CX − CµX)

= n(X − µX)>S−1X (X − µX).

Example 1.11. Testing for time trend can be expressed as linear hypothesis H0 : C>µ = 0.

(i) For example, testing no time change at all:

H0 : µ1 − µ2 = µ2 − µ3 = · · · = µK−1 − µK = 0

can be treated as linear hypothesis

H0 : C>µ = 0

with

C> =

1 −1

1 −1

. . .. . .

1 −1

.

For example, if K = 3, then

C> =

1 −1 0

0 1 −1

,

and if K = 4, then

C> =

1 −1 0 0

0 1 −1 0

0 0 1 −1

.

We can say that C is K × (K − 1) matrix. Now from

√n(µ− µ) ∼ NK(0,Σ),

we get√nC>(µ− µ) ∼ NK−1(0, C>ΣC),

10


and from

(n− 1)Σ ∼WK(Σ, n− 1),

we get

(n− 1)C>ΣC ∼WK−1(C>ΣC, n− 1).

Therefore, we get

n(C>µ)>(C>ΣC)−1(C>µ) ∼H0

T 2(K − 1, n− 1)d≡ (n− 1)(K − 1)

n−K + 1F (K − 1, n−K + 1).

Or, to make the problem simple, we can assume Σ ≈ Σ and use

n(C>µ)>(C>ΣC)−1(C>µ) ≈H0

χ2(K − 1).

(ii) Next, we consider testing for equal change (with equal intervals), or testing for linearity:

H0 : µ2 − µ1 = µ3 − µ2 = · · · = µK − µK−1

is equivalent to

H0 : µ1 − 2µ2 + µ3 = 0, µ2 − 2µ3 + µ4 = 0, · · · ,

i.e.,

H0 : C>µ = 0

where

C> =

1 −2 1

1 −2 1

. . .. . .

. . .

1 −2 1

.

For example, if K = 3, then

C> =(

1 −2 1),

and if K = 4, then

C> =

1 −2 1 0

0 1 −2 1

.

In this case, C is K × (K − 2) matrix. Now from

√nC>(µ− µ) ∼ NK−2(0, C>ΣC),

11


and

(n− 1)C>ΣC ∼WK−2(C>ΣC, n− 1),

we get

n(C>µ)>(C>ΣC)−1(C>µ) ∼H0

T 2(K − 2, n− 1)d≡ (n− 1)(K − 2)

n−K + 2F (K − 2, n−K + 2),

or approximately,

n(C>µ)>(C>ΣC)−1(C>µ) ≈H0

χ2(K − 2).

Example 1.12. Tukey’s Simultaneous Confidence Intervals. Consider a random sample Y1, Y2, · · · , Yn ∼

Np(µ,Σ). In this example, we want to find a confidence interval of `>µ with NOT specified or pre-

determined `. In other words, we want to consider various values of `. Let

S =1

n− 1

n∑i=1

(Yi − Y )(Yi − Y )>.

Note that for a particular choice of `, 100(1− α)% C.I. is

µ :

∣∣∣∣n(`>Y − `>µ)2

`>S`

∣∣∣∣ ≤ t2α/2(n− 1)

,

i.e.,

P

(∣∣∣∣n(`>Y − `>µ)2

`>S`

∣∣∣∣ ≤ t2α/2(n− 1)

)= 1− α.

However, we want to make a statement such that for any choice of `,

P

(∣∣∣∣n(`>Y − `>µ)2

`>S`

∣∣∣∣ ≤ c) ≥ 1− α.

It can be achieved if one choose c s.t.

P

(max`

∣∣∣∣n(`>Y − `>µ)2

`>S`

∣∣∣∣ ≤ c) = 1− α.

Note that

max`

n(`>Y − `>µ)2

`>S`= n(Y − µ)>S−1(Y − µ) ∼ T 2(p, n− 1)

from Cauchy-Schwarz inequality ((`>S`)((Y − µ)>S−1(Y − µ)) ≥ `>(Y − µ), equality occurs when

` ∝ S−1(Y − µ)). Thus we get

P

(max`

∣∣∣∣n(`>Y − `>µ)2

`>S`

∣∣∣∣ ≤ (n− 1)p

n− pFα(p, n− p)

)≥ 1− α

12


for any `, and hence 100(1− α)% simultaneous C.I. of `>µ is

`>Y ±

√(n− 1)p

n(n− p)Fα(p, n− p)`>S`.

Example 1.13. Bonferroni’s Simultaneous Confidence Intervals. In this example, suppose that

we already prespecified our goals, and hence we want to find confidence intervals for m prespecified

linear combinations. In here, we control overall error rate. Let Ci be the event that `>i µ is included in

confidence interval, and P (Cci ) = αi. Then overall error rate is

1− P (all m linear combinations are included in C.I.’s) = P

(m⋃i=1

Cci

)≤

m∑i=1

P (Cci ) =m∑i=1

αi,

and hence we can achieve desired overall error rate as letting∑αi = α. One choice is αi = α/m:

P

(n

∣∣∣∣`>i Y − `>i µ`>i S`i

∣∣∣∣ < tα/(2m)(n− 1) ∀i)≥ 1− α.

Remark 1.14. Note that in usual, length of Bonferroni C.I. is shorter than that of Tukey. It is accept-

able because Tukey’s simultaneous C.I. ensures including every linear combination with predetermined

error rate. However, if the number of combination m becomes large, than Bonferroni C.I. becomes

poor.

Example 1.15. Multiple Comparison.

(i) Consider a multiple comparison problem: One should get a decision with various hypotheses

H01, H02, · · · , H0k. To make the overall error rate less than α, it is not sufficient to perform

an ordinary hypothesis testing to each hypothesis with significance level α. Rather, one should

control the probability of at least one incorrect rejection. For this, one can apply Bonferroni

correction: To test each hypothesis with significance level α/k, to make the probability of family-

wise error rate less than α.

(ii) One can apply other approach, which is called Holm’s step down procedure, or Holm-Bonferroni

method. The method is as follows: LetH1, H2, · · · , Hk be a family of hypotheses and P1, P2, · · · , Pkthe corresponding P -values. Order the P -values from lowest to highest, P(1), P(2), · · · , P(k), and

let the associated hypotheses be H(1), H(2), · · · , H(k). Now for a given significance level α, let i

be the minimal index such that

P(i) >α

k + 1− i.

In other words, starting from the hypothesis with smallest P -value, test each hypothesis H(j)

13


with significance level α/(k − j + 1). Then,

reject H(j) if j = 1, 2, · · · , i− 1

accept H(j) if j = i, i+ 1, · · · , k.

Such procedure ensures FWER ≤ α, where FWER denotes family-wise error rate. Why? Let I0

be the set of indices corresponding to the (unknown) true null hypotheses, having k0 members.

Further assume that we wrongly reject a true hypothesis. We have to prove that the probability

of this event is at most α. Let h be the first rejected true hypotheses (first in the ordering

H(1), H(2), · · · , H(k)). So h − 1 is the last false hypotheses rejected and h − 1 + k0 ≤ k. From

there, we get1

k − h1≤ 1

k0.

Since h is rejected, we have

P(h) ≤α

k + 1− h,

by definition of the test, and hence

P(h) ≤α

k0

is obtained. Therefore, if we wrongly reject a true hypothesis, there has to be a true hypothesis

with P -value at most α/k0. Now define

A =

Pi ∈

α

k0for some i ∈ I0

.

Then whatever the (unknown) set of true hypotheses I0 is, we have

P (A) ≤ α

by the Bonferroni inequalities.

Example 1.16. Now we consider two-sample case. Consider hypothesis comparing means of two

populations,

H0 : µ1 = µ2

with equal variances.

(i) Paired comparison. For example, let

xi be repeated measurements for one eye, and

yi be repeated measurements for the other eye.

14


Then xi and yi are paired, i.e., not independent. However, we are interested in the mean of

di = xi − yi. Denote E(di) = δ and Cov(d) = Σd. Then our interest is to test

H0 : δ = δ0,

where δ0 = 0 (in this case). Applying the result of one-sample case to di’s, we obtain

T 2 = n(d− δ0)>S−1d (d− δ0) ∼

H0

T 2(p, n− 1)d≡ (n− 1)p

n− pF (p, n− p),

where

d =1

n

n∑i=1

di and Sd =1

n− 1

n∑i=1

(di − d)(di − d)>.

Paired comparison might be used, for example, to opthalmologic1 data, or pre-post data.

(ii) Unpaired case. Now assume that observed data set is unpaired. Assume that

Xi ∼ Np(µx,Σ), i = 1, 2, · · · , n1, Yj ∼ Np(µy,Σ), j = 1, 2, · · · , n2.

Also assume that X’s and Y ’s are independent. Then

Σ−1/2(X − Y ) ∼H0

Np

(0,

(1

n1+

1

n2

)I

),

and hence from

(n1 + n2 − 2)Sp ≡n1∑i=1

(Xi −X)(Xi −X)>︸︷︷︸∼Wp(Σ,n1−1)

+

n2∑j=1

(Yi − Y )(Yi − Y )>︸︷︷︸∼Wp(Σ,n2−1)

∼Wp(Σ, n1 + n2 − 2),

we get (1

n1+

1

n2

)−1

(X − Y )>S−1p (X − Y ) ∼

H0

T 2(p, n1 + n2 − 2).

Example 1.17. (Unequal variance) Let

Xi ∼ Np(µx,Σx), Yj ∼ Np(µy,Σy), i = 1, 2, · · · , n1, j = 1, 2, · · · , n2.

Then under H0 : µx = µy,

X − Y ∼ Np

(0, n−1

1 Σx + n−12 Σy

),

1안과학의

15


and hence we get

(X − Y )>(

1

n1Σx +

1

n2Σy

)−1

(X − Y ) ≈H0

χ2(p)

asymptotically.

Example 1.18. Multivariate ANOVA (MANOVA). Now we want to compare more than 2

populations. For this, assume that

X(1)i ∼ Np(µ1,Σ), i = 1, 2, · · · , n1,

X(2)i ∼ Np(µ2,Σ), i = 1, 2, · · · , n2,

...

X(k)i ∼ Np(µk,Σ), i = 1, 2, · · · , nk.

Also assume that X(j)i are independent, and all of populations have a common (unknown) variance

Σ. Let n =∑k

i=1 ni. Our goal is to test

H0 : µ1 = µ2 = · · · = µk.

For this, we use a likelihood ratio test; 2(`∗1− `∗0) is asymptotically χ2-distributed. MLE is obtained as

µj = X(j)

=1

nj

nj∑i=1

X(j)i , Σ =

1

n

k∑i=1

ni∑j=1

(X(j)i −X

(j))(X

(j)i −X

(j))>,

and

µ0 = X =1

n

k∑i=1

ni∑j=1

X(j)i , Σ0 =

1

n

k∑i=1

ni∑j=1

(X(j)i −X)(X

(j)i −X)>

under H0. Now note that log-likelihood function is obtained as

`(µ,Σ) = −n2

log |2πΣ| − 1

2

k∑i=1

ni∑j=1

(X(j)i − µ)>Σ−1(X

(j)i − µ)

= −n2

log |2πΣ| − 1

2

k∑i=1

ni∑j=1

tr(

Σ−1(X(j)i − µ)(X

(j)i − µ)>

)

= −n2

log |2πΣ| − 1

2tr

Σ−1k∑i=1

ni∑j=1

(X(j)i − µ)(X

(j)i − µ)>

and hence

`∗1 = `(µ, Σ) = −n2

log |2πΣ| − n

2tr(

Σ−1Σ)

16


`∗0 = `(µ0, Σ0) = −n2

log |2πΣ0| − n

2tr(

(Σ0)−1Σ0).

Thus we obtain

2(`∗1 − `∗0) = n log|Σ0||Σ|

≈H0

χ2((k − 1)p).

Or we can also use

Λ =|Σ0||Σ|

=

(|W |

|B +W |

)−1

,

where

W = nΣ =k∑i=1

ni∑j=1

(X(j)i −X

(j))(X

(j)i −X

(j))> (“Variation within treatment”)

and

B =

k∑i=1

ni(X(j) −X)(X

(j) −X)>. (“Variation between treatment”)

Note that such statistic is based on following decomposition of sum of variations,

nΣ0 =∑k

i=1 ni(X(j) −X)(X

(j) −X)> + nΣ.

T = B + W.

Also note that,

Σ−1/2WΣ−1/2 ∼Wp(I,∑

ni − k)

Σ−1/2BΣ−1/2 ∼Wp(I, k − 1),

and W ⊥ B. Different from the lecture note, we will define

|Σ||Σ0|

=|W |

|B +W |,

and letting

2(`∗1 − `∗0) = −n log Λ.

Λ is said to be distributed as Wilk’s lambda distribution, and written as Λ ∼ Λ(p,∑ni−k, k−1),

where

A ∼Wp(Σ,m), B ∼Wp(Σ, n)⇒ |A||A+B|

∼ Λ(p,m, n).

Or we can apply

• Roy’s greatest root test, which uses the largest eigenvalue of W−1B;

• Lawley-Hotelling trace test, which uses tr(W−1B);

17


• Pillai-Bartlett trace test, which uses tr((I +W−1B)−1).

Note that Wilk’s lambda is equivalent to det(I +W−1B)−1, which motivates Pillai-Bartlett test. Also

note that, Bartlett has shown modified version for likelihood ratio test,

−(n− 1− p+ k

2

)log Λ = −

(n− 1− p+ k

2

)log

|W ||B +W |

≈H0

χ2((k − 1)p).

18

Chapter 2

Multivariate Regression

2.1 Introduction

2.1.1 Review of univariate regression

Recall the univariate regression model

Y = Zβ + ε,

where Y is n × 1 vector, Z is n × p design matrix, β is p × 1 coefficient vector, and ε is n × 1 error

vector. It is equivalent to the model for each unit,

yi = Ziβ + εi, i = 1, 2, · · · , n.

We assume that εi’s are independently (normally) distributed with constant variance. Then OLS

(ordinary least square) estimator β of β is a solution of following normal equation

Z>(Y − Zβ) = 0,

or equivalently,n∑i=1

Z>i Ziβ =n∑i=1

Z>i yi.

With this, we can show that

Eβ = β and Cov(β) = σ2(Z>Z)−1,

and β is BLUE. In model checking step, we check whether data (and fitted model) satisfy “assump-

tions” of the model, such as mean structure (linearity; residuals does not contain systematic pattern),

19


homoskedasticity, distributional pattern (or outliers), and independence.

2.1.2 Multivariate Regression Model

Consider following prolactin level data measured 4 times every 15 minutes over 30 women (figure

2.1). It is known that there exist 3 groups one of those every women belongs to, and our interest is

relationship between prolactin levels of each group and covariates. For this, one may apply regression

model using dummy variables reflecting a group effect assuming all of observations are independent.

However, since it is repeatedly measured over time, it is not reasonable to assume independence.

Rather, it is correlated; thus we might use multivariate regression model which deals with response

vectors of data observed from each individual.

Figure 2.1: Prolactine levels over time

First we only consider the case with equal dimensions; the number of repeat is same for every

individuals. Assume following mean model

Yi = Xiβ + εi, i = 1, 2, · · · ,K,

where

Yi = (yi1, yi2, · · · , yin)>

and

εi = (εi1, εi2, · · · , εin)>

are n×1 vectors. In here, i is an index for independent units; n is the number of repeated measurements.

For instance, in the prolactin data, K = 30 and n = 4. Equal dimensions means that n is equal for

any i.

20


Example 2.1. If we analyze pre-post data, then n = 2. Also following model can be considered:

prei

posti

=

β0

β0 + δ

+ β1

age(pre)i

age(post)i

+

εi1εi2

=

1 0 age(pre)i

1 1 age(post)i

β0

δ

β1

+

εi1εi2

.

Example 2.2. Now consider an example of modeling measurements from 4 time points:

yi1

yi2

yi3

yi4

=

1 1 zi

1 2 zi

1 3 zi

1 4 zi

β0

βT

βZ

+

εi1

εi2

εi3

εi4

. (“Reduced Model”)

Note that in here, the only time-varying covariate is time itself; the other covariate Z is time-constant.

Or we can also consider following model, which assumes less than previous one:

yi1

yi2

yi3

yi4

=

1 0 0 0 zi

1 1 0 0 zi

1 0 1 0 zi

1 0 0 1 zi

α0

α1

α2

α3

αZ

+

εi1

εi2

εi3

εi4

. (“Full Model”)

Example 2.3. Testing Linearity. If the time interval is the same and if the trend is linear, using

time as continuous variable is correct (and we select a reduced model). But using dummy variables

yields consistent estimates

α1 − α0 = α2 − α1 = α3 − α2,

equivalently, C>α = 0, where

C> =

1 −2 1 0

0 1 −2 1

.

Then we can test H0 : C>α = 0 with test statistics

(C>α)>V ar(C>α)−1C>α ∼H0

χ2(2).

Remark 2.4. Note that such repeatedly measured data is correlated. Then using OLS estimator

21


for correlated data is acceptable? Note that V ar(Yi) is not a diagonal matrix; independence and

homoskedasticity are violated, and therefore OLS is not BLUE anymore. Actually, for least square

estimator

β =

(K∑i=1

X>i Xi

)−1 K∑i=1

X>i Yi,

we get

E(β) = β,

but

V ar(β) = V ∗ =

(K∑i=1

X>i Xi

)−1 K∑i=1

X>i V ar(Yi)Xi

(K∑i=1

X>i Xi

)−1

=

(K∑i=1

X>i Xi

)−1( K∑i=1

X>i ΣXi

)(K∑i=1

X>i Xi

)−1

, (2.1)

where Σ = V ar(Yi|Xi). In statistical package problem, such as R, estimates of V ar(β) is available, but

such program gives standard error of β regarding all of the data independent,

V ar(β) = σ2(X>X)−1,

which is not acceptable in our case.

Remark 2.5. Note that we can express the model with augmented matrix,

X(nK×p)

=

X1

X2

...

XK

, Y(nK×1)

=

Y1

Y2

...

YK

,

which makes the least square estimator

β = (X>X)−1X>Y.

Note that such notation can be used even if the data is of unequal dimension; i.e., the number of

repeat is not the same.

Example 2.6. Then what should we do to find the standard error of β, i.e., an estimate of V ar(β)?

There are various ways to achieve the goal.

22


• To make it simple, one can adapt

Σ =1

K

K∑i=1

(Yi −Xiβ)(Yi −Xiβ)>

to (2.1), which yields

V ∗ =

(K∑i=1

X>i Xi

)−1 K∑i=1

X>i ΣXi

(K∑i=1

X>i Xi

)−1

.

Main advantage of such estimator is consistency (under technical assumptions). Furthermore, it

is unbiased.

• Alternative empirical variance estimate is

V ∗ =

(K∑i=1

X>i Xi

)−1( K∑i=1

X>i (Yi −Xiβ)(Yi −Xiβ)>Xi

)(K∑i=1

X>i Xi

)−1

.

• Which one would be preferred? If the data set has an equal dimension, i.e., the number of repeat

is same among all of individuals, then former is nice estimate as well as the latter. However, if

we handle the data with unequal dimension, then variance matrix of each individual is not the

same; dimension does not agree! In precise, Σi = V ar(Yi|Xi) is ni × ni matrix, where ni is the

number of repeated measurements of ith individual. Therefore, the estimate

Σ =1

K

K∑i=1


is NOT available, and hence the latter estimate V ∗ must be preferred.

• Another alternative is following Jackknife estimate, which is asymptotically equivalent to V ∗

(the second one). It is defined as

JV =K∑i=1

(β−i − β)(β−i − β)>

=

K∑i=1

(X>−iX−i)

−1X>−i(Y−i −X−iβ)− (X>X)−1X>(Y −Xβ)

×

(X>−iX−i)−1X>−i(Y−i −X−iβ)− (X>X)−1X>(Y −Xβ)

>≈ (X>X)−1

(K∑i=1

X>i (Yi −Xiβ)(Yi −Xiβ)>Xi

)(X>X)−1

≈ V ∗

23


from (X>−iX−i)−1 ≈ (X>X)−1 if K is large.

• Even though it has to compute each β−i, Jackknife estimate can be obtained via iteration of

simple routine (i.e.. further matrix multiplication or inverse computation may not be needed),

and hence it can be more preferred.

• In general, we can consider following “model-based” and “robust” variance estimates. Consider

a model with V ar(Yi|Xi) = Vi = Vi(α) and let Vi = Vi(α) (In here, α denotes parameter, such

as σk or ρ). Then we can use following “naive” or “model-based” estimate,

V armod

(β) =

(K∑i=1

X>i V−1i Xi

)−1

.

However, it might yield poor performance if the model (e.g. for variance) is mis-specified. Thus we

can also use following “empirical” or “robust” estimate (i.e., “robust for model misspecification”),

V arrob

(β) =

(K∑i=1

X>i V−1i Xi

)−1( K∑i=1

X>i V−1i (Yi − µi)(Yi − µi)>V −1

i Xi

)(K∑i=1

X>i V−1i Xi

)−1

.

Such “robust” estimate is motivated as follows. Let W−1 be a “working variance;” i.e., we

“modeled” V ar(Y|X) = W−1. Then we get

β = (X>WX)−1X>WY,

and hence

V ar(β) = (X>WX)−1X>V−1X(X>WX)−1,

where V = V ar(Y|X). If we construct V based on our model, then V = W−1, and hence we

get “model-based” estimate. In contrast, if one wants to plug-in V which is consistent for V

whatever the true variance structure is, then we obtain a “robust” variance estimate.

Example 2.7. Following is a result of applying regression model on prolactine data. In the table,

naive s.e. means a standard error that statistical package gave (i.e., it is σ(X>X)−1/2), while s.e.

means a real s.e. based on theoretical result.

Note that naive s.e. assumes all of data are independent, even if our data are correlated, so it cannot

estimate sd(β) well. For example, if data are positively correlated, then naive s.e. may underestimate

the real one, since it assumes each independent n × K points are observed, while theoretical one is

based on K correlated individuals. Note that standard error (or variance) converges to 0 as sample

size becomes large due to the consistency of estimator.

24


β naive s.e. s.e. p-value

Intercept 3.902 0.256 0.306 <.0001base 0.00680 0.00167 0.0025 <.0001time -0.481 0.0727 0.0970 <.0001

group 1 -0.112 0.350 0.3285 0.7492group 2 -0.288 0.288 0,2356 0.3193group 3 0.000 . . .

time*group1 -0.0794 0.126 0.1358 0.5291time*group2 0.0741 0.103 0.1318 0.4722time*group3 0.000 . . .

Example 2.8. Weighted Least Square Estimator. Recall that OLS is consistent and asymptot-

ically normal distributed, but not the BLUE because of repeatedness of data. If we can figure out

overall structure of correlation definitely, then weighted least square (WLS) estimator would be the

BLUE, which gives better estimates. Consider WLSE

β =

(K∑i=1

X>i Σ−1Xi

)−1 K∑i=1

X>i Σ−1Yi = (X>V −1X)−1X>V −1Y,

where

V =

Σ

Σ

.. .

Σ

,

which minimizes following weighted square loss

WSS(β) =

K∑i=1

(Yi −Xiβ)>Σ−1(Yi −Xiβ).

In here, we assume that Σ does not depend on the data, i.e., the model is homoskedastic. We can

easily obtain β and

V ar(β) =

(K∑i=1

X>i Σ−1Xi

)−1

if Σ is known. However, Σ is unknown in most cases, and hence we should replace Σ with estimate Σ,

where

Σ =1

K

K∑i=1

(Yi −Xiβ)(Yi −Xiβ)>.

Unfortunately, it depends on β, while Σ is needed to obtain β; Thus, we cannot obtain β or Σ directly.

Hence we use an iterative approach:

25


Step 1. First obtain OLS β (as an initial value of iteration). Then compute

Σ =1

K

K∑i=1

(Yi −Xiβ)(Yi −Xiβ)>.

Step 2. Using Σ, obtain β(1).

Step 3. Using β(1), obtain Σ(1). Go to Step 2, and repeat until converge.

Actually, β(1) is already good enough.

Remark 2.9. Note that Σ is not assumed to have a specified structure in previous paragraph. What

if one would like to assume some structure of Σ? We rather consider a correlation structure with

correlation matrix Ω, which satisfies

Σ = (diag(Σ))1/2Ω(diag(Σ))1/2.

Note that diag(Σ)1/2 is a part reflecting heteroskedasticity. For example,

Σ =

σ1 0 0 0

0 σ2 0 0

0 0 σ3 0

0 0 0 σ4

Ω

σ1 0 0 0

0 σ2 0 0

0 0 σ3 0

0 0 0 σ4

,

if Σ is 4× 4. We can use several types of the correlation structure, such as

• Intraclass;

Ω =

1 ρ ρ ρ

ρ 1 ρ ρ

ρ ρ 1 ρ

ρ ρ ρ 1

• AR(1);

Ω =

1 ρ ρ2 ρ3

ρ 1 ρ ρ2

ρ2 ρ 1 ρ

ρ3 ρ2 ρ 1

• or Random effects (handled later).

26


Example 2.10. Note that depending on the structure, estimators of correlation coefficients are dif-

ferent. We can use a method-of-moment approach in here. Let

rij =Yij − µij

σj

be a standardized residual of ith individual at jth time point. Then

E(rijrik) = ρjk,

by definition, and hence we get a system of equations whose solution is MME. For example, if one

assumes intraclass correlation, then ρ satisfies

∑i

rij rik =∑i

ρ

for any j and k, and hence ∑i

∑j>k

rij rik =∑i

∑j>k

ρ

holds, which implies

ρ =

(K∑i=1

ni(ni − 1)

2

)−1 K∑i=1

∑j>k

rij rik.

On the other hand, if one assumes AR(1) correlation structure, then

E(rijrik) = ρ|j−k|,

It is achieved when

rij = ρri,j−1 + uij

for independent r.v.’s rij with suitable variance, and hence we can obtain ρ from the regression

coefficient where outcome is rij and covariate is ri,j−1 without intercept.

Example 2.11. We can also compute MLE for β or Σ. In here, we assume the normality of Y. Then

MLE for β is equal to WLS estimator. Thus the remain part is ML estimates for Σ. Note that

logL = −K2

log |Σ| − 1

2

K∑i=1

(Yi −Xiβ)>Σ−1(Yi −Xiβ),

and matrix differentiation rule∂ log |Σ|∂θk

= tr

(Σ−1 ∂Σ

∂θk

),

27


∂Σ−1

∂θk= −Σ−1 ∂Σ

∂θkΣ−1.

Then we get

∂ logL

∂θk= −K

2tr

(Σ−1 ∂Σ

∂θk

)+

1

2

K∑i=1

(Yi −Xiβ)>Σ−1 ∂Σ

∂θkΣ−1(Yi −Xiβ)

= −K2

tr

(Σ−1 ∂Σ

∂θk

)+

1

2

K∑i=1

tr

(Σ−1 ∂Σ

∂θkΣ−1(Yi −Xiβ)(Yi −Xiβ)>

)

= −K2

tr

(Σ−1 ∂Σ

∂θk

)+

1

2tr

(Σ−1 ∂Σ

∂θkΣ−1

K∑i=1


),

and then likelihood equation becomes

tr

(Σ−1 ∂Σ

∂θk

)= tr

(Σ−1 ∂Σ

∂θkΣ−1K−1

K∑i=1


).

In here, θ denotes a vector of parameters in Σ, for example, θ = (σ1, σ2, σ3, σ4, ρ)>.

Example 2.12. Similarly, we can deal with heteroskedastic case. In here,

logL = −1

2

K∑i=1

log |Σi| −1

2

K∑i=1

(Yi −Xiβ)>Σ−1i (Yi −Xiβ),

and hence likelihood equation is

K∑i=1

tr

(Σ−1i

∂Σ

∂θk

)=

K∑i=1

tr

(Σ−1i

∂Σi

∂θkΣ−1i (Yi −Xiβ)(Yi −Xiβ)>

).

Remark 2.13. Note that

E

[∂ logL

∂θk

]= −K

2tr

(Σ−1 ∂Σ

∂θk

)+

1

2tr

(Σ−1 ∂Σ

∂θkΣ−1E

K∑i=1


)= 0

provided that V ar(Yi|Xi) = Σ, i.e., variance is correctly specified. In summary,

E

[∂ logL

∂θk

]= 0. (2.2)

Note that (2.2) does hold even if normality of Y is violated; it comes from V ar(Yi|Xi) = Σ. However,

recall that (2.2) is the key fact for consistency or asymptotic normality of MLE (or MCE with contrast

function − logL). It tells us that “even when Y is not normal, MLE for Σ obtained as if Y is normal is

consistent” estimator of Σ (under some technical assumptions). For more details, see following remark

2.14.

28


Remark 2.14. Do not assume any distributional characteristic of Y. We only know that E(Yi|Xi) =

Xiβ and V ar(Yi|Xi) = Σ. Our strategy to estimate Σ is to find minimizer of following contrast function

ρ(Σ) =1

2log |Σ|+ 1

2K

K∑i=1

(Yi −Xiβ)>Σ−1(Yi −Xiβ),

which is equivalent to (minus) log-likelihood function when Y is normal.

ρ(Σ) Eρ(Σ)

Σ = arg min ρ(Σ) Σtrue(∗)= arg minEρ(Σ)

K →∞P

K →∞P

We can ensure (∗) part from (2.2), i.e.,

∂

∂θEρ(Σ) = 0,

which yields consistency. Similarly, we can derive asymptotic normality from (2.2). Recall that key

point of the derivation of asymptotic distribution is second-order approximation and LLN.

Example 2.15. For certain correlation structure, computation of MLE becomes simple. For example,

consider following intrclass correlation

Σ =

1 ρ ρ ρ

ρ 1 ρ ρ

ρ ρ 1 ρ

ρ ρ ρ 1

= (1− ρ)I + ρ11>.

Then

Σ−1 =1

1− ρ

(I − ρ

1 + (p− 1)ρ11>

)and hence

tr

(Σ−1∂Σ

∂ρ

)= tr

Σ−1

0 1 1 1

1 0 1 1

1 1 0 1

1 1 1 0

= tr(

Σ−1(11> − I))

=ρp(p− 1)

(1− ρ)(1 + (p− 1)ρ).

Remark 2.16. There are some remarks.

• Note that the MLE is different from ad-hoc estimates (such as MME, ...). But plugging in

29


different estimates of ρ does not affect the asymptotic variance of β. In other words, you don’t

have to sweat on finding “best estimate” of Σ (equivalently, ρ), if we’re interested in attaining

the estimates of β (This will be handled at the appendix section, section 2.3).

• Consistency of MLE is guaranteed even if the normality of Y is violated, but only when mean

and the variance are correctly specified, i.e. µi = Xiβ and V ar(Yi|Xi) = Σ in real.

• If mean is mis-specified, neither OLS nor WLS is consistent. If variance is mis-specified, OLS is

still consistent (∵ it does not assume the variance structure), but WLS is not consistent. Thus,

efficiency of WLS depends on true correlation structure and the number of time points.

• In other words, WLS has a burden of “specifying variance correctly,” but then you get an

efficiency. In contrast, OLS is less efficient, but there is no burden to specify correct variance

structure.

Example 2.17. We saw that ignoring correlation structure will give us a loss on efficiency, but then

how many do we lose? One criteria is an asymptotic relative efficiency, or ARE, in abbreviation, which

is defined as a ratio of (asymptotic) variance. In this case, we can write

ARE = V ar(β)V ar(β)−1,

where β and β denote WLS (with unstructured covariance) and OLS, respectively. Then we easily

obtain

ARE = (X>WX)−1(X>X)(X>VX)−1(X>X),

where W = V −1. For example, see following figure 2.2, which gives a plot of ARE when the correlation

structure is given as intraclass or AR(1), respectively. The number of repeat is T = 4. Note that higher

ARE denotes that OLS is more efficient. As you can see, if there is no correlation, OLS is “the best,”

and hence ARE is highest. Also you can see that as correlation becomes higher, OLS loses its efficiency.

In addition, since correlation between measurements is lower when the data have AR(1) structure,

and hence OLS is more efficient when the true correlation structure is AR(1) compared with the case

intraclass-correlated.

Now we see how the number of repeated measurements T affects to the performance of OLS. Fol-

lowing figure (figure 2.3) gives ARE-correlation plot with various values of T when the data set is

intraclass-correlated. As you can see, if T becomes larger, since correlation structure becomes signifi-

cant, OLS would be less efficient.

Note that, in practical study, efficiency means gain of money; if the estimator has lower variance,

then required sample size becomes smaller. In this sense, efficiency (ARE) might be an important issue

30


Figure 2.2: ARE of OLS vs. WLS

Figure 2.3: ARE of OLS vs. WLS

in the study.

2.2 Modeling Variance and Correlation

In many problems, variance changes over times (e.g.. see prolactin data), and how the variability

changes over time might be of interest. However, if there are many time points (e.g. T = 24), not

only estimating σ1, σ2, · · · , σ24 is annoying, but such estimation cannot show trend of variance chang

effectively. In this section, we see some approaches for modeling variance or correlation coefficients

and estimating methods.

31


2.2.1 Modeling Variance

Consider a model

E(Yij − µij)2 = σ2ij = exp(γ0 + γ1Xij). (2.3)

(Since it is a nonlinear model, we cannot use the notation Yi or Xi in EYi = Xiβ) Then we can treat

(Y − µ)2 as outcome, exp(γ0 + γ1Xij) as mean, and find γ using least square estimation, i.e., find the

minimizer of ∑i,j

ε2ij =∑i,j

((Yij − µij)2 − exp(γ0 + γ1Xij)

)2.

It is equivalent to find a solution of

∑i,j

2∂εij∂γ

εij = −2∑i,j

∂σ2ij

∂γ


)= 0,

i.e., find a zero of

U(β, γ) :=∑i,j

∂σ2ij

∂γ


)2=∑i,j

1

Xij

exp(γ0 + γ1Xij)((Yij − µij)2 − exp(γ0 + γ1Xij)

).

(Or we can rewrite U(β, γ) as

U(β, γ) =∑i,j

1

Xij

∂σ2ij

∂ηij


),

where ηij = γ0 + γ1Xij is a linear component) Then we can solve the problem using weighted least

square technique, if one treats ∂σ2ij/∂ηij as a weight. However, there are two main problems in such

procedure. First, outcome (Yij − µij)2 is unavailable as β is unknown. Thus one needs to obtain the

estimate of β first, and then solve U(β, γ) = 0 after plug-in β. Or, equivalently, one should solve

∑Ki=1X

>i (Yi −Xiβ)

U(β, γ)

= 0.

Second, even if we can obtain the outcome, we cannot solve U(β, γ) = 0 directly, since the weight

∂σ2ij

∂ηij= exp(γ0 + γ1Xij)

32


depends on unknown variable γ. We can easily solve this problem using iterative algorithm such as

Newton-Rhapson,

γ(p+1) = γ(p) +

(−∂U(β, γ)

∂γ>

)−1

U(β, γ)|γ=γ(p) ,

or scoring method,

γ(p+1) = γ(p) + E

(−∂U(β, γ)

∂γ>

)−1

U(β, γ)|γ=γ(p) .

Remark 2.18. Note that such estimator is asymptotic distributed as multivariate normal: Details

will be discussed later.

Remark 2.19. A consistent variance estimate can be obtained by Jackknife procedure.

(i) Delete ith subject and obtain OLS, say β−i.

(ii) Given the estimate, compute (Yij − µij)2= (Yij − (Xiβ−i)j)2.

(iii) Fit variance model and obtain γ−i.

(iv) Repeat (i) to (iii) for i = 1, 2, · · · ,K, and compute

K∑i=1

β−i − βγ−i − γ

(β−i − β γ−i − γ).

Remark 2.20. Note that, we can also use WLS instead of OLS. WLS should be used if the purpose

of variance modeling is to improve the efficiency of β, but since WLS needs covariance to obtain, we

should recompute β at every iteration as γ is updated. Also note that, to obtain more efficient estimate

of γ, we may use weighted square criterion

K∑i=1

∂ε>i∂γ

Ω−1i εi,

where Ωi = V ar(εi). If our interest is only in the (efficient) estimation of β, we may use just least

square form; but we should employ above (weighted) criteria if one wants to perform an inference on

γ for efficient inference.

2.2.2 Modelling Correlation

In genetic studies (or famaily studies), modeling correlation is of primary interest (e.g. “Is the

correlation between siblings higher than that between cousins?”). Note that

E(Yij − µij)(Yik − µik) = ρijk =exp(α0 + α1Zijk)− 1

exp(α0 + α1Zijk + 1,

33


where Zijk is a covariate (e.g. whether an individual is sibling or cousin). In correlation modeling, we

can treat (Yij − µij)(Yik − µik) as outcome,exp(α0+α1Zijk)−1exp(α0+α1Zijk+1 as mean, and find α so that

∑i,j,k ε

2ijk is

minimized, where εijk = (Yij − µij)(Yik − µik)− ρijk. Then we should solve

U(β, α) = 0,

where

U(β, α) =∑i,j,k

∂ρijk∂α

((Yij − µij)(Yik − µik)− ρijk) =∑i,j,k

1

Zijk

∂ρijk∂ηijk

((Yij − µij)(Yik − µik)− ρijk).

Note that our ‘outcome’ is not known, so one needs to obtain the estimate of β first, plug in, then solve

U(β, α) = 0. The same arguments used in the variance modeling can be applied in the correlation

modeling, such as nonlinear regression routine or jackknife procedure. Furthermore, one can combine

two models, and then mean, variance, and correlation is modeled simultaneously.

2.3 Appendix: Asymptotic variance of WLS estimator

Consider a WLS estimator for β,

β(θ) = (X>V−1(θ)X)−1X>V−1(θ)Y,

where θ is a (vector of) parameter(s). Then we get

V ar(β(θ) = (X>V−1(θ)X)−1

and hence we can estimate such variance as

V ar(β(θ)) = (X>V−1(θ)X)−1.

However, actually we obtain β(θ), instead of β(θ), and then it may be too optimistic to expect

V ar(β(θ)) = V ar(β(θ)).

Following theorem tells us that such expectation may be acceptable; if θ is sufficiently a good estimator

for θ (consistent, at least), their asymptotic variances are the same, i.e.,

V ar(β(θ)) ≈ V ar(β(θ)).

34


Theorem 2.21. Let θ be a consistent estimator for θ and the index 0 denotes the true value; e.g., β0

denotes the true β. Then

√K(β(θ)− β0) =

√K(β(θ0)− β0) +OP (K−1/2)

as K →∞.

Proof. Note that

√K(β(θ)− β0) =

(1

KX>V−1(θ)X

)−1

K−1/2X>V−1(θ)ε

=

(1

KX>V−1(θ0)X

)−1

K−1/2X>V−1(θ)ε + oP (1) (∵ consistency of θ)

holds. Under reasonable assumption, we can say that K−1X>V−1(θ0)X converges to somewhere,

and hence the remain part is K−1/2X>V−1(θ)ε. Now define S(θ) = X>V−1(θ)ε. Then by Taylor’s

theorem, we get

K−1/2S(θ) = K−1/2S(θ0) +K−1/2 ∂S

∂θ

∣∣∣∣θ=θ∗

(θ − θ0)

= K−1/2S(θ0) +K−1 ∂S

∂θ

∣∣∣∣θ=θ∗

·√K(θ − θ0).

From consistency, under some technical assumptions,√K(θ− θ0) = OP (1), and therefore it’s enough

to show

K−1 ∂S

∂θ

∣∣∣∣θ=θ∗

= OP (K−1/2).

Note that for any component θk

∂S

∂θk= −X>V−1(θ)

∂V

∂θk(θ)V−1(θ)ε = −

K∑i=1

X>i Σ−1 ∂Σ

∂θkΣ−1εi = OP (K1/2)

from E(εi) = 0.

35

Chapter 3

Mixed Effects Model

3.1 Motivation

In previous chapter, we learned analysis methods for repeated measured data with equal dimensions.

Then how to handle the data with unequal dimension? That is, we observed

Yi ∼ Nni(Xiβ,Σi), i = 1, 2, · · · ,K.

(Since there are different number of observations, their variances become also different; at least, their

dimension is different) If one estimates β and Σi directly (without assumption of correlation structure),

then there exist too many parameters, which becomes difficult to estimate. If the maximum number of

measurements, say n, is known, one can assume that Σi is a submatrix of Σn×n, but then there might

be a problem of data missing or irregular time points (in fact, two might coincide with). Alternatively

one can consider specially structured models, which is called mixed effects model, which assumes

that the elements of Yi are correlated only because they share common characteristics, called random

effects. For example, let yij be an observation of ith individual at jth time point. Then mixed effects

model

yij = β0 + bi +Xijβ + εij

can be considered, where εi and bi are independent random variables with

εi ∼ N(0, σ2Ii), bi ∼ N(0, D), bi ⊥ εi|Xi.

Note that all of distributional characteristics or independence are obtained conditional X. Also do

not confuse the notation; σ2 and D are scalars. Then random effect term bi can be interpreted as a

“specific characteristic or propensity of each individual.” Furthermore, we can easily obtain following

36


properties.

• E(yij |xij) = β0 + xijβ. (“Marginal mean response”)

• E(yij |bi, xij) = β0 + bi + xijβ. (“Conditional mean response”)

• V ar(yij |bi, xij) = σ2. (“Conditional independence”)

• V ar(yij |xij) = EV ar(yij |bi, xij) + V arE(yij |bi, xij) = σ2 +D.

• Cov(yij , yik|xij , xik) = D (“only sharing bi yields correlation”)

• Corr(yij , yik|xij , xik) = Dσ2+D

.

Remark 3.1. Random Intercept Model. Recall that the model is

E(yij) = β0 + xijβ.

However, if we consider a response of specific individual, then we should see

E(yij |bi) = β0 + xijβ + bi.

In other words, mean response of each individual has a random intercept depending on random effect

bi. Note that model yields a common slope, i.e., parallel mean response lines.

Figure 3.1: Random intercept model

Remark 3.2. Matrix notation. Consider following notations.

37


• Yi = (Yi1, · · · , Yini))>, Xi =

1 Xi1

1 Xi2

......

1 Xini

.

Then our model can be rewritten asY1

Y2

...

YK

=

X1

X2

...

XK

β0

β

+

11

12

. . .

1K

b1

b2...

bK

+

ε1

ε2...

εK

or

Y = Xβ + Zb+ ε.

Then we can write

• E(Yi|Xi, bi) = Xiβ + 1ibi;

• E(Yi|Xi) = Xiβ;

• V ar(Yi|Xi, bi) = σ2Ii;

• V ar(Yi|Xi) = σ2Ii + 1iD1>i ;

• Corr(Yi) =D

σ2 +DJi +

(1− D

σ2 +D

)Ii.

Also, marginal (log) likelihood becomes

logL = −1

2

K∑i=1

log |Vi| −1

2

K∑i=1

(Yi −Xiβ)>V −1i (Yi −Xiβ),

where

Vi = σ2Ii + 1iD1>i = V ar(Yi|Xi).

Note that

V −1i = (σ2 +D)

1

1− ρ

(Ii −

ρ

1 + (ni − 1)ρJi

),

where ρ =D

σ2 +D. Note that if σ2 increases, then correlation ρ decreases; verifying “which one belongs

to each individual” becomes more difficult. We can minimize such likelihood using Newton-Rhapson

algorithm.

38


3.2 Model Estimation

Consider following mixed effects model

Yi = Xiβ + Zibi + εi, i = 1, 2, · · · ,K.

3.2.1 Notations

From now on, we use following notations. Let

Yi = (yi1, yi2, · · · , yini)>

be ni × 1 vector of responses, and

εi = (εi1, εi2, · · · , εini)>

be ni× 1 vector of errors. Also, let Xi be ni× p covariate matrix and β be p× 1 coefficient vector. For

random effect terms, let Zi be ni × q matrix and bi be q × 1 vector of random regression coefficients.

Example 3.3. (Random intercept and random slope) We can consider following model with q = 2

and ni = 4: yi1

yi2

yi3

yi4

=

1 1 xi1

1 2 xi2

1 3 xi3

1 4 xi4

β0

β∼

+

1 1

1 2

1 3

1 4

b0ib1i

+

εi1

εi2

εi3

εi4

.

In here, the notation β∼

is used to emphasize that β = β∼

is a vector.

3.2.2 Likelihood approach

Now we assume distributional characteristics. Assume that εi ∼ Nni(0, σ2Ii) and bi ∼ Nq(0, D),

where D is q × q matrix. Also assume that bi and εi are (conditionally) independent. Then,

• Conditionally on bi,

E(Yi|Xi, bi) = Xiβ + Zibi, V ar(Yi|Xi, bi) = σ2Ii.

• Marginally,

E(Yi|Xi) = Xiβ, V ar(Yi|Xi) = σ2Ii + ZiDZ>i .

More precisely,

Yi ∼ Nni(Xiβ, σ2Ii + ZiDZ

>i ).

39


Thus we get a marginal likelihood as

logL = −1

2

K∑i=1

log |Vi| −1

2

K∑i=1

(Yi −Xiβ)>V −1i (Yi −Xiβ),

where

Vi = V ar(Yi|Xi) = σ2Ii + ZiDZ>i .

Then we can find MLE (or equivalently, WLSE) which maximizes marginal likelihood. Note that

likelihood equation isK∑i=1

X>i V−1i (Yi −Xiβ) = 0

which is obtained from∂ logL

∂β= 0. For V , we can also find likelihood equation from

− ∂L∂θk

=1

2

K∑i=1

tr

(V −1i

∂Vi∂θk

)− 1

2

K∑i=1

(Yi −Xiβ)>V −1i

∂Vi∂θk

V −1i (Yi −Xiβ) = 0,

where θ is a vector of parameters in Vi, e.g., θ = (σ2, D11, D12, D22)>, in the case of q = 2.

Remark 3.4. In mixed effects model, there are fixed effect term and random effect term. If bi is

available, then conditional mean response gives us more information if individual is specified; but

unfortunately, bi cannot be observed. Thus we should estimate or predict bi terms1. Under our model,

Yi and bi are jointly distributed as

Yibi

∼ Nni+q

Xiβ

0

,

Vi ZiD

ZiD> D

.

Thus conditional distribution of bi given Yi is also multivariate normal:

bi|Yi ∼ Nq(E(bi|Yi), V ar(bi|Yi))

where

E(bi|Yi) = DZ>i V−1i (Yi −Xiβ) and V ar(bi|Yi) = D −DZ>i V −1ZiD.

Then E(bi|Yi) = DZ>i V−1i (Yi −Xiβ) is the best linear unbiased predictor (BLUP), in the sense

that

E(bi − E(bi|Yi))2 ≤ E(bi − a>i Yi)2 ∀a s.t. E(a>Yi) = E(bi).

1bi can be estimated : In this view, we regard bi as a “realization” of each random effect. bi can be predicted : In thisview, we think bi as a random term itself. The latter term is more used in literature.

40


(Actually, it is also a best predictor (BP)) It gives us a heuristic interpretation; if Yi > Xiβ for any

component, then bi should be positive.

Remark 3.5. Can we maximize the joint likelihood over β and b1, b2, · · · , bK as if bi’s are fixed

parameters? That is, minimize SS(β, b1, b2, · · · , bK), where

SS(β, b1, b2, · · · , bK) =K∑i=1

(log f(Yi|Xi, bi) + log g(bi|Xi))

=K∑i=1

(Yi −Xiβ − Zibi)>R−1

i (Yi −Xiβ − Zibi) + b>i D−1b,

and V ar(Yi|bi) = Ri. In general, this approach will not lead to a correct inference about β, since the

number of unknown bi’s increases as the sample size increases, which implies that we would not be

able to enjoy goal properties of MLE. However, in linear mixed effect models, this will give a correct

inference for β. Why? Note that bi minimizing SS becomes

bi = (Z>i R−1i Zi +D−1)−1Z>i R

−1i (Yi −Xiβ),

since SS can be viewed as regularized (weighted) squared sum of residuals with responses Yi − Xiβ

and covariates Zi. However, from

(Z>i R−1i Zi +D−1)DZ>i = Z>i R

−1i (ZiDZ

>i +Ri),

we obtain

(Z>i R−1i ZiD

−1)−1Z>i R−1i = DZ>i (ZiDZ

>i +Ri)

−1,

which yields

bi = DZ>i (ZiDZ>i +Ri)

−1(Yi −Xiβ) = DZ>i Vi(Yi −Xiβ) = E(bi|Yi).

It is exactly equal to BLUP of bi, and hence β based on bi will be also same as that based on BLUP

by following equantity:

K∑i=1

X>i R−1i (Yi −Xiβ − Zibi) =

K∑i=1

X>i Vi(Yi −Xiβ). (3.1)

Why (3.1) yields β = βBLUP ? LHS of (3.1) is the same as derivative of SS(β, b1, · · · , bK); RHS is

41


a likelihood equation. Thus the remained part is to prove (3.1). It is easily obtained from

Yi −Xiβ − Zibi = Yi −Xiβ − ZiDZ>i (ZiDZ>i +Ri)

−1(Yi −Xiβ)

= Ri(ZiDZ>i +Ri)

−1(Yi −Xiβ)

and

K∑i=1

X>i R−1i (Yi −Xiβ − Zibi) =

K∑i=1

X>i (ZiDZ>i +Ri)

−1(Yi −Xiβ)

=K∑i=1

X>i V−1i (Yi −Xiβ).

Remark 3.6. If bi is observed, the problem is easy. Construct full data as (Y o, b) and observed data

as (Y o) and missing data as (b).2 If bi were observed, the joint likelihood is

logL =

K∑i=1

log f(Yi|bi) + log g(bi)

= −1

2

K∑i=1

log |σ2Ii| −1

2

K∑i=1

(Yi −Xiβ − Zibi)>(σ2Ii)−1(Yi −Xiβ − Zibi)−

n

2log |D| − 1

2

K∑i=1

b>i D−1bi,

and hence

D =1

K

K∑i=1

bib>i

is obtained. But bi is unobserved, and hence such estimate is unavailable.

A possible approach for such missing problem is to implement an EM algorithm.

Theorem 3.7. EM algorithm. Repeat following steps until estimates are converge.

β(p+1) =

(K∑i=1

X>i Xi

)−1( K∑i=1

Xi(Yi − Zibi)

)

D(p+1) = D(p) − 1

K

K∑i=1

D(p)Z>i V−1i ZiD

(p) +1

K

K∑i=1

D(p)V −1i (Yi −Xiβ

(p))(Yi −Xiβ(p))>V −1

i ZiD(p)

σ2(p+1) =1

n

K∑i=1

(Yi −Xiβ

(p+1) − Zibi)>(Yi −Xiβ(p+1) − Zibi) + tr

(Z>i ZiV ar(bi|Yi, θ(p))

),

2The term “missing” does not denote only the case that the data is failed to observed even if we intended to, but it isdetermined what we defined the full data set is. For example, in mixed effects model, bi is not intended to observe, andhence it does not contain failure in collecting the data, but we often regard it as missing. In contrast, (assuming thatmax(n1, · · · , nK) = n1), Y2,n2+1, · · · , Y2,n2 are “failed to observe” even if we intended to (∵ j denotes the time points),but we will NOT call such data are missing.

42


where

bi = D(p)Z>i V−1i (Yi −Xiβ

(p)).

Proof. Note that

log f(Yc|θ) = −1

2

K∑i=1

log |σ2Ii| −1

2

K∑i=1

(Yi −Xiβ − Zibi)>(σ2Ii)−1(Yi −Xiβ − Zibi)

− K

2log |D| − 1

2

K∑i=1

b>i D−1bi.

Then we can obtain Q(θ|θ(p)) for EM algorithm as

Q(θ|θ(p)) = Eb|Y,θ(p) [log f(Yc|θ)]

= −n2

log σ2 − K

2log |D| − 1

2

K∑i=1

Eb|Y,θ(p)[b>i D

−1bi

]− 1

2σ2

K∑i=1

Eb|Y,θ(p)[(Yi −Xiβ − Zibi)>(Yi −Xiβ − Zibi)

].

If we denote as

µpi := E(bi|Yi, θ(p)) = D(p)Z>i (ZiD(p)Z>i + σ2(p)Ii)

−1(Yi −Xiβ(p))

Σpi := V ar(bi|Yi, θ(p)) = D(p) −D(p)Z>i (ZiD

(p)Z>i + σ2(p)Ii)−1ZiD

(p),

we can see

Eb|Y,θ(p) [b>i D−1bi] = tr(D−1Σp

i ) + µp>i D−1µpi

and

Eb|Y,θ(p) [(Yi −Xiβ − Zibi)>(Yi −Xiβ − Zibi)] = (Yi −Xiβ)>(Yi −Xiβ)− 2µp>i Z>i (Yi −Xiβ)

+ tr(Z>i ZiΣpi ) + µp>i Z>i Ziµ

pi .

Combining all above terms, finally we get

Q(θ|θ(p)) = −n2

log σ2 − K

2log |D| − 1

2

K∑i=1

(tr(D−1Σp


)− 1

2σ2

K∑i=1

((Yi −Xiβ)>(Yi −Xiβ)− 2µp>i Z>i (Yi −Xiβ) + tr(Z>i ZiΣ

pi ) + µp>i Z>i Ziµ

pi

).

43


In M-step, we have to find θ(p+1) maximizing Q(θ|θ(p)).

• To find β(p+1), fix σ2 and D. β(p+1) is a maximizer of

K∑i=1

((Yi −Xiβ)>(Yi −Xiβ)− 2µp>i Z>i (Yi −Xiβ) + µp>i Z>i Ziµ

pi

)=

K∑i=1

(Yi−Xiβ−Ziµpi )>(Yi−Xiβ−Ziµpi ),

and hence

β(p+1) =

(K∑i=1

X>i Xi

)−1 K∑i=1

(Yi − Ziµpi ).

• Now we find D(p+1) which minimizes

K log |D|+K∑i=1

(tr(D−1Σp


).

(Note that for any β and σ2, Q is maximized when this is minimized) It is equal to

K log |D|+K∑i=1

tr

(K∑i=1

(Σpi + µpiµ

p>i )

),

and hence

D(p+1) =1

K

K∑i=1

(Σpi + µpiµ

p>i ) =

1

K

K∑i=1

Eb|Y,θ(p) [bib>i ].

It can be rewritten as

D(p+1) =1

K

K∑i=1

(Σpi + µpiµ

p>i )

=1

K

K∑i=1

(D(p) −D(p)Z>i V

(p)−1i ZiD

(p) +D(p)Z>i V(p)−1i (Yi −Xiβ

(p))(Yi −Xiβ(p))>V

(p)−1i ZiD

(p))

= D(p) − 1

K

K∑i=1

D(p)Z>i V(p)−1i ZiD

(p)

+1

K

K∑i=1

D(p)Z>i V(p)−1i (Yi −Xiβ

(p))(Yi −Xiβ(p))>V

(p)−1i ZiD

(p).

• Finally, plugging in β(p+1) and D(p+1), we may find σ2(p+1) which maximizes profile likelihood.

Note that

Q(β(p+1), D(p+1), σ2|θ(p)) = −n2

log σ2 − K

2log |D(p+1)| − 1

2

K∑i=1

K∑i=1

Eb|Y,θ(p)[b>i D

(p+1)−1bi

]− 1

2σ2

K∑i=1

Eb|Y,θ(p)[(Yi −Xiβ

(p+1) − Zibi)>(Yi −Xiβ(p+1) − Zibi)

],

44


and hence maximizing Q(β(p+1), D(p+1), σ2|θ(p)) is equivalent to minimize

n

2log σ2 +

1

2σ2

K∑i=1



].

Note that



]= (Yi −Xiβ

(p+1))>(Yi −Xiβ(p+1))− 2µp>i Z>i (Yi −Xiβ

(p+1)) + Eb|Y,θ(p)(b>i Z>i Zibi)

= (Yi −Xiβ(p+1) − Ziµpi )

>(Yi −Xiβ(p+1) − Ziµpi ) + tr(Z>i ZiΣ

pi ).

So we obtain

σ2(p+1) =1

n

K∑i=1

((Yi −Xiβ

(p+1) − Ziµpi )>(Yi −Xiβ

(p+1) − Ziµpi ) + tr(Z>i ZiΣpi )).

Remark 3.8. Diagnostic tools. In model diagnostic, we should check all the assumptions in model.

First, we should check whether mean part (=fixed effect part) is correctly specified. That is, we have

to check if the model includes correct variables or not (Two types of possibilities exist in the mean

misspecification; one is that variables those should be in the model are omitted; the other one is that

variables those should not be in the model are included) There are several tools to achieve the goal;

we can test H0 : βi = 0 (for example) using likelihood ratio, or see residual terms V−1/2(Y − µ).

Note that V−1/2(Y − µ) is a function of covariates, and its mean would not be zero if mean part is

incorrectly designed. We can also employ AIC = −2L + 2k or BIC = −2L + k logK (where k is the

number of unknown parameters) in order to perform a model selection.

Next task is to check random effect part, i.e., selection of Z. Compared to mean specification, since

random part Zb is unobservable directly, we cannot “test” b = 0. Then how to verify correctness? One

can pay attention to the fact that Z appears in marginal variance of Y,

V ar(Yi) = ZiDZ>i + σ2Ii.

Thus we can verify model specification with variance term. If model has been correctly specified, then

(Yij − µij)2 − σij should have mean zero, and V−1/2(Y − µ) should be homoscedastic. For precise

45


analysis, we can test D12 = 0 or D22 = 0 where

V ar(Yi|Xi) = σ2Ii + Zi

D11 D12

D21 D22

Z>i .

How to test it? We can apply likelihood ratio test intuitively. The nullD12 = 0 denotes uncorrelatedness

between random intercept and slope. However, testing D22 = 0 may cause a problem, because 0 is on

the boundary of parameter space for D22. In here, standard likelihood inference may not perform well.

Finally, we can check normal assumption using p-p plot or q-q plot with standardized residuals

V−1/2(Y − µ).

Remark 3.9. What happen if mean or variance model is misspecified? Assume that E(Y|X) = Xβ

is true model but we mistakenly fitted E(Y|X) = X∗β∗. Then β∗ is a solution of

K∑i=1

X∗>i V −1i (Yi −X∗i β∗) = 0.

Recall that the solution β ofK∑i=1

X>i V−1i (Yi −Xiβ) = 0

converges to the solution of

E

[K∑i=1

X>i V−1i (Yi −Xiβ)

]= 0,

i.e., optimizer of

E

[K∑i=1

(Yi −Xiβ)>V −1i (Yi −Xiβ)

].

Under the true model, solution is a true value of β. Also, in this case, we can say that β∗ converges

to the solution β∗ of

E

[K∑i=1

X∗>i V −1i (Yi −X∗i β∗)

]= 0,

even if we considered misspecified model.

How about variance misspecification? Also in this case, solution V ∗ and β∗ converges to the optimizer

of

∫ (−

K∑i=1

log |V ∗i (ρ)| −K∑i=1

(Yi −X∗i β∗)>V ∗−1

i (ρ)(Yi −X∗i β∗)

)︸︷︷︸

loss under the model

|2πVi|−1/2e−12

∑Ki=1(Yi−Xiβ)>V −1

i (Yi−Xiβ)︸︷︷︸pdf under the true model

dYi.

From this, we can also say that, if only variance model is misspecified, and mean model is well specified,

46


then β becomes consistent, because it converges to the solution of

E

(K∑i=1

X>i V∗−1

i (Yi −Xiβ)

)= 0.

Example 3.10. Nothern Manhattan Stroke Study (Cognitive Function Data). Stroke-free

subjects in nothern Manhattan are followed annually to detect the outcome of stroke. One of the goals is

to examine whether the changes in cognitive function over time depend on kidney function measured

by creatinine levels. Totally 2029 subjects are investigated, but as the time goes by, the number

of observed subjects decreases (n1 = 2029, n2 = 1695, n3 = 1348, n4 = 920, n5 = 491, n6 = 164).

Observed outcome is TICS (Repeated Telephone Interview for Cognitive Status) score. Following

figures describe the data.

Figure 3.2: Box plot. Each box corresponds to each time point. There is a slightly increasing trend astime goes by.

Figure 3.3: Score-time plot. (Left) Subjects are divided in two groups with low creatinine level andhigh level, respectively. Median is used to determine whether creatinine level is low or high. (Right)Subjects are clustered with the number of repeats. We can find a tendency that subjects observedlonger had higher score.

47


Figure 3.4: (Left) Random intercept - random slope plot. Each dot corresponds to each individual.Dots on a line are from subjects with ni = 1 (i.e., observed only once). Note that bi = E(bi|Yi) =DZ>i V

−1i (Yi − µi), and ni = 1 yields Zi = (1, 1) (∵ Zi’s first column is (1, 1, 1, 1)>; second one is

(1, 2, 3, 4)>). (Right) Random intercept - subjects plot.

WLSE Mixed model

β s.e. p-value β s.e. p-value

(Intercept) 47.4 1.21 <.0001 (Intercept) 47.8 1.22 <.0001logCR -0.88 0.53 0.096 logCR -0.70 0.54 0.1963time 0.081 0.036 0.0235 time 0.056 0.029 0.0592AGE -0.23 0.013 < .0001 AGE -0.24 0.013 < .0001

woman -0.68 0.25 0.0063 woman -0.68 0.26 0.0105heduc 4.56 0.26 < .0001 heduc 4.54 0.26 < .0001med -2.00 0.24 < .0001 med -2.11 0.24 < .0001

DIAB -0.46 0.27 0.0892 DIAB -0.53 0.27 0.0522etmod 0.84 0.22 0.0002 etmod 0.90 0.22 < .0001NCAD 0.54 0.26 0.0412 NCAD 0.52 0.26 0.0482

CR*time -0.36 0.12 0.0049 CR*time -0.41 0.11 0.0001

Table 3.1: WLSE vs LMM fit. In WLSE, intraclass structure of correlation is assumed. In LMM,random intercept and slope model is used. We can verify that two models are fitted similarly.

Remark 3.11. Conditional independence also can be dropped. That is, we need not assume V ar(Yi|bi) =

σ2Ii, but more complicated error structure can be given. For example, we can assume that εi|bi ∼

Nni(0, σ2Ωi), where

Ωi =

1 α α2 · · · αni−1

α 1 α · · · αni−2

α2 α 1 · · · αni−3

......

.... . .

...

αni−1 αni−2 αni−3 · · · 1

.

3.2.3 Restricted Maximum Likelihood (REML)

Note that DMLE (estimate of variance component) is criticized because it tends to be biased down-

ward by ignoring the degrees of freedom lost by estimating β. In this view, restricted maximum like-

48


lihood (REML) estimation is used, which is based on N − p linearly independent error contrasts.

Intuitively, it uses p dimensional components from N data to estimate main effect term β, and uses

rest N − p error components to estimate variance components.

Let S = I − X(X>X)−1X> be N × N matrix, and A be N × (N − p) matrix satisfying S =

AA> and A>A = I (spectral decomposition). Then w = A>Y provides a particular set of N − p

linearly independent error contrasts which is orthogonal (in the sense of statistical independence)

to β = G>Y, where G = V−1X(X>V−1X)−1. Note that orthogonality of w and β comes from

A>VG = A>X(X>V−1X)−1 = 0 and hence

Cov(A>Y, G>Y) = A>VG = 0.

(When we assume that Y is distributed as multivariate normal, w = A>Y and β = G>Y are inde-

pendent) This point has an important interpretation; since A>Y is independent of β, and β might be

viewed as nuisance when we estimate variance component, we can achieve more efficient estimate in

hand if we use A>Y. Patterson and Thompson (1971) provides likelihood based on A>Y and later

named by restricted likelihood. It turns out that

|A>VA| = |V||X>V−1X|.

(It can be shown via change of variable; for more details, see a tutorial document by Zhang (2015).)

Also note that:

Proposition 3.12.

Y>A(A>VA)−1A>Y = (Y −Xβ)>V−1(Y −Xβ).

Proof. Key point of the proof is that A>VG = 0 holds. In other words, A∗ := V1/2A and X∗ :=

V−1/2X are orthogonal. (Motivation: since column space of (A∗,X∗) is N dimensional with A∗ ⊥ X∗,

projection matrix constructed by (A∗,X∗) would be identity) Now let

T :=(A∗ X∗

)A∗>X∗>

(A∗ X∗)−1A∗>

X∗>

.

Then

T = I−(A∗ X∗

)(A∗>A∗)−1 0

0 (X∗>X∗)−1

A∗>X∗>

= I−A∗(A∗>A∗)−1A∗>−X∗(X∗>X∗)−1X∗>

49


is obtained. Since A∗(A∗>A∗)−1A∗> + X∗(X∗>X∗)−1X∗> is idempotent matrix, so is T , but from

tr(TT>) = tr(T ) = N − (N − p)− p = 0,

we obtain T = 0. Thus we get

A∗(A∗>A∗)−1A∗> + X∗(X∗>X∗)−1X∗> = I.

Rearranging the terms, we obtain

A(A>VA)−1A> + V−1X(X>V−1X)−1X>V−1 = V−1.

Thus we get

A(A>VA)−1A> = V−1 −V−1X(X>V−1X)−1X>V−1

= V−1/2(I −V−1/2X(X>V−1X)−1X>V−1/2)V−1/2

= V−1/2(I −V−1/2X(X>V−1X)−1X>V−1/2)(I −V−1/2X(X>V−1X)−1X>V−1/2)V−1/2

= (V−1/2 −V−1X(X>V−1X)−1X>V−1/2)(V−1/2 −V−1/2X(X>V−1X)−1X>V−1)

= (I −GX>)V−1(I −XG>)

= (I −XG>)>V−1(I −XG>),

and therefore

Y>A(A>VA)−1A>Y = Y>(I −XG>)>V−1(I −XG>)Y

= (Y −Xβ)>V−1(Y −Xβ).

Note that (restricted) likelihood of A>Y ∼ N(0, A>VA) is

l∗ = −1

2log |A>VA| − 1

2Y>A(A>VA)−1A>Y

= −1

2log |V| − 1

2log |X>V−1X| − 1

2(Y −Xβ)>V−1(Y −Xβ)

= −1

2log

∣∣∣∣∣K∑i=1

X>i V−1i Xi

∣∣∣∣∣− l(β(D)),

50


where

l(β) = −1

2log |V| − 1

2(Y −Xβ)>V−1(Y −Xβ)

is a marginal likelihood. Thus, restricted likelihood can be viewed as a “proper likelihood for V when

β is nuisance.” In other words, in REML approach, β is not of interest, so we “plug-in” β into the

likelihood, and find estimate of variance component, i.e., the restricted likelihood is free of β.

Remark 3.13. Bayesian interpretation is also available. Note that restricted likelihood becomes

l∗ = −1

2log |V ar(β)|+ l(β);

i.e., it can be viewed as a posterior likelihood when we have a prior distribution of β.

3.2.4 Two-stage Modeling

We can also handle the data two-stage. Such two-stage approach summarizes within individual trend

first, and then perform an inference across individuals. Model can be summarized as following: First,

“summarize within individual trend,” i.e.,

Yi|βi ∼ Nni(Xiβi, Vi).

Note that

βi = (X>i V−1i Xi)

−1X>i V−1i Yi,

and hence

βi|βi ∼ Np(βi, (X>i V−1i Xi)

−1).

Next, inference “across individuals” is performed, i.e.,

βi ∼ Np(Ziα,D).

It can be viewed as following. First, main effect of each individual βi is determined from βi ∼

Np(Ziα,D); βi becomes different as individual differs, even if Ziα are all the same among individuals.

Next, for each individual, response Yi is determined given βi. It is named as “two-stage” modeling

since estimation procedure is two stage: First stage is to obtain βi. Note that marginal likelihood is

βi ∼ N(Ziα,D + (X>i V−1i Xi)

−1).

Using this, one can infer α and D; this is second stage.

51


Remark 3.14. Two-stage modeling is useful when time series is long such as diary studies. Note that

X>i V−1i Xi becomes large if studied long time, i.e., ni is large. Thus V ar(βi) = (X>i V

−1i Xi)

−1 + D

becomes small when time series is long; variance becomes large when time series is short.

Example 3.15. Recall prolactin data. Let

Xi =

1 1

1 2

1 3

1 4

, Zi =

1 groupi baselinei 0 0 0

0 0 0 1 groupi baselinei

;

first and second row of Zi correspond to (random) intercept and (random) slope, respectively (Note

that βi is random in this model). Then

βi|βi ∼ N(βi, (X>i V−1i Xi)

−1).

Remark 3.16. Second stage is to estimate α and D. It can be achieved by maximum likelihood:

K∑i=1

Z>i (D + (X>i V−1i Xi)

−1)−1(βi − Ziα) = 0

is likelihood equation for α. It is easily obtained that

α =

(K∑i=1


−1)−1Zi

)−1( K∑i=1


−1)−1βi

).

For DMLE , iterative algorithm such as Newton-Rhapson algorithm is required. Alternatively, EM-type

estimate can be computed:

D(p+1) = D(p) − 1

K

∑D(p)Σ−1

i D(p) +1

K

∑D(p)Σ−1

i (βi − Ziα(p))(βi − Ziα(p))>Σ−1i D(p),

where Σi = D(p) + (X>i V−1i Xi)

−1.

3.3 Appendix 1: EM algorithm

The Expectation-Maximization (EM) algorithm is a computational algorithm to find the maximum

likelihood estimate of the parameters of an underlying distribution from a given data set when the data

is incomplete or has missing values. Many statistical problems can be formulated as a missing data

problem. Examples include mixture models, cluster analysis, analysis with latent variables, random

52


effects models, and causal inference. Before we start, we introduce some notations those will be used

through the section. Let Yo be the observed data, while Ym denotes the missing data. Then Yc =

(Yo, Ym) be complete data set. Also, let f(Yc|θ) and g(Yo|θ) be densities of the complete data and

observed data given θ, respectively. Then

g(Y0|θ) =

∫f(Yc|θ)dYm

holds. Finally, let

k(Ym|Yo, θ) =f(Yc|θ)g(Yo|θ)

be a density of missing data YM . In fact, to cover a broader case of incomplete data, we can let

observed data as a function of complete data, say T (Yc) = Yo, X(y) = Yc ∈ X|T (Yc) = y be a set of

complete data whose observed data are equal to y. Then

g(y|θ) =

∫X(y)

f(x|θ)dx.

However, in this section, we only consider the case where missing data is overt so that Yc = (Yo, Ym).

The goal of EM algorithm is to maximize `(θ) = log g(Yo|θ). For this, at (p + 1)th iteration, one

finds θ s.t. `(θ) ≥ `(θ(p)), which satisfies

`(θ(p+1)) ≥ `(θ(p))

so that θ(p) converges to θMLE , eventually. To do this, EM algorithm introduces Q(θ|θ(p)) and maxi-

mizes `(θ) via

Q(θ|θ(p)) = EYm [log f(Yc|θ)|Yo, θ(p)],

i.e., maximizes Q(θ|θ(p)) at the pth iteration. It is motivated as following. Note that

`(θ) = log g(Yo|θ)

= log

∫f(Yc|θ)dYm

= log

∫f(Yc|θ)

k(Ym|Yo, θ(p))k(Ym|Yo, θ(p))dYm

= logEYm

[f(Yc|θ)

k(Ym|Yo, θ(p))

∣∣∣∣Yo, θ(p)

]≥ EYm

[log

f(Yc|θ)k(Ym|Yo, θ(p))

∣∣∣∣Yo, θ(p)

]

53


by Jensen inequality. Also note that,

`(θ(p)) = log g(Yo|θ(p))

= EYm

[log g(Yo|θ(p))

∣∣∣Yo, θ(p)]

= EYm

[log

f(Yc|θ(p))

k(Ym|Yo, θ(p))

∣∣∣∣∣Yo, θ(p)

]

holds. Combining both results in

`(θ) ≥ `(θ(p)) + EYm

[log

f(Yc|θ)f(Yc|θ(p))

∣∣∣∣Yo, θ(p)

].

We can observe θ∗ that increases right-hand side of previous inequality increases `(θ). In order to get

the greatest possible increase, choose

θ(p+1) = arg maxθ

EYm

[log

f(Yc|θ)f(Yc|θ(p))

∣∣∣∣Yo, θ(p)

].

Note that it is equivalent to

θ(p+1) = arg maxθ

EYm [log f(Yc|θ)|Yo, θ(p)].

Algorithm 1 EM algorithm

1: Repeat following iteration until it converges.2: for p = 1, 2, 3, · · · do3: E-step. Evaluate Q(θ|θ(p)).4: M-step. Find θ(p+1) s.t. θ(p+1) = arg maxθQ(θ|θ(p)).5: end for

It is well known that EM algorithm finds a stationary point monotonely.

Theorem 3.17 (Monotonicity Theorem.). Every EM algorithm increases `(θ) at every iteration, that

is

`(θ(p+1)) ≥ `(θ(p))

with equality if and only if

Q(θ(p+1)|θ(p)) = Q(θ(p)|θ(p)).

Proof. Only a sketch would be given. Note that f(Yc|θ(p+1)) = g(Yo|θ(p+1))k(Ym|Yo, θ(p+1)), and hence

Q(θ(p+1)|θ(p)) =

∫log f(Yc|θ(p+1))k(Ym|Yo, θ(p))dYm

=

∫log g(Yo|θ(p+1))k(Ym|Yo, θ(p))dYm

54


3.4 Appendix 2: BLUP approach for mixed effects model

In class, EM algorithm and (RE)ML approach were handled for estimation (or prediction) on mixed

effects model. In this section, we introduce BLUP approach, which was handled shortly in class.

Consider a linear mixed effects model

Yi = Xiβ + Zibi + εi,

or equivalently,

Y = Xβ + Zb+ ε,

where

V ar

bε

=

D 0

0 R

.

(D in here is different as that of D in the main paragraph; actually, D = I ⊗D) Then

V ar(Y) = ZDZ> +R =: V.

Let K>β +M>b be a function of interest to be predicted. Let’s denote a (linear) predictor as H>Y.

Then “unbiasedness” condition of H yields

E(H>Y) = H>Xβ = E(K>β +M>b) = K>β,

i.e.,

K = X>H.

Now our goal is to (find the “best” one) minimize a variance of “residuals” (in fact, “prediction error”).

Let

PE = K>β +M>b−H>Y

be a prediction error. Then

V ar(PE) = V ar(K>β +M>b−H>Y)

= V ar

M

−H

> b

Y

= M>DM +H>VH −M>DZ>H −H>ZDM.

56


Then V ar(PE) should be minimized under the constraint K = X>H. Therefore, our target function

to optimize becomes

Q = V ar(PE) + (X>H −K)>Φ. (“penalized criterion”)

Then “normal equations” become

∂Q

∂H= 2VH − 2ZDM + XΦ = 0

∂Q

∂Φ= X>H −K = 0.

From now on, for convenience, let θ = Φ/2. Then “normal equations” become

VH = ZDM −Xθ

X>H = K.

From the first one, we get

H = V−1(ZDM −Xθ);

and hence

X>V−1(ZDM −Xθ) = K,

i.e.,

Xθ = X(X>V−1X)−(X>V−1ZDM −K),

where A− denotes the generalized inverse matrix of A.3 Finally, we get

H = V−1ZDM −V−1X(X>V−1X)−(X>V−1ZDM −K),

which yields to our conclusion,

K>β +M>bBLUP

= M>DZ>V−1(Y −X(X>V−1X)−X>V−1Y) +K>(X>V−1X)−X>V−1Y.

Note that exactly the same logic can be applied to the case that K>β + M>b is not a scalar; Then

we get

K>βBLUP = K>(X>V−1X)−X>V−1Y

3Recall that A = A(A>A)−A>A.

57


and

M>bBLUP = M>DZ>V−1(Y −XβBLUP),

under the assumption that K>β is estimable. Especially, if X>V−1X is nonsingular, then

βBLUP = (X>V−1X)−1X>V−1Y

is the same as BLUE, and

bBLUP = DZ>V−1(Y −Xβ)

is equal to E(b|Y). Note that solutions XβBLUE

and ZbBLUP

are also obtained by minimizing

SS = (Y −Xβ − Zb)>R−1(Y −Xβ − Zb) + b>D−1b

(as we saw in remark 3.5), which is equivalent problem to solve

X>R−1X X>R−1Z

Z>R−1X Z>R−1Z +D−1

βb

=

X>R−1Y

Z>R−1Y

. (3.2)

In other words, if β and b are the solution of (3.2), then Xβ is BLUE of Xβ, and Zb is BLUP of Zb.

The equation (3.2) is called Mixed Model Equation (MME), and such method is called Henderson’s

MME approach.

References

• Searle, S. R., & Gruber, M. H. (2016). Linear models. John Wiley & Sons.

• Searle, S. R., Casella, G., & McCulloch, C. E. (2009). Variance components (Vol. 391). John

Wiley & Sons.

• Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric regression (No. 12). Cam-

bridge university press.

• Searle, S. R., & Henderson, C. R. (1961). Computing procedures for estimating components of

variance in the two-way classification, mixed model. Biometrics, 17(4), 607-616.

58

Chapter 4

Missing Data Problem

4.1 Introduction: Missing Mechanisms and Patterns

In the study, when subjects are lost to follow up, outcome cannot be observed. If we can observe

baseline covariates even if we lost to follow up (i.e., outcome and time-varying covariates are not

observed), then we need to separate treatments for missing outcome (drop-out) and missing covariates.

Definition 4.1. Before the start, we may define some notations.

• Yij be outcome at the jth measurement on the ith subject.

• Yi = (Yi1, Yi2, · · · , Yin)> be complete data.

• Xi be n× p design matrix for fixed effects.

• Zi be n× q design matrix for random effects.

• rij be an observation (non-missing) indicator for yij; rij = 1 if yij is observed.

• ti be the number of observations for i.

• Ri = (ri1, ri2, · · · , rin)> .

• Let Yi = (Y oi , Y

mi ), where Y o and Y m are observed and missing outcome, respectively.

Example 4.2. Let n = 3 and Ri = (1, 0, 1)>. Then Y oi = (Yi1, Yi3) and Y m

i = (Yi2).

Example 4.3. Why proper treatment of missing outcome is important? Figure 4.1 is random gener-

ated simulation data. First plot shows the fitted line when all of data are used. Fitted lines in second

and third plot are found when there are missing data but ignored. In the second plot, non-missing

indicator Ri is randomly generated; Ri depends on Xi. Third plot is generated under similar situation;

only difference is that Ri depends on Xi×Yi. Even if complete data is the same, how missing appears

59


might affect to the fit if missing data are not treated properly. In summary, “some missing is none;

some is dangerous.”

Figure 4.1: Simulation: Treatment of Missing Data

Definition 4.4 (Missing Mechanisms). (a) If missingness is (conditionally) independent of complete

data, i.e.,

R ⊥ (Y o, Y m)|(X,Z),

then such missing mechanism is called Missing Complete At Random (MCAR).

(b) If missingness is (conditionally) independent of missing data given the observation, i.e.,

R ⊥ (Y m)|(Y o, X, Z),

then such missing mechanism is called Missing At Random (MAR).

(c) If R depends on Y m given (Y o, X, Z), then such missing mechanism is called Nonignorable

Missingness (NI).

Note that such definition is valid under regression setting and missing outcome.

Example 4.5. (a) Recall the cognitive function example. If cognitive score of patient A is low, then

cognitive function of A is questionable and hence further diagnostic is needed. Therefore, A will

revisit with high probability and missing may not occur at future time point. In contrast, if cogni-

tive score of patient B is high, then B may not have any problem in his/her cognitive function, and

hence future data of B would be missed. In this case, “missingness R depends on the observation

(status at previous visit),” and hence we may suppose that missing mechanism is MAR.

(b) (NI)

Definition 4.6. An array of observation indicators (r1, r2, r3) is called missing pattern. There are

seven possible patterns; (1,1,1), (1,1,0), (1,0,0), (1,0,1), (0,1,0), (0,1,1) and (0,0,1). Patterns (1,1,1),

(1,1,0) and (1,0,0) are non-increasing, and called monotone. In monotone data, P (rj = 1|rj−1 =

0) = 0.

1, 1, 1, · · · , 1, 1︸︷︷︸all observed

, 0, 0, 0, · · · , 0, 0︸︷︷︸all missed

Missing patterns (1,0,1), (0,1,0), (0,1,1), and (0,0,1) are called non-monotone. In non-monotone

data, P (rj = 1|rj−1 = 0) 6= 0.

60


Remark 4.7. Sometimes nonresponse mechanism and missing pattern are confounded. For example,

mechanism r3 ⊥ y3|y1, y2, r2 = 1, i.e.,

P (r3|y1, y2, y3, r2 = 1) = P (r3|y1, y2, r2 = 1)

is MAR for the monotonically missing data (∵ under monotonicity, r2 = 1 implies r1 = 1, and hence

(y1, y2) = Y o), but is NI for the nonmotonically missing data (∵ y1 could be missed). Therefore, we

need a complex set of assumptions for the non-monotonically missing data to be MAR.

Remark 4.8. Note that we can estimate the missing mechanisms via binary regression (e.g. logis-

tic). For example, consider situation assumed at previous remark. Assume the monotonicity. Then

if P (r3|y1, y2, r2 = 1) does not depend on y1 and y2, missing mechanism becomes MCAR. In other

words, by testing coefficient (corresponding to Y o) equals to zero, we can test MCAR vs.

MAR. In our example, fitting (logistic) regression model

logitP (r3|y1, y2, r2 = 1) = γ0 + γ1y1 + γ2y2,

and if γ1 = γ2 = 0 fails to reject, we may accept MCAR. However, keep in mind that the sample size

is not powered to detect such effect; since study is not designed to verify missing mechanism,

sample size may not big enough to detect the difference of MCAR and MAR. Also remark that,

to distinguish between MAR and NI, we need to construct a test statistic to check, for example,

whether r3 is independent of y3 given observed data. In our example, perform binary regression to

P (r3|y1, y2, y3, r2 = 1), and seeing if it depends on y3. Even if y3 is missed in practice, it can be

handled as a missing covariate problem. Such approach may be theoretically fine, but we cannot check

parametric model assumptions (i.e., model diagnostic process is unavailable). In summary, such test

(MAR vs. NI) is doable but relies on unverifiable model assumptions.

4.2 Missing Outcomes

Remark 4.9. There are several approaches to handle missing outcomes; likelihood approach, esti-

mating equation approach, imputation, or inverse probability weighting might be available. In this

course, only likelihood approach will be covered. Likelihood approach maximizes joint likelihood given

observed data, (R, Y o):K∏i=1

∫f(Yi, Ri|Xi;β)dY m

i .

61


Let

L(θ) =K∏i=1

∫f(Yi, Ri|Xi; θ)dY

mi

be a likelihood function given observed data. To make this approach possible, joint distribution of

(Yi, Ri) should be available; after partitioning into conditional and marginal one, we can model each

one. There are two ways of partitioning f(Yi, Ri|Xi; θ):

• Selection model: f(Y,R|X; θ) = P (R|Y,X; θ1)f(Y |X; θ2), or

• Pattern mixture model: f(Y,R|X; θ) = f(Y |X,R; θ3)P (R|X; θ4).

In selection model, distribution of R|Y should be characterized; it is about “what will be selected as

missing.” In pattern mixture model, distribution of Y |R should be characterized; given the missing

pattern, distribution of Y is characterized.

Note that parameter of our interest is θ2, as we are interested in the relationship between X and

Y . Thus selection model has an advantage that P (R|Y,X; θ1) can be ignored in estimation step. Also

note that for selection model, missing mechanism R|Y,X should be determined. On the other hand,

in pattern mixture model, an additional step should be taken to convert θ3 to θ2, but notation could

be convenient because (non-)missing indicator is given when handling Y |R,X.

Example 4.10. Selection Model. Note that

L(θ) =K∏i=1

∫P (Ri|Yi, Xi; θ1)f(Yi|Xi; θ2)dY m

i .

We have to examine P (Ri|Yi, Xi; θ1).

(i) Under MCAR, P (Ri|Yi, Xi; θ1) = P (Ri|Xi; θ1), and hence

L(θ) =K∏i=1


i

=

K∏i=1

∫P (Ri|Xi; θ1)︸︷︷︸constant of Yi

f(Yi|Xi; θ2)dY mi

=K∏i=1

P (Ri|Xi; θ1)

∫f(Yi|Xi; θ2)dY m

i .

Thus log-likelihood becomes

`(θ) =

K∑i=1

logP (Ri|Xi; θ1) +

K∑i=1

log


i .

62


Note that our interest is only in θ2; we can ignore P (Ri|Xi; θ1). In other words, we don’t need

to construct missing model R|X; θ1!

(ii) Under MAR, P (Ri|Yi, Xi; θ1) = P (Ri|Y oi , Xi; θ1), and hence

L(θ) =K∏i=1


i

=

K∏i=1

∫P (Ri|Y o

i , Xi; θ1)︸︷︷︸constant of Ymi

f(Yi|Xi; θ2)dY mi

=K∏i=1

P (Ri|Y oi , Xi; θ1)


i .

Again, log likelihood becomes

`(θ) =K∑i=1

logP (Ri|Y oi , Xi; θ1) +

K∑i=1

log


i ,

and we are not interested in θ1, so we don’t need to construct missing model either.

In any case of missing mechanism, step of constructing missing model explicitly can be omitted, i.e.,

missing part (e.g., modeling P (r3|y1, y2, y3), etc..) can be “ignored;” it is compared as Nonignorable

(NI) case.

Remark 4.11. Remark that we should estimate E(Yi|Xi) in the regression. However, we can actually

estimate E(Yi|Xi, Ri) rather than E(Yi|Xi) directly (in MCAR or MAR); if missing pattern is given,

then it can be regarded as if missing values are designed to be missed, and so we can “ignore” the

missing! Then it is equivalent to solve the equation

K∑i=1

X>i ∆i(Yi −Xiβ) = 0.

where ∆i is a diagonal matrix with the jth diagonal element is rij . Then it is to solve

K∑i=1

X>i ∆i(Yi −Xiβ) =

K∑i=1

X>io(Yio −Xioβ) = 0.

It gives an inference for

E(Yi|Xi, Ri) = Xiβ.

However, if the missing mechanism is MCAR, E(Yi|Xi, Ri) = E(Yi|Xi), and hence we can “ignore the

missing” and get an estimate of E(Yi|Xi) = Xiβ. It means that, we don’t need to conduct a likelihood

63


approach in MCAR case. We need a likelihood approach in other missing mechanisms. Thus the rest

part of this section concentrates on other missing mechanisms, MAR or NI.

Following proposition shows how to deal with regression model (e.g. multivariate linear model) via

likelihood approach using selection model.

Proposition 4.12. Maximizing the likelihood

L =K∏i=1

∫f(Yi|Xi;β)P (R|Y o

i , Ymi , Xi; θ1)dY m

i

is equivalent to solve the equation

K∑i=1

E

[∂

∂βlog f(Yi|Xi;β)

∣∣∣∣Y oi , Ri, Xi

]= 0.

Proof. Score equation from the likelihood becomes

∂

∂βlogL =

K∑i=1

∂

∂βlog

∫f(Yi|Xi;β)P (Ri|Y o


i .

Note that score becomes

∂

∂βlogL =

K∑i=1

∂

∂βlog



i

=K∑i=1

∂

∂β



i∫f(Yi|Xi;β)P (Ri|Y o


i

=

K∑i=1

∫∂

∂βf(Yi|Xi;β)P (Ri|Y o




i

=K∑i=1

∫∂

∂βlog f(Yi|Xi;β)× f(Yi|Xi;β)P (Ri|Y o




i

=K∑i=1

∫∂


f(Yi|Xi;β)P (Ri|Y oi , Y

mi , Xi; θ1)∫

f(Yi|Xi;β)P (Ri|Y oi , Y

mi , Xi; θ1)dY m

i︸︷︷︸=f(Ymi |Y oi ,Ri,Xi;β,θ1)

dY mi

=K∑i=1

∫∂

∂βlog f(Yi|Xi;β)f(Y m

i |Y oi , Ri, Xi;β, θ1)dY m

i

=K∑i=1

E

[∂


∣∣∣∣Y oi , Ri, Xi

].

64


Remark 4.13. (i) Note that previous proposition is valid even under NI.

(ii) If Ri ⊥ (Y mi )|Y o

i , Xi, i.e., missing mechanism is MAR, then the likelihood can be factored into

L =K∏i=1



i =K∏i=1

∫f(Yi|Xi;β)dY m

i P (Ri|Y oi , Xi; θ1).

Thus under MAR, the probability of observation (P (Ri|Y oi , Xi)) does not have to be explicitly

specified in likelihood approach. It implies that standard mixed effects modeling is valid, since

in linear mixed model we maximize the “marginal likelihood,” which is same as

∫f(Yi|Xi;β)dY m

i

in here (Y mi includes actual missing values and random effect term bi).

(iii) Now consider multivariate linear model with normal assumption. Then we get

log f(Yi|Xi;β) = −1

2log |Vi| −

1

2(Yi −Xiβ)>V −1

i (Yi −Xiβ).

Now let Yi = (Y oi , Y

mi ), Xi = (Xio, Xim) and Vi =

Vi11 Vi12

Vi21 Vi22

. To apply proposition 4.12, we

should obtain gradient of log f(Yi|Xi;β). Note that

∂

∂βlog f(Yi|Xi;β) = −X>i V −1

i (Yi −Xiβ)

= −(X>io X>im

)Vi11 Vi12

Vi21 Vi22

−1 Yio −Xioβ

Yim −Ximβ

= −

(X>io X>im

)I −V −1i11 Vi12

0 I

V −1i11 0

0 V −1i22·1

Yio −Xioβ

Yim −Ximβ − Vi21V−1i11 (Yio −Xioβ)

holds. Now under MAR,

E(Yim −Ximβ|Yio, Ri, Xi) = E(Yim −Ximβ|Yio, Xi)

= Vi21V−1i11 (Yio −Xioβ),

65


and hence we get

E

[∂


∣∣∣∣Yio, Ri, Xi

]= −

(X>io X>im

)I −V −1i11 Vi12

0 I

V −1i11 0

0 V −1i22·1

Yio −Xioβ

0

= −X>ioV −1

i11 (Yio −Xioβ).

Then score equation becomes

K∑i=1

XioV−1io (Yio −Xioβ) = 0.

It implies that, likelihood approach with selection model is equivalent to “ignore missing out-

comes” in multivariate linear regression model under normal assumption.

(iv) In summary,

– both multivariate linear regression (MLR) and mixed linear regression models can accom-

modate different numbers of outcome measurements per subject;

– MLR (seeing∑X>i V

−1i (Yi −Xiβ)=0) is valid only under MCAR and biased under MAR

and NI;

– however, if data are from multivariate normal distribution, MLR with correct variance speci-

-fication is likelihood analysis, so is valid under MAR;

– mixed effect model is valid if the missing probability does not depend on any unobserved

outcome values or random effects, i.e., R ⊥ (Y mi , b)|(Y o, X, Z).

Example 4.14. Suppose that outcome is 2-dimensional,

Yi = (Yi1, Yi2)>,

and Xi’s are covariate. Assume that Yi1 is completely observed, but Yi2 is not. Let Ri = 1 if Yi2 is

observed, and Ri = 0 otherwise. For simplicity of notation, let Ri = 1 for i = 1, 2, · · · ,K1 and Ri = 0

for i = K1 + 1, · · · ,K. Then likelihood is

L =

K1∏i=1

f(Yi|Xi;β)P (Ri|Yi, Xi; θ1) ·K∏

i=K1+1

∫f(Yi|Xi;β)P (Ri|Yi, Xi; θ1)dYi2,


K1∑i=1

X>i V−1i (Yi − µi) +

K∑i=K1+1

X>i V−1i (E(Yi|Xi, Yi1, Ri)− µi) = 0.

66


Plug-in observations Xi, Yi and Ri = 0 to E(Yi|Xi, Yi1, Ri), we get

E(Yi|Xi, Yi1, Ri) =

Yi1

E(Yi2|Xi, Yi1, Ri = 0)

MAR=

Yi1

µi2 + σ21σ11

(Yi1 − µi1)

,

for Vi =

σ11 σ12

σ12 σ22

, and therefore, from

V −1i =

1

1− ρ2

1σ11

− σ12σ11σ22

sym. 1σ22

,

we get

V −1i (E(Yi|Xi, Yi1, Ri = 0)− µi) =

1

1− ρ2

1σ11

− σ12σ11σ22

sym. 1σ22

Yi1

σ21σ11

(Yi1 − µi1)

=

σ11σ22

σ11σ22 − σ212

σ11σ22−σ212

σ211σ22

(Yi1 − µi1)

0

(∵ ρ2 = σ212/σ11σ22)

=

1σ11

(Yi1 − µi1)

0

.

Thus likelihood equation becomes

K1∑i=1


K∑i=K1+1

X>i1Yi1 − µi1σ11

= 0.

It is equivalent to “ignore missing outcomes.”

When Ri is not independent of Yi2 (NI), given (Yi1, Xi), E(Yi2|Xi, Yi1, Ri) 6= E(Yi2|Xi, Yi1) and

E(Yi2|Xi, Yi1, Ri = 0) =

∫Yi2f(Yi2|Yi1, Xi)P (Ri = 0|Yi2, Yi1, Xi)dYi2∫f(Yi2|Yi1, Xi)P (Ri = 0|Yi2, Yi1, Xi)dYi2

.

In practice, it is evaluated via numerical integration. Note that the conditional expectation of missing

Y2 is a function of the observation probability P (Ri = 0| · · · ) when data are nonignorably missing. We

should consider the model on the observation probability in this case, such as logistic model.

Example 4.15. Missing covariate example. Also suppose 2-dimensional outcome example, Yi =

67


(Yi1, Yi2)>, and xi be covariate. In here, assume that Yi is completely observed, but xi is not. Let

Ri = 1 if xi is observed, and Ri = 0 otherwise. Let Xi = (1, xi) be covariates including intercept. For

the simplicity of notation, let Ri = 1 for i = 1, 2, · · · ,K1, and Ri = 0 for i = K1 + 1, · · · ,K. Then

likelihood becomes

L =

K1∏i=1

f(Yi|Xi;β)g(Xi)P (Ri|Yi, Xi)︸︷︷︸joint likelihood of (Yi, Xi, Ri)

K∏i=K1+1

∫f(Yi|Xi;β)g(Xi)P (Ri|Yi, Xi)dxi.

Recall the missing outcome problem: It’s enough to consider Yi|Y oi , not Xi|Yi, and hence we can

“plug-in” observed values in the likelihood term as

X>i V−1i

Y oi −Xo

i β

E(Y mi −Xm

i β|Y oi , Ri)

.

However, in the missing covariate problem, score is not linear anymore: X>i V−1i Xiβ is quadratic for

Xi. Thus we cannot “plug-in”;

E(X>imV−1i Ximβ|Yio, Xio, Ri) 6= E(Xim|Yio, Xio, Ri)

>V −1i E(Xim|Yio, Xio, Ri)β.

Thus computation becomes difficult. In the rest part of this example, we consider the case that xi is

binary. Then likelihood becomes

L =

K1∏i=1

f(Yi|Xi)g(Xi)P (Ri|Yi, Xi)

K∏i=K1+1

1∑xi=0



−∂ logL

∂β=

K1∑i=1

X>i V−1i (Yi − µi)−

K∑i=K1+1

∂

∂βlog

1∑xi=0

f(Yi|Xi)g(Xi)P (Ri|Yi, Xi) = 0.

Now by the proof of proposition 4.12,

∂

∂βlog

1∑xi=0

f(Yi|Xi)g(Xi)P (Ri|Yi, Xi) =

1∑xi=0

∂

∂βlog f(Yi|Xi) · f(Yi|Xi)g(Xi)P (Ri|Yi, Xi)

1∑xi=0


= E(X>i V−1i (Yi −Xiβ)|Yi, Ri)

68


is obtained, and therefore likelihood equation is

K1∑i=1


K∑i=K1+1

(1, 1)>V −1

i (Yi − (1, 1)β)pi + (1, 0)>V −1i (Yi − (1, 0)β)(1− pi)

= 0,

where pi = E(xi|Yi, Ri) = P (xi = 1|Yi, Ri). Computation of solving this equation is not difficult when

xi is binary. We may handle the problem as if missing covariate part is “duplicated,” and plug-in 1

and 0 in the covariate and conduct a weighted average.

Y X W

Y1 x1 1Y2 x2 1...

......

YK1 xK1 1YK1+1 1 pK1+1

YK1+1 0 1− pK1+1

YK1+2 1 pK1+2

YK1+2 0 1− pK1+2...

......

YK 1 pKYK 0 1− pK

Example 4.16. Mixed Effect Models with Selection Model. In here, the observation probability

can depend on unobservable random effects. Likelihood is

L =N∏i=1

∫ ∫f(Yi|Xi, Zi, bi)g(bi)P (Ri|Y o

i , Ymi , Xi, Zi, bi)dbi︸︷︷︸

marginal likelihood in LMM

dY mi .

(i) Under MAR, Ri ⊥ (Y mi , bi)|(Y o

i , Xi, Zi). Then likelihood becomes

L =

N∏i=1

∫∫f(Yi|Xi, Zi, bi)g(bi)P (Ri|Y o

i , Ymi , Xi, Zi, bi)dbidY

mi

=N∏i=1

∫ ∫f(Yi|Xi, Zi, bi;β)g(bi)dbi︸︷︷︸

marginal likelihood

dY mi P (Ri|Y o

i , Xi, Zi).

Hence we can “ignore” P (Ri|Y oi , Xi, Zi) part if our interest is to obtain an inference for β.

(ii) Under NI, explicit modeling of the observation probability is required. Maximizing

L =

N∏i=1

∫∫f(Yi|Xi, Zi, bi)g(bi)P (Ri|Y o

i , Ymi , Xi, Zi, bi)dbidY

mi

69


is to solveN∑i=1

E

[∂

∂βlog f(Yi|Xi, Zi, bi;β)

∣∣∣∣Y oi , Xi, Zi, Ri

]= 0,

i.e.,N∑i=1

X>i (E(Yi|Y oi , Xi, Zi, Ri)−Xiβ) = 0.

Note that

E(Yij |Y oi , Xi, Zi, rij = 0) =

∫∫yf(y|Y o, Xi, Zi, bi)g(bi)P (R = 0|y, Y o, Xi, Zi, bi)dbdy∫∫f(y|Y o, Xi, Zi, bi)g(bi)P (R = 0|y, Y o, Xi, Zi, bi)dbdy

,

and in practice it can also be computed via numerical methods.

70

Chapter 5

Generalized Estimating Equation

5.1 Review on GLM

5.1.1 Basic Concepts

There are three components of GLM (model assumption): They are random component, systematic

component, and link function, respectively.

• Random Component

Y has a distribution in the exponential family taking the form

fY (y; θ, φ) = exp

(yθ − b(θ)

φ+ c(y, φ)

).

θ is called a “canonical parameter.”

Example 5.1. (i) For N(µ, σ2), θ = µσ , φ = σ and b(θ) = µ2

2σ .

(ii) For Ber(p), θ = log p1−p and b(θ) = − log(1− p).

(iii) For Poi(µ), θ = logµ and b(θ) = µ.

Proposition 5.2. E(Y ) = µ = b′(θ), V ar(Y ) = b′′(θ)φ.

Proof. Note that ∫exp

(yθ − b(θ)

φ+ c(y, φ)

)dy = 1.

Thus we get

∂

∂θ

∫exp

(yθ − b(θ)

φ+ c(y, φ)

)dy =

∫y − b′(θ)

φexp

(yθ − b(θ)

φ+ c(y, φ)

)dy = 0,

71


and hence

E(Y )− b′(θ) = 0.

Furthermore, from

∂2

∂θ2

∫exp

(yθ − b(θ)

φ+ c(y, φ)

)dy =

∫ (−b′′(θ)φ

+

(y − b′(θ)

φ

)2)

exp

(yθ − b(θ)

φ+ c(y, φ)

)dy = 0,

we get

V ar(Y ) = b′′(θ)φ.

• Systematic component

µ := E(Y |X) is related to X through a linear combination of X, η = Xβ.

• Link function

Linear combination η is a function of µ = E(Y |X) via a “link function” g, i.e., η = g(µ). It is

assumed that g is twice differentiable. g is called “canonical link” if it satisfies θ = η.

Example 5.3.

(i) In normal model, η = θ = µ is a canonical link.

(ii) In binary model, η = θ = log µ1−µ is a canonical link, i.e., canonical link is logit function.

(iii) In Poisson model, η = θ = logµ is a canonical link, i.e., canonical link is a log function.

Note that link function determines the interpretation of β. Also note that, canonical link makes

observed information and expected information the same.

5.1.2 Score and Information

• Consider

`(β, φ) =yθ − b(θ)

φ+ c(y, φ).

Then score is obtained as

∂`

∂β(β, φ) =

n∑i=1

∂ηi∂β

∂µi∂ηi

∂θi∂µi

∂`

∂θi

=n∑i=1

X>i1

g′(µi)(b′′(θ))−1Yi − b′(θi)

φ

72


=n∑i=1

X>i g′(µi)

−1[V ar(Yi|Xi)]−1(Yi − µi)

=n∑i=1

∂µi∂β

V ar(Yi|Xi)−1(Yi − µi).

If one uses canonical link function, then

∂`

∂β(β, φ) =

n∑i=1

X>i(Yi − µi)

φ,

since∂µi∂ηi

∂θi∂µi

=∂θi∂ηi

= 1.

• Now consider

I(θ) =

n∑i=1

−∂2`(θ)

∂θ2.

I(θ) is called an (observed) information. One can also consider an expected information, i(θ) =

EI(θ). For canonical link,

I(θ) =n∑i=1

X>iφ

∂µi∂β>

is deterministic, and hence i(θ) = I(θ). It implies that, when we use canonical link, likelihood

function is strictly concave, g′(µi)−1 = b′′(θi), and hence

i(β) =

n∑i=1

X>i b′′(θi)Xi

φ.

5.1.3 Computation

• Newton-Raphson Method

In general, the equation U(θ) = 0 does not have a closed form solution. Then one should solve

the equation iteratively. Starting from initial value θ(0), one can iteratively update the estimate

by

θ(p+1) = θ(p) + I(θ(p))−1U(θ(p)).

When the update becomes very small, i.e., |θ(p+1) − θ(p)| < c, where c is pre-specified small

value, for example c = 10−8, the iteration is stopped and θ(p+1) is declared as the solution. This

computational algorithm is called Newton-Raphson algorithm.

• Iteratively Re-weighted Least Square Score function of GLM has a form of weighted linear re-

73


gression.

U(β) =n∑i=1

X>i (g′(µi)V ar(Yi))−1︸︷︷︸

weight

(Yi − µi)

If one can represent “µi part” as a linear combination Xiβ, then one can conduct WLS estimation

updating weights at each iterations. However, µi 6= Xiβ makes ordinary weighted least square

approach impossible. Thus we “linearize” g(µi), and find the pseudo data variable whose mean

is Xiβ.

Consider a pseudo-data variable

Zi = g(µi) + g′(µi)(Yi − µi) = ηi +∂ηi∂µi

(Yi − µi).

Then E(Zi) = ηi = Xiβ and

V ar(Zi) =

(∂µi∂µi

)2

V ar(Yi) = g′(µi)2V ar(Yi).

Hence the score equation becomes

U(β) =

n∑i=1

X>i (g′(µi)2V ar(Yi))

−1g′(µi)(Yi − µi) =n∑i=1

X>i V ar(Zi)−1(Zi − ηi),

and we can obtain WLS estimate as

β =

(n∑i=1

X>i V ar(Zi)−1Xi

)−1 n∑i=1

X>i V ar(Zi)−1Zi,

and update V ar(Zi) (weights) iteratively. Such approach is called Iteratively Re-weighted Least

Remark 5.4. IRLS is equivalent to Newton-Raphson method with Fisher scoring. IRLS update is

β(p+1) =

(n∑i=1

X>i V ar(p)(Zi)

−1Xi

)−1 n∑i=1

X>i V ar(p)(Zi)

−1Zi

=

(n∑i=1

X>i

(g′(µi)

2V ar(p)(Yi))−1

Xi

)−1( n∑i=1

X>i

(g′(µi)

2V ar(p)(Yi))−1 (

η(p)i + g′(µ

(p)i )(Yi − µ(p)

i )))

= β(p)i +

(n∑i=1

X>i

(g′(µi)

2V ar(p)(Yi))−1

Xi

)−1( n∑i=1

X>i

(g′(µi)V ar

(p)(Yi))−1

(Yi − µ(p)i )

)

= β(p) + i(β(p))−1U(β(p)),

which is the same as that in Newton-Raphson algorithm.

74


5.2 Generalized Estimating Equations

5.2.1 Quasi-Likelihood

Quasi-likelihood approach is an extension of likelihood inference, which is suggested by Wedderburn

(1974). It is motivated by score equation of GLM. Recall that score equation in GLM is

U(β) =

n∑i=1

X>i (g′(µi)V ar(Yi))−1(Yi − µi) = 0.

Note that score equation is based on the likelihood function, and hence it comes from full distributional

assumption, but it only requires mean µi and variance V ar(Yi) of the response variable! Thus, if we

only model mean and variance part, we can “mimic” the procedure in GLM even if there is no

distributional assumption.

Now consider only U(β) = 0, an estimating equation, not the likelihood `(β). In other words, we will

find an estimator β as a solution of U(β) = 0, and not consider it as an optimizer of the likelihood.

Such method is called a quasi-likelihood approach. It can be used to model overdispersed binomial

or counts data. For example, Poisson model is commonly used for counts data, whose variance is the

same as mean. However, if it is known that their variance should be larger than mean, Poisson model

may not be adequate. In here, we can apply quasi-likelihood, which does not specify full distributional

characteristics but only first and second moments. Note that we solve

∂µ>

∂βV −1(Y − µ) = 0,

if E(Y ) = µ and V ar(Y ) = V . Following asymptotic behavior is known.

Proposition 5.5. Let the solution of U(β) = 0 be β. Then as n→∞

√n(β − β0)

d−−−→n→∞

N(0,Ω−1),

where

Ω = limn→∞

1

n

∂µ>

∂βV −1 ∂µ

∂β>.

Proof. By definition,

U(β) = U(β0) +∂U

∂β>

∣∣∣∣β=β0

(β − β0) +OP (1)

75


and note that

∂U

∂β>

∣∣∣∣β=β0

= − ∂µ>

∂βV −1 ∂µ

∂β>

∣∣∣∣β=β0

+∂2µ

∂β∂β>

∣∣∣∣β=β0

V −1(Y − µ)︸︷︷︸mean zero

= − ∂µ>

∂βV −1 ∂µ

∂β>

∣∣∣∣β=β0

+OP

(n1/2

).

There we get

√n(β − β0) =

(1

n

∂µ>

∂βV −1 ∂µ

∂β>

∣∣∣∣β=β0

+OP (n−1/2)

)−1(1√nU(β0) +OP (n−1/2)

)= Ω−1 1√

nU(β0)︸︷︷︸

d−−−→n→∞

N(0,Ω)

+oP (1).

5.2.2 Extension of quasi-likelihood method to multivariate outcome: GEE

Now consider a multivariate outcome Yi = (Yi1, Yi2, · · · , Yini)> and covariate Xij , for j = 1, 2, · · · , ni

and i = 1, 2, · · · ,K. There are two main approaches to treat longitudinal data; one is GEE, the other

one is GLMM. GEE (Generalized Estimating Equation) is a marginal approach: It does not assume

any full distributional properties. GLMM (Generalized Linear Mixed Model) is a conditional approach:

It controls the correlation via random effect term. In this chapter, GEE will be introduced.

GEE extends quasi-likelihood approach based on following argument. Assume that true parameter

is obtained as a solution of “real estimating equation.” If estimating equation is obtained by the

data converges to the real one as the number of the data increases, then estimates converge to the

true parameter under some technical assumptions. GEE approach extends such argument to the

longitudinal data, which is proposed by Liang and Zeger (1986) and Zeger and Liang (1986).

Figure 5.1: Estimating Equation

76


Definition 5.6. For each subunit, assume that each element Yij given Xij arises from an exponential

family distribution and follows GLM, i.e.,

E(Yij |Xij) = µij , V ar(Yij |Xij) = b′′(θij)φ, ηij = Xijβ = g(µij).

In here, we can specify

E(Yi|Xi) = µi and V ar(Yi|Xi) = Vi,

where

Vi = Ai(µi)1/2Ωi(α)Ai(µi)

1/2

for working correlation matrix Ωi.

Remark 5.7. Note that we do NOT specify the FULL distribution of Yi; just marginal distri-

bution of each Yij is specified, and only specified properties of Yi are mean and variance.

In GEE, we consider an estimating equation

U(β, α) =K∑i=1

∂µ>i∂β

V −1i (Yi − µi) = 0.

An equivalent representation is

K∑i=1

(∂µi1∂β

· · · ∂µini∂β

)(p×ni)

V −1i

(ni×ni)(Yi − µi)

(ni×1)

= 0.

Example 5.8. For repeated binary data, assume that

logP (Yij = 1|Xij)

P (Yij = 0|Xij)= log

pij1− pij

= β0 + β1Xij

and

E(Yi) = pi = (pi1, pi2, · · · , pini)>, V ar(Yij) = pij(1− pij).

We further assume “working correlation” for Yi. For example, if Ωi(α) = Ii (“working independence

correlation”), the GEE solution is the same as the MLE obtained as if Yij ’s are independent.

We can obtain estimates β and α as following iterative procedure. First, set α = 0 (=indepen-

dent working correlation), and obtain initial estimates of β, β(0). Then solve U(β(0), α) = 0 and let

the solution α(1). Then solve U(β, α(1)) = 0 and let the solution β(1). Repeat this procedure until

convergence.

77


Remark 5.9. (i) Recall the likelihood approach: In the likelihood inference, U =∂ logL

∂θ, where

logL = (Y −Xβ)>V (α)−1(Y −Xβ), and so via iterative approach, we can say that

L(β(0), α(0)) ≤ L(β(1), α(0)) ≤ L(β(1), α(1)) ≤ · · · .

It implies that “every single step guarantees that likelihood increases.” However, in GEE, there

is no objective function to optimize, and hence convergence of such iterative approach is not

ensured.

(ii) The GEE estimator β is asymptotically normal with mean β0 and variance W−10 W1W

−10 , where

W1(β0, α) = V ar(U) and W0(β, α) = E

(− ∂U

∂β>

).

In other words,√K(β − β0) ≈ N(0,KW−1

0 W1W−10 ).

For the proof, note following Taylor expansion

U(β, α) = 0 = U(β0, α) +∂U

∂β>

∣∣∣∣β=β∗

(β − β0), |β − β0| ≥ |β∗ − β0|.

Thus we get

β − β0 =

(− ∂U

∂β>(β∗)

)−1

U(β0, α).

Note that since∂U

∂β>is sum of K independent random variables,

−K−1 ∂U

∂β>(β∗) ≈ −K−1 ∂U

∂β>(β0) = −K−1W0 + oP (1)

as K →∞. Furthermore, since U is sum of K independent random variables,

K−1/2U(β0, α) = K−1/2(EU +OP

(√V ar(U)

)) = OP (1).

Thus we get

√K(β − β0) =

(K−1W0

)−1(K−1/2U(β0, α)

)+ oP (1) ≈ N(0,KW−1

0 W1W−10 ).

78


(iii) Note that

V ar(U) =

K∑i=1

∂µ>i∂β

V −1i V ar(Yi)V

−1i

∂µi∂β>

=K∑i=1

∂µ>i∂β

V −1i

∂µi∂β>

if V ar(Yi) = Vi. Thus, if Ωi (and consequently Vi) is correctly specified, W1 = W0 holds. Even if

correlation structure is misspecified, variance of β has a sandwich form, whose estimate is robust.

(iv) By inverse function theorem (Foute, 1977), if θ is the solution of U(θ) = 0, then θ = U−1(0)

converges to (EU)−1(0). Thus, θ0 = (EU)−1(0), i.e., U(θ0) = 0 is the ONLY key part in the

consistency of θ (i.e.. only correct specification of mean model is required). Therefore, even when

Ωi(α) is misspecified, the estimate β is consistent. Variance misspecification does not affect to

the consistency of β.

(v) Recall that

W0 = E

(− ∂U

∂β>

)=

K∑i=1

∂µ>i∂β

V −1i

∂µi∂β>

and

W1 =

K∑i=1

∂µ>i∂β

V −1i V ar(Yi)V

−1i

∂µi∂β>

.

Vi might be misspecified, but for consistent estimate, V ar(Yi) should estimate the true one. Thus

we estimate W1 as

W1 =K∑i=1

∂µ>i∂β

∣∣∣∣β=β

Vi(α)−1(Yi − µi)(Yi − µi)>Vi(α)−1 ∂µi∂β>

∣∣∣∣β=β

for consistent estimate.

Remark 5.10. Estimation for α. Note that α is related to the “correlation coefficient,”

αjk =K∑i=1

(Yij − µij)(Yik − µik)σj σk

.

Since σj depends on β and φ, estimate of α depends on both terms, i.e.,

α = α(β, φ).

79


Also note that

σ2j =

1

K

K∑i=1

(Yij − µij)2

depends on φ, and so φ does. Thus φ = φ(β). Denoting α(β, φ(β)) = α∗(β), we have

K−1/2K∑i=1

Ui(β, α∗(β)) = K−1/2

K∑i=1

Ui(β, α)︸︷︷︸A∗

+

(1

K

K∑i=1

∂

∂α>Ui(β, α)

)︸︷︷︸

B∗

√K(α∗ − α)︸︷︷︸

C∗

.

Now under some technical assumptions,

√K(α(β, φ)− α) = OP (1),

√K(β − φ) = OP (1),

∣∣∣∣∂α(β, φ)

∂φ

∣∣∣∣ ≤ H(Y, β) = OP (1),

we have

C∗ =√K(α(β, φ(β))− α(β, φ) + α(β, φ)− α

)

=√K

∂α(β, φ∗)

∂φ︸︷︷︸≤OP (K−1/2)

(φ− φ)︸︷︷︸=OP (K−1/2)

+ α(β, φ)− α︸︷︷︸=OP (K−1/2)

= OP (1)

and B∗ = oP (1) because

B∗k =1√K

K∑i=1

∂

∂αkUi(β, α) = − 1√

K

(K∑i=1

∂µ>i∂β

V −1i

∂Vi∂αk

V −1i (Yi − µi)

)k

is sum of independent random variables with mean zero. Thus score “with α∗ plugged-in”

K−1/2K∑i=1

Ui(β, α∗(β))

is asymptotically equivalent to

A∗ = K−1/2K∑i=1

Ui(β, α).

in other words, the quality of α does not affect on the asymptotic behavior of β; only the required

condition for α is√K-consistency. Thus we don’t have to sweat on finding “best estimate” of α. We

may take MME approach for α.

80


Remark 5.11. (i) Choice of Ωi(α). Popular choices of Ωi(α) include intraclass,

Ωi(α) =

1 α α · · · α

α 1 α · · · α

α α 1 · · · α...

......

. . ....

α α α · · · 1

,

and AR(1) correlation structure,

Ωi(α) =

1 α α2 · · · αni−1

α 1 α · · · αni−2

α2 α 1 · · · αni−3

......

.... . .

...

αni−1 αni−2 αni−3 · · · 1

.

In intraclass, MME can be used for estimation of α, for example,

α =

K∑i=1

∑j>k

Yij − pij√pij(1− pij)

Yik − pik√pik(1− pik)

K∑i=1

ni(ni − 1)

2

,

in binary regression. Note that α is the solution of

W (α) :=

K∑i=1

ni(ni − 1)

2α−

K∑i=1

∑j>k

rij rik = 0,

and hence it converges to α∗, the solution of EW (α∗) = 0. It is same as the true α0 if Ωi(α) is

correctly specified.

(ii) Summary: Estimation of α. Since consistency of β is unaffected by misspecification of Ωi(α),

we can specify Ωi(α) as the identity matrix first. Then we can estimate (and we need to do) α

from residuals. Also note that, when the number of sub-units in the unit is the same for all units

(ni = n ∀i), we can assume that Ωi(α) = Ω(α), and estimate all n(n−1)/2 unknown parameters

without specifying a particular structure as follows:

Ω(α) =1

K

K∑i=1

A−1/2i (Yi − ρi)(Yi − ρi)>A−1/2

i .

81


Remark 5.12. Caution in using GEE.

(i) Since the asymptotic properties of GEE depends on large K and fixed ni (precisely, maxi ni =

O(1) as K →∞), data with large ni or small K are not suited to GEE. For example, study that

the data are collected everyday, or the data with ni increasing as a function of K, may not be

adequate to apply GEE.

(ii) For some “working correlation” structures such as m-dependence, there is a range of α that

yields non-positive definite matrix. Users should make sure the estimated correlation matrix is

positive definite.

(iii) The lack of definition of α: One of the assumptions for proof is that the estimate of α be√K-

consistent, i.e.,√K(α− α) = OP (1). However, α is not a true correlation, but just a parameter

from “working correlation,” which is commonly misspecified, and sometimes it is not clear what

α estimates. For example, it would converge to the solution α∗ of EW (α∗) = 0, but what α∗

means is not well specified, and even it is not guaranteed that such solution exists. In summary,

α is subject to “uncertainty of definition,” leading to breakdown of the asymptotic properties of

β (α may not be√K-consistent!).

(iv) (Cont’d) Therefore, it is a good practice to fit independent working correlation first, i.e., Ωi(α) =

Ii.

(v) Ωi(α) = Ii yields OLS fit. βOLS is still consistent even if variance structure is misspecified;

however, we lose efficiency by ignoring correlation. In contrast, if we specify correlation structure

correctly, we gain efficiency, but there is a risk that the variance model is misspecified; then

asymptotic behavior of β might be ruined. One should keep in mind that there is no such thing

as a free lunch when choosing the method to employ.

5.3 More topics on GEE

5.3.1 Hypothesis Testing

Remark 5.13. In the likelihood inference, we can test the hypothesis H0 : β1 = β10 via likelihood

ratio test and its asymptotic equivalences. Let β = (β>1 , β>2 )> be a parameter of interest, where β1 is

the parameter of interest of length q and β2 is a nuisance parameter with length p − q. Then for log

likelihood function `,

2(`(β1, β2)− `(β10, β2)) ≈H0

χ2(q),

82


where β2 is MLE of β2 under H0. Also, we can consider Wald’s test statistic

(β − β0)>V ar(β)(β − β0)

and Rao’s test statistic (score test statistic)

U(β0)>i(β0)−1U(β0) = U(β0)>

(− ∂U

∂β

∣∣∣∣β=β0

)−1

U(β0) = U1(β0)>

((− ∂U

∂β

∣∣∣∣β=β0

)11·2

)−1

U1(β0).

Note that: ((− ∂U

∂β

∣∣∣∣β=β0

)11·2

)−1

is an upper left matrix of (− ∂U

∂β

∣∣∣∣β=β0

)−1

.

Using this, we can develop a “Wald-like” and “score-like” tests,

(β − β0)>V ar(β)(β − β0)

and

U(β0)>V ar(U)−1U(β0),

respectively. However, we cannot develop “LRT-like” test because there is no likelihood or objective

function that we have to maximize. In this case, the “likelihood based on working independence model”

plays an important role. Let the product of the densities of Yij ’s be L(θ;Y ), where θ> = (β>, φ).

Although L(θ : Y ) is not the likelihood function, we can consider “LRT-like statistic”

TL = 2(logL(β1, θ2)− logL(β10, θ2)),

where θ2 = (β>2 , φ)> and θ2 is the maximizer of L under the restriction β1 = β10. TL would be asymp-

totically χ2 distributed if all Yij’s are independent, but actually they are correlated! Following

theorem tells the distribution of TL under the null.

Theorem 5.14. Let P0 be a q × q upper left matrix of W−10 and P1 be the variance of U1(β10, β2),

which is the first q elements of the GEE function evaluated at β2 = β2, and β2 is the solution of the

GEE under the null. Then

TLd≈

q∑j=1

djχ2j ,

where d1 ≥ d2 ≥ · · · ≥ dq are eigenvalues of P0P1.

83


Proof. Note that

`(β0) ≈ `(β) +∂`(β)

∂β

∣∣∣∣β=β︸︷︷︸

=0

(β0 − β) +1

2(β0 − β)>

∂2`(β)

∂β∂β>

∣∣∣∣β=β

(β0 − β)

and hence

2(`(β)− `(β0)) ≈ (β − β0)>∂2`(β)

∂β∂β>

∣∣∣∣β=β

(β − β0).

Also recall that

β − β0 ≈

(− ∂U

∂β>

∣∣∣∣β=β0

)−1

U(β0),

where U =∂`

∂β. Then we get

2(`(β)− `(β0)) ≈ U(β0)>

E(− ∂U

∂β>

∣∣∣∣β=β0

)︸︷︷︸

=W0

−1

U(β0).

Similarly, we get

2(`(β0)− `(β0)) ≈ U2(β0)>E

(− ∂U2

∂β>

∣∣∣∣β=β0

)−1

U2(β0).

Therefore, letting W0 =

W011 W012

W021 W022

, we get

2(`(β)− `(β0)) ≈ U(β0)>W−10 U(β0)− U2(β0)>W−1

022U2(β0)

= (U1(β0)−W012W−1022U2(β0))>W−1

022·1(U1(β0)−W012W−1022U2(β0)).

Now note that:

V ar(U) = W0

only when the variance model is correctly specified ! Thus in general

V ar(U1(β0)−W012W−1022U2(β0)) 6= W022·1.

Hence in here, we should find the distribution of x>Ax, where x ∼ N(0, V ). Recall that mgf of x>Ax

isp∏i=1

(1− 2tλi)−1/2

84


where λi’s are eigenvalues of AV . Note that t 7→ (1− 2tλi)−1/2 is mgf of λi · χ2(1), and hence

x>Axd≡ λ1χ

21 + λ2χ

22 + · · ·+ λpχ

2p,

where χ2i are independent χ2(1) distributed random variables. In our problem,

W−1022·1

is an upper left q × q matrix of W−10 , and

V ar(U1(β0)−W012W−1022U2(β0)) = P1.

It can be shown as following. First note that

U1(β10, β2)

U2(β10, β2)

=

U1(β10, β2)

0

≈ U(β0)−

(− ∂U

∂β>

) 0

β2 − β20

≈ U(β0)−W0

0

β2 − β20

x and

U1(β10, β2) ≈ U1(β0)−W012(β2 − β20)

0 ≈ U2(β0)−W022(β2 − β20)

hold. Now using

β2 − β20 ≈W−1022U2(β0),

we obtain

U1(β10, β2) ≈ U1(β0)−W012W−1022U2(β0).

85


5.3.2 Model Selection

Remark 5.15. Review of AIC. Let ` be a log likelihood based on the specified modelM, andM∗

be the true model with true β∗. Let

∆(β, β∗) = EM∗(−2`(β))

be a Kullback-Leibler divergence between β and β∗. Then Akaike Information Criterion (AIC) is the

estimates of EM∗(∆(β, β∗)). (CAUTION: Note that we cannot plug-in directly as

∆(β, β∗) = EM∗(−2`(β))!

At ∆(β, β∗), expectation should be taken only for the likelihood.) Now note that

∆(β, β∗) = EM∗(−2`(β))

≈ EM∗(−2`(β∗)− 2(β − β∗)> ∂`

∂β

∣∣∣∣β=β∗

− (β − β∗)> ∂2`

∂β∂β>

∣∣∣∣β=β∗

(β − β∗)

)

= EM∗(−2`(β∗)) + 2(β − β∗)>EM∗(− ∂`∂β

∣∣∣∣β=β∗

)︸︷︷︸

=0

+(β − β∗)>EM∗(− ∂2`

∂β∂β>

∣∣∣∣β=β∗

)︸︷︷︸

=:I(β∗)

(β − β∗)

= EM∗(−2`(β∗)) + (β − β∗)>I(β∗)(β − β∗),

and hence

∆(β, β∗) ≈ EM∗(−2`(β∗)) + (β − β∗)>I(β∗)(β − β∗)

holds. Now note that

−2`(β∗) ≈ −2`(β)− 2∂`

∂β

∣∣∣∣β=β︸︷︷︸

=0

(β∗ − β)− (β∗ − β)>∂2`

∂β∂β>

∣∣∣∣β=β︸︷︷︸

≈−I(β)≈−I(β∗)

(β∗ − β)

= −2`(β) + (β − β∗)>I(β∗)(β − β∗),

and putting in this, we get

EM∗(∆(β, β∗)) ≈ EM∗(−2`(β)) + 2EM∗(

(β − β∗)>I(β∗)(β − β∗)).

Now from

β − β∗ ≈ I(β∗)−1 ˙(β∗),

86


we get

(β − β∗)>I(β∗)(β − β∗) = ˙(β∗)>I(β∗)−1 ˙(β∗),

and therefore

EM∗(∆(β, β∗)) ≈ EM∗(−2`(β)) + 2EM∗(

˙(β∗)>I(β∗)−1 ˙(β∗))

= EM∗(−2`(β)) + 2tr

I(β∗)−1EM∗( ˙(β∗) ˙(β∗)>︸︷︷︸=:J(β∗)

)

= EM∗(−2`(β)) + 2EM∗(tr(J(β∗)I(β∗)−1))

holds. AIC is defined as an estimate of EM∗(∆(β, β∗)), which is defined as

AIC = −2`(β) + 2tr(J(β∗)I(β∗)−1

).

Remark 5.16. Extension to GEE. Lack of likelihood leads to lack of model selection devices

such as AIC. Quasi-likelihood under the Independence model Criterion, which is an AIC for GEE, is

suggested by Pan (2001). Let Q(β) be a “quasi-likelihood” under independent working criterion,

Q(β) =K∑i=1

ni∑j=1

log g(Yij ; θij).

Recall that score equation is U(β) = 0, where

U(β) =K∑i=1

∂µ>i∂β

V −1i (Yi − µi),

where Vi = A1/2i RiA

1/2i is a working variance. Let UR(β) be a score function with working correlation

R. Then∂Q

∂β= U I . Also, if we suppress the dependence of ∆(β, β∗) on the true model M∗, we get

EM∗

(−∂Q∂β

∣∣∣∣β=β∗

)= 0 and ΩI := EM∗

(− ∂Q

∂β∂β>

∣∣∣∣β=β∗

)=

K∑i=1

∂µ>i∂β

V −1i

∂µi∂β>

.

Thus we get

∆(β, β∗) = EM∗(−2Q(β))

≈ EM∗(−2Q(β∗)) + (β − β∗)>EM∗(− ∂2Q

∂β∂β>(β∗)

)(β − β∗)

= EM∗(−2Q(β∗)) + (β − β∗)>ΩI(β − β∗).

87


Hence

∆(β, β∗) ≈ EM∗(−2Q(β∗)) + (β − β∗)>ΩI(β − β∗)

holds. In here, β = β(R) is estimate based on the working correlation R. Now note that

−2Q(β∗) ≈ −2Q(β)− 2(β∗ − β)>∂Q

∂β

∣∣∣∣β=β

− (β∗ − β)>∂2Q

∂β∂β>

∣∣∣∣β=β

(β∗ − β).

Note that

− ∂2Q

∂β∂β>

∣∣∣∣β=β

≈ EM∗(− ∂2Q

∂β∂β>

∣∣∣∣β=β

).

However, to be cautious that∂Q

∂β

∣∣∣∣β=β

=∂Q

∂β

∣∣∣∣β=β(R)

6= 0!

Rather,∂Q

∂β

∣∣∣∣β=β(I)

= U I(β(I)) = 0.

Thus EM∗∆(β, β∗) is approximated as

EM∗∆(β, β∗) ≈ EM∗(−2Q(β)) + EM∗(2(β − β∗)>U I(β))︸︷︷︸=(♠)

+2EM∗(

(β − β∗)>ΩI(β − β∗)).

Ignoring (♠) part, we get

QIC := −2Q(β) + 2tr(ΩI V ar(β)).

Now note that

ΩI =

K∑i=1

∂µ>i∂β

V −1i

∂µi∂β>

= W0,

and hence ΩI = W0. On the other hand, we should estimate the variance V ar(β) robust with respect

to the model misspecification, and hence we use

V ar(β) = W−10 W1W

−10 .

Therefore, we obtain

QIC = −2Q(β) + 2tr(W1W−10 ).

88

Chapter 6

Generalized Linear Mixed Models

6.1 Introduction

Recall the extension from linear regression model to the linear mixed effects model. There is only

one way to extend linear model to the mixed effects model; considering “additive” random effect terms

which are Gaussian distributed. However, GLM has two ways of extension for longitudinal data:

(i) ηi = Xiβ + Zibi, where bi ∼ g(bi;D);

(ii) ηi = Xiβ + ν(ui), where ui ∼ g(ui;D) is random effect term.

The former one is called Generalized Linear Mixed Model (GLMM); the latter one can be viewed

as a generalization of GLMM. Due to its hierarchical structure, second model is called a Hierarchical

Generalized Linear Model (HGLM).

How to handle such random effect term? One can integrate over with respect to the random term,

and consider a marginal likelihood. Since integration does not give a closed form in usual, one should

approximate the integral. Depending on the level of approximation, various methods are available

such as Laplace approximation, Penalized quasi-likelihood, or Marginal quasi-likelihood. On the other

hands, conditional approach can be used; if we “condition” on each individual, we only think for each

individual and hence individual effect becomes ignorable. For this, conditional likelihood or pseudo-

likelihood is available.

6.2 Extension of GLM

As mentioned in previous section, there are two ways for extension; one is GLMM, and the other is

HGLM.

89


6.2.1 GLMM

The correlation is induced by sharing an unobservable common factor, which is called random effect.

The model is

Yij |bi ∼ f(Yij |bi;β),

where f comes from exponential family, and bi ∼ g(bi;D) with known g(·; ·) indexed by unknown D.

Also, there is systematic component ηij = Xijβ+Zijbi and link function ηij = h(E(Yij |bi)). Typically,

bi is assumed to be distributed as multivariate normal, but it is not necessary; actually, it cannot give

any advantage in GLMM, different as classical LMM.

Remark 6.1. GLMM is called “subject-specific model”; whereas GEE is called “population average

model.” GEE gives “population average,” and hence it yields the same prediction value under same

covariate value. In contrast, due to the random effect term, GLMM yields different prediction value

even under the same covariate value. prediction becomes different as subject differs.

Example 6.2. Consider a logistic model. Given random effect bi, the response probability is assumed

to be

P (Yij = 1|Xij , bi) =exp(β0 + β1Xij + bi)

1 + exp(β0 + β1Xij + bi).

Then β1 becomes “log odds ratio”;

eβ1 =P (Yij = 1|Xij = 1, bi)/P (Yij = 0|Xij = 1, bi)

P (Yij = 1|Xij = 0, bi)/P (Yij = 0|Xij = 0, bi),

but it does not hold marginally,

eβ1 6= P (Yij = 1|Xij = 1)/P (Yij = 0|Xij = 1)

P (Yij = 1|Xij = 0)/P (Yij = 0|Xij = 0).

The right-hand-side of latter equality is “marginal” or “population average” odds ratio. Note that:

P (Yij = 1|Xij) =

∫P (Yij = 1|Xij , bi)dG(bi).

Remark 6.3. (i) Population average model can be used to compare two groups; for example,

“smoking population” vs. “non-smoke population.” On the other hands, subject-specific model

can be used compare when “specific individual” smokes or non-smokes.

(ii) GLMM implies that conditional log odds ratio (when increase of a unit of Xij) is same among

individuals (log odds ratioi≡ β1). It is consequence of “additive modeling” for random effects.

90


(iii) Note that marginal mean is obtained as

E(Yij =

∫E(Yij |bi)dG(bi).

Thus, GLMM models conditional mean to determine the (marginal) mean structure of Yij , rather

than directly modeling marginal mean model (just as GEE).

(iv) Marginal mean does not have a logit form even if we assumed logit form for conditional mean in

usual. Otherwise, we can consider a distribution of random effect that makes both marginal and

conditional mean same. In precise, one can find G such that

∫H(βX + b)dG(b) = H(φβX + α),

where H is a function, for example, a logit function. Such G is called a “bridge distribution.”

(v) Usually conditional independence is assumed:

Cov(Yij , Yik) = ECov(Yij , Yik|bi) + Cov(E(Yij |bi), E(Yik|bi))

= 0 + Cov(E(Yij |bi), E(Yik|bi)).

(vi) In practice, both GEE and GLMM are fitted, results from both models are compared, and

whether they are far from each other or not may be seen.

6.2.2 HGLM

Some models are formulated in a slightly different way. Instead of imposing distributional assumption

on the random effects that is linear in η, one can impose distributional assumption on a function of

that random effect. This type of model is called hierarchical generalized linear models.

Specifically, let Yi = (Yi1, · · · , Yini)> be the response for the ith unit (i = 1, 2, · · · ,K) and νi be the

corresponding unobserved random effect. We restrict our attention to nested hierarchical structure in

that each outcome Yij , j = 1, 2, · · · , ni of Yi is repeatedly measured within unit i.

Definition 6.4. We assume that Yij arises from an exponential family distribution given random

effect νi and follows generalized linear models (GLMs) with the density f(Yij |νi;ψij , φ), where

log f(Yij |νi;ψij , φ) =Yijψij − b(ψij)

φ+ d(Yij , φ).

91


In here, ψij denotes the canonical parameter, and φ is the dispersion parameter. We denote by

µij := E(Yij |νi) = b′(ψij) and ηij = q(µij),

with q(·) as the link function and ηij = Xijβ+νi with νi = ν(ui) for some strictly monotonic differen-

tiable function of ui. Xij = (1, X1ij , · · · , Xpij) is a 1× (p+ 1) covariate vector corresponding to fixed

effects β = (β0, · · · , βp)>.

HGLM imposes distributional assumption on “ui” with density

f(ui;α) = exp

(a1(α)ui − a2(α)

ϕ+ c(ui, ϕ)

).

Let θ = (β>, φ,α>)>, ν = (ν1, ν2, · · · , νK)>, u = (u1, u2, · · · , uK)>, and Y = (Y>1 ,Y>2 , · · · ,Y>K)>.

The joint likelihood (H-likelihood) for θ and ν is defined as

H(θ,ν(u); Y) =1

K

K∑i=1

nihi(θ, ν(ui); Yi),

where

hi(θ, ν(ui); Yi) = `1i(θ, ν(ui); Yi) + `2i(θ, ν(ui))

with

`1i(θ, ν(ui); Yi) =1

ni

ni∑j=1

log f(Yij |νi;ψij , φ)

=1

ni

ni∑j=1

(Yijψij − b(ψij)

φ+ d(Yij , φ)

)

and

`2i(θ, ν(ui)) =1

nilog f(ν(ui);α)

=1

ni

a1(α)ui − a2(α)

ϕ+ c(ui, ϕ)︸︷︷︸

ui part

+ logduidνi︸︷︷︸

Jacobian

.

Remark 6.5. (i) Note that there is a subtle difference between the h-likelihood and the joint likeli-

hood. The h-likelihood is the joint density of Y and the random effects ν = ν(u), and therefore

is a subclass of joint likelihood defined on a particular scale of u, ν(u), out of a class of scales

ν∗(u) (i.e., only think particular class of transformations which yields a convenient form).

92


(ii) Obviously, maximizing β and νi jointly would lead to incorrect inference about β due to curse of

dimensionality (number of i increases!). However, “posterior mode” νi plays an important role

in inference and computation. Recall that

K∏i=1

k(νi|Yi1, · · · , Yini)︸︷︷︸posterior

∝K∏i=1

ni∏j=1

f(Yij |νi)× g(ui)duidνi

︸︷︷︸

h-likelihood

,

regarding g(ui)dui/dνi as if it is “prior” on ui, and hence maximizer νi of h-likelihood becomes a

“posterior mode.” Below, we examine some popular models and the corresponding joint likelihood

and νi.

Example 6.6 (Normal-Normal Model). Consider a normal-normal model

E(Yij |ν(ui)) = µij + ν(ui), µij = ηij = Xijβ, ν(ui) = ui,

f(ui;D) =1√

2πDexp

(− u

2i

2D

).

Then since

f(Yij |ν(ui);β,D) =1√

2πσ2exp

(−(Yij − µij)2

2σ2

),

we get the h-likelihood

hi(θ, ν(ui); Yi) =1

ni

ni∑j=1

(−1

2log 2πσ2 − 1

2σ2(Yij − µij − ui)2

)− 1

2log 2πD − u2

i

2D

.

Hence we get

h′i(θ, ν(ui); Yi) :=∂

∂uihi(θ, ν(ui); Yi) =

1

ni

ni∑j=1

1

σ2(Yij − µij − ui)−

uiD

,

and the solution of h′i(θ, ν(ui); Yi) = 0 becomes

ui =D

niD + σ2

ni∑j=1

(Yij − µij).

Remark 6.7. Note that ui = E(ui|Yi) is the same as BLUP of ui. It implies that

– posterior mode ui is same as posterior mean E(ui|Yi) in normal case;

– in normal linear mixed effects model, regarding ui’s as fixed parameters gives correct inference

for β.

93


Example 6.8 (Poisson-Gamma Model). Consider a Poisson-Gamma model

E(Yij |ν(ui)) = µij = exp(ηij), ηij = Xijβ + ν(ui), ν(ui) = log ui,

f(Yij |ν(ui);θ) =e−µijµ

Yijij

Yij !,

and

f(ui;λ, k) =uk−1i exp(−ui/λ)

Γ(k)λk.

Due to identifiability, we set E(ui) = 1, i.e., k = λ−1. Note that

µij = eηij = eXijβ+log ui = uieXijβ,

i.e.. ui is a “common multiplier” for ith unit. Then we have that H-likelihood is

K∏i=1

ni∏j=1

e−µijµYijij

Yij !·uk−1i e−ui/λ

Γ(k)λk· ui

=

K∏i=1

ni∏j=1

e−uieXijβ

(uieXijβ)Yij

Yij !· u

ki e−ui/λ

Γ(k)λk

.

Thus h-likelihood becomes


ni

ni∑j=1

(−uieXijβ + Yij log ui + YijXijβ − log Yij !

)+ k log ui −

uiλ− log Γ(k)λk

,

and hence ui is the solution of

h′i(θ, ν(ui); Yi) =1

ni

ni∑j=1

(−eXijβ +

Yijui

)+k

ui− 1

λ

= 0,

i.e.,

ui =

ni∑j=1

Yij + k

ni∑j=1

eXijβ + λ−1

=

ni∑j=1

Yij + k

ni∑j=1

eXijβ + k

. (6.1)

Remark 6.9. (i) E(Yij |ui) = µij = uieXijβ. The assumption E(ui) = 1 means that marginal mean

becomes E(Yij) = eXijβ.

(ii) Predictor ui in (6.1) satisfies E(ui) = 1 = E(ui), and hence ui is an “unbiased predictor.” Note

that if we did not think ν(ui) = log ui and so Jacobian term were omitted in h-likelihood, ui

94


should be

ui =

ni∑j=1

Yij + k − 1

ni∑j=1

eXijβ + k

,

which is biased predictor. Also note that bias becomes negligible when ni becomes large: “Prior

becomes negligible if sample size goes to ∞.”

(iii) Simple predictor of ui is a sample mean,

ui =

ni∑j=1

Yij

ni∑j=1

eXijβ

,

which is also unbiased. “Prior” gets a role “pulling ui towards to 1” as adding k to both numerator

and denominator.

Example 6.10 (Binomial-Beta Model). For binary outcome, model with canonical link function is

E(Yij |ν(ui)) = µij =eηij

1 + eηij.

There are two choices for modeling:

i) ηij = Xijβ + bi, µij =eXijβ+bi

1 + eXijβ+bi, bi ∼ N(0, D);

ii) ηij = Xijβ + ν(ui), µij =eXijβ+ν(ui)

1 + eXijβ+ν(ui), ui ∼ g(ui).

We consider second model in this example. That is,

ηij = logµij

1 + µij= Xijβ + ν(ui), ν(ui) = log

ui1− ui

,duidνi

=

(1

ui+

1

1− ui

)−1

= ui(1− ui),

f(Yij |ν(ui);θ) = µYijij (1− µij)1−Yij ,

and

f(ui;α1, α2) =uα1−1i (1− ui)α2−1

B(α1, α2),

where B(α1, α2) =Γ(α1)Γ(α2)

Γ(α1 + α2)is a Beta function. Here we set α1 = α2 = α to give E(ui) = 1/2. Then


ni

ni∑j=1

(Yij logµij + (1− Yij) log(1− µij)) + loguα−1i (1− ui)α−1

B(α, α)ui(1− ui)

95


=1

ni

ni∑j=1

(Yij log

µij1− µij

+ log(1− µij))

+ loguαi (1− ui)α

B(α, α)

=

1

ni

ni∑j=1

(Yij(Xijβ + ν(ui))− log(1 + eXijβ+ν(ui))

)+ log

uαi (1− ui)α

B(α, α)

and hence we get


ni

ni∑j=1

(Yij

ui(1− ui)−

1ui(1−ui)e

Xijβ+ν(ui)

1 + eXijβ+ν(ui)

)+α

ui− α

1− ui

=

1

ni

ni∑j=1

(Yij − µijui(1− ui)

)+α(1− 2ui)

ui(1− ui)

.

Therefore we get

ui =

ni∑j=1

(Yij − µij) + α

2α, µij =

eXij+ν(ui)

1 + eXij+ν(ui)=

ui1− ui

eXijβ

1 +ui

1− uieXijβ

=uie

Xijβ

(1− ui) + uieXijβ.

Example 6.11 (Gamma-Inverse gamma Model). Let

E(Yij |ν(ui)) = kµij , ηij = logµij , ηij = Xij(β) + ν(ui), ν(ui) = log ui,

f(Yij |ν(ui);β, k) =1

Γ(k)

(Yijµij

)kexp

(−Yijµij

)1

Yij,

and ui arises from an inverse-gamma density

f(ui;α) =1

Γ(α+ 1)

(α

ui

)α+1

exp

(− αui

)1

ui

with E(ui) = 1. Then h-likelihood is


ni

ni∑j=1

(k log

Yijµij− Yijµij− log Γ(k)− log Yij

)

+(α+ 1) logα

ui− α

ui− log ui − log Γ(α+ 1) + log ui

)

=1

ni

ni∑j=1

(k log

Yijµij− Yijµij− log Γ(k)− log Yij

)+ (α+ 1) log

α

ui− α

ui− log Γ(α+ 1)

96


and hence


ni

ni∑j=1

(− k

µij+Yijµ2ij

)∂µij∂ui

+

(−α+ 1

ui+α

u2i

)=

1

ni

ni∑j=1

(− k

uieXijβ+

Yij

u2i e

2Xijβ

)eXijβ +

(−α+ 1

ui+α

u2i

)=

1

ni

ni∑j=1

Yij − kuieXijβ

u2i e

Xijβ+α− (α+ 1)ui

u2i

.

Thus predictor of ui becomes

ui =Yije

−Xijβ + α

nik + α+ 1.

Remark 6.12. Note that E(Yij) = E(kµij) = E(kuieXijβ) = keXijβ. It implies

E(ui) =nik + α

nik + α+ 16= 1 = E(ui).

Thus, all of HGLM setting does not yield unbiased predictor for ui.

Example 6.13 (Gamma-Normal Model: with log link). In here, we consider the same model as

previous example (example 6.11) for Yij , but assume different model for ui. The model is

E(Yij |ν(ui)) = kµij , ηij = logµij , ηij = Xij(β) + ui, ui ∼ N(0, λ).

Note that it is same setting as that of GLMM. Then

hi(θ, ui; Yi) =1

ni

ni∑j=1

((k − 1) log Yij −

Yijµij− k logµij − log Γ(k)

)− 1

2log 2πλ− u2

i

2λ

.

µij = eXijβ+ui yields that

h′i(θ, ui; Yi) =1

ni

ni∑j=1

(Yije

Xijβ+ui

e2(Xijβ+ui)− k)− uiλ

=1

ni

ni∑j=1

(Yij

eXijβ+ui− k)− uiλ

.

However, in here, h′i(θ, ui; Yi) = 0 does not have a closed form solution for ui.

97


6.3 Marginal Likelihood Approach

In here, we consider a “marginal likelihood,” which averages the joint likelihood over bi,

LM (β,D; Yi) =

∫ ni∏j=1

f(Yij |Xij , bi)dG(bi),

and maximizes it. Since it is defined via integration, usually marginal likelihood does not have a closed

form, and hence we should approximate the integral via, for example, Laplace approximation.

6.3.1 Laplace Approximation

Let ì(θ; bi,Yi) be the joint log-likelihood of Yi and bi,

ì(θ; bi,Yi) =

ni∑j=1

log f(Yij |bi;β) + log g(bi;D),

where θ = (β>, D)>. We need to evaluate

∫eì(θ;bi,Yi)dbi. When the integral is hard to evaluate, one

can use following Laplace approximation.

Proposition 6.14 (Laplace approximation). Let bi be the solution of `′i(bi) = 0, i.e., h′i(bi) = 0. Then

∫eì(θ;bi,Yi)dbi = eì(θ;bi.Yi)

(√2π(−`′′i (θ; bi,Yi)

)−1/2+ oP (1)

)

as ni →∞.

Proof. Note that

∫eì(bi)dbi =

∫enihi(bi)dbi

=

∫exp

(ni

(hi(bi) + (bi − bi)h′i(bi) +

(bi − bi)2

2h′′i (bi) +

(bi − bi)3

6h

(3)i (bi) + · · ·

))dbi

= enihi(bi)∫eni(bi−bi)

2h′′i (bi)/2 exp

ni(bi − bi)3

6h

(3)i (bi) +

ni(bi − bi)4

24h

(4)i (bi) + · · ·︸︷︷︸

=:∆i

dbi


2h′′i (bi)/2

(1 + ∆i +

1

2∆2i +

1

6∆3i + · · ·

)dbi


2h′′i (bi)/2

(1 +

ni(bi − bi)3

6h

(3)i (bi) +

ni(bi − bi)4

24h

(4)i (bi) + · · ·

+1

2

n2i (bi − bi)6

36h

(3)i (bi)

2 + · · ·

)dbi.

98


Now recall that ∫e−ax

2x2mdx =

(2m)!

m!22m

√πa−

2m+12 .

Using this, we obtain ∫eni2

(bi−bi)2h′′i (bi)dbi =√π(−ni

2h′′i (bi)

)−1/2

∫eni2

(bi−bi)2h′′i (bi)(bi − bi)2m−1dbi = 0

and

∫eni2

(bi−bi)2h′′i (bi)ni

(2m)!(bi − bi)2mh

(2m)i (bi)dbi =

(2m)!

m!22m

√π(−ni

2h′′i (bi)

)− 2m+12 ni

(2m)!h

(2m)i (bi)

=

√π

m!22m

(−ni

2h′′i (bi)

)−1/2n1−mi (−2)m

h(2m)i (bi)

h′′i (bi)m

=(−ni

2h′′i (bi)

)−1/2· (−1)m

√π

m!2m`(2m)i (bi)

`′′i (bi)m.

Furthermore, we get

∫eni2

(bi−bi)2h′′i (bi)1

2

n2i

36(bi − bi)6h

(3)i (bi)

2dbi =(−ni

2h′′i (bi)

)−1/2 5

12× 24ni.

Then we get

∫enihi(bi)dbi = enihi(bi)

√π(−ni

2h′′i (bi)

)−1/2(1 + oP (1))

≈ enihi(bi)√π(−ni

2h′′i (bi)

)−1/2,

and therefore we get the approximation

∫eì(θ;bi,Yi)dbi = eì(θ;bi,Yi)

(√2π(−`′′i (θ; bi,Yi)

)−1/2+ oP (1)

),

i.e.,

log

∫eì(θ;bi,Yi)dbi ≈ ì(θ; bi,Yi)−

1

2log(−`′′i (bi)

)+

1

2log 2π.

Remark 6.15. (i) That is, the marginal log-likelihood can be approximated by

logLM (β,D; Y) ≈K∑i=1

ì(θ; bi,Yi)︸︷︷︸“plug-in” bi

− 1

2

K∑i=1

log(−`′′i (θ; bi,Yi)

)︸︷︷︸

“correction term”

.

99


(ii) Higher-order correction can be made by

logLM (β.D; Y) ≈K∑i=1

ì(θ; bi,Yi)−1

2

K∑i=1

log(−`′′i (θ; bi,Yi)

)+

K∑i=1

log(1− Cni(θ; bi,Yi))

where

Cni(θ; bi,Yi) =J!i(θ; bi,Yi)

8− J2i(θ; bi,Yi)

24,

J!i(θ; bi,Yi) = −`(4)i (θ; bi,Yi)

`′′2

i (θ; bi,Yi)= − 1

ni

h(4)i (θ; bi,Yi)

h′′2

i (θ; bi,Yi),

and

J2i(θ; bi,Yi) = −`(3)2

i (θ; bi,Yi)

`′′3

i (θ; bi,Yi)= − 1

ni

h(3)2

i (θ; bi,Yi)

h′′3

i (θ; bi,Yi).

Remark 6.16. (i) Goal of such computation is to obtain an inference for β. In normal-normal case,

h′′i (bi) does not depend on β, and hence likelihood function becomes

L(β) =K∏i=1

∫eìdbi ≈

K∏i=1

enihi(bi)√π(−ni

2h′′i (bi)

)−1/2

︸︷︷︸constant

.

Thus correction term would be a constant and hence it can be ignored. In other words, we can

just use ì(bi), i.e., just “plug-in” bi. It yields joint maximization of β and bi; jointly maximizing

β and bi as if bi’s are fixed parameters also works in normal-normal case.

(ii) Laplace approximation is valid only when ni’s are large. It would not be acceptable to apply

such approximation to the data with ni = 1 or ni = 2; it often fails to converge in practice.

Most commonly used methods in practice are penalized quasi-likelihood (PQL) and marginal

quasi-likelihood (MQL). From the rest part of this section, we describe Laplace approximation,

PQL and MQL when random effects arise from normal distribution.

Example 6.17. As before, let yi be the ni × 1 outcome vector, Yi = (Yi1, Yi2, · · · , Yini)>, and Xi, Zi

be ni × p, ni × q matrices of explanatory variable associated with the fixed and random effects. Also,

let bi be a q × 1 random effect, while α is a p× 1 fixed effect.

We assume that Yij |bi arise from f(yij |bi) and conditionally independent with E(Yij |bi) = µij(bi)

and V ar(Yij |bi) = φV (µij). Also assume that

ηi = g(µi(bi)) = Xiα+ Zibi,

or equivalently,

ηij = g(µij(bi)) = Xijα+ Zijbi.

100


Random effects bi are assumed to arise from N(0, D(θ)).

Then the marginal likelihood is

e`(α,θ) =

K∏i=1

∫ ni∏j=1

f(yij |bi)

f(bi)dbi

∝K∏i=1

|D|−1/2

∫exp

ni∑j=1

ìj(yij ;µij(bi))−1

2b>i D

−1bi

dbi,

where ìj(yij ;µij(bi)) = log f(yij |bi).

Now we apply Laplace approximation. Let

ni∑j=1


2b>i D

−1bi = k(bi).

Then it gives

`(α, θ) ≈K∑i=1

(−1

2log |D|+ k(bi)−

1

2log | − k′′(bi)|

),

where bi = bi(α, θ) is the solution of k′(bi) = 0, where1

k′(bi) =

ni∑j=1

∂ηij∂bi

∂µij∂ηij

∂θij∂µij

∂ìj∂θij

−D−1bi

=

ni∑j=1

Z>ijg′(µij)

−1b′′(θij)−1 yij − b′(θij)

φ−D−1bi

=

ni∑j=1

Z>ij (yij − µij(bi))φV (µij)g′(µij)

−D−1bi.

Also note that

−k′′(bi) =

ni∑j=1

Z>ijZij

φV (µij)g′(µij)2−

ni∑j=1

Z>ij (yij − µij(bi))∂

∂bi

(1

φV (µij)g′(µij)

)︸︷︷︸

=:Ri

+D−1,

where E(Ri) = EE(Ri|bi) = 0. Thus, ignoring Ri (regarding Ri ≈ 0), we can express

−k′′(bi) = Z>i WiZi +D−1,

1To maintain the consistency of notation (from chapter GEE), I regarded Xij and Zij as row vectors

101


where

Zi =

Zi1...

Zini

, Wi =

1

φV (µi1)g′(µi1)2

. . .

1φV (µini )g

′(µini )2

.

In conclusion, we get

`(α, θ) ≈K∑i=1

(−1

2log |D|+ k(bi)−

1

2log |Z>i WiZi +D−1|

)

=K∑i=1

ni∑j=1


2b>i D

−1bi −1

2log |I + Z>i WiZiD|

6.3.2 Penalized Quasi-Likelihood

In Laplace approximation, Z>i WiZi + D−1 is the “correction term” of just replacing bi into bi

(i.e., e`(bi) ≈ e`(bi)). Note that our interest is to obtain an inference for “α.” Since ∂∂αW 6= 0, we

cannot ignore the correction term in general, but ∂∂αW = 0 in the normal case (i.e., LMM, with

µij(bi) = Xijα+ Zijbi, g(µi) = µi). In such case, approximation becomes

`(α, θ) ≈K∑i=1

(k(bi)−

1

2log |D|

).

It can be viewed as a “profile likelihood.” In other words, maximizing `(α, θ) is equivalent to max-

imizing∑K

i=1

(∑nij=1 `ij(yij ;µij(bi))−

12 log |D| − 1

2b>i D−1bi

)jointly over (α>, b1, b2, · · · , bK)> gives

the correct inference.2

Motivated by this, we can obtain the inference via other approaches; which is called Penalized

Quasi-Likelihood (PQL). PQL ignores the correction term −12

∑Ki=1 log |D + Z>i WiZi| for Laplace

approximation, hoping that ∂∂αW ≈ 0. That is to maximize

K∏i=1

ni∏j=1

f(yij |bi;α) · f(bi;D)

jointly over (α>, b1, · · · , bK).

The name “Penalized” comes from the fact that this method maximizes the likelihood function

K∏i=1

ni∏j=1

f(yij |bi)

2Actually, the correction term is related to the variance term b′′(θij). Thus, if one differentiates the correction term(so that we can find the maximization solution), we obtain the function of “3rd derivative” of b. It implies that, if thedistribution of Yi has high skewness, then the correction term highly affects to the maximization of the approximatedlikelihood `(α, θ).

102


over (α>, b1, · · · , bK) with penalty∏Ki=1 f(bi;D), so that “estimation” of bi’s obtained as if they are

fixed are “pulled toward zero.”

Example 6.18 (Continued). Note that penalized quasi-likelihood function is

PQL =K∑i=1

ni∑j=1


2b>i D

−1bi

if we used normal random effect. Thus we get

∂

∂αPQL =

K∑i=1

ni∑j=1

X>ij (yij − µij(bi))φV (µij)g′(µij)

∂

∂biPQL =

ni∑j=1

Z>ij (yij − µij(bi))φV (µij)g′(µij)

−D−1bi.

Remark 6.19. Note that problem in example 6.18 can be treated as a huge GLM with parameter

(α>, b>1 , · · · , b>K)> ∈ Rp+Kq. Let

N =

K∑i=1

ni and X =

X1

X2

...

XK

be an N × p matrix, where Xi =

Xi1

Xi2

...

Xini

.

Also, let

Z =

Z1

Z2

. . .

ZK

be an N ×Kq matrix where Zi =

Zi1

Zi2...

Zini

,

and b =

b1

b2...

bK

be a Kq × 1 vector. Then the information matrix is a (p+Kq)× (p+Kq) matrix

I = E

− ∂2`

∂α∂α>− ∂2`

∂α∂b>

− ∂2`

∂b∂α>− ∂2`

∂b∂b>

,

103


where we use ` = PQL. Note that

E

[− ∂2`

∂α∂α>

]=

K∑i=1

ni∑j=1

X>ijXij

φV (µij)g′(µij)2=

K∑i=1

X>i WiXi = X>WX,

where

Wi = diag

(1

φV (µi1)g′(µi1)2, · · · , 1

φV (µini)g′(µini)

2

)and W =

W1

W2

. . .

WK

.

Similarly,

E

[− ∂2`

∂α∂b>i

]=

ni∑j=1

X>ijZij

φV (µij)g′(µij)2= X>i WiZi,

and hence we get

E

[− ∂2`

∂α∂b>

]=(X>1 W1Z1, · · · , X>KWKZK

)= X>WZ.

Finally, from

E

[− ∂2`

∂bi∂b>i

]=

ni∑j=1

Z>ijZij

φV (µij)g′(µij)2+D−1 = Z>i WiZi +D−1

and

E

[− ∂2`

∂bi∂b′>i

]= 0 for i 6= i′,

we get

E

[− ∂2`

∂b∂b>

]= Z>WZ + I ⊗D−1,

where ⊗ denotes the Kronecker product. Therefore we get the summarized form of information matrix,

I =

X>WX X>WZ

Z>WX Z>WZ + I ⊗D−1

.

Remark 6.20. We can find α and b with Newton-Raphson algorithm (with Fisher scoring)

α(p+1)

b(p+1)

=

α(p)

b(p)

+ I−1

∂`∂α

∣∣α=α(p)

∂`∂b

∣∣b=b(p)

.

104


Note that update rule at (α0, b0) can be written as

α0

b0

+ I−1

∂`∂α0

∂`∂b0

= I−1

Iα0

b0

+

∂`∂α0

∂`∂b0

= I−1

X>WX X>WZ


α0

b0

+

∂`∂α0

∂`∂b0

= I−1

X>W (Xα0 + Zb0)

Z>W (Xα0 + Zb0) + (I ⊗D−1)b0

+

∂`∂α0

∂`∂b0

.

Each score can be rewritten as

∂`

∂α0=

K∑i=1

ni∑j=1

X>ij (yij − µij)φV (µij)g′(µij)

= X>W

g′(µ11)

. . .

g′(µKnK )

(Y − µ) = X>W∂η

∂µ(Y − µ)

∂`

∂b0=

K∑i=1

ni∑j=1

Z>ij (yij − µij)φV (µij)g′(µij)

−

D−1b01

...

D−1b0K

= Z>W

g′(µ11)

. . .

g′(µKnK )

(Y − µ)− (I ⊗D−1)b0

= Z>W∂η

∂µ(Y − µ)− (I ⊗D−1)b0,

and henceα0

b0

+ I−1

∂`∂α0

∂`∂b0

= I−1

X>W (Xα0 + Zb0)

Z>W (Xα0 + Zb0) + (I ⊗D−1)b0

+

∂`∂α0

∂`∂b0

= I−1

X>W (Xα0 + Zb0)

Z>W (Xα0 + Zb0) + (I ⊗D−1)b0

+

X>W ∂η∂µ(Y − µ)

Z>W ∂η∂µ(Y − µ)− (I ⊗D−1)b0

= I−1

X>W (Xα0 + Zb0 + ∂η∂µ(Y − µ))

Z>W (Xα0 + Zb0 + ∂η∂µ(Y − µ))

Therefore, as in GLM, we can see above that the Newton-Raphson algorithm is equivalent to IRLS-like

procedure as follows. Let

y∗i = Xiα+ Zibi +∂ηi∂µi

(yi − µi).

Note that

E(y∗i |bi) = Xiα+ Zibi and V ar(y∗i |bi) = ∆iV ar(yi|bi)∆i = W−1i ,

105


where ∆i = diag

(∂ηi1∂µi1

, · · · , ∂ηini∂µini

). Thus, the update

α(p+1)

b(p+1)

=

X>WX X>WZ


−1X>Wy∗

Z>Wy∗

is a weighted least square estimator of (α>, b1, b2, · · · , bK) with ridge penalty with pseudo-outcome

y∗, while W is re-evaluated at b(θ) = b(α(θ)) every iteration.3

Remark 6.21. Practical Remark. R function which performs PQL regards as X>WZ = 0, so that it

updates α and bi’s separately. Thus it may not give the exact solution.

Remark 6.22. Variance component estimation. Variance component θ (for b) can be obtained

by maximizing

`(α, θ) ≈K∑i=1

−1

2log |I + Z>WZD| −

ni∑j=1


2b>i D

−1bi

,

which is an “approximated marginal likelihood over θ.” Again, the correction term −12

∑Ki=1 log |D +

Z>WZ| is ignored in PQL method. On the other hand, φ (variance component for α) can be estimated

by maximizing the PQL or by moments method using Pearson chi-squared statistics

φ =K∑i=1

ni∑j=1

(yij − µij(bi))2

V (µij(bi)).

Note that such estimator does not take into account penalizing term.

6.3.3 Marginal Quasi-Likelihood

Marginal Quasi-Likeliood (MQL) evaluates marginal moments and construct GEE-type esti-

mating equation, instead of evaluating marginal likelihood directly. This methods obtains approxi-

mate forms of marginal mean and variance using Taylor expansion around bi = 0 for all i’s, and use

estimating equations to draw inference. This approximation would be close when D ≈ 0 (i.e., bi’s are

almost all nearly zero), otherwise yields biased results. However, there are some special cases that this

approximation yields consistent results, including normal-normal and Poisson-gamma model.

From now on, let h = g−1 be an inverse link function.

3The fact that Fisher scoring in PQL is equivalent to IRLS with ridge penalty is not surprising; in fact, it is coherent,because in PQL we consider ridge-like penalty term of bi’s on the likelihood function.

106


Example 6.23 (Poisson model). Assume that

E(Yij |bi) = biµij , µij = eXijα, and V ar(Yij |bi) = biµij .

Also assume that E(bi) = 1 and V ar(bi) = φ. Then marginal moments become

E(Yij = EE(Yij |bi) = µij

V ar(Yij = V arE(Yij |bi) + EV ar(Yij |bi) = φµ2ij + µij .

(Note that in GLM V ar(Yij) = µij ; there are additional term φµ2ij due to the random effect) Also,

under the conditional independence assumption,

Cov(Yij , Yik) = Cov(E(Yij |bi), E(Yik|bi)) + ECov(Yij , Yik|bi) = µijµikφ.

Thus one can use GEE-type estimating equations to estimate α:

K∑i=1

∂µi∂α

V −1i (Yi − E(Yi|Xi)) = 0,

where

E(Yi|Xi) =

eXi1α

...

eXiniα

, Vi =

µi1 + µ2

i1φ µi1µi2φ · · ·...

. . ....

· · · µi,ni−1µi,niφ µini + µ2iniφ

= φµiµ>i +diag(µi1, · · · , µini).

(Note that α has different interpretation from β in GEE) Variance component φ can be estimated by,

for example, solving ∑i,j

µ2ij

((Yij − µij)2 − (µ2

ijφ+ µij))

= 0,

which is equivalent to minimizing∑

i,j

((Yij − µij)2 − (µ2

ijφ+ µij))

. (It is one of possible approaches

to estimate φ, and other methods are also available.)

Now we see some general form of MQL. MQL uses following approximation for the conditional

moments.

E(Yi|bi) = h(Xiα+ Zibi) ≈ h(Xiα) + h′(Xiα)Zibi =: µ∗i

V ar(Yi|bi) = diag(φV (µij(bi))) ≈ diag(φV (µ∗ij))

107


Based on these approximations, the marginal moments become

E(Yi) =: µi ≈ h(Xiα)

and

V ar(Yi) = V arE(Yi|bi) + EV ar(Yi|bi) ≈ V0i + ∆−1i ZiDZ

>i ∆−1

i .

where

V0i = diag(φV (µi)) = diag(φV (Eµ∗i )) ≈ EV ar(Yi|bi),

∆i = diag(g′(µi)).

(∆−1i = diag(g′(µi)

−1) ≈ diag(h′(Xiα))) So we can set up an estimating equation

U(α, θ) =∂µ>

∂αV ar(Y )−1(Y − µ) = 0,

which is to solve (∵ ∂µ>i /∂α = X>i ∆−1i )

X>∆−1(V0 + ∆−1ZDZ>∆−1

)−1(Y − µ) = 0,

which is again equivalent to

X>(∆V0∆ + ZDZ>)−1∆(Y − µ) = 0.

This also can be solved using pseudo-dependent variable setting

y∗ = η + ∆(Y − µ)

with weight matrix W = V −1 = (∆V0∆ + ZDZ>)−1.

6.4 Conditional Likelihood Approach

When the interest is only in the regression coefficient for time-varying covariate X, say β1, you can

“eliminate the individual-specific term” by conditioning each individual.

Example 6.24 (Motivation). Consider the case of binary Yij with logit link function and ηij =

Xijβ1 + Ziβ2 + bi. Note that

ni∑j=1

Yij =: Yi+ (“How many occurred in total?”)

108


does not carry information about time-varying covariate coefficient β1, and hence conditioning on∑j Yij will only give the information of β1. (Comment for just a motivation: recall sufficiency of

statistics!) The conditional likelihood is

LC =K∏i=1

ni∏j=1

P (Yij = 1|Xij , bi)YijP (Yij = 0|Xij , bi)

1−Yij

∑pairs

ni∏j=1

P (Yij = 1|Xij , bi)YijP (Yij = 0|Xij , bi)

1−Yij

=K∏i=1

ni∏j=1

eYijβ1Xij

∑pairs

ni∏j=1

eYijβ1Xij

(“What actually happend?”)

(“Summation for all possible cases”)

where∑pairs

means summation over (Yi1, · · · , Yini) that satisfy∑

j Yij = Yi+. The last equality can be

easily verified. For example, assume that Yi1 = 1, Yi2 = Yi3 = 0 are observed. Then

LCi =P (Yi1 = 1|bi)P (Yi2 = 0|bi)P (Yi3 = 0|bi)

P (Yi1 = 1|bi)P (Yi2 = 0|bi)P (Yi3 = 0|bi) + P (Yi1 = 0|bi)P (Yi2 = 1|bi)P (Yi3 = 0|bi) + · · ·

=

eXi1β1+Ziβ2+bi

1 + eXi1β1+Ziβ2+bi

1


1


eXi1β1+Ziβ2+bi


1


1

1 + eXi3β1+Ziβ2+bi+ · · ·

=eXi1β1+Ziβ2+bi

eXi1β1+Ziβ2+bi + eXi2β1+Ziβ2+bi + eXi3β1+Ziβ2+bi=

eXi1β

eXi1β + eXi2β + eXi3β.

Note that β2 and bi are “eliminated” by conditioning. Also note that, such logic is valid only when we

use canonical link.

Following proposition gives a generalized version of example 6.24.

Proposition 6.25 (Kalbfleisch, 1978). Consider a model with canonical link and ηij = Xijβ1 +Ziβ2 +

bi. Conditioning on Y(i1), Y(i2), · · · , Y(ini) (i.e., we only knows the values but what matches to each one),

a conditional likelihood can be written as

LC :=

K∏i=1

ni∏j=1

f(Yij |Xij , bi)f(bi)

∑pairs

ni∏j=1

f(Y(ij)|Xij , bi)f(bi)︸︷︷︸=:LCi

=

K∏i=1

ni∏j=1

exp (Yijβ1Xij/φ)

∑pairs

ni∏j=1

exp(Y(ij)β1Xij/φ

) .

109


Proof. Since

f(Yij |Xij , bi)f(bi) = exp

(Yijθij − b(θij)

φ+ c(Yij , φ)

)f(bi),

we get

LCi =

exp

ni∑j=1

(Yijθij − b(θij)

φ+ c(Yij , φ)

) f(bi)

∑pairs

exp

ni∑j=1

(Y(ij)θij − b(θij)

φ+ c(Y(ij), φ)

) f(bi)

=

exp

ni∑j=1

Yijθij − b(θij)φ

∑pairs

exp

ni∑j=1

Y(ij)θij − b(θij)φ

(∵∑j

c(Yij , φ) =∑j

c(Y(ij), φ))

=

exp

ni∑j=1

Yijθijφ

∑pairs

exp

ni∑j=1

Y(ij)θij

φ

=

exp

ni∑j=1

Yijηijφ

∑pairs

exp

ni∑j=1

Y(ij)ηij

φ

(∵ canonical link!)

=

exp

ni∑j=1

Yij(Xijβ1 + Ziβ2 + bi)

φ

∑pairs

exp

ni∑j=1

Y(ij)(Xijβ1 + Ziβ2 + bi)

φ

=

ni∏j=1

exp (Yijβ1Xij/φ)

∑pairs

ni∏j=1

exp(Y(ij)β1Xij/φ

) (∵∑j

Yij =∑j

Y(ij))

Remark 6.26 (Pseudo-likelihood). Evaluating the denominator can be computationally prohibitive,

because it requires exp(Y(ij)β1Xij) for all permutation of (Yi1, · · · , Yini). It means that, even if ni

is small as ni = 6, we have to evaluate more than 700 terms of exp(Y(ij)! One alternative is to use

“pairwise conditioning,” which leads to pseudo-likelihood (Liang and Qin, 2000):

K∏i=1

∏j>k

f(Yij |Xij , bi)f(Yik|Xik, bi)

f(Yij |Xij , bi)f(Yik|Xik, bi) + f(Yik|Xij , bi)f(Yij |Xik, bi)

=

K∏i=1

∏j>k

eYijβ1XijeYikβ1Xik

eYijβ1XijeYikβ1Xik + eYikβ1XijeYijβ1Xik.

110


6.5 Applications and Further topics

6.5.1 Multi-level Modeling

Suppose we have repeated measures of systolic & diastolic blood pressure over time. Then we have

two levels of clustering, i.e., there are two factors which yield correlation; over times and across systolic

vs. diastolic.

Given b1i and b2i, suppose Y1ij (systolic) and Y2ij (diastolic) arise independently from an exponential

family distribvution, and

η1ij = X1ijβ1 + b1i

η2ij = X2ijβ2 + b2i

where bi = (b1i, b2i)> ∼ N(0, D) and D is a 2 × 2 matrix. Then under the assumptions Y1ij ⊥

Y2ij |(b1i, b2i); Y1ij |X1ij , X2ij , b1i, b2id≡ Y1ij |X1ij , b1i, the marginal likelihood becomes

K∏i=1

∫ ni∏j=1

(f(Y1ij |X1ij , b1i)f(Y2ij |X2ij , b2i)) g(bi;D)dbi.

If Ykij |Xkij , bki is distributed as normal with mean Xkijβ + bki, then Yi|Xi is distributed with multi-

variate normal and elements of variance are

Cov(Y1ij , Y1ik) = D11 (j 6= k), Cov(Y2ij , Y2ik) = D22 (j 6= k), and Cov(Y1ij , Y2ik) = D12.

6.5.2 Heavy-tail distribution

Suppose we have repeated measures on random effects whose distributions have heavier tails than

normal distribution. We may handle this by modeling the variance with random effects:

ηij = Xijβ + bi

where bi is the distributed with mean 0 and variance φi where log φi = logD + log ui. That is,

assume that bi|ui ∼ N(0, uiD) where ui is another random effect term, which is, e.g., inverse-gamma

distributed.

When Yij |bi is distributed with normal and ui follows the inverse-gamma distribution with E(ui) = 1

and V ar(ui) = τ/(1 − τ) for 0 ≤ τ ≤ 1. This model can be viewed as an extension of multivariate

111


t-distribution, where the degrees of freedom may not necessarily be integer.

K∏i=1

∫ ∫ ni∏j=1

f(yij |bi)

1√2π

1√uiD

e− b2i

2uiD1

uα+1i

e− αui

1

uidbidui

=K∏i=1

∫ ni∏j=1

f(yij |bi)

∫ 1√2π

1√uiD

e− b2i

2uiD1

uα+1i

e− αui

1

uidui︸︷︷︸

multivariate t-distribution

dbi

In other words, marginal distribution of bi is multivariate t-distribution, and hence the model reflects

heavy-tail behavior of the random effects. When τ = 0, the model reduces to mixed effects model with

normal error.

6.5.3 Summarizing by individual or by time

Instead of one huge model, we can reduce the dimension by summarizing by individual or by time.

• Summarizing by individual. Consider a diary data: Asthma patients write a “diary” whether they

were attacked or not, once every three days. Because series is very long, we would summarize the

data by each individual. For this, we consider an individual regression coefficients (β0i, β1i), and

think that information about individual specific effect is summarized in (β0i, β1i), and obtain an

inference based on such fitted coefficients. It gives a kind of two-stage modeling. In precise, at

first stage we assume that

(β0i, β1i)>|β0i, β1i,Wi(β0i, β1i) ∼ N

((β0i, β1i)

>,Wi(β0i, β1i)),

where Wi(β0i, β1i) is the variance of each individual specific coefficients. It reflects that vari-

ance becomes small when the series is long. Then at second stage, the unobservable coefficients

(β0i, β1i)> is assumed to be normally distributed, i.e.,

(β0i, β1i)> ∼ N

((α0, α1)>, D

).

Combining the first and second stages, marginally we obtain that

β0i

β1i

∼ Nα0

α1

,Wi +D

and we can find maximal likelihood estimators for α0, α1 and D using (β0i, β1i)’s.

• Summarizing by time. Several authors (Moulton and Zeger, 1989; Wei and Stram, 1988; Wei and

112


Johnson, 1985) considered obtaining summary statistics by time and combining the time-specific

statistics. These methods are useful when the number of time points is limited and time interval

is balanced, i.e., when each time points has a specific meaning. Then summarize statistic for

each time and then jointly draw an inference. This model could fit into GEE framework when

the interaction terms for time and all covariates are included in the model.

113

Chapter 7

Nonparametric Longitudinal Data

Analysis

7.1 Local Polynomial Regression

7.1.1 Review of Local Polynomial Regression

Recently many methods have been developed to analyze longitudinal data using nonparametric

methods such as kernel and spline. Start with the model

y = g(t) + ε

and independent observation (yi, ti)ni=1 on a closed interval [a, b]. It is assumed that g is unknown but

expected to be smooth. Fix a point s in the interior of [a, b]. Since g is smooth, we can consider Taylor

expansion

g(t) ≈ g(s) + g′(s)(t− s) + · · ·+ g(p)(s)

p!(t− s)p, where t is near s.

Now estimate g(s), g′(s), · · · , g(p)(s) using least square criterion: First we minimize

n∑i=1

(yi − β0 − β1(ti − s)− · · · − βp(ti − s)p)2Kh(ti − s)

w.r.t. β0, β1, · · · , βp, and as the estimate of g(s), take g(s) = β0. Also note that g′(s) = β1, · · · , g(p)(s) =

p!βp. In here, K(·) is a kernel function which satisfies K(·) ≥ 0 and∫K(t)dt = 1, and Kh(·) =

K(·/h)/h. For example, rectangular kernel K(t) = I(−12 < t < 1

2) or Gaussian kernel K(t) =

(2π)−1/2 exp(− t2

2 ) can be used. A bandwidth h makes the role of “size of neighborhood;” the small h

gives tight neighborhood and makes regression local.

114


Criterion can be rewritten as matrix representation

n∑i=1

(yi − β0 − β1(ti − s)− · · · − βp(ti − s)p)2Kh(ti − s) = (y −Xsβ)>Ksh(y −Xsβ),

where y = (y1, · · · , yn)>, β = (β0, β1, · · · , βp)>, Ksh = diag(Kh(t1 − s), · · · ,Kh(tn − s)), and

Xs =

x>1

x>2...

x>n

=

1 t1 − s · · · (t1 − s)p

1 t2 − s · · · (t2 − s)p...

.... . .

...

1 tn − s · · · (tn − s)p

.

Thus estimating equation is

X>s Ksh(y −Xsβ) = 0,

which gives the solution

β = (X>s KshXs)−1X>s Kshy.

Since g(s) ≈ β0, we get

g(s) = β0 = e>1 (X>s KshXs)−1X>s Kshy

where e1 = (1, 0, 0, · · · , 0)>.

Remark 7.1. (a) One WLS computation is needed for each estimation of g(s) at s. However, com-

putation is not so heavy, since we invert just (p+ 1)× (p+ 1) matrix and practically p is set 0 or

1.

(b) (Choice of order p) Practically p is set 0 or 1. p = 0 yields local constant approximation or

Nadaraya-Watson estimator

g(s) =

∑ni=1Kh(ti − s)yi∑ni=1Kh(ti − s)

.

p = 1 yields local linear approximation.

(c) (Choice of bandwidth h) Too large h (=too many weight to far-off points) yields inaccurat estimate

(larger bias), while too small h (=use very few points for estimation) yields larger variance. Usually

we choose h which minimizes MSE or MISE empirically using CV.

7.1.2 Local Polynomial: Population mean model for longitudinal data

Suppose a longitudinal dataset (yij , tij) was observed for i = 1, 2, · · · , n and j = 1, 2, · · · ,mi.

115


Naive Population Mean model (NPM)

Naively we can just minimize

n∑i=1

mi∑j=1

(yij − β0 − β1(tij − s)− · · · − βp(tij − s)p)2Kh(tij − s)

=n∑i=1

(yi −Xs,iβ)>Ksh,i(yi −Xs,iβ) = (y −Xsβ)>Ksh(y −Xsβ)

where β = (β0, · · · , βp)>, yi = (yi1, · · · , yimi)>, Ksh,i = diag(Kh(ti1 − s), · · · ,Kh(timi − s)),

Xs,i =

1 ti1 − s · · · (ti1 − s)p

1 ti2 − s · · · (ti2 − s)p...

.... . .

...

1 timi − s · · · (timi − s)p

and

y =

y1

...

yn

, Xs =

Xs,1

...

Xs,n

, Ksh = blockdiag(Ksh,1, · · · ,Ksh,n).

Note that NPM does not take into account correlated structure, but naive LS also gives good (e.g.

consistent) point estimation. Then the solution is

β = (X>s KshXs)−1X>s Kshy =

(n∑i=1

X>s,iKsh,iXs,i

)−1 n∑i=1

X>s,iKsh,iyi.

From f(s) ≈ β0, we get

f(s) = β0 = (1,0p)>(X>s KshXs

)−1X>s Kshy.

GEE-type model (NPM-GEE)

We can reflect “working correlation” in the model. Assume that mii≡ m and Vm×m be a working

covariance matrix. Then we minimize

n∑i=1

(yi −Xs,iβ)>K1/2sh,iV

−1K1/2sh,i(yi −Xs,iβ).

116


Then the solution is

β =

(n∑i=1

X>s,iK1/2sh,iV

−1K1/2sh,iXs,i

)−1 n∑i=1

X>s,iK1/2sh,iV

−1K1/2sh,iyi

and hence

f(s) = (1,0p)>

(n∑i=1

X>s,iK1/2sh,iV

−1K1/2sh,iXs,i

)−1 n∑i=1

X>s,iK1/2sh,iV

−1K1/2sh,iyi.

Remark 7.2. It turns out that ignoring correlation (i.e., working independent correlation) gives more

efficient estimate than considering the correlation.

7.2 Spline methods

7.2.1 Review: Spline

There are two types of spline regression:

• Spline regression or Penalized Spline regression (P-spline). It is just a basis expansion; e.g.,

truncated power basis can be used.

• Smoothing spline (S-spline). It is performed via natural cubic spline basis.

P-spline

Consider y = f(t) + ε and independent observations (yi, ti)ni=1 on a closed interval [a, b]. Let

the class F of functions f(·) be known but f is unknown. Denote the basis function for F by

φ1(t), φ2(t), · · · , φl(t) and the assumption f ∈ F , i.e.,

f(t) = β1φ1(t) + β2φ2(t) + · · ·+ βlφl(t).

Then coefficients β = (β1, · · · , βl)> is estimated via minimizing following least square criterion

n∑i=1

(yi − β1φ1(ti)− β2φ2(ti)− · · · − βlφl(ti))2

or penalized criterion

n∑i=1

(yi − β1φ1(ti)− β2φ2(ti)− · · · − βlφl(ti))2 + λP (β),

117


where P (β) is a specified penalty function of β. There are several choices for basis functions φi(·): For

global basis, polynomial basis (of order p), Fourier basis, or eigenfunctions of a covariance operator

(FPCA; Karhunen-Loeve theorem); for local basis, we consider fixed knots a = τ0 < τ1 < · · · < τM <

τM+1 = b on [a, b] and consider truncated power basis 1, t, · · · , tp, (t− τ1)p+, · · · , (t− τM )p+ or natural

cubic spline basis 1, t, d1(t) − dM−1(t), · · · , dM−2(t) − dM−1(t) where dk(t) =(t− τk)3

+ − (t− τM )3+

τM − τk,

B-spline, wavelet, etc.

Remark 7.3. In spline regression, we need to choose

• the number of knots;

• location of knots;

• degree p (if one uses polynomial basis);

• and smoothing parameter λ (if one uses P-spline).

Such things can be selected via, for example, CV or GCV. There are many other criteria such as like

goodness-of-fit, model complexity, generalized maximum likelihood, AIC, BIC, etc.

Smoothing spline

Basically it comes from minimizing

n∑i=1

(yi − f(ti))2 + λ

∫ b

a

(f ′′(t)

)2dt

w.r.t. f ∈W 22 [a, b], where

W 22 [a, b] =

f : [a, b]→ R : f ′′ is L2 and absolutely continuous on (a, b)

denotes a Sobolev space. The penalty term gives another way to control roughness of f . It is known

that the minimizing function of above criterion is a natural cubic spline with knots at t1, t2, · · · , tn if

ti’s are distinct. Based on this, the criterion can be rewritten as

n∑i=1

(yi − β1φ1(ti)− · · · − βMφM (ti))2 + λβ>Mβ = (y −Xβ)>(y −Xβ) + λβ>Mβ,

where M is the number of distinct timepoints, φ1(t), · · · , φM (t) denote the natural cubic spline basis

functions, and M is an M ×M matrix with

Mij =

∫ b

aφ′′i (t)φ

′′j (t)dt.

118


We also use conventional notations, y = (y1, · · · , yn)>, β = (β1, · · · , βp)>, and

X =

φ1(t1) φ2(t1) · · · φM (t1)

φ1(t2) φ2(t2) · · · φM (t2)...

.... . .

...

φ1(tn) φ2(tn) · · · φM (tn)

.

Then

ˆbbeta = (X>X + λM)−1X>y

and hence

f(t) = Φ(t)>(X>X + λM)−1X>y

where Φ(t) = (φ1(t), · · · , φM (t))>.

7.2.2 Spline regression: Population mean model for longitudinal data

Suppose a longitudinal dataset (yij , tij) was observed for i = 1, 2, · · · , n and j = 1, 2, · · · ,mi. We

assume that

• E(yij |tij) = f(tij);

• (yi, ti)ni=1 are independent;

• and V ar(yi|ti) = Σ (working correlation).

Naive Population Mean model (NPM)

NPM pretends each (yij , tij) is an independent observation, and

i) (P-spline) minimizes

n∑i=1

mi∑j=1

(yij − β0 − β1tij − · · · − βp+M (tij − τM )p+)2 + λ

p+M∑k=p+1

β2k

=n∑i=1

(yi −Xiβ)>(yi −Xiβ) + λβ>Gβ = (y −Xβ)>(y −Xβ) + λβ>Gβ.

In here, β = (β0, · · · , βp+M )>, G = diag(0p+1,1M ), yi = (yi1, · · · , yimi)>,

119


Xi =

1 ti1 · · · (ti1 − τM )p+

1 ti2 · · · (ti2 − τM )p+...

.... . .

...

1 timi · · · (timi − τM )p+

,y =

y1

y2

...

yn

, and X =

X1

X2

...Xn

.

Then the solution is

β = (X>X + λG)−1X>y =

(n∑i=1

X>i Xi + λG

)−1 n∑i=1

X>i yi

and hence

f(t) = Φ(t)>(X>X + λG)−1X>y = Φ(t)>

(n∑i=1

X>i Xi + λG

)−1 n∑i=1

X>i yi.

ii) (S-spline) minimizes

n∑i=1

mi∑j=1

(yij − β1φ1(tij)− · · · − βMφM (tij))2 + λβ>Mβ

=n∑i=1

(yi −Xiβ)>(yi −Xiβ) + λβ>Mβ = (y −Xβ)>(y −Xβ) + λβ>Mβ,

where M is the number of distinct timepoints through tij and others are similarly defined.

NPM-GEE model

NPM-GEE method is also proposed by Welsh, Lin and Carroll (2002). It reflects working correlation

structure to NPM and construct GEE-type estimation procedure. Assume mii≡ m and Wm×m be a

working covariance matrix. We estimate via

i) (P-spline) minimizingn∑i=1

(yi −Xiβ)>W−1(yi −Xiβ) + λβ>Gβ.

The solution is

β =

(n∑i=1

X>i W−1Xi + λG

)−1 n∑i=1

X>i W−1yi

and hence we get

f(t) = Φ(t)>

(n∑i=1

X>i W−1Xi + λG

)−1 n∑i=1

X>i W−1yi.

120


ii) (S-spline) minimizingn∑i=1

(yi −Xiβ)>W−1(yi −Xiβ) + λβ>Mβ.

It gives the solution

β =

(n∑i=1

X>i W−1Xi

)−1 n∑i=1

X>i W−1yi.

Remark 7.4. Recall that in ordinary gee model, if true covariance structure is used as working one

(W = Σ), the estimator is the most efficient. For spline NPM-GEE methods, such tendency is remained

as same, but for kernel methods, true covariance as working one is inefficient (Lin and Carroll, 2000),

rather working independence is preferable! (cf. remark 7.2) One of the reasons is that kernel methods

are applied locally ; it needs local covariance structure, while our working correlation structure acts

globally. NPM-GEE works on spline methods because spline methods are applied globally.

121

analysis of repeated measurements (spring 2017) of repeated measurements (spring 2017) ... p1 ˙ p2...

Documents