0. generalized linear models (glms)laws/sb1/glms.pdf · i gand 1 aresuchthat 1„...

4. Generalized Linear Models (GLMs)

Neil Laws1

1Building on previous versions of this material by Geoff Nicholls and Jen Rogers.1

4.1 GLM setup

Similar notation to before: (yi , xi), i � 1, . . . , n, withI yi the ith responseI xi � (xi1 , . . . , xip)T a p-vector of explanatory variables for the ith case

I the matrix X �©«xT1...

xTn

ª®®¬I the parameter vector

©«β1...βp

ª®®¬.

2

GLMs generalise LMs (linear models). Both have:

(i) stochastic partI LM: Yi ∼ N(µi , σ2)I GLM: Yi ∼ a more general pdf/pmf f

(ii) deterministic part, same for LM and GLM: a linear predictor ηi � xTi β,where usually xi1 � 1 for all i, an intercept term

(iii) link between stochastic and deterministic: let µi � E(Yi), thenI LM: µi � ηiI GLM: g(µi) � ηi.

g is called the link function, it’s a smooth invertible (increasing) function of themean µi.

The linear predictor ηi � xTi β can take any value in R:

I g and g−1 are such that g−1(ηi) takes only appropriate values for µiI for example g−1 : R→ [0, 1] in Bernoulli case where µi is a probability,

and in this case g : [0, 1] → R.

3

4.2 Exponential families

An exponential dispersion family has a pdf/pmf of the form

f (y; θ, φ) � exp{yθ − κ(θ)

φ+ c(y;φ)

}.

In general the parameters θ and φ are unknown, φ > 0.

The case φ � 1 gives a natural exponential family (with natural parameter θ).We will mostly deal with binomial and Poisson models where φ � 1.

For normal models φ � σ2 (but we have handled normal LMs without theGLM framework).

If φ is known we have a one-parameter exponential family (i.e. without the"dispersion" addition, and the natural parameter is θ/φ).κ(·) and c(·) are assumed known.

4

Example: binomial

Y ∼ Binomial(m, π)where m fixed, known.

P(Y � y) �(my

)πy(1 − π)m−y

� exp{log

(my

)+ y log π + (m − y) log(1 − π)

}� exp

{y log

(π

1 − π

)+m log(1 − π) + log

(my

)}This corresponds to:

I θ � log( π1−π

), φ � 1, and so π �

eθ1+eθ

I κ(θ) � −m log(1 − π) � m log(1 + eθ)I c(y;φ) � log

(my)

5

Example: Poisson

Y ∼ Poisson(λ)

P(Y � y) �λy exp(−λ)

y!� exp

(y log λ − λ − log(y!)

)This corresponds to:

I θ � log λ, φ � 1, and so λ � eθ

I κ(θ) � λ � eθ

I c(y;φ) � − log(y!)

6

Each Yi in our GLM has an exponential dispersion pdf/pmf as above withparameter θi.

That is, the parameters θi vary with i.

We didn’t bother with subscripts of i above, and we also omit them in a fewplaces below.

Ex 1: In the case of logistic regression for binary data, the model will beYi ∼ Bernoulli(πi)where θi � log πi

1−πi� β1xi1 + · · · + βpxip.

Ex 2: In the case of a log-linear model for count data, the model will beYi ∼ Poisson(λi)where θi � log λi � β1xi1 + · · · + βpxip.

Note that in both of these examples the parameter θi is itself given in terms ofthe parameter vector β � (β1 , . . . , βp)T.

7

4.2.1 Moments

1 �

∫exp

{yθ − κ(θ)

φ+ c(y;φ)

}dy

where∫means

∑in the discrete case.

Differentiate w.r.t. θ:

0 �

∫1φ

(y − κ′(θ)

)exp{. . . } dy

�1φ

(E(Y) − κ′(θ)

).

Letting µ(θ) � E(Y), we obtain µ(θ) � E(Y) � κ′(θ).

8

Differentiating a second time:

0 �

∫ {1φ

(− κ′′(θ)

)+

1φ2

(y − κ′(θ)

)2} exp{. . . } dy

� −κ′′(θ)φ

+1φ2 var(Y).

Hence var(Y) � φκ′′(θ).

Hence µ′(θ) � κ′′(θ) � 1φ var(Y) > 0 as φ > 0 and var(Y) > 0.

So µ is an increasing function of θ, so invertible, and we can write bothµ � µ(θ) and θ � θ(µ).That is, we can parametrise using θ (canonical parametrisation) or using µ(mean parametrisation).

In our GLM we have η � g(µ) an invertible (increasing) function of µ, sousing the above we can also write η � η(θ) and θ � θ(η).

9

We define a variance function V(µ) by

var(Y) � φV(µ).

Exercise 4.1Using var(Y) � φκ′′(θ), show that V(µ) � κ′′

(κ′−1(µ)

).

SolutionV(µ) � var(Y)

φ � κ′′(θ)

and µ � κ′(θ) so θ � κ′−1(µ)hence V(µ) � κ′′

(κ′−1(µ)

). �

So in general the variance (function) depends on the mean µ – this is differentto what we had for LMs.

10

Exercise 4.2Show that dθ

dµ �1

V(µ) .

SolutionWe have µ � κ′(θ).Differentiating w.r.t. θ:

dµdθ

� κ′′(θ)

and hencedθdµ

�1

κ′′(θ)

�1

V(µ) from Exercise 4.1. �

11

4.2.2 Canonical link

Recall that we have both η � η(θ) and θ � θ(η).One particularly simple possibility is that θ � η.

The canonical link function is the g for which θ � η. This arises when

η � θ � κ′−1(µ) � κ′−1(g−1(η))

that is, when κ′(η) � g−1(η)which corresponds to g−1 � κ′ or g � κ′−1.

12

Example: binomial

Y ∼ Binomial(m, π), for which κ(θ) � m log(1 + eθ)

µ � E(Y) � κ′(θ) � m eθ

1 + eθ� mπ

V(µ) � κ′′(θ) � m

[exp(θ)

1 + exp(θ) −exp(θ) exp(θ)(1 + exp(θ)

)2 ]� mπ(1 − π)

�µ(m − µ)

m.

For the canonical link: θ � η � g(µ) � κ′−1(µ).

From above µ � κ′(θ) � m eθ1+eθ , so solving this for θ will give us κ′−1 � g the

canonical link.

Solving: θ � log µm−µ .

Hence canonical link g(µ) � log µm−µ .

13

Example: Poisson

Y ∼ Poisson(λ), for which κ(θ) � λ � eθ .

µ � E(Y) � κ′(θ) � eθ � λ

V(µ) � κ′′(θ) � eθ � λ

Canonical link: θ � η � g(µ) � κ′−1(µ)

and from above µ � eθ , solving gives θ � log µ, and hence the canonical linkis g(µ) � log µ.

14

4.3 Inference

Recall we have (yi , xi), i � 1, . . . , n, etc.I We start with one parameter θi for each observation yi, i � 1, . . . , n.I Recall θ � θ(η).

We model θi � θ(xTi β) in terms of the set of parameters β � (β1 , . . . , βp)T .Equivalenty θi � θ(ηi)where ηi � xTi β.

I So we now have just p parameters β1 , . . . , βp in the GLM (often p � n).

I Since µi � E(Yi) � g−1(xTi β)we are back to the situation with pexplanatory variables xi � (xi1 , . . . , xip)T for the ith response,and if βj > 0 the mean response µi increases as xĳ increases.

I Significant explanatory variables generate a significant shift in the meanresponse.

15

4.3.1 Likelihood

Since Yi ∼ f (yi; θi , φ) the log-likelihood for β is

`(β) �n∑i�1

{yiθi − κ(θi)

φ+ c(yi;φ)

}where θi � θi(β).The MLE for β is the solution to the p equations

∂`∂βj

� 0, j � 1, . . . , p.

There is no closed form for the MLE. MLEs can be computed numericallyusing the iteratively re-weighted least squares (IRLS) algorithm.

16

4.3.2 IRLSRecall the Newton-Raphson algorithm (e.g. see A9 Statistics notes) tocompute β:

I start at some β(0) and iterate until convergence using

β(k+1) � β(k) + J−1 ∂`∂β

��β(k)

(4.1)

Here

I ∂`∂β is the column vector of partial derivatives ∂`

∂β �

(∂`∂β1

, . . . , ∂`∂βp

)TI J is the observed information matrix, with (i, j) element − ∂2`

∂βi∂βj

I and in (4.1) both J−1 and ∂`∂β are evaluated at β � β(k).

We can write J as J � − ∂2`∂β∂βT

.

Note: β is a column vector and ββT is a matrix, so the notation ∂`∂β and − ∂2`

∂β∂βT

makes sense.17

Fisher scoring replaces J by I to give the iteration:

β(k+1) � β(k) + I−1 ∂`∂β

��β(k)

(4.2)

where I is the expected information

I � −E(∂2`

∂β∂βT

).

18

For a GLM we can simplify things a bit:

∂`∂βj

�

n∑i�1

∂`∂ηi

∂ηi∂βj

(4.3)

�

n∑i�1

∂`∂θi

dθidηi

xĳ (4.4)

�

n∑i�1

yi − µiφ

dθidµi

dµidηi

xĳ (4.5)

�

n∑i�1

yi − µiφV(µi)g′(µi)

xĳ (4.6)

where we have used the chain rule several times and ηi � ηi(θi), θi � θi(µi),g(µi) � ηi and Exercise 4.2.

19

In vector notation:∂`∂β

� XTu (4.7)

where u �∂`∂η � (u1 , . . . , un)T and where

ui �∂`∂ηi

�yi − µi

φV(µi)g′(µi).

Taking transposes in (4.7):

∂`

∂βT� uTX

�∂`

∂ηTX. (4.8)

20

Looking at (4.3) and replacing ` by a general function we have

∂∂βj

�

n∑i�1

xĳ∂∂ηj

, that is ∂∂β

� XT ∂∂η. (4.9)

Then using (4.9) to differentiate (4.8) gives

∂2`

∂β∂βT�∂∂β

∂`

∂ηTX

� XT ∂∂η

∂`

∂ηTX

� XT ∂2`

∂η∂ηTX.

So the expected information matrix is I � XTWX where

W � −E(∂2`

∂η∂ηT

).

21

We have from (4.3)–(4.6) that ui � ∂`/∂ηi is a function of µi only. Alsoµi � g−1(ηi) is a function of ηi only. Hence all mixed partial derivatives arezero: ∂2`/∂ηi∂jηj � 0 for i , j.

So W is a diagonal matrix.

The diagonal entries are

Wii �1

φV(µi)g′(µi)2.

22

Exercise 4.3Show that Wii �

1φV(µi)g′(µi)2 .

SolutionWe have

E[∂2`

∂η2i

]� E

[∂∂ηi

(∂`∂ηi

) ]� E

[∂∂ηi

(∂`∂θi

dθidηi

) ]� E

[∂∂θi

(∂`∂θi

dθidηi

)dθidηi

]� E

[∂2`

∂θ2i

(dθidηi

)2+ {. . . } ∂`

∂θi

]�

(dθidηi

)2E[∂2`

∂θ2i

](4.10)

since dθi/dηi and {. . . } are non-random terms, and using the standard resultE(∂`/∂θi) � 0.

23

Also∂2`

∂θ2i�

∂∂θi

(yi − κ′(θi)

φ

)� −κ′′(θi)/φ � −V(µi)/φ. (4.11)

Now g(µi) � ηi, so g′(µi) � dηi/dµi and then dµi/dηi � 1/g′(µi). Hence

dθidηi

�dθidµi

dµidηi

�1

V(µi)1

g′(µi)(4.12)

using Exercise 4.2.

Using (4.10), (4.11), (4.12):

Wii � E

(∂2`

∂η2i

)�

(1

V(µi)g′(µi)

)2 V(µi)φ

�1

φV(µi)g′(µi)2. �

24

Iteratively Reweighted Least SquaresThe iteration (4.2) becomes

β(k+1) � β(k) + I−1 ∂`∂β

��β(k)

� β(k) +(XTW(k)X

)−1XTu(k)

�

(XTW(k)X

)−1XTW(k)

(Xβ(k) +

(W(k)

)−1u(k))where W(k) and u(k) are evaluated at β(k), and[(

W(k))−1

u(k)]i� φV(µ(k)i )g

′(µ(k)i )2 yi − µ(k)iφV(µ(k)i )g′(µ

(k)i )

� g′(µ(k)i )(yi − µ(k)i ).

Set z(k)i � [Xβ(k)]i + g′(µ(k)i )(yi − µ(k)i ) and z(k) � (z(k)1 , . . . , z(k)n )T so that

β(k+1) �(XTW(k)X

)−1XTW(k)z(k).

25

IRLS AlgorithmThe IRLS algorithm for estimation of the MLE for a GLM via the sequenceβ(k) → β as k→∞ is:

1. Start withI µ(0) � y so η(0) � g(µ(0)) � g(y)I z(0) � g(y)I W(0) � diag(φV(y)g′(y)2)−1.

2. For k � 0, 1, 2, . . . :

β(k+1) � (XTW(k)X)−1XTW(k)z(k)

η(k+1) � Xβ(k+1) , µ(k+1) � g−1(η(k+1))z(k+1) � η(k+1) + g′(µ(k+1))(y − µ(k+1))

W(k+1) � diag(

1φV(µ(k+1))g′(µ(k+1))2

).

Note that β(k+1) is the MLE in the weighted regression (Section 2.3 of thecourse) of z(k) on X.

26

IRLS example: Poisson

Recall φ � 1, µ � λ, V(µ) � λ � µ, η � Xβ.

Using the canonical link: g(µ) � log(µ) � η, µ � exp(η), g′(µ) � 1/µ.We then have that:

z � η + g′(µ)(y − µ)

� η +y − µµ

W � diag(

1φV(µ)g′(µ)2

)� diag(µ1 , µ2 , . . . , µn).

27

4.3.3 Variance of MLEsRecall that if β is an MLE then, asymptotically as n→∞,

(β − β) D→ N(0, I−1).

For a GLM we have seen that I � XTWX.

So we have variances var(βj) ≈ I−1jj � (XTWX)−1jj .

This gives us a test for βj � 0 as the null distribution satisfies

βj√I−1jj

D→ N(0, 1) as n→∞.

and alsoβj√I−1jj

D→ N(0, 1) as n→∞

where in the second case I−1jj is estimated using I � I(β).

This test is a Wald test. Of course we can also use the above to find aconfidence interval for βj.

28

4.4 Binomial and Poisson models

So far we have focused on:I GLMs in general

– setup

– exponential families

– canonical link functionI Estimation and inference

– likelihood function, score equations

– IRLS

We now focus on analysis and interpretation.

29

4.4.1 Binomial model – logistic regression

Challenger data setup:

We have yi � fail[i] and xi � (xi1 , xi2) � (1, temp[i])T. So we have anintercept and one other explanatory variable (temperature).

We model the binary outcomes Yi as

Yi ∼ Bernoulli(πi)

where πi denotes the probability of obtaining yi � fail[i] � 1.

Since E(Yi) � πi we have µi � πi.

The linear predictor is ηi � β1 + β2xi2.

30

Using the canonical link function, which have previously seen isg(µ) � log µ

1−µ for binary data (i.e. binomial canonical link with m � 1), weobtain

πi �exp(β1 + β2xi2)

1 + exp(β1 + β2xi2).

This is logistic regression: we are modelling the probability of yi � 1 for abinary response as logistic function of a linear predictor. The logistic functionis eη/(1 + eη).This makes sense: we have a mapping from ηi ∈ R to µi � πi ∈ [0, 1].

31

The log-likelihood is

`(β) �n∑i�1

yiηi − log(1 + exp(ηi))

where ηi � β1 + β2xi2.

For the Bernoulli model we also have φ � 1, V(µi) � µi(1 − µi), g′(µ) � 1µ(1−µ)

and

Wii � V(µi)W � diag

(µ1(1 − µ1), . . . , µn(1 − µn)

)zi � [Xβ]i +

yi − µiV(µi)

� [Xβ]i +yi − µiµi(1 − µi)

.

So we can run IRLS . . . glm() in R does this and more.

32

Analysis and prediction

Fitted values: the fitted values of the probability at the observation points xiare

µi � g−1(xTi β) �exp(xTi β)

1 + exp(xTi β).

We can use the function

π(t) �exp(β1 + β2t)

1 + exp(β1 + β2t)

to get the estimated probability of failure as a function of temperature.

33

Consider a 50% probability of failure: for this we’d have

π �12 ⇐⇒ η � log π

1 − π � 0

⇐⇒ β1 + β2t � 0⇐⇒ t � −β1/β2

and we would estimate this temperature by −β1/β2.

34

Odds and the Odds Ratio

The quantity π/(1 − π) is the odds of success (here a Bernoulli successcorresponds to an O-ring failure).

For logistic regression we have

odds, O �π

1 − π � exp(xTβ)

which we can estimate by O � exp(xT β)

Give an interpretation of βj in logistic regression:

I βj gives the estimated change in log odds of success when thecorresponding explanatory variable xj increases by 1 unit.

35

Let x � (1, x2)T and x′ � (1, x2 + 1)T so that x′ corresponds to a temperatureincrease of 1 degree F compared to x.

Consider O � exp(xT β) and O′ � exp(x′T β).Then

O′

O� exp(β2).

I β2 � −0.232 is the estimated change in log odds of an O-ring failure whentemperature increases by 1 degree F

– this change is negative, so the log odds of an O-ring failure decrease bythis amount per degree F of temperature increase

I the odds of an O-ring failure are multiplied by a factor of exp(β2) � 0.793per degree F of temperature increase.

Note, since η � log π1−π , an increase of x2 by 1 degree has an additive effect on

the log odds, and has a multiplicative effect on the odds.

36

The Challenger data is Bernoulli, each observation corresponds to mi � 1trials.

For Binomial data Yi ∼ Binomial(mi , πi)with mi > 1, we saw thatI µi � miπi

I and the canonical link is ηi � log µimi−µi � log πi

1−πi

– which is the same function of πi as in the Bernoulli case

and so the same interpretations of regression coefficients as on the previouscouple of slides (in terms of odds, log-odds) apply in the mi > 1 case.

37

4.4.2 Poisson model – AIDS deaths

We model the number of deaths in period i as

Yi ∼ Poisson(µi).

With an intercept xi1 � 1 for all i, and an explanatory variable xi2 denotingthe time period, the linear predictor is

ηi � β1 + β2xi2.

The canonical link is g(µ) � log µ, and with this link we have η � log µ, soµ � eη.

So our Poisson mean is the quantity µi � exp(β1 + β2xi2).

38

The quantity µ � exp(β1 + β2x2) is the mean number of deaths.

Let x � (1, x2)T and x′ � (1, x2 + 1)T, so x′ corresponds to an increase of 1 timeperiod compared to x.

Consider µ � exp(xT β) and µ′ � exp(x′T β), the estimated mean number ofdeaths at x and at x′.

Thenµ′

µ� exp(β2).

We estimated β2 � 0.257.I Interpretation: the estimated number of deaths increases by a

multiplicative factor of exp(0.257) � 1.292 for an increase of one timeperiod.

I As this is greater than 1, the fitted model is saying that the number ofdeaths increases with time.

39

4.5 Asymptotics, analysis of deviance, model selection

We now look at model selection.I We have already seen how to test for the significance of a single GLM

parameter βj.I How do we test for the significance of a group of variables? We use the

likelihood ratio test (LRT).

– For regression with the normal linear model we managed to get anexact test by writing the LRT statistic as a monotone function of astatistic F, where F has a distribution we know exactly.

– For regression with GLMs we stick to the LRT statistic and accept a testbased on the asymptotic distribution of this statistic.

40

Example: Tobacco budworm mortality

Example from Venables and Ripley (2002).

12 batches of 20 tobacco budworm moths: exposed for 3 days to differentdose levels of a toxin; the numbers dead or disabled were recorded.

Dose1 2 4 8 16 32

MortalityMale 1 4 9 13 18 20Female 0 2 6 10 12 16

Categorical variable sex ∈ {M, F}.Continuous variable dose, code as ldose � log2(dose), as suggested by thescale of the variable.

We look for an effect due to ldose, and allow for possibly different interceptsand slopes in the linear predictors for the two sexes.

41

The linear predictors are:

ηi � β1 + β2xi + β3gi + β4gixi

for i � 1, . . . , n, with n � 12,I where gi � 1 if sex[i] � M, and is zero otherwiseI and xi � ldose[i].

42

The yi are the cell counts (numbers dead or disabled). R will fit for the scaledresponse yi/mi (where mi � 20 for i � 1, . . . , n here).

Then we have µi � E(Yi/mi) and µi � πi, where πi is the probability ofsuccess in the ith Binomial trial,

and our model isYi ∼ Binomial(mi , πi).

We use the canonical link function, which is the logistic

log(µi/(1 − µi)) � ηi.

So we model the proportion µi of “successes” (where a “success”corresponds to a moth-death).

43

Odds

The linear predictors are:

ηi � β1 + β2xi + β3gi + β4gixi.

This means that the odds for males and females are:

O(x) �{

exp(−2.99 + 0.91x) Femaleexp(−2.99 + 0.91x + 0.18 + 0.35x) Male

44

Null and Saturated Models

There are two important models, one “above” and one “below” our model.

I The saturated model – this model “unlinks” the θiI in our β-model the θi are bound by the constraint g(µi) � xTi β on the meansI but in the saturated model we remove this constraint and instead we have

one parameter θ(s)i for each response yi

I so the MLE θ(s)i in the saturated model is just θ(s)i � arg maxθi f (yi; θi).

I The null model – this is the model where β2 � · · · � βp � 0I and where we have just an intercept in the linear predictor: g(µi) � β1 for all iI so the means are all equal and we have single common natural parameter θ

I let θ(0) be the MLE for the natural parameter of the null model.

45

I The saturated log-likelihood is the maximised value of the log-likelihoodfor the saturated model: so it is `(θ(s); y)

where θ(s) � (θ(s)1 , . . . , θ(s)n )T and y � (yi , . . . , yn)T.

I The null log-likelihood is the maximised value of the log-likelihood for thenull model: so it is `(θ(0); y)

where θ(0) is a scalar and y � (yi , . . . , yn)T .

As an example we’ll calculate the saturated log-likelihood and nulllog-likelihood for a binomial model.

46

Saturated log-likelihood

Yi ∼ Binomial(mi , πi), i � 1, . . . , n (where mi � 20, n � 12 for budworm data).

P(Yi � yi) �(miyi

)πyii (1 − πi)

mi−yi

`(πi) � yi log πi + (mi − yi) log(1 − πi) + log(miyi

)and solving `′(πi) � 0 gives π(s)i � yi/mi.

47

P(Yi � yi) � exp(yi log

(πi

1 − πi

)+mi log(1 − πi) + log

(miyi

))� exp

(yiθi − κ(θi)

φ+ c(yi;φ)

)where θi � log

(πi/(1 − πi)

)and so θ(s)i � log

(π(s)i /(1 − π

(s)i )

)� log yi

mi−yi

and κ(θi) � −mi log(1 − πi) and so κ(θ(s)i ) � −mi log(1 − yi/mi).

`(θ(s); y) �n∑i�1

yiθ(s)i − κ(θ

(s)i ) + constant

�

n∑i�1

yi log(

yimi − yi

)+mi log

(1 −

yimi

)+ constant

where constant �∑

i log(miyi), but this constant not important because we will

be looking at differences of log-likelihoods and the constant will cancel.48

Null log-likelihood

`(π) �n∑i�1

yi log π + (mi − yi) log(1 − π) + constant

Solving `′(π) � 0 gives π(0) � y/m. Hence θ(0) � log ym−y and

`(θ(0); y) �n∑i�1

yiθ(0) − κ(θ(0)) + constant

� θ(0)n∑i�1

yi − log(1 + exp(θ(0))

) n∑i�1

mi + constant

� ny log(

ym − y

)+ nm log

(1 −

ym

)+ constant.

49

4.5.1 DevianceThe scaled deviance D(y) for our GLM is related to the log-likelihood:

D(y) � −2`(β; y) + 2`(θ(s); y).

This is of course the LRT statistic for a test comparing the saturated modelwith the GLM of interest.

In general D(y) depends on φ, but for e.g. binomial and Poisson models wehave φ � 1. From the form of f (y; θ, φ)we can see that in general D(y) isgiven by some expression divided by φ.

The deviance itself is the scaled deviance evaluated at a scale parameterof φ � 1. So e.g. deviance and scaled deviance are the same for binomial andPoisson models.

(Scaled deviance D(y) � devianceφ .)

Since the parameter space of θ(s) includes the parameter space of θ(β) as asubspace, we have D(y) > 0.

The null deviance is D(0)(y) � −2`(θ(0); y) + 2`(θ(s); y).50

4.5.2 Goodness of fit

Since D is the LRT statistic for a test with null parameter space of dimension pand alternative of dimension nI we expect D(Y) ∼ χ2(n − p) approximatelyI under the hypothesis that our GLM model includes all the factors

generating variation in the response.

So if D(y) is large on the scale of a χ2(n − p) then we question our model.

However, this model check is NOT always applicable – it is NOT suitable forBernoulli models.

The difference between this case and the standard case of a LRT is that herethe dimension of the alternative parameter space depends on n.

51

When is D(Y) ≈ χ2(n − p) a good approximation?

We approach the asymptotic distribution for a LRT statistic as our MLEsconverge to their limiting values. Since, in the saturated model, we have oneparameter for each observation, increasing n doesn’t add to the precision ofour estimates of these parameters (increasing n does increase the precision ofour estimate of β, but that is just half the LRT statistic).

So, when is this distributional assumption good? Suppose we had multiplereplicates for each observation: say yĳ, j � 1, . . . ,mi, for given explanatoryvariables xi and, in the saturated model, just the one parameter θ(s)i for eachbatch of mi replicates. Now θ(s)i → θ(s)i as mi →∞. This is the limit whereD(Y) ≈ χ2(n − p) is a good approximation.

An example of this good situation is: Yi ∼ Binomial(mi , πi)with mi →∞ forevery i, because this is equivalent to mi replicates of a Bernoulli(πi).A situation when the χ2(n − p) approx for D(Y) can’t be used is where wehave just one replicate (i.e. mi � 1) at each xi, i.e. when we have data that aregenuinely binary/Bernoulli.

52

4.5.3 Model choice

The likelihood ratio test for comparing two nestedmodelsI model Q with dimension qI and model Pwith dimension p, where p < q

has LRT statistic Λ � D(P)(y) −D(Q)(y), and we can use the approximation

D(P)(y) −D(Q)(y) ∼ χ2(q − p).

This approx is ok even with e.g. Bernoulli data as it does not involve thesaturated log-likelihood (which was the quantity that could cause problemsin the previous section – note that the saturated log-likelihood cancels whenwe take the difference of two deviances).

Assuming that β1 is an intercept, the test for no relation between the meanresponse µi and the other explanatory variables in the model Q is the test forβ2 � · · · � βq � 0 and has test statistic

D(0)(y) −D(Q)(y) ∼ χ2(q − 1).

53

Some notes

I All remarks about model choice for LMs apply here.I We want to include all important variables and no redundant variables.I Tools available are as for LMs:

I forwards selectionI backwards eliminationI global search with AIC.

54

4.6 GLM diagnostics

The diagnostic methods for GLMs are similar to those used for normal LMs.

55

Pearson residuals

The raw residuals for normal linear model are yi − µi (where µi � yi).

We can consider something similar for a GLM – but in a general GLM thevariance of the response varies with the mean:

E(Yi) � µi , var(Yi) � φV(µi).

We take this into account: the Pearson residuals are defined by

rPi �yi − µi√V(µi)

.

Note that∑

r2Piis a Pearson goodness of fit statistic of the form∑(obs − exp)2/var.

56

Deviance residuals

These are defined so that∑

r2Di� deviance �

∑di.

Letì(θ; yi) �

yiθ − κ(θ)φ

+ c(yi;φ)

and letdi � −2ì(β; yi) + 2ì(θ(s)i ; yi)

so that the deviance of the model is

D(y) �n∑i�1

di.

A poorly fitting observation will make a large contribution to the deviance.We have di > 0 and larger di are observations in greater conflict with the restof the data. The deviance residuals are defined to be

ri � sign(yi − µi)√di

where sign(yi − µi) is +1 if yi − µi > 0, and −1 if yi − µi < 0.

57

Example: binomial

Yi ∼ Binomial(mi , πi)The Pearson residuals are

rPi �yi −miπi√miπi(1 − πi)

where πi are the estimated probabilities under the model with parameter

estimate β, i.e. πi �exp(xTi β)

1+exp(xTi β).

58

For the deviance residuals we need ì(β; yi) and ì(θ(s)i ; yi). For the binomialwe know the saturated log-likelihood is

ì(θ(s)i ; yi) � yi log(

yimi − yi

)+mi log

(mi − yimi

)+ constant

and similarly

ì(β; yi) � yi log(

µimi − µi

)+mi log

(mi − µi

mi

)+ constant

where µi � miπi and πi �exp(xTi β)

1+exp(xTi β).

So

di � −2ì(β; yi) + 2ì(θ(s)i ; yi)

� 2{yi log

(yi(mi − µi)µi(mi − yi)

)+mi log

(mi − yimi − µi

)}and

rDi � sign(yi − µi)√di.

59

Leverage

Analogous to LMs, we can define a matrix H with the diagonal entries hiibeing the leverage components.

For a GLM we have

H � W1/2X(XTWX)−1XTW1/2.

This is the hat matrix that we reach in the (weighted) linear regression donein the IRLS algorithm.

As usual, a large value of hii indicates that the fit may be sensitive toobservation i. But note that a difference for GLMs compared to LMs is thatthe leverages here are not only a function of X, they also depend on y via theweights W.

We have∑n

i�1 hii � p, so the average leverage value is p/n and, as a guide, wecan consider a value of hii as being large if it is greater than 2p/n. (Or look athii/(p/n) and compare to 2.)

60

Standardised Residuals

Both the Pearson and deviance residuals can be standardised:

r′Pi�

rPi√φ(1 − hii)

, r′Di�

rDi√φ(1 − hii)

(with φ replaced by an estimate φ if it is unknown). These residuals shouldnow be approximately standard normal for Poisson and binomial modelswith large counts.

So most of them should lie between −2 and +2 (i.e. most of them, not all ofthem).

For Bernoulli and binomial models with small counts we cannot expectresiduals to be approx normal, though they should still have approx unitvariance.

61

Influence

A point with high leverage need not do much damage if it has low influence.But highly influential points do shift the fitted surface – as for LMs.

For an LM, if φ � σ2 were known, our previous expression for Cook’sdistance would correspond to

Ci �(y − y−i)T (y − y−i)

pσ2�(β − β−i)T(XTX)(β − β−i)

pφ.

For a GLM, the Cook’s distance for the ith data point is

Ci �(β − β−i)T(XTWX)(β − β−i)

pφ.

It is defined in the space of the linear predictor. As a guide, points withCook’s distance Ci >

8n−2p can be thought of as having high influence,

potentially being outliers and removed from the analysis.

62

Which residuals?

Standardised Pearson and deviance residuals are on a fixed scale, canpotentially identify misfitting observations easily.

In many cases Pearson and deviance residuals are very similar. Devianceresiduals tend to be preferred as they tend to have less skewed distributionsand often appear slightly closer to a standard normal distribution.

63

4.7 Dispersion and the scale parameter

For the binomial and Poisson distributions the scale, or dispersion, parameteris φ � 1.

For other distributions such as gamma, normal we may have φ , 1 and φmay be unknown.

The IRLS algorithm equation doesn’t need φ! φ cancels out in each step ofthe algorithm. So we can fit our GLM at φ � 1 and then look for an estimateof φ based on the fitted value of β.

Since var(Yi) � φV(µi)we can form an estimate

φ �1

n − p

n∑i�1

(yi − µi)2V(µi)

with µi � g−1(xTi β).

64

Why bother?

Although φ does not appear in IRLS for the estimation of βj, it is needed forcalculation of

var(βj) � (XTWX)−1jj � φ

(XTdiag

(1

V(µ)g′(µ)2

)X)−1jj. (4.13)

If we assume φ � 1 when φ > 1:I the variances of the βj will be too smallI CIs will be too narrowI p-values will be too optimistic.

From (4.13) we could adjust by multiplying estimated variances by a factorof φ (so multiplying standard errors by φ1/2).

65

0. generalized linear models (glms)laws/sb1/glms.pdf · i gand 1 aresuchthat 1„...

Documents