stk3100/4100 autumn 2014 - universitetet i oslo · stk3100/4100 autumn 2014 ... additional text...

STK3100/4100 Autumn 2014• Introduction to Generalized Linear Models

(GLM) and mixed models• Teacher: Magne Aldrin

• professor II at Uio• main position at Norsk Regnesentral

(Norwegian Computing Center)• Responsible for exercises: Tonje Gulbrandsen

Lien• Slides based on previous presentations of Sven

Ove Samuelsen and Geir Storvik

GLM and MM – p. 1

Plan for the day

1. Introduction, literature, computer program

2. Examples

3. Informal definition of GLM

4. Mixed models

5. Plan for the course

GLM and MM – p. 2

Generalized Linear Models (GLM)

• Extension of multiple regression and anova

• An important class of models

• A common framework for regression analysis of

continuous, binary (binomial), categorical (multinomial) or

count response variables

• Includes ordinary least squares regression (OLS), logistic

regression and Poisson regression

GLM and MM – p. 3

Mixed models/random effect models

• Some regression coefficients are random

• Can account for correlations within groups of observations

• Can be combined with GLM

• Active research field, still not fully developed

GLM and MM – p. 4

Goals

• Introduction toGLM

• learn to use these models to analyse practical problems

• know the mathematical background for the analyses

• Knowledge ofmixed models

• learn to use these models to analysesimple practical

problems

• knowledge of approximations and challenges when

using such models

The course will have both a practical and a theoreticalperspective, with examples from medicine, biology, socialsciences, economics and insurance

GLM and MM – p. 5

Literature

Main text book Generalized Linear Models for Insurance Data

by Piet de Jong and Gillian Z. Heller.

• available at Akademika.

• homepage:

www.actuary.mq.edu.au/research/books/GLMsforInsuranceData

Additional text book Mixed Effects Models and Extensions in

Ecology with R by Alain Zuur et al.

• ebook, can be downloaded from internet

• only selected chapters

GLM and MM – p. 6

R statistical package

• We will use theRpackage for computing

• Can be downloaded for free from

http://mirrors.sunsite.dk/cran/

• Can be used under Windows, Mac and Linux operative

systems

• We will mainly used routines that already are programmed

in R, not much programming by yourselves

• R homepage:http://www.r-project.org/

GLM and MM – p. 7

Data example 1: Birth weight and gestational age

Boys Girls

Duration (weeks) Birth weight (grams) Duration (weeks) Birth weight (grams)

40 2968 40 331738 2795 36 272940 3163 40 293535 2925 38 275436 2625 42 321037 2847 39 281741 3292 40 312640 3473 37 253937 2628 36 241238 3176 38 299140 3421 39 287538 2975 40 3231

Av. 38.33 3024.00 38.75 2911.33

Interested in studying how birth weight depends on gestational

age and gender

GLM and MM – p. 8

Scatter plot for Ex. 1

svangerskapslengde (uker)

fłdse

lsve

kt (g

)

36 38 40 42

2400

2600

2800

3000

3200

3400

+ Jenter

o Gutter

GLM and MM – p. 9

Typical model for Ex. 1: Linear regression

One response variable and two explanatory variables:

Yjk = birth weight for baby no.k gender no.j

xjk = gestational age for baby no.k gender no.j

for k = 1, ..., 12 andj = 1, 2 (j = 1 means boy andj = 2 girl)

Assumed model:

Yjk = αj + βxjk + εjk

whereεjk ∼ N(0, σ2), i.e. normally distributed with mean 0 and

common varianceσ2 and also independent

β = slope, regression coefficient

αj = intercept for genderjGLM and MM – p. 10

Least squares estimates

svangerskapslengde (uker)

fłdse

lsve

kt (g

)

36 38 40 42

2400

2600

2800

3000

3200

3400

GutterJenter

Estimates:α1 = −1610, α2 = −1773, β = 121

GLM and MM – p. 11

Equivalent model formulation

• Linearity: E[Yjk] = µjk = αj + βxjk

• Constant variance: Var[Yjk] = σ2

• Normality: Yjk ∼ N(µjk, σ2)

• Independent responses:Yjk-s are independent

Extensions in STK3100/4100:

• Linearity after transformation ofµ by a link functiong():

g(µjk) = αj + βxjk ⇔ E[Yjk] = µjk = g−1(αj + βxjk)

• The variance depends on the expectation

• Other distributions: Binomial, Poisson, gamma, ...

• Including random effects (mixed models) to account for

dependenciesGLM and MM – p. 12

Data Ex. 2: Deadly dose of poison for beetles

About 60 beetles were exposed to each of 8 differentconcentrations ofCS2, and number killed at each of theconcentrations were recorded

Dose

(log10

CS2mg l−1)

Number

beetles

Number

dead

1.6907 59 6

1.7242 60 13

1.7552 62 18

1.7842 56 28

1.8113 63 52

1.8369 59 53

1.8610 62 61

1.8839 60 60

Want to study how mortality depends on dose


Ex. 2: Proportion of dead beetles vs dose

dose (log_10)

ande

l dod

e bi

ller

1.70 1.75 1.80 1.85

0.0

0.2

0.4

0.6

0.8

1.0


Reasonable model for Ex. 2

Yi = number of dead beetles with dosexi are binomially

distributed

Yi ∼ bin(ni, πi)

whereπi = probability for a beetle to die at dosexi and

ni = number of beetles treated with dosexi

A linear model forπi estimated by ordinary least squares (OLS)

is problematic because

• 0 ≤ πi ≤ 1 that can not be guaranteed by a linear

expressionα + βxi

• Var(Yi) = niπi(1− πi), non-constant (heteroscedastic)

varianceGLM and MM – p. 15

Usual solution for Ex. 2: Logistic regression

Logistic regression model:

πi =exp(α + βxi)

1 + exp(α + βxi)

Then0 ≤ πi ≤ 1

Fit or estimate the model by Maximum Likelihood (ML).

• Take into account that the responses are binomially

distributed

• Estimates are efficient if we have enough data,

(better estimation methods may exist when number of

observations are few)


Logistic regression for Ex. 2

MLE: α = −60.72, β = 34.27

Predicted probabilities:π = exp(α+βx)

1+exp(α+βx)

dose (log_10)

ande

l dod

e bi

ller

1.70 1.75 1.80 1.85

0.0

0.2

0.4

0.6

0.8

1.0


Estimating parameters in logistic regression

Storvik: "Numerical optimization of likelihoods: Additional

literature for STK2120" gives a Newton-Raphson routine inR to

fit logistic regression to these data

But this is already implemented inR. Use the command

glm(cbind(dead,tot-dead)˜Dose,data=beetle,

family=binomial)

• glm = Generalised linear model

• family=binomial indicates that we have binary or

binomial response data

• cbind(dead,tot-dead) is an n x 2 matrix with no.

successes and no. failures in the two columns


Data Ex. 3: Number of children among pregnant

women

de Jong & Heller: Data for no. previous children among 141

pregnant women of various ages.

The number of children tends to increase by age (as expected)

20 25 30 35 40

01

23

45

67

alder

anta

ll bar

n

20 25 30 35 40

0.0

0.5

1.0

1.5

2.0

2.5

alder

gjenn

omsn

ittlig

anta

ll bar

n


Data Ex. 3b: Number of car damages

de Jong & Heller: Data for reported car damages for 65535

policies

Explanatory variables:

• Value of car

• Age of car

• Type of car

• Gender of driver

• Age of driver

The response variable is acount variable in both examples,perhaps Poisson distributed


In Ex. 3: Yi = No. previous children for mother no. i

it can be reasonable to assume thatYi is Poisson distributed with

expectationµi,

whereµi depends onxi = mother’s age

Similar to Ex. 2:

• Expectationµi > 0

• Variance ofYi equal toµi, i.e. non-constant variance

Usual solution: Poisson regression

Yi ∼ Po(µi) whereµi = exp(α + βxi)

This is also a GLM and can be fitted by the glm function in R

Must specify that response data are Poisson distributed byfamily=poisson


Poisson regression for Ex. 3

MLE for (α, β): (α, β) = (−4.0895, 0.1129)

Gives fitted expectationsµi = exp(α + βxi)

20 25 30 35 40

0.00.5

1.01.5

2.02.5

alder

forve

ntet a

ntall b

arn

o Observert i 5 årsgrupper

Tilpasset med glm


Definition of GLM

Independent responses:Y1, Y2, . . . , Yn conditioned on

explanatory variables

Vectors of explanatory variablesx1,x2, . . . ,xn

wherexi = (xi1, xi2, . . . , xip) arep-dimensional

A GLM = Generalized Linear Model is defined by

• Y1, Y2, . . . , Yn comes from the same class of distributions

from the exponential family

(The exponential family will be defined later. It includes

normal, binomial, Poisson and gamma distributions)

• Linear predictorsηi = β0 + β1xi1 + · · ·+ βpxip

• Link functiong(): µi = E[Yi] is coupled to the linear

predictor byg(µi) = ηiGLM and MM – p. 23

The linear regression model is a GLM

• Responses (Yi-s) are normally distributed

• Linear predictorηi = β0 + β1xi1 + · · ·+ βpxip

• E[Yi] = µi = ηi, i.e. the link functiong(µi) = µi is the

identity function

The R-commandslm for linear regression andglm does

essentially the same, but with slightly different output

Linear regression is the default specification ofglm


Ex 1: Birth weights

> lm(vekt˜sex+svlengde)

Call:

lm(formula = vekt ˜ sex + svlengde)

Coefficients:

(Intercept) sex svlengde

-1447.2 -163.0 120.9

> glm(vekt˜sex+svlengde)

Call: glm(formula = vekt ˜ sex + svlengde)

Coefficients:

(Intercept) sex svlengde

-1447.2 -163.0 120.9

Degrees of Freedom: 23 Total (i.e. Null); 21 Residual

Null Deviance: 1830000

Residual Deviance: 658800 AIC: 321.4GLM and MM – p. 25

The logistic regression model is a GLM

• Responses (Yi-s) are binomially distributed Bin(ni, πi)


• E[Yi]/ni = πi =exp(ηi)

1+exp(ηi).

Gives the link functiong(πi) = log( πi

1−πi

) = ηi

g(π) = log( π1−π

) = logit(π) is called the logit function


Logistic regression in R

> glmfit = glm(cbind(dead,tot-dead)˜dose,data=beetle,

family=binomial)

> print(glmfit)

Call: glm(formula = cbind(dead, tot - dead) ˜ dose, family = binomial,

data = beetle)

Coefficients:

(Intercept) dose

-60.72 34.27


Null Deviance: 284.2

Residual Deviance: 11.23 AIC: 41.43


The Poisson regression model is a GLM

• ResponsesYi ∼ Po(µi)


• E[Yi] = µi = exp(ηi), i.e. the link functiong(µi) = log(µi)

is the (natural) logarithm

> glm(children˜age,family=poisson)

Call: glm(formula = children ˜ age, family = poisson)

Coefficients:

(Intercept) age

-4.0895 0.1129


Null Deviance: 194.4

Residual Deviance: 165 AIC: 290


Ex. 4

Weights of 30 rats measured weekly in 5 weeks

10 15 20 25 30 35

150

200

250

300

350

days

Wei

ght


Ordinary linear model

ResponseYi,j is weight of rati for weekj.

Individual differences in level can be handled by one intercept

per rat.

Possible model:

Yi,j = αi + β ∗ xj + εi,j, εi,j ∼ N(0, σ2)

wherexj is number of days.Can estimateα1, ..., α30, β, σ

2 by ordinary linear regression


Ex. 4 cont.

The 30 rats are a sample from a population and we are interested

in the whole population.

We assume therefore a distribution forαi for all rats in the

population.

Specifically, we assumeαi ∼ N(α, σ2a),

whereα andσa are parameters

This is an example of amixed model

This mixed model can alternatively be formulated as

Yi,j = α + ai + β ∗ xj + εi,j, εi,j ∼ N(0, σ2)

whereai ∼ N(0, σ2a).


Ex 4: Estimation in R

lme(y˜x,random=˜1|id,data=d)

Linear mixed-effects model fit by REML

Data: d

AIC BIC logLik

1145.302 1157.290 -568.6508

Random effects:

Formula: ˜1 | id

(Intercept) Residual

StdDev: 14.03351 8.203811

Fixed effects: y ˜ x

Value Std.Error DF t-value p-value

(Intercept) 106.56762 3.0379720 119 35.07854 0

x 6.18571 0.0676639 119 91.41824 0

Correlation:

(Intr)

x -0.49


Some extensions

Other GLM-s:

• Count data withV ar(Y ) > E(Y ) - overdispersion:

Negative binomial distribution

• Continuous, non-normal response: Gamma or Inverse

Gaussian distributions

Extensions of GLM:

• Multinomial responses (STK3100)

• Mixed models (STK3100,STK4070)

• Dependent responses (STK3100,STK4060/STK4150)

• Survival data (STK4080)

• Generalized Additive Models (GAM) (STK4030)GLM and MM – p. 33

Overview of the book of de Jong & Heller

• Ch. 1: Introduction, data examples

• Ch. 2: Various distributions (most of these should be

known)

• Ch. 3: Exponential family, ML estimation

• Ch. 4: Linear modelling (mostly known from

STK1110/STK2120)

• Ch. 5: GLM

• Ch. 6: Count data (Poisson regression, overdispersion)

• Ch. 7: Categorical responses (binomial and multinomial)

• Ch. 8: Continuous responses

Ch. 1, 2 and 4 will not be teached in detail. Read!GLM and MM – p. 34

Overview of the book of Zuur et al.

• Ch. 5: Linear mixed models

• Ch. 8: Exponential family (supplement to de Jong &

Heller)

• Ch. 13: GLM and mixed models

• Perhaps other chapters


Plan for the course

de Jong & Heller

• Will mainly follow the chapters in the book 3, 5, 6, 7 and 8

Zuur et al.

• will mostly look at models and examples

The lecture slides will be published at the home page of thiscourse, together with an overview of planned lectures


stk3100/4100 autumn 2014 - universitetet i oslo · stk3100/4100 autumn 2014 ... additional text...

Documents