stk3100/4100 autumn 2014 - universitetet i oslo · stk3100/4100 autumn 2014 ... additional text...
TRANSCRIPT
STK3100/4100 Autumn 2014• Introduction to Generalized Linear Models
(GLM) and mixed models• Teacher: Magne Aldrin
• professor II at Uio• main position at Norsk Regnesentral
(Norwegian Computing Center)• Responsible for exercises: Tonje Gulbrandsen
Lien• Slides based on previous presentations of Sven
Ove Samuelsen and Geir Storvik
GLM and MM – p. 1
Plan for the day
1. Introduction, literature, computer program
2. Examples
3. Informal definition of GLM
4. Mixed models
5. Plan for the course
GLM and MM – p. 2
Generalized Linear Models (GLM)
• Extension of multiple regression and anova
• An important class of models
• A common framework for regression analysis of
continuous, binary (binomial), categorical (multinomial) or
count response variables
• Includes ordinary least squares regression (OLS), logistic
regression and Poisson regression
GLM and MM – p. 3
Mixed models/random effect models
• Some regression coefficients are random
• Can account for correlations within groups of observations
• Can be combined with GLM
• Active research field, still not fully developed
GLM and MM – p. 4
Goals
• Introduction toGLM
• learn to use these models to analyse practical problems
• know the mathematical background for the analyses
• Knowledge ofmixed models
• learn to use these models to analysesimple practical
problems
• knowledge of approximations and challenges when
using such models
The course will have both a practical and a theoreticalperspective, with examples from medicine, biology, socialsciences, economics and insurance
GLM and MM – p. 5
Literature
Main text book Generalized Linear Models for Insurance Data
by Piet de Jong and Gillian Z. Heller.
• available at Akademika.
• homepage:
www.actuary.mq.edu.au/research/books/GLMsforInsuranceData
Additional text book Mixed Effects Models and Extensions in
Ecology with R by Alain Zuur et al.
• ebook, can be downloaded from internet
• only selected chapters
GLM and MM – p. 6
R statistical package
• We will use theRpackage for computing
• Can be downloaded for free from
http://mirrors.sunsite.dk/cran/
• Can be used under Windows, Mac and Linux operative
systems
• We will mainly used routines that already are programmed
in R, not much programming by yourselves
• R homepage:http://www.r-project.org/
GLM and MM – p. 7
Data example 1: Birth weight and gestational age
Boys Girls
Duration (weeks) Birth weight (grams) Duration (weeks) Birth weight (grams)
40 2968 40 331738 2795 36 272940 3163 40 293535 2925 38 275436 2625 42 321037 2847 39 281741 3292 40 312640 3473 37 253937 2628 36 241238 3176 38 299140 3421 39 287538 2975 40 3231
Av. 38.33 3024.00 38.75 2911.33
Interested in studying how birth weight depends on gestational
age and gender
GLM and MM – p. 8
Scatter plot for Ex. 1
svangerskapslengde (uker)
fłdse
lsve
kt (g
)
36 38 40 42
2400
2600
2800
3000
3200
3400
+ Jenter
o Gutter
GLM and MM – p. 9
Typical model for Ex. 1: Linear regression
One response variable and two explanatory variables:
Yjk = birth weight for baby no.k gender no.j
xjk = gestational age for baby no.k gender no.j
for k = 1, ..., 12 andj = 1, 2 (j = 1 means boy andj = 2 girl)
Assumed model:
Yjk = αj + βxjk + εjk
whereεjk ∼ N(0, σ2), i.e. normally distributed with mean 0 and
common varianceσ2 and also independent
β = slope, regression coefficient
αj = intercept for genderjGLM and MM – p. 10
Least squares estimates
svangerskapslengde (uker)
fłdse
lsve
kt (g
)
36 38 40 42
2400
2600
2800
3000
3200
3400
GutterJenter
Estimates:α1 = −1610, α2 = −1773, β = 121
GLM and MM – p. 11
Equivalent model formulation
• Linearity: E[Yjk] = µjk = αj + βxjk
• Constant variance: Var[Yjk] = σ2
• Normality: Yjk ∼ N(µjk, σ2)
• Independent responses:Yjk-s are independent
Extensions in STK3100/4100:
• Linearity after transformation ofµ by a link functiong():
g(µjk) = αj + βxjk ⇔ E[Yjk] = µjk = g−1(αj + βxjk)
• The variance depends on the expectation
• Other distributions: Binomial, Poisson, gamma, ...
• Including random effects (mixed models) to account for
dependenciesGLM and MM – p. 12
Data Ex. 2: Deadly dose of poison for beetles
About 60 beetles were exposed to each of 8 differentconcentrations ofCS2, and number killed at each of theconcentrations were recorded
Dose
(log10
CS2mg l−1)
Number
beetles
Number
dead
1.6907 59 6
1.7242 60 13
1.7552 62 18
1.7842 56 28
1.8113 63 52
1.8369 59 53
1.8610 62 61
1.8839 60 60
Want to study how mortality depends on dose
GLM and MM – p. 13
Ex. 2: Proportion of dead beetles vs dose
dose (log_10)
ande
l dod
e bi
ller
1.70 1.75 1.80 1.85
0.0
0.2
0.4
0.6
0.8
1.0
GLM and MM – p. 14
Reasonable model for Ex. 2
Yi = number of dead beetles with dosexi are binomially
distributed
Yi ∼ bin(ni, πi)
whereπi = probability for a beetle to die at dosexi and
ni = number of beetles treated with dosexi
A linear model forπi estimated by ordinary least squares (OLS)
is problematic because
• 0 ≤ πi ≤ 1 that can not be guaranteed by a linear
expressionα + βxi
• Var(Yi) = niπi(1− πi), non-constant (heteroscedastic)
varianceGLM and MM – p. 15
Usual solution for Ex. 2: Logistic regression
Logistic regression model:
πi =exp(α + βxi)
1 + exp(α + βxi)
Then0 ≤ πi ≤ 1
Fit or estimate the model by Maximum Likelihood (ML).
• Take into account that the responses are binomially
distributed
• Estimates are efficient if we have enough data,
(better estimation methods may exist when number of
observations are few)
GLM and MM – p. 16
Logistic regression for Ex. 2
MLE: α = −60.72, β = 34.27
Predicted probabilities:π = exp(α+βx)
1+exp(α+βx)
dose (log_10)
ande
l dod
e bi
ller
1.70 1.75 1.80 1.85
0.0
0.2
0.4
0.6
0.8
1.0
GLM and MM – p. 17
Estimating parameters in logistic regression
Storvik: "Numerical optimization of likelihoods: Additional
literature for STK2120" gives a Newton-Raphson routine inR to
fit logistic regression to these data
But this is already implemented inR. Use the command
glm(cbind(dead,tot-dead)˜Dose,data=beetle,
family=binomial)
• glm = Generalised linear model
• family=binomial indicates that we have binary or
binomial response data
• cbind(dead,tot-dead) is an n x 2 matrix with no.
successes and no. failures in the two columns
GLM and MM – p. 18
Data Ex. 3: Number of children among pregnant
women
de Jong & Heller: Data for no. previous children among 141
pregnant women of various ages.
The number of children tends to increase by age (as expected)
20 25 30 35 40
01
23
45
67
alder
anta
ll bar
n
20 25 30 35 40
0.0
0.5
1.0
1.5
2.0
2.5
alder
gjenn
omsn
ittlig
anta
ll bar
n
GLM and MM – p. 19
Data Ex. 3b: Number of car damages
de Jong & Heller: Data for reported car damages for 65535
policies
Explanatory variables:
• Value of car
• Age of car
• Type of car
• Gender of driver
• Age of driver
The response variable is acount variable in both examples,perhaps Poisson distributed
GLM and MM – p. 20
In Ex. 3: Yi = No. previous children for mother no. i
it can be reasonable to assume thatYi is Poisson distributed with
expectationµi,
whereµi depends onxi = mother’s age
Similar to Ex. 2:
• Expectationµi > 0
• Variance ofYi equal toµi, i.e. non-constant variance
Usual solution: Poisson regression
Yi ∼ Po(µi) whereµi = exp(α + βxi)
This is also a GLM and can be fitted by the glm function in R
Must specify that response data are Poisson distributed byfamily=poisson
GLM and MM – p. 21
Poisson regression for Ex. 3
MLE for (α, β): (α, β) = (−4.0895, 0.1129)
Gives fitted expectationsµi = exp(α + βxi)
20 25 30 35 40
0.00.5
1.01.5
2.02.5
alder
forve
ntet a
ntall b
arn
o Observert i 5 årsgrupper
Tilpasset med glm
GLM and MM – p. 22
Definition of GLM
Independent responses:Y1, Y2, . . . , Yn conditioned on
explanatory variables
Vectors of explanatory variablesx1,x2, . . . ,xn
wherexi = (xi1, xi2, . . . , xip) arep-dimensional
A GLM = Generalized Linear Model is defined by
• Y1, Y2, . . . , Yn comes from the same class of distributions
from the exponential family
(The exponential family will be defined later. It includes
normal, binomial, Poisson and gamma distributions)
• Linear predictorsηi = β0 + β1xi1 + · · ·+ βpxip
• Link functiong(): µi = E[Yi] is coupled to the linear
predictor byg(µi) = ηiGLM and MM – p. 23
The linear regression model is a GLM
• Responses (Yi-s) are normally distributed
• Linear predictorηi = β0 + β1xi1 + · · ·+ βpxip
• E[Yi] = µi = ηi, i.e. the link functiong(µi) = µi is the
identity function
The R-commandslm for linear regression andglm does
essentially the same, but with slightly different output
Linear regression is the default specification ofglm
GLM and MM – p. 24
Ex 1: Birth weights
> lm(vekt˜sex+svlengde)
Call:
lm(formula = vekt ˜ sex + svlengde)
Coefficients:
(Intercept) sex svlengde
-1447.2 -163.0 120.9
> glm(vekt˜sex+svlengde)
Call: glm(formula = vekt ˜ sex + svlengde)
Coefficients:
(Intercept) sex svlengde
-1447.2 -163.0 120.9
Degrees of Freedom: 23 Total (i.e. Null); 21 Residual
Null Deviance: 1830000
Residual Deviance: 658800 AIC: 321.4GLM and MM – p. 25
The logistic regression model is a GLM
• Responses (Yi-s) are binomially distributed Bin(ni, πi)
• Linear predictorηi = β0 + β1xi1 + · · ·+ βpxip
• E[Yi]/ni = πi =exp(ηi)
1+exp(ηi).
Gives the link functiong(πi) = log( πi
1−πi
) = ηi
g(π) = log( π1−π
) = logit(π) is called the logit function
GLM and MM – p. 26
Logistic regression in R
> glmfit = glm(cbind(dead,tot-dead)˜dose,data=beetle,
family=binomial)
> print(glmfit)
Call: glm(formula = cbind(dead, tot - dead) ˜ dose, family = binomial,
data = beetle)
Coefficients:
(Intercept) dose
-60.72 34.27
Degrees of Freedom: 7 Total (i.e. Null); 6 Residual
Null Deviance: 284.2
Residual Deviance: 11.23 AIC: 41.43
GLM and MM – p. 27
The Poisson regression model is a GLM
• ResponsesYi ∼ Po(µi)
• Linear predictorηi = β0 + β1xi1 + · · ·+ βpxip
• E[Yi] = µi = exp(ηi), i.e. the link functiong(µi) = log(µi)
is the (natural) logarithm
> glm(children˜age,family=poisson)
Call: glm(formula = children ˜ age, family = poisson)
Coefficients:
(Intercept) age
-4.0895 0.1129
Degrees of Freedom: 140 Total (i.e. Null); 139 Residual
Null Deviance: 194.4
Residual Deviance: 165 AIC: 290
GLM and MM – p. 28
Ex. 4
Weights of 30 rats measured weekly in 5 weeks
10 15 20 25 30 35
150
200
250
300
350
days
Wei
ght
GLM and MM – p. 29
Ordinary linear model
ResponseYi,j is weight of rati for weekj.
Individual differences in level can be handled by one intercept
per rat.
Possible model:
Yi,j = αi + β ∗ xj + εi,j, εi,j ∼ N(0, σ2)
wherexj is number of days.Can estimateα1, ..., α30, β, σ
2 by ordinary linear regression
GLM and MM – p. 30
Ex. 4 cont.
The 30 rats are a sample from a population and we are interested
in the whole population.
We assume therefore a distribution forαi for all rats in the
population.
Specifically, we assumeαi ∼ N(α, σ2a),
whereα andσa are parameters
This is an example of amixed model
This mixed model can alternatively be formulated as
Yi,j = α + ai + β ∗ xj + εi,j, εi,j ∼ N(0, σ2)
whereai ∼ N(0, σ2a).
GLM and MM – p. 31
Ex 4: Estimation in R
lme(y˜x,random=˜1|id,data=d)
Linear mixed-effects model fit by REML
Data: d
AIC BIC logLik
1145.302 1157.290 -568.6508
Random effects:
Formula: ˜1 | id
(Intercept) Residual
StdDev: 14.03351 8.203811
Fixed effects: y ˜ x
Value Std.Error DF t-value p-value
(Intercept) 106.56762 3.0379720 119 35.07854 0
x 6.18571 0.0676639 119 91.41824 0
Correlation:
(Intr)
x -0.49
GLM and MM – p. 32
Some extensions
Other GLM-s:
• Count data withV ar(Y ) > E(Y ) - overdispersion:
Negative binomial distribution
• Continuous, non-normal response: Gamma or Inverse
Gaussian distributions
Extensions of GLM:
• Multinomial responses (STK3100)
• Mixed models (STK3100,STK4070)
• Dependent responses (STK3100,STK4060/STK4150)
• Survival data (STK4080)
• Generalized Additive Models (GAM) (STK4030)GLM and MM – p. 33
Overview of the book of de Jong & Heller
• Ch. 1: Introduction, data examples
• Ch. 2: Various distributions (most of these should be
known)
• Ch. 3: Exponential family, ML estimation
• Ch. 4: Linear modelling (mostly known from
STK1110/STK2120)
• Ch. 5: GLM
• Ch. 6: Count data (Poisson regression, overdispersion)
• Ch. 7: Categorical responses (binomial and multinomial)
• Ch. 8: Continuous responses
Ch. 1, 2 and 4 will not be teached in detail. Read!GLM and MM – p. 34
Overview of the book of Zuur et al.
• Ch. 5: Linear mixed models
• Ch. 8: Exponential family (supplement to de Jong &
Heller)
• Ch. 13: GLM and mixed models
• Perhaps other chapters
GLM and MM – p. 35
Plan for the course
de Jong & Heller
• Will mainly follow the chapters in the book 3, 5, 6, 7 and 8
Zuur et al.
• will mostly look at models and examples
The lecture slides will be published at the home page of thiscourse, together with an overview of planned lectures
GLM and MM – p. 36