introduction to the generalized linear model: logistic ... › courses › 02418 › week9 ›...

28

Upload: others

Post on 30-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Introduction to the Generalized Linear Model:Logistic regression and Poisson regression

Statistical modelling: Theory and practice

Gilles Guillot

[email protected]

November 4, 2013

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 1 / 28

Page 2: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

1 Logistic regression

2 Poisson regression

3 References

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 2 / 28

Page 3: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

US space shuttle Challenger accident data

●●● ● ●●

● ● ●

●●

● ●● ●● ●● ●

15 20 25

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Temperature (°C)

Pro

port

ion

of fa

ilure

s

Proportion of O-rings experiencing thermal distress on the US space shuttle Challenger.

Each point corresponds to a �ight. Proportion obtained among 6 O-rings for each �ight.

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 3 / 28

Page 4: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

US space shuttle Challenger accident data (cont')

●●● ● ●●

● ● ●

●●

● ●● ●● ●● ●

15 20 25

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Temperature (°C)

Pro

port

ion

of fa

ilure

s

Key question: what is the probability to experience a failure if thetemperature is close to 0?

For the previous example, we would need a model along the line ofNumber of failures ∼ temperatureorProportion of failures ∼ temperature

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 4 / 28

Page 5: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Targeted online advertising

The problem

A person makes a search on Google on say 'cheap �ight Yucatan'

Is it worthwhile to display an ad for say 'hotel del Mar Cancún ?

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 5 / 28

Page 6: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Targeted online advertising (cont')

Google knows the preferences of this person from before

Google knows on which ad this person as clicked or not clicked in thepastIn stat. words: Google has vector of binary explanatory variables(x1, ..., xp) that summarizes the prospect's preferences

Google knows the preferences of other persons from before who havebeen shown the ad for 'hotel del Mar Cancún and the same ads as theprospect.Denoting yi (∈ {0, 1}) the pastGoogle knows yi and (xi1, ..., xip) for these persons

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 6 / 28

Page 7: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Targeted online advertising (cont')

In statistics word:

We have a dataset yi, xi1, ..., xip with i = 1, ..., nWe have a prospect known through its vector (x1, ..., xp)We want to predict the behavior y (0 or 1) of this person regarding thead 'hotel del Mar Cancún

We need a model along the line of y ∼ x1, ..., xpThe novelty (compared to the linear model) is that y is a count

What follows applies

to any type (quantitative or categorical) of explanatory variablesto any number (p = 1, p > 1) of explanatory variables

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 7 / 28

Page 8: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Hints about what is coming below

In a linear regression we write

yi = β0 + β1xi + εi and εi ∼ N (0, σ2)

which we can rephrased as

yi ∼ N (β0 + β1xi, σ2)

Idea of a logistic regression: twinkle the equation above to �t data that cannot be considered Normal.

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 8 / 28

Page 9: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

The logistic function

De�nition of the logistic function

The logistic function is a function l : R −→ R de�ned as

l(x) =exp(x)

1 + exp(x)=

1

1 + exp(−x)

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

Logistic function

x

l(x)

This function l is a bijection (one-to-one), with inverseg(x) = l−1(x) = ln(y/(1− y))

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 9 / 28

Page 10: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Logistic regression for a 0/1 response variable

We consider here the case of a single explanatory variable x and abinary response y.

The data consist of

y = (y1, ..., yn) observations of a binary (0/1) or two-way categoricalresponse variablex = (x1, ...., xn) observations of a quantitative explanatory variable

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 10 / 28

Page 11: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Logistic regression for a 0/1 response variable (cont')

We want a model such that P (Yi = 1) varies with xi

We assume that Yi ∼ b(pi)We assume that pi = l(β0 + β1xi) (l logistic function)

This is an instance of a so-called Generalized Linear Model

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 11 / 28

Page 12: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Formal de�nition of the logistic regression for a binary response

pi = l(β0 + β1xi) for a vector of unknown deterministic parametersβ = (β0, β1)

Yi ∼ b(pi) = b(l(β0 + β1xi))

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 12 / 28

Page 13: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Note the analogy with the linear regression yi ∼ N (β0 + β1xi, σ2)

Under this model: E[Yi] = pi = 1/(1 + e−(β0+β1x1))

V [Yi] = pi(1− pi). The variance of Yi around its mean is entirelycontrolled by pi, hence by β0 and β1.

The model enjoys the structure: "Response = signal + noise" but wenever write it explicitly nor do we make any assumption or even referto a residual ε. The structure "Response = signal + noise" is implicit.

Straightforward extension for more than one explanatory variables:

pi = l(∑p

j=1 x(j)i )

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 13 / 28

Page 14: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Logistic regression for �nite count data

We consider data y = (y1, ..., yn) obtained as counts over several replicatesdenoted N = (N1, ..., Nn). [For example, each Ni is equal to 6 in the0-rings data.]

pi = l(β0 + β1xi) for a vector of unknown deterministic parametersβ = (β0, β1)

Yi ∼ Binom(Ni, pi)

Under this model:E[Yi] = Nipi = Ni/(1 + e−(β0+β1x1))V [Yi] = Nipi(1− pi)

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 14 / 28

Page 15: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Maximum likelihood inference for logistic regression

Data: y = (y1, ...., yn)x = (x1, ..., xn) and N = (N1, ..., Nn)Parameters: β = (β0, β1). NB: there is no parameter controlling thedispersion of residualsLikelihood:

L(y1, ...., yn;β0, β1) =

n∏i=1

(Ni

yi

)pyii (1− pi)

Ni−yi

∝n∏i=1

pyii (1− pi)Ni−yi (1)

lnL(y1, ...., yn;β0, β1) =

n∑i=1

yi ln pi + (Ni − pi) ln(1− pi) (2)

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 15 / 28

Page 16: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

ML inference cont'

Maximizing the log-likelihood

In constrast with the linear model, the maximum log-likelihood of thelogistic regression does not have a closed-from solution.It has to be maximised numerically.

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 16 / 28

Page 17: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Model �tting in R

Binary response

With x, y two vectors of equal lengths, y being binary:

glm(formula = (y~x),

family=binomial(logit))

Returns parameters estimates and loads of diagnostic tools (e.g. AIC).

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 17 / 28

Page 18: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

Example of output of glm function

Call:

glm(formula = (y ~ x), family = binomial(link = logit))

Deviance Residuals:

Min 1Q Median 3Q Max

-2.1886 -1.0344 0.5489 0.9069 2.1294

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.54820 0.07364 7.444 9.75e-14 ***

x 1.08959 0.08822 12.351 < 2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1342.7 on 999 degrees of freedom

Residual deviance: 1138.1 on 998 degrees of freedom

AIC: 1142.1

Number of Fisher Scoring iterations: 4

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 18 / 28

Page 19: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Logistic regression

General binomial reponse

With x, N, y three vectors of equal lengths

glm(formula = cbind(y,N-y)~x,

family=binomial(logit))

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 19 / 28

Page 20: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Poisson regression

Poisson regression what for?

In many statistical studies, one tries to relate a count to someenvironmental or sociological variables.

For example:

Number of cardio-vascular accidents among people over 60 in a USstate ∼ Average income in the state?Number of bicycles in a Danish household ∼ Distance to the city centre

When the response is a count which does not have any natural upperbound, the logistic regression is not appropriate.

Even if there is a natural upper bound, the Binomial disitrbution maygive a poor �t

The Poisson regression is a natural alternative to the logisticregression in these cases.

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 20 / 28

Page 21: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Poisson regression

Poisson process, Poisson equation, Poisson kernel, Poisson distribution, Poisson

bracket, Poisson algebra, Poisson regression, Poisson summation formula, Poisson's

spot, Poisson's ratio, Poisson zeros....

Simeon Poisson 1781-1840

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 21 / 28

Page 22: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Poisson regression

The Poisson distribution

De�nition: Poisson distribution

The Poisson distribution is a distribution on N with probability mass

function: P (X = k) =e−λλk

k!for k ∈ N, and some parameter

λ ∈ (0,+∞)

Notation: X ∼ P(λ)

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 22 / 28

Page 23: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Poisson regression

The Poisson distribution (cont)

If X ∼ P(λ)E[X] = λ

V ar[X] = λ

If X ∼ P(λ) and Y ∼ P(µ), X,Y independent

X + Y ∼ P(λ+ µ)

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 23 / 28

Page 24: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Poisson regression

Set up for the Poisson regression

y1, ..., yn a discrete, positive variable, e.g. number of bicycles in nhouseholds

x1, ..., xn a quantitative variable, e.g. distance of household to citycentre or household income

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 24 / 28

Page 25: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Poisson regression

Poisson regression, de�nition

A Poisson regression is a model relating an explanatory variable x to apositive count variable y with the following assumptions:

De�nition: Poisson regression

y1, ..., yn are independant realizations of Poisson random variablesY1, ..., Yn with E[Yi] = µi

lnµi = αxi + β

For short Yi ∼ P(exp(αxi + β))

αxi + β is called the linear predictor

the function that relates the linear predictor to the mean of Yi iscalled the link function

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 25 / 28

Page 26: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Poisson regression

Poisson distribution �tting in R

glm.res = glm(formula = (nb.cycles ~ dist),

family = poisson(link="log"),

data = dat)

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 26 / 28

Page 27: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

Poisson regression

Output of glm() function

summary(glm.res)

Call:

glm(formula = (nb.cycles ~ dist), family = poisson(link = "log"),

data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.9971 -0.7196 -0.2033 0.4084 3.3813

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.09426 0.24599 -4.448 8.65e-06 ***

dist 0.63211 0.06557 9.640 < 2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 198.77 on 99 degrees of freedom

Residual deviance: 101.55 on 98 degrees of freedom

AIC: 356.91

Number of Fisher Scoring iterations: 5Gilles Guillot ([email protected]) Intro GLM November 4, 2013 27 / 28

Page 28: Introduction to the Generalized Linear Model: Logistic ... › courses › 02418 › week9 › lecture › glm.pdf · bound, the logistic regression is not appropriate. Even if there

References

Reading

A Modern Approach to Regression with R, Chapter 8 Logistic Regression,Sheather, Springer 2009.

Gilles Guillot ([email protected]) Intro GLM November 4, 2013 28 / 28