week 12: linear probability models, logistic and probit · week 12: linear probability models,...

28
Week 12: Linear Probability Models, Logistic and Probit Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1

Upload: tranminh

Post on 08-Sep-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Week 12: Linear Probability Models, Logistic andProbit

Marcelo Coca Perraillon

University of ColoradoAnschutz Medical Campus

Health Services Research Methods IHSMP 7607

2017

c©2017 PERRAILLON ARR 1

Outline

Modeling 1/0 outcomes

The “wrong” but super useful model: Linear Probability Model

Logistic regression

Probit regression

c©2017 PERRAILLON ARR 2

Binary outcomes

Binary outcomes are everywhere: whether a person died or not, brokea hip, has hypertension or diabetes, etc

We typically want to understand what is the probability of the binaryoutcome given explanatory variables

It’s exactly the same type of models we have seen during thesemester, the difference is that we have been modeling the conditionalexpectation given covariates: E [Y |X ] = β0 + β1X1 + · · ·+ βpXp

Now, we want to model the probability given covariates:P(X = 1|X ) = f (β0 + β1X1 + · · ·+ βpXp)

Note the function f() in there

c©2017 PERRAILLON ARR 3

Linear Probability Models

We could actually use our linear model to do so

It’s very simple to understand why. If Y is an indicator or dummyvariable, then E [Y |X ] is the proportion of 1s given X , which weinterpret as a probability of Y given X

We can then interpret the parameters as the change in the probabilityof Y when X changes by one unit or for a small change in X

For example, if we model diedi = β0 + β1agei + εi , we could interpretβ1 as the change in the probability of death for an additional year ofage

c©2017 PERRAILLON ARR 4

Linear Probability Models

The problem is that we know that this model is not entirelycorrect. Recall that in the linear model we assumeY ∼ N(β0 + β1X1 + · · ·+ βpXp, σ

2)

That is, Y distributes normal conditional on Xs

Obviously, a 1/0 variable can’t distribute normal, and εi can’t benormally distributed either

The big picture: The model is not correct but it is extremely usefulas a first step for analyzing the data

In particular, it is very useful because we can interpret the parametersin the probability scale, which is the scale of interest

Remember that we only needed the normality assumption forinference

c©2017 PERRAILLON ARR 5

Example

Data from the National Health and Nutrition Examination Survey(NHANES)

We model the probability of hypertension given age

reg htn age

Source | SS df MS Number of obs = 3,529

-------------+---------------------------------- F(1, 3527) = 432.78

Model | 59.3248737 1 59.3248737 Prob > F = 0.0000

Residual | 483.471953 3,527 .137077389 R-squared = 0.1093

-------------+---------------------------------- Adj R-squared = 0.1090

Total | 542.796826 3,528 .153853976 Root MSE = .37024

------------------------------------------------------------------------------

htn | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0086562 .0004161 20.80 0.000 .0078404 .009472

_cons | -.1914229 .0193583 -9.89 0.000 -.2293775 -.1534682

------------------------------------------------------------------------------

An additional year of life increases the probability of hypertension by0.8 percent

c©2017 PERRAILLON ARR 6

ExamplePlot predicted values against observed values (note the jittering)

reg htn age

predict htnhat_lpm

scatter htn age, jitter(5) msize(tiny) || line htnhat_lpm age, sort ///

legend(off) color(red) saving(hlpm.gph, replace)

graph export hlpm.png, replace

di _b[_cons] + _b[age]*20

c©2017 PERRAILLON ARR 7

Linear Probability Models

You can see the first problem with the LPM

The relationship between age (or any other variable) cannot belinear. Probabilities need to be constrained to be between 0 and 1

In this example, the probability of hypertension for a 20 y/o is-.0182996

In this a big problem in this example? No, because on average theprobability of hypertension is 0.19, which is not close to zero

Lowess is neat for seeing 0/1 variables conditional on a variable

c©2017 PERRAILLON ARR 8

LowessLowess can show you how the relationship between the indicatorvariable and the explanatory variable looks like

scatter htn age, jitter(5) msize(vsmall) || line lhtn age, sort ///

legend(off) color(red) saving(hlowess.gph, replace) ///

title("Lowess")

c©2017 PERRAILLON ARR 9

LPN, more issues

The variance of a 1/0 (binary) depends on the values of X so there isalways heteroskedasticity: var(y |x) = p(x)[1− p(x)]

We know that we need this assumption for correct SEs and F tests.So we can correct SEs in LPMs using the robust option (Huber-WhiteSEs; aka sandwich estimator)

Still, we do know that SEs are not totally correct because theydo not distribute normal either, even is we somehow correct forheteroskedasticity

c©2017 PERRAILLON ARR 10

So why do we use LPMs?

Not long ago, maybe 10 or 15 years ago, you couldn’t use otheralternatives with large datasets (logistic, probit)

It would take too long to run the models or they wouldn’t run;researchers would take a sample and run logit or probit as a sensitivityanalysis

The practice still lingers in HSR and health economics

The main reason to keep using LPM as a first step in modeling, it’sbecause the coefficients are easy to interpret

In my experience, if the average of the outcome is not close to 0 or 1,not much difference between LPM or logit/probit (but SEs canchange)

Not a lot of good reasons to present LPM results in papers anymore.Again, as a first step, LPM is great

c©2017 PERRAILLON ARR 11

Logistic or logit model

There are many ways to arrive to logistic regression

Again: Logistic models can be derived in several ways, which makesthings confusing

One way is to think about imposing a restriction on the functionalform so that probabilities are bounded between 0 and 1 regardless forall values of X

One of those transformations is the logistic response function:

π = ex

1+ex = 11+e−x , where x is any real number

π(x) is then restricted to be between 0 and 1

c©2017 PERRAILLON ARR 12

Logistic response functionIf we constrain the response to be between 0 and 1, it can’t be linearwith respect to X

twoway function y=exp(x) / (1+ exp(x)), range(-10 10) saving(l1.gph, replace)

twoway function y=exp(-x) / (1+ exp(-x)), range(-10 10) saving(l2.gph, replace)

graph combine l1.gph l2.gph, xsize(20) ysize(10)

graph export lboth.png, replace

c©2017 PERRAILLON ARR 13

Logistic or logit model

Notice a couple of things. If we use a model like the logistic responsefunction, then it is not possible that the effect of x on π is linear; theeffect depends on the value of x

But we can make the function linear using the so-called logittransformation

ln( π1−π ) = x

I made you go the other way in one homework. If you solve for π youget to the logistic response function

Of course, instead of just x we could have β0 + β1x and instead of πwe write p:

ln( p1−p ) = β0 + β1X , which transformed is p = eβ0+β1X

1+eβ0+β1X

c©2017 PERRAILLON ARR 14

Logistic model

That in a nutshell is the logistic model and that’s how your textbookintroduces the modelp

1−p are defined as the odds, an log( p1−p ) are the “log odds” or the

just the logit

Sadly, it is bad way of introducing the model (but it’s useful)

Note that we made no statistical assumptions anywhere

A better way to introduce logistict regression (and probit) is bymaking assumptions about the distribution of the outcome 1/0 andthen deriving a model for it

We did so when we covered maximum likelihood

c©2017 PERRAILLON ARR 15

Bernoulli example, redux

Suppose that we know that the following numbers were simulatedusing a Bernoulli distribution: 0 0 0 1 1 1 0 1 1 1

We can denote by y1, y2, ..., y10

Recall that the pdf of a Bernoulli random variable isf (x ; p) = py (1− p)1−y , where y ∈ {0, 1}. The probability of 1 is pwhile the probability of 0 is (1− p)

We want to figure out what is the p̂ we used to generate thosenumbers

The probability of the first number will be given by py1(1− p)1−y1 ,the probability of the second by py2(1− p)1−y2 and so on...

If we assume that the numbers above are independent, the jointprobability of seeing all 10 numbers will be given by themultiplication of the individual probabilities

We now want to make p a function of covariates

c©2017 PERRAILLON ARR 16

Binomial or Bernoulli?

A Bernoulli random variable can take only two values, 0 or 1 and hasonly one parameter, p

A Binomial variable models the number of successes (the 1’s) in nBernoulli trials and has two parameters, n and p

So no surprise that we could use either a Binomial or Bernoulli tomodel how the probability p is a function of covariates:p(x) = f (x1, ..,x p)

We arrive to the same likelihood function

c©2017 PERRAILLON ARR 17

A more general way

As I said, you can arrive to the logistic model in many different ways

We know that the probability of seeing the 1/0 variable y ispy (1− p)(1− y) and it’s variance is var(y) = p(1− p)

We can make p depend on a function of covariatesβ0 + β1X1 + · · ·+ βpXp . So we model:

pi = P(yi = 1|X1, ...,Xp) = F (β0 + β1X1 + · · ·+ βpXp)

F() is usually a cumulative distribution function

The most common are the logistic distribution and the standardnormal cumulative distribution function

Confusion: The logistic response function is also a probability densityfunction

If we use a logistic distribution function we arrive to the logisticregression model; if we use a standard normal cumulative densityfunction, we arrive to the probit model

c©2017 PERRAILLON ARR 18

Yet another way

What about if we assume there is a latent (unobserved) variable y∗

that is continuous. Think about it as a measure of illness

If y∗ crosses a threshold, then the person dies. We only observe if theperson died but we can’t observe the latent variable

We can write this problem as

P(y = 1|X ) = P(β0 + β1X1 + · · ·+ βpXp + u > 0) = P(−u <β0 + β1X1 + · · ·+ βpXp) = F (β0 + β1X1 + · · ·+ βpXp)

F() is the cdf of -u. If we assume logistic, we get logistic regression, ifwe assume cumulative normal, we get a probit model

See Cameron and Trivedi Chapter 14, section 14.3.1

c©2017 PERRAILLON ARR 19

Big picture

Not a big difference (in the probability scale) between probit andlogit; it’s use is “cultural.” If you are an economist you run probitmodels; for the rest of the world, there is the logistic model

IMPORTANT: There is a big difference in terms of interpreting aregression output because the coefficients are estimated in differentscales

In the logistic model the effect of a covariate can be made linear interms of the odds-ratio; you can’t do the same in probit models

In either model, there is NO ADDITIVE ERROR

There is no decomposing variance; now we are in a non-linear world

c©2017 PERRAILLON ARR 20

Simulation

set seed 12345678

set obs 10000

gen x1 = rchi2(1)+2

* Make the probability a function of x1

gen pr1 = exp(-1+0.1*x1) /(1+exp(-1+0.1*x1))

* Generate a one-trial binomial (Bernoulli)

gen y = rbinomial(1, pr1)

list y in 1/10

+---+

| y |

|---|

1. | 0 |

2. | 1 |

3. | 1 |

4. | 0 |

5. | 1 |

|---|

6. | 0 |

7. | 0 |

8. | 0 |

9. | 1 |

10. | 0 |

+---+

c©2017 PERRAILLON ARR 21

Simulation

Note that the parameters match the ones I used in the gen step inprevious slide

logit y x1

Iteration 0: log likelihood = -6351.6567

Iteration 1: log likelihood = -6328.5462

Iteration 2: log likelihood = -6328.4883

Iteration 3: log likelihood = -6328.4883

Logistic regression Number of obs = 10,000

LR chi2(1) = 46.34

Prob > chi2 = 0.0000

Log likelihood = -6328.4883 Pseudo R2 = 0.0036

------------------------------------------------------------------------------

y | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

x1 | .0998315 .0145838 6.85 0.000 .0712478 .1284153

_cons | -1.004065 .0493122 -20.36 0.000 -1.100715 -.9074147

------------------------------------------------------------------------------

c©2017 PERRAILLON ARR 22

Big picture

In the next weeks, we will focus on parameter interpretation

You already have most of the tools to do modeling but we need toadapt them to these types of models

The tricky part about logistic or probit models or other types ofmodels is to move from the work in which relationship are linear tothe world in which they are not

There is no R2 but there are other ways to check the predictive abilityof models

c©2017 PERRAILLON ARR 23

Comparing LPM, logit, and probit models

We will compare models in several ways (more on the next weeks)

* LPM

qui reg htn age

est sto lpm

predict hatlpm

* Logit

qui logit htn age

est sto logit

predict hatlog

* Probit

qui probit htn age

est sto prob

predict hatprob

line hatlpm age, sort color(blue) || line hatlog age, sort || ///

line hatprob age, sort legend(off) saving(probs.gph, replace)

graph export prob.png, replace

c©2017 PERRAILLON ARR 24

Predicted probabilities

Note that probabilities are not linear with probit and logit eventhough we wrote models in the same way

Note that there is practically no difference between the logit andprobit models

c©2017 PERRAILLON ARR 25

Coefficients?

The coefficients are very different because they are measured indifferent scales

. est table lpm logit prob, p

-----------------------------------------------------

Variable | lpm logit prob

-------------+---------------------------------------

_ |

age | .00865616

| 0.0000

_cons | -.19142286

| 0.0000

-------------+---------------------------------------

htn |

age | .0623793 .0352797

| 0.0000 0.0000

_cons | -4.4563995 -2.5542014

| 0.0000 0.0000

-----------------------------------------------------

legend: b/p

c©2017 PERRAILLON ARR 26

Similar in probability scale

We will learn how to make them comparable using the probabilityscale, which is what we really care about. The margins command iscomputing an average effect across values of age

. * LPM

qui reg htn age

margins, dydx(age)

------------------------------------------------------------------------------

| Delta-method

| dy/dx Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0086562 .0004161 20.80 0.000 .0078404 .009472

------------------------------------------------------------------------------

* Logit

qui logit htn age

margins, dydx(age)

------------+----------------------------------------------------------------

age | .0084869 .0004066 20.87 0.000 .00769 .0092839

------------------------------------------------------------------------------

* Probit

qui probit htn age

margins, dydx(age)

-------------+----------------------------------------------------------------

age | .0084109 .0003917 21.47 0.000 .0076432 .0091787

------------------------------------------------------------------------------

c©2017 PERRAILLON ARR 27

Summary

There are many ways to arrive to the logistic or probit models

The big picture is that we are trying to model a probability, whichmust be bounded between 0 and 1

If it’s bounded, then the effect of any covariate on the probabilitycannot be linear

The tricky part is to learn how to interpret parameters. The shortstory is that we estimate (assuming one predictor)

log( p1−p ) = β0 + β1x but we care about p = eβ0+β1x

1+eβ0+β1x

In the probit model, we interpret parameters as shifts in thecumulative normal, even less intuitive

c©2017 PERRAILLON ARR 28