logistic regression ~ handout #1course1.winona.edu/bdeppa/biostatistics/handouts/ho… · web...

98
Handout 15 – Introduction to Logistic Regression This handout covers material found in Section 13.8 of your text. You may also want to review regression techniques in Chapter 11. These data are taken from the text “Applied Logistic Regression” by Hosmer and Lemeshow. Researchers are interested in the relationship between age and presence or absence of evidence of coronary heart disease (CHD). 1

Upload: others

Post on 09-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Handout 15 – Introduction to Logistic Regression

This handout covers material found in Section 13.8 of your text. You may also want to review regression techniques in Chapter 11.

These data are taken from the text “Applied Logistic Regression” by Hosmer and Lemeshow. Researchers are interested in the relationship between age and presence or absence of evidence of coronary heart disease (CHD).

The smooth is an estimate of:E(CHD|Age) = P(CHD=1|Age) Why?

Expectation of a Bernoulli Random VariableLet θ(Agei) denote the probability of having CHD for a given Age i.

Note: that CHDi|Agei is a Bernoulli random variable with the following probability distribution:

CHDi|Agei

P(CHDi|Agei)

0 1 - θ(Agei)1 θ(Agei)

We can find the expected value of the Bernoulli random variable as follows:

1

Page 2: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

How do we develop a parametric model for a dichotomous response like CHD using Age of the person as the covariate? We might try a linear regression model with Age as a predictor and CHD as the response. Before we do this in SAS, consider our linear regression model:

CHDi|Agei=η0+η1 Agei+e i ,

where CHDi|Agei={0 in the absence of CHD,

1 in the presence of CHD.

Note that the mean function is given by E(CHDi|Agei)=η0+η1Agei . As we saw above, we can find the expected value of the Bernoulli random variable as follows:

E(CHDi|Agei)=0×[1−θ( Agei )]+1×θ (Agei )=θ (Agei ).

Why is this important? This shows that… (see previous page)

E(CHDi|Age i )=η0+η1Agei=θ( Agei)=P(CHDi|Agei )

That is, the regression line gives an estimate of the probability of having CHD for a given Age.

In SAS… (using file CHD.sas)proc reg; model CHD = age;plot CHD*age; run;

2

Page 3: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Some Problems that Arise Using this Model1. Non-normality of the error terms. Only two different error terms

are possible for each Agei: - θ(Agei) if the response is 0, and 1 - θ(Agei) if the response is 1.

2. Non-constant variance of the error terms. Since CHDi|Agei is a Bernouilli random variable, we know that Var(CHDi|Agei ) = θ(Agei) × [1- θ(Agei)]. This then implies that,

Var(CHDi|Agei ) = [η0+η1 Agei]×[1-(η0+η1 Agei ) ]

That is, the variance function varies with Age and is NOT constant.

3. Constraints on the response function. A linear representation permits estimates or predictions outside the range 0 to 1, which is not correct when modeling probabilities. For example, what is our estimate of θ(Age=20) if we use a linear regression model?

Comment: The constraint that the mean function fall between 0 and 1 frequently rules out a linear response function. For our CHD example, the use of a linear response function might require us to assume a probability of 0 for the mean response for all individuals beneath a certain age and a probability of 1 for all individuals over a certain age (see below). Such a model is often considered unreasonable, however.

Ideally, we’d like to find a model where the probabilities 0 and 1 are reached asymptotically. One such model is the logistic regression model.

Recall:

3

Page 4: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

The Simple Logistic Mean Function

We parameterize this model as follows:

E( y i|x i )=θ( xi)=exp(η0+η1 xi )1+exp (η0+η1 xi) .

Some examples of simple logistic mean functions are shown below:

With η0 = 0 With η1 = -1

Comments:

1. The logistic mean function is always between 0 and 1.2. As η1 increases, the function becomes more S-shaped; therefore, the

function changes more rapidly in the center.3. When η1 is positive, the function is monotone increasing; when η1 is

negative, the function is monotone decreasing.4. Changing η0 shifts the function horizontally.5. The logistic function possesses the property of symmetry. If the

response variable is recoded by changing 1s to 0s and 0s to 1s, the signs of all coefficients will be reversed.

4

Page 5: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

5

Page 6: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

To fit the logistic regression model in SAS, you can use the following programming statements:

ods html;ods graphics on;

proc logistic descending;model CHD = age / link=logit;graphics estprob; run;

ods graphics off;ods html close;

Questions:

1. Based on the plot, find θ( 40)= P(CHD|Age=40) .

2. Based on the plot, find θ(60 )=P(CHD|Age=60) .

This curve is a plot of:

θ( Agei)=P(CHD|Age i )=exp( η0+η1Agei )1+exp( η0+η1 Agei )

6

Page 7: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Interpreting the Model Parameters

Mean function: E(CHD|Agei )=θ (Agei )=

exp (η0+η1Agei )1+exp(η0+η1 Agei) .

Fitted Model Equation (or Fitted Probabilities):

E(CHD|Agei )=θ (Agei )=exp ( η0+η1 Agei )1+exp( η0+η1 Agei)

Note that in the mean function, the probabilities θ(Agei) are nonlinear functions of η0 and η1. However, a simple transformation results in a linear model. That is, we can show the following:

ln [ θ(Agei )1−θ (Agei ) ]=η0+η1 Agei

Proof of the previous claim:

7

Page 8: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Fitting the Model in JMPSelect Analyze > Fit Y by X and place CHD (y/n) in the Y box and age in the X box.The resulting output is shown below. Because the response is a dichotomous categorical variable logistic regression is performed.

Example:P(CHD|Age=40)=

P(CHD|Age=60 )=

The curve is a plot of:

P(CHD|Age )=exp( ηo+η1 Age )1+exp ( ηo+ η1 Age )

8

Page 9: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Interpretation of Model Parameters

P(CHD=1|Age) =θ ( x

~)= e

ηo+η1 Age

1+eηo+η1 Age

Odds for Success

θ ( x~)

1−θ( x~)=

Thus

ln ( θ( x~)

1−θ( x~))=ηo+η1 Age

Suppose we contrast individuals who are Age = x to those who are Age = x + c. What can we say about the increased risk associated with a c year increase in age? The logistic model gives us a means to do this through the odds ratio (OR).

ln (OR associated with a c year increase in age )=ln(θ( Age=x+c )1−θ( Age=x+c )θ( Age=x )1−θ( Age=x )

)=ln(θ ( Age=x+c )

1−θ ( Age=x+c ) )− ln(θ( Age=x )1−θ ( Age=x ) )=ηo+η1( Age+c )−(ηo+η1 Age )=cη1

Exponentiating both sides gives

Thus the multiplicative increase (or decrease ifη1<0 ) in odds associated with a c year

increase in age isecη1

.

9

Page 10: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Example: Interpreting a c year increase in age.

Question: Is it reasonable to assume that a c unit increase in a continuous predictor is constant regardless of starting point? For example, does the risk associated with a 5 year increase in age remain constant throughout one’s life?

10

Page 11: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Statistical Inference for the Logistic Regression ModelGiven estimates for the model parameters and their estimated standard errors what types of statistical inferences can be made? One approach is to use the normal-theory based methods outlined below.

Hypothesis Testing

For testing:Ho : ηi=0Ha : ηi≠0

Large sample test for significance of “slope” parameter (ηi )

z=ηi

SE ( ηi )≈N (0,1)

Confidence Intervals for Parameters and Corresponding OR’s

For dichotomous categorical predictors (i.e. 0/1 predictors)100(1−α )% CI for ηi

ηi±z1−α /2 SE( ηi )

100(1−α )% CI for OR Associated with ηi

exp( ηi±z1−α /2SE( ηi))

If ηi corresponds to a continuous predictor and we wish to examine the OR associated with a c unit increase the CI for the OR becomes

exp( c ηi±z1−α /2 cSE ( ηi ))

Often times categorical predictors have more than two levels and we will see to handle that case later in the notes.

Example: What is the OR for CHD associated with a 10 year increase in age? Give a 95% confidence interval based on this estimate.

11

Page 12: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Some Mathematical Details: Estimation of the Model ParametersIn a linear regression analysis, the regression coefficients are estimated based on the least squares method. That is, the estimates are obtained by minimizing the sum of the squared residuals. In a logistic regression analysis, the model parameters are estimated through a process called the maximum likelihood method. The basic principle of maximum likelihood is to choose as estimates those parameter values which, if true, would maximize the probability of observing what we have actually observed. This involves:

1. Finding an expression (i.e., the likelihood function) for the probability of the data as a function of the unknown parameters.

For the logistic model, the binary response variable is assumed to follow a binomial distribution with a single trial (n=1) and probability of “success” equal to θ(xi). Therefore, for the ith observed pair ( xi ,yi ), the contribution to the likelihood is

θ( x i )yi (1−θ( x i) )

1− yi

where θ( x i )=

eηo+η1 xi

1+eηo+η1 xi and

y i={10Then, since we assume independence across observations, the likelihood function is given by

L(η~ )=L (ηo , η1)=∏i=1

n

θ( x i)yi(1−θ( x i))

1− y i

2. Finding the values of the unknown parameters which make the value of this expression as large as possible.

For computational purposes it is usually easier to maximize the logarithm of the likelihood function rather than the likelihood function itself. This works because the logarithm is a monotonic increasing function; therefore, the maximizing parameters are the same for the likelihood and log-likelihood functions. The log-likelihood function is given by

lnL (ηo ,η1)=∑i=1

n

y i ln (θ( x i ))+(1− y i) ln (1−θ ( xi ))

To find the parameter estimates, we solve simultaneously the equations given by setting the partial derivatives with respect to each parameter equal to 0:

12

Page 13: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

∂∂ ηo

lnL(ηo , η1 )=0

∂∂ η1

lnL (ηo , η1 )=0

Several different nonlinear optimization routines are used to find solutions to such systems. This process gets increasingly computationally intensive as the number of terms in the model increases.

Example: Estimating Model Parameters with a Single Dichotomous Predictor

CHD and Indicator of Age Over 55

Computed using standard approach

Logistic ModelThere are two different ways to code dichotomous variables (0,1) coding or (-1,+1), i.e. contrast) coding. JMP uses contrast coding by default whereas other packages will generally use the (0,1) coding as default. The two coding types are shown below.

Age 55+ = {10 Age 55+ =

{+1−1

For the purposes of discussion we will consider the (0,1) coding.

Recall

θ( x )=P(CHD=1|x )= e

ηo+η1 x

1+eηo+η1 x

where x = Age 55+ indicator we have the following.

Age > 55

Age < 55

Age > 55

Age < 55

13

Page 14: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Age ≥ 55 (x = 1) Age < 55 (x = 0)

CHD = 1

θ( x i=1)=exp(η0+η1 )1+exp(η0+η1 )

θ( xi=0)=exp(η0)1+exp (η0 )

CHD = 0

1-θ( x i=1)= 1

1+exp(η0+η1 )

1-θ( x i=0)= 1

1+exp (η0 )

Estimating the model parameters “by hand”

OR =

(θ( x=1)/(1−θ ( x=1))(θ( x=0 )/(1−θ ( x=0 ))

=

14

Page 15: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

EXAMPLE: Using the data in the file CHD.sas we will create a dummy variable indicating whether the subject is over age 55 or not. Then instead of examining the relationship between the CONTINUOUS variable age and the presence or absence of evidence of coronary heart disease (CHD), we could consider the dichotomous predictor:

Over55={0 if Age <551 if Age ≥ 55

data CHD;input ageGrp$ age CHD;if (age ge 55) then Over55=1; else Over55=0;datalines; 1 20 0. . .. . .

proc sort data=CHD;by descending CHD descending Over55;run;

proc freq order=data;tables CHD*Over55 / all;run;

Standard OR output from SAS:

Using PROC LOGISTIC to fit the model:

proc logistic data=CHD descending;model CHD = Over55 / link=logit;output out=probs predicted=predicted_probabilities;run;

15

Page 16: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Questions:

1. Use the model parameters to predict the probability of having CHD for a person who is 55 or over and for a person who is younger than 55.

2. Given only the estimates of the model parameters, could you find the odds ratio for having CHD associated with being 55 or over?

Verify these values from the SAS output:proc print data=probs; run;

16

Page 17: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

17

Page 18: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Analysis in JMP (CHD55.JMP)To fit a logistic regression model is best to use the Analyze > Fit Model option.We place CHD (1 = Yes, 2 = No) in the Y box and Age > 55 (1 = Yes, 2 = No) in the model effects box. The key is to have “Yes” for risk and disease alpha-numerically before “No”, thus the use of 1 for “Yes” and 2 for “No”.

The summary of the fitted logistic model is shown below. Notice that the parameter estimates are the not the same as those obtained from SAS. This is because JMP uses contrast coding for the Age > 55 predictor (+1 = Age > 55 and -1 = Age < 55).

18

Page 19: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

OR’s and Fitted Probabilities

Using JMP to Compute OR’s, CI’s, and Fitted Probabilities

Because we have the disease and the risk factor are alpha-numerically ordered the OR’s are correct as given.

19

Page 20: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

By selecting Save Probability Formula we can save the fitted probabilities to the spreadsheet.

Example: CHD and Age Over 55 in R

> Over55[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0[53] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1Levels: 0 1

> chd[1] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1 1[53] 0 1 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1Levels: 0 1

> table(chd,Over55)

Over55chd 0 1 0 51 6 1 22 21

> chd55 = glm(chd~Over55,family=”binomial”)> summary(chd55)

Call:glm(formula = chd ~ Over55, family = "binomial")

Deviance Residuals: Min 1Q Median 3Q Max -1.734 -0.847 -0.847 0.709 1.549

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.8408 0.2551 -3.296 0.00098 ***Over55 2.0935 0.5285 3.961 7.46e-05 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1) Null deviance: 136.66 on 99 degrees of freedomResidual deviance: 117.96 on 98 degrees of freedomAIC= 121.96 Number of Fisher Scoring iterations: 4

20

Page 21: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

How do we measure discrepancy between observed and fitted values?

In OLS regression with a continuous response we used

RSS=∑i=1

n

( y i− y i)2=∑

i=1

n

( y i−( ηT ui) )2

=∑i=1

n

( y i−( ηo+η1 u1 i+⋯+ηk uki))2

In logistic regression modeling we can use the deviance (typically denoted D or G2) which is defined as

D =2 ln

D=2∑

i=1

n

y i ln ( y i

θ ( xi ))+(1− y i ) ln ( 1− y i

1−θ( x i))Because the likelihood function of the saturated model is equal to 1 when the response (y i ¿is 0 or 1, the deviance reduces to:

D = -2 ln(likelihood of the fitted model)

= -2 ∑i=1

n

[ y¿¿ i ln ( θ ( xi ))+(1− y i ) ln (1−θ ( x i ))]¿

The deviance can be used to compare two potential models where one model is nested within the other by using the “General Chi-Square Test” for comparing rival logistic regression models. We will see more applications of this in more detail when we discuss multiple logistic regression and model development, however we will demonstrate this process below when considering a single predictor x.

The general nested model concept:

General Chi-Square TestConsider the comparing two rival models where the alternative hypothesis model

likelihood of saturated modellikelihood of fitted model

21

Page 22: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Ho :log (θ( x )1−θ( x )

)=η1T x1

H1 :log(θ ( x )1−θ( x )

)=η1T x1+η

2T x2

General Chi-Square Statistic

χ2= (residual deviance of reduced model) – (residual deviance of full model)

= D( for model without the terms in x2 )−D(for model with the terms in x2)~ χ

Δ df 2

If the full model is needed χ2

is BIG and the associated p-value = P( χ Δ df2 > χ2)is small.

Example: CHD and Age ~ a single numeric predictor

Ho :H1 :

From JMP

From R> summary(chd.glm)Call:glm(formula = chd ~ Age, family = "binomial")

Deviance Residuals: Min 1Q Median 3Q Max -1.9718 -0.8456 -0.4576 0.8253 2.2859

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.30945 1.13365 -4.683 2.82e-06 ***

(reduced model OK)

(full model needed)

22

Page 23: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Age 0.11092 0.02406 4.610 4.02e-06 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null deviance: 136.66 on 99 degrees of freedomResidual deviance: 107.35 on 98 degrees of freedom

23

Page 24: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Statistical Inference for the Logistic Regression Model (In SAS)

First, consider the following output from PROC LOGISTIC:

proc logistic descending;model CHD = age / link=logit;output out=get_values predicted=predicted_probabilities;run;

All of these statistics are testing the same null hypothesis:

Ho: all explanatory variables in the model have coefficients of zero.Ha: at least one explanatory variable in the model has a coefficient different from zero.

The Likelihood Ratio test compares the log-likelihood for the fitted model with the likelihood for a model with NO explanatory variables. PROC LOGISTIC reports -2×log-likelihood for each of these models, and the chi-square test statistic is the difference of these two numbers. Note that the df = 1 corresponds to the one independent variable in the model.

The Score statistic is a function of the first and second derivatives of the log-likelihood function under the null hypothesis. There is some evidence that this test does not perform as well as the likelihood ratio test for small samples.

The Wald statistic is an approximation that is more accurate with larger sample sizes.

24

Page 25: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Hypothesis Testing For Individual Coefficients

Ho : ηi=0Ha : ηi≠0

When the sample size is large, the test for significance of the “slope” parameter (ηi )can be calculated as follows:

z=η i

SE( ηi ) =

χ2= z2 =

Confidence Intervals for Coefficients and Corresponding Odds Ratios

A 100(1−α )% confidence interval for ηi can be calculated as follows:

ηi±z1−α/2 SE( ηi )

A 100(1−α )% confidence interval for the odds ratio associated with ηi is calculated as follows:

exp( ηi±z1−α/2 SE( ηi))

These intervals can be calculated in SAS PROC LOGISTIC as follows:

proc logistic descending;model CHD = age / link=logit clparm=wald;run;

25

Page 26: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

If β i corresponds to a continuous predictor and we wish to examine the odds ratio associated with a c unit increase, the confidence interval for the odds ratio becomes

exp( c×ηi±z1−α/2 c×SE( ηi))

Example: Find the odds ratio for CHD associated with a 10 year increase in age, and give a 95% confidence interval based on this estimate.

proc logistic descending;model CHD = age / link=logit clparm=pl clodds=pl;units age=10;run;

The preceding intervals are all known as Wald intervals (based on normal-theory methods).

These may not be appropriate for small samples; therefore, you may want to consider another

method called the Profile Likelihood method. This involves an iterative evaluation of the

likelihood function and produces intervals that may not be symmetric around the estimate.

proc logistic descending;model CHD = age / link=logit clparm=pl;run;

Questions:

26

Page 27: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

1. How do these compare to the Wald confidence intervals?

2. How would you find the profile likelihood confidence interval for the odds ratio?

27

Page 28: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Statistics Measuring Predictive Power

Once again, consider the model using the continuous variable age to predict CHD:

proc logistic descending;model CHD = age / link=logit;output out=get_values predicted=predicted_probabilities;run;

Recall that the p-values shown above are used to test the usefulness of the logistic regression model. We can also consider a few other statistics to investigate the model’s predictive power:

Generalized R2

This is calculated as follows: 1−exp[−Likelihood Ratio Chi-Square

n ]=You can also request this quantity from SAS:

proc logistic descending;model CHD = age / link=logit rsq;run;

Note that the upper-bound of the generalized R2 is less than 1. Therefore, PROC LOGISTIC also reports a quantity labeled the “Max-rescaled R-Square,” which divides the original generalized R2 by its upper bound.

Ordinal Measures of Association

SAS PROC LOGISTIC also reports the following statistics:

28

Page 29: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

The idea behind these statistics is as follows. For the 100 observations in the data set, there exist 100×(99)/2 = 4,950 different ways to pair them up (without pairing an observation with itself). Of these pairs, 2,499 have either both 1s or both 0s for an observed response. These are ignored, leaving 2,451 pairs in which one case has a 0 and the other case has a 1. For these pairs, SAS determines whether the observation with a 1 has a higher predicted value (based on the model) than does the observation with a 0. If this is the case, the pair is called concordant. If not, the pair is discordant.

Let C = the number of concordant pairs = D = the number of discordant pairs = T = the number of ties = N = the total number of pairs (before eliminating any) =

The four measures of association are given as

1. Somer’s D = C−D

C+D+T

2. Gamma = C−DC+D

3. Tau-a = C−D

N

4. C = .5×(1 + Somer’s D)

All four measures vary between 0 and 1, with large values corresponding to stronger associations between the predicted and observed values. Finally, note that the measure known as C has another familiar interpretation. Consider the following programming statements.

ods html;ods graphics on;

proc logistic data=CHD descending;model CHD = age / link=logit outroc=roc_data;run;

29

Page 30: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

ods graphics off;ods html close;

proc print data=roc_data; run;

These statements request the following output.

.

.

30

Page 31: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

The ROC curve is obtained by changing the classification rule based on the estimated probability. Note that the area under the ROC curve is the same as C.

More Analysis in JMP – Logistic Regression with a Single Numeric Predictor

Estimated Odds Ratios

ROC Curve and Table

OPTIONS FOR LOGISTIC REGRESSIONLikelihood Ratio Tests – same as in SASWald Tests – normal-theory based

Confidence Intervals – gives CI’s for population parameters in the model.Odds Ratios –Gives odds ratio associated with a unit increase in x, i.e. c = 1 and the odds ratio associated with being at the maximum of x vs. the minimum of x.

ROC Curve – if we use θ( x

~)

=P (CHD|x~ ) to construct a rule for classifying a

patient as having CHD vs. No CHD this option gives the ROC curve coming from all possible cutpoints based on this estimated probability.

31

Page 32: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

By changing the classification rule based on estimated probability we can obtain an ROC curve.

32

Page 33: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

33

Page 34: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Analysis in R – Logistic Regression with Single Numeric Predictor> CHD <- read.table(file.choose(),header=T)> CHD agegrp age chd1 1 20 02 1 23 03 1 24 04 1 25 05 1 25 1. . . .. . . .. . . .96 8 63 197 8 64 098 8 64 199 8 65 1100 8 69 1

> names(CHD)[1] "agegrp" "age" "chd" > attach(CHD)

> chd <- factor(chd) > chd.glm <- glm(chd~age,family="binomial")> summary(chd.glm)

Call:glm(formula = chd ~ age, family = "binomial")

Deviance Residuals: Min 1Q Median 3Q Max -1.9718 -0.8456 -0.4576 0.8253 2.2859

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.30945 1.13263 -4.688 2.76e-06 ***age 0.11092 0.02404 4.614 3.95e-06 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 136.66 on 99 degrees of freedomResidual deviance: 107.35 on 98 degrees of freedomAIC: 111.35

Number of Fisher Scoring iterations: 3

> probCHD <- exp(-5.30945 + .11092*age)/(1+exp(-5.30945 + .11092*age))

Make sure that you specify family=”binomial” or R will perform ordinary least squares

34

Page 35: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

> plot(age,probCHD,type="b",ylab="P(CHD|Age)",xlab="Age")

An easier way obtain the estimated probabilities is to extract them from the model object.

> probCHD <- fitted(chd.glm)> plot(Age,probCHD,type=”b”,ylab=”P(CHD|Age)”) # This produces plot above

We can obtain the estimated logit (Li= ηo+ η1 Age ) by using the predicted command.> chd.logit = predict(chd.glm)> plot(Age,chd.logit,type="b",ylab="L = no + n1*Age")> title(main="Plot of Estimated Logit vs. Age")

P(CHD|Age )= eηo+η1 Age

1+eηo+η1 Age

ηo=−5. 310η1=. 11092

35

Page 36: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Multiple Logistic Regression

The multiple logistic mean function has the basic form,

ln ( θ(~x )1−θ(~x ))=η0+η1u1+η2 u2+…+ηk−1uk−1

where the

ui=are termsbased on the x j ' s

What are terms?

36

Page 37: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Terms (cont’d)

37

Page 38: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

EXAMPLE 1: The data in the file OC_Use.sas are from a case-control study comparing the use of oral contraceptives and the occurrence of myocardial infarctions. Subjects were also classified into one of five age groups.

data OCUse;input AgeGrp$ Status$ OCuse$ count;datalines;1 Case Yes 41 Case No 21 Control Yes 621 Control No 2242 Case Yes 92 Case No 122 Control Yes 332 Control No 3903 Case Yes 43 Case No 333 Control Yes 263 Control No 3304 Case Yes 64 Case No 654 Control Yes 94 Control No 3625 Case Yes 65 Case No 935 Control Yes 55 Control No 301;

In Handout 14, we discussed using the Cochran-Mantel-Haenszel test for controlling for a single categorical covariate while assessing the association between two other variables. We could apply this test to these data in order to “adjust” for age group when examining the relationship between oral contraceptive use and disease status:

proc freq order=data;tables AgeGrp*status*OCuse / cmh all;weight count;run;

Question: What do you conclude from this test?

38

Page 39: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

The “/ all” option gives us the odds ratio for having myocardial infarction associated with oral contraceptive use for each age group:

Age Grou

p

Odds Ratio

1

2

3

4

5

Recall that we can test for a difference in these odds ratios:

Finally, the CMH methods also provide us with a common odds ratio:

39

Page 40: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

You can think of this as an estimate of the odds ratio for having myocardial infarction associated with oral contraceptive use after controlling for age group.

Fitting Multiple Logistic Regression Models in SAS

Logistic regression methods also provide us with a method for controlling for confounding variables. Note that we can use a multiple logistic regression model to predict the probability of myocardial infarction based on oral contraceptive use. Moreover, we can add age group to the model in order to adjust for age.

First, we must make our binary response variable numeric in order to use PROC LOGISTIC:

data OCuse2;set OCuse;if Status="Case" then MI = 1; else MI = 0;run;

We can leave the two predictor variables (OC use and Age group) in categorical format; however, we must place these variable names in the ‘class’ statement in PROC LOGISTIC:

proc logistic descending;class OCUse agegrp;model MI = OCUse agegrp;weight count;run;

40

Page 41: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Questions:

1. Is the logistic regression model useful? Explain.

2. Why does SAS report four coefficients for Age Group?

The Multiple Logistic Mean Function

The multiple logistic regression model for this example is parameterized as follows:

E( MI|~x=OCuse, AgeGrp )=θ (~x )=exp (η0+η1 OCuse+η2Age1+η3 Age2+η4 Age3+η5 Age4)1+exp(η0+η1 OCuse+η2 Age1+η3Age2+η4 Age3+η5 Age4 )

.

Also, recall thatln ( θ(~x )

1-θ(~x ) )=η0+η1OCuse+η2 Age1+η3 Age2+η4 Age3+η5 Age4.

Note that since Age Group has five levels, its definition requires four dummy (or indicator) variables. SAS lists the value of these dummy variables in the PROC LOGISTIC output:

41

Page 42: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

This method is known as “effects coding” (the reference group is identified by -1):

Age 1={1 if Age group 1-1 if Age group 50 otherwise

Age 2={1 if Age group 2-1 if Age group 50 otherwise

Age 3={1 if Age group 3-1 if Age group 50 otherwise

Age 4={1 if Age group 4-1 if Age group 50 otherwise

OCuse={1 if non-user-1 if user

PROC LOGISTIC reports the following parameter estimates and odds ratios:

Questions:

1. How does SAS calculate the odds ratio for MI associated with not using oral contraceptives for those in Age group 1?

ln ( θ(~x )1-θ(~x ) )=η0+η1 OCuse+η2 Age1+η3 Age2+η4 Age3+η5 Age4

42

Page 43: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

2. How does SAS calculate the odds ratio for MI associated with not using oral contraceptives for those in Age group 2?

3. How does SAS calculate the odds ratio for MI associated with being in Age group 1 versus Age group 5, adjusted for oral contraceptive use?

Reordering the Factors

To examine the effects of OC use and Age Group, we may want to “reorder” the levels of both variables. That is, we may want to use the non-OC users as the reference group. Also, we may want to use the youngest age group as our baseline.

PROC LOGISTIC allows you to specify a reference group in the class statement:

proc logistic descending;class OCUse(param=ref ref='No') agegrp(param=ref ref='1');model MI = OCUse agegrp;weight count;run;

43

Page 44: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Note that the indicator variables are now defined using a method known as “dummy coding”:

Age 2={1 if Age group 20 otherwise

Age 3={1 if Age group 30 otherwise

Age 4={1 if Age group 40 otherwise

Age 5={1 if Age group 50 otherwise

OCuse={0 if non-user1 if user

Questions:

1. Suppose we want to find the odds ratio associated with being in Age group 5 when compared to being in Age group 1 after adjusting for oral contraceptive use.

ln ( θ(~x )1-θ(~x ) )=η0+η1 OCuse+η2 Age2+η3 Age3+η4 Age4+η5 Age5

2. Find the odds ratio associated with oral contraceptive use ADJUSTED for age. How does this compare to the CMH estimate?

3. Find a 95% confidence interval for this odds ratio.

44

Page 45: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Note that PROC LOGISTIC returns these odds ratios and their confidence intervals:

4. Interpret the age effect in terms of odds ratios after adjusting for OC use.

5. How would you compare OC users in Age group 5 to non-OC users in Age group 1?

6. How would you compare OC users in Age group 4 to non-OC users in Age group 3?

45

Page 46: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Example 1 in JMP

To fit a logistic regression model using OC use and Age group as covariates in JMP select Analyze > Fit Model and place both Age and OC use in the Construct Model Effects box and Case-Control status as the response as shown below:

46

Page 47: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Then click Run to fit the model for these data. The resulting output is shown below.

We can see the odds for MI will be given which is the response category of interest here. Had it read No MI/MI we would want to recode the response so MI was the response value of interest.

OC Use = No is the category of interest and OC Use = Yes is being used as the reference group, i.e. the denominator odds in the odds ratio. This is not what we want. We want to find the OR for having an MI associate with OC Use = Yes, i.e the risk associated with oral contraceptive use. We can achieve this by recoding OC Use so OC Use = Yes is the category of interest using the Value Ordering option in the Column Info… This process is shown below.

47

Page 48: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Repeating the model fit above we obtain the results shown below. I have selected the Wald Tests and the Odds Ratio options from the Nominal Logistic Fit pull-down menu.

Interpretation of the results from JMP:

Highlight Yes and click Move Up so Yes is at the top of the list. This will make OC Use = Yes the response value of interest in terms of computing the odds ratio for MI associated with OC Use.

48

Page 49: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

EXAMPLE 2: Consider the data found in the file Lowbirth.JMP. These data are from a study to identify potential risk factors for low birth weight.  A random sample of new mothers was taken and the following variables were recorded:

Low = birth weight less than 2500 grams (Y or N) Prev = previous history of premature labor (History or None) Hyper = hypertension during pregnancy (HT or Normal) Smoke = smoked during pregnancy (Cig or No Cig) Uterine = uterine irritability during pregnancy (Irritation or None) Minority = minority status of mother (White or Nonwhite) Age = mother’s age in years (yrs.) Lwt = mother’s weight at last menstrual cycle (lbs.)

Let’s begin by fitting a model with all predictors in JMP using Analyze > Fit Model.

Questions:

1. Is the overall model useful? Explain.

2. Are all predictors significant in the model? Explain.

49

Page 50: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Comparing Models with the Likelihood Ratio Test

We can fit the reduced model eliminating those terms that are not significant and then test whether the reduced model is adequate.

Ho: ln ( θ(~x )

1-θ(~x ) )=η0+η2 L wt+η3Minority+η4 Smoke+η5 Prev+η6 Hyper

Ha:

ln ( θ(~x )1-θ(~x ) )=η0+η1 Age+η2 Lwt+η3 Minority+η4 Smoke+η5 Prev+η6 Hyper+η7 Uterine

The test statistic is given by

χ2 = (Residual Deviance of reduced model) - (Residual Deviance of full model)

Note: Residual Deviance = -2×log-likelihood

Under the null hypothesis, this test statistic follows the chi-square distribution with degrees of freedom equal to the change in degrees of freedom between the two competing models.

Fitting the Null Hypothesis Model: (Age and Uterine dropped from the model)

Fitting the Alternative Hypothesis Model:

Carrying out the test:

50

Page 51: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Residual Deviance for Null Hypothesis Model:

Residual Deviance for Alternative Hypothesis Model:

Test Statistic, χ2 =

df =

To find the p-value use R:

> 1 – pchisq(3.60666,df=2)[1] 0.1647543

Conclusion:

Interpretation of Model Parameters for Reduced Model

Questions:

1. Find and interpret the odds ratio for low birth weight associated with being a minority.

51

Page 52: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

2. Find and interpret the odds ratio for low birth weight associated with being a smoker.

3. Find and interpret the odds ratio for low birth weight associated with having hypertension.

4. Find and interpret the odds ratio for low birth weight associated with having a history of preterm labor.

5. Find and interpret the odds ratio for low birth weight associated with a 10 pound increase in pre-pregnancy weight.

52

Page 53: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Logistic Regression Diagnostics: Residuals and Influence Statistics

As in the case of ordinary least squares (OLS) regression, we need to be wary of cases that are poorly fit and those that may have excessive influence on our results.

ResidualsPearson and Deviance residuals are useful in identifying observations that are not explained well by the model. Pearson residuals are components of the Pearson chi-square statistic and deviance residuals are components of the deviance.

Pearson Residual : The Pearson residual for the ith observation is defined by

e χ i=

yi− yi

√ni θ(~x i)(1−θ (~x i ))

Note that the Pearson’s chi-square statistic is the sum of the squared chi-residuals.

Deviance Residual : The deviance residual for the ith observation is defined by

Di=sgn ( y i−θ (x i ))⋅[2yi ln( y i

ni θ( x i))+2( ni− y i ) ln( ni− y i

ni(1−θ( x i)))]1

2

Note that the deviance is the sum of squares of the deviance residuals.

Influence Statistics

These measures can be used to identify cases that are highly influential on the logistic regression estimates.

DFBETAS : For each parameter estimate, a DFBETAS diagnostic is calculated for each observation. This is the standardized difference in the parameter estimate due to deleting the observation, and it can be used to assess the effect of an individual observation on each estimated parameter of the fitted model. These measures are useful for detecting observations that are causing instability in the selected coefficients.

C and CBAR : These diagnostics provide scalar measures of the influence of individual observations on the regression estimates. They

53

Page 54: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

are based on the same idea as the Cook distance in linear regression theory.

DIFDEV and DIFCHISQ : These are diagnostics for detecting ill-fitted observations; in other words, observations that contribute heavily to the disagreement between the data and the predicted values of the fitted model. DIFDEV is the change in the deviance due to deleting an individual observation while DIFCHISQ is the change in the Pearson chi-square statistic for the same deletion.

In cases of both poor fit and high influence, it is good to look at the covariate values for these individuals to address the role they play in the analysis. In many cases there will be several individuals with the same covariate pattern, especially if most or all of the predictors are categorical in nature.

To obtain these measures from SAS PROC LOGISTIC, use the following code:

ods html;ods graphics on;

*Reduced model;proc logistic data=LowBirthWeight descending;class MINORITY(param=ref ref='0') SMOKE(param=ref ref='0')

PTL(param=ref ref='0') HTN(param=ref ref='0');model LOW = WEIGHT MINORITY SMOKE PTL HTN / link=logit influence;run;

ods graphics off;ods html close;

SAS returns the following plots:

54

Page 55: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

55

Page 56: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

56

Page 57: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

57

Page 58: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

58

Page 59: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

59

Page 60: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

60

Page 61: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

61

Page 62: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Logistic Regression in R

In this section of the notes we examine logistic regression in R. There are several functions that I wrote for plotting diagnostics similar to what SAS does, although the inspiration for them came from work Prof. Malone and I did for OLS as part of his senior project.

Example 1: Oral Contraceptive Use and Myocardial InfarctionsSet up a text file with the data in columns with variable names at the top. The case and control counts are in separate columns. The risk factor OC use and stratification variable Age follow.

> OCMI.data = read.table(file.choose(),header=T) # read in text file

> OCMI.data MI NoMI Age OCuse1 4 62 1 Yes2 2 224 1 No3 9 33 2 Yes4 12 390 2 No5 4 26 3 Yes6 33 330 3 No7 6 9 4 Yes8 65 362 4 No9 6 5 5 Yes10 93 301 5 No

> attach(OCMI.data)

> OC.glm <- glm(cbind(MI,NoMI)~Age+OCuse,family=binomial) # fit model

> summary(OC.glm)Call:glm(formula = cbind(MI, NoMI) ~ Age + OCuse, family = binomial)

Deviance Residuals: [1] 0.456248 -0.520517 1.377693 -0.886710 -1.685521 0.714695 -0.130922 0.033643 [9] -0.045061 0.008822

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.3698 0.4347 -10.054 < 2e-16 ***Age2 1.1384 0.4768 2.388 0.0170 * Age3 1.9344 0.4582 4.221 2.43e-05 ***Age4 2.6481 0.4496 5.889 3.88e-09 ***Age5 3.1943 0.4474 7.140 9.36e-13 ***OCuseYes 1.3852 0.2505 5.530 3.19e-08 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 158.0085 on 9 degrees of freedomResidual deviance: 6.5355 on 4 degrees of freedomAIC: 58.825

62

Page 63: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Number of Fisher Scoring iterations: 3

Find OR associated with oral contraceptive use ADJUSTED for age. Recall: CMH procedure gave 3.97.

> exp(1.3852)[1] 3.995625

Find a 95% CI for OR associated with OC use.> exp(1.3852-1.96*.2505)[1] 2.445428> exp(1.3852+1.96*.2505)[1] 6.528518

Interpreting the age effect in terms of OR’s ADJUSTING for OC use. Note: The reference group is Age = 1 which was women 25 – 29 years of age.

> OC.glm$coefficients(Intercept) Age2 Age3 Age4 Age5 OCuseYes -4.369850 1.138363 1.934401 2.648059 3.194292 1.385176

> Age.coefs <- OC.glm$coefficients[2:5]> exp(Age.coefs) Age2 Age3 Age4 Age5 3.121653 6.919896 14.126585 24.392906

Find 95% CI for age = 5 group.

> exp(3.1943-1.96*.4474)[1] 10.14921> exp(3.1943+1.96*.4474)[1] 58.62751

Example 2: Coffee Drinking and Myocardial InfarctionsCoffeeMI.data = read.table(file.choose(),header=T)> CoffeeMI.data Smoking Coffee MI NoMI1 Never > 5 7 312 Never < 5 55 2693 Former > 5 7 184 Former < 5 20 1125 1-14 Cigs > 5 7 246 1-14 Cigs < 5 33 1147 15-25 Cigs > 5 40 458 15-25 Cigs < 5 88 1729 25-34 Cigs > 5 34 2410 25-34 Cigs < 5 50 5511 35-44 Cigs > 5 27 2412 35-44 Cigs < 5 55 5813 45+ Cigs > 5 30 1714 45+ Cigs < 5 34 17

> attach(CoffeeMI.data)> Coffee.glm = glm(cbind(MI,NoMI)~Smoking+Coffee,family=binomial)

63

Page 64: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

> summary(Coffee.glm)

Call:glm(formula = cbind(MI, NoMI) ~ Smoking + Coffee, family = binomial)

Deviance Residuals: Min 1Q Median 3Q Max -0.7650 -0.4510 -0.0232 0.2999 0.7917

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.2981 0.1819 -7.136 9.60e-13 ***Smoking15-25 Cigs 0.6892 0.2119 3.253 0.00114 ** Smoking25-34 Cigs 1.2462 0.2398 5.197 2.02e-07 ***Smoking35-44 Cigs 1.1988 0.2389 5.017 5.24e-07 ***Smoking45+ Cigs 1.7811 0.2808 6.342 2.27e-10 ***SmokingFormer -0.3291 0.2778 -1.185 0.23616 SmokingNever -0.3153 0.2279 -1.384 0.16646 Coffee> 5 0.3200 0.1377 2.324 0.02012 * ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 173.7899 on 13 degrees of freedomResidual deviance: 3.7622 on 6 degrees of freedomAIC: 84.311

Number of Fisher Scoring iterations: 3

OR for drinking 5 or more cups of coffee per day.Note: CMH procedure gave OR = 1.375

> exp(.3200)[1] 1.377128

95% CI for OR associated with heavy coffee drinking

> exp(.3200 - 1.96*.1377)[1] 1.051385> exp(.3200 + 1.96*.1377)[1] 1.803794

Reordering a FactorTo examine the effect of smoking we might want to “reorder” the levels of smoking status so that individuals who have never smoked are used as the reference group. To do this in R you must do the following:

Smoking = factor(Smoking,levels=c("Never","Former","1-14 Cigs","15-25 Cigs","25-34 Cigs","35-44 Cigs","45+ Cigs"))

The first level specified in the levels subcommand will be used as the reference group, “Never” in this case. Refitting the model with the reordered smoking status factor gives the following:

64

Page 65: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

> Coffee.glm2 <-glm(cbind(MI,NoMI)~Smoking+Coffee,family=binomial)> summary(Coffee.glm2)Call:glm(formula = cbind(MI, NoMI) ~ Smoking + Coffee, family = binomial)Deviance Residuals: Min 1Q Median 3Q Max -0.7650 -0.4510 -0.0232 0.2999 0.7917

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.61344 0.14068 -11.469 < 2e-16 ***SmokingFormer -0.01376 0.25376 -0.054 0.9568 Smoking1-14 Cigs 0.31533 0.22789 1.384 0.1665 Smoking15-25 Cigs 1.00451 0.17976 5.588 2.30e-08 ***Smoking25-34 Cigs 1.56150 0.21254 7.347 2.03e-13 ***Smoking35-44 Cigs 1.51417 0.21132 7.165 7.77e-13 ***Smoking45+ Cigs 2.09646 0.25855 8.108 5.13e-16 ***Coffee> 5 0.31995 0.13766 2.324 0.0201 * ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 173.7899 on 13 degrees of freedomResidual deviance: 3.7622 on 6 degrees of freedomAIC: 84.311

Number of Fisher Scoring iterations: 3

Notice that “SmokingNever” is now absent from the output so we know it is being used as the reference group. The OR’s associated with the various levels of smoking are computed below.

> Smoke.coefs = Coffee.glm$coefficients[2:7]> exp(Smoke.coefs)SmokingFormer Smoking1-14 Cigs Smoking15-25 Cigs Smoking25-34 Cigs 0.986338 1.370715 2.730561 4.765984 Smoking35-44 Cigs Smoking45+ Cigs 4.545632 8.137279

Confidence intervals for each could be computed in the standard way.

Some Details for Categorical Predictors with More Than Two Levels

65

Page 66: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Consider the coffee drinking/MI study above. The stratification variable smoking has seven levels. Thus it requires six dummy variables to define it. The level that is not defined using a dichotomous dummy variable serves as the reference group. The table below shows how the value of the dummy variables:

Level D2 D3 D4 D5 D6 D7Never (Reference Group)

0 0 0 0 0 0

Former 1 0 0 0 0 01 – 14 Cigs 0 1 0 0 0 015 – 24 Cigs 0 0 1 0 0 025 – 34 Cigs 0 0 0 1 0 035 – 44 Cigs 0 0 0 0 1 045+ Cigs 0 0 0 0 0 1

Example: Coffee Drinking and Myocardial InfarctionsCoffeeMI.data = read.table(file.choose(),header=T)> CoffeeMI.data Smoking Coffee MI NoMI1 Never > 5 7 312 Never < 5 55 2693 Former > 5 7 184 Former < 5 20 1125 1-14 Cigs > 5 7 246 1-14 Cigs < 5 33 1147 15-25 Cigs > 5 40 458 15-25 Cigs < 5 88 1729 25-34 Cigs > 5 34 2410 25-34 Cigs < 5 50 5511 35-44 Cigs > 5 27 2412 35-44 Cigs < 5 55 5813 45+ Cigs > 5 30 1714 45+ Cigs < 5 34 17

The Logistic Model

ln ( θ( x )~

1−θ( x~))=ηo+η1 Coffee+η2 D2+η3 D3+η4 D4+η5 D5+η6 D6+η7 D7

where Coffee is a dichotomous predictor equal to 1 if they drink 5 or more cups of coffee per day.

Comparing the log-odds of a heavy coffee drinker who who smokes 15-25 cigarettes day to a heavy coffee drinker who has never smoked we have.

66

Page 67: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

ln (θ1( x~)

1−θ1 ( x~) )=ηo+η1+η4

ln (θ2( x~)

1−θ2 ( x~) )=ηo+η1

Taking the difference gives,

ln (θ1( x

~)

1−θ1( x~)

θ2( x~)

1−θ2( x~) )=η4

thus

eη4= the odds ratio associated with smoking 15-24 cigarettes per day when compared to

individuals who have never smoked amongst heavy coffee drinkers. Because η1 is not involved in the odds ratio the result is the same for non-heavy coffee drinkers as well!

You can also consider combinations of factors, e.g. if we compared heavy coffee drinkers who smoked 15-24 cigarettes to a non-heavy coffee drinkers who have never smoked the associated OR would be given bye

η1+η4 .

Using our fitted model the OR’s ratios discussed above would be.

> summary(Coffee.glm)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.61344 0.14068 -11.469 < 2e-16 ***SmokingFormer -0.01376 0.25376 -0.054 0.9568 Smoking1-14 Cigs 0.31533 0.22789 1.384 0.1665 Smoking15-25 Cigs 1.00451 0.17976 5.588 2.30e-08 ***Smoking25-34 Cigs 1.56150 0.21254 7.347 2.03e-13 ***Smoking35-44 Cigs 1.51417 0.21132 7.165 7.77e-13 ***Smoking45+ Cigs 2.09646 0.25855 8.108 5.13e-16 ***Coffee> 5 0.31995 0.13766 2.324 0.0201 * ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

OR for 15-24 cigarette smokers vs. never smokers (regardless of coffee drinking status)> exp(1.00451)[1] 2.730569

67

Page 68: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

OR for 15-24 cigarette smokers who are also heavy coffee drinkers vs. non-smokers who are not heavy coffee drinkers > exp(.31995 + 1.00451)[1] 3.760154

Similar calculations could be done for other combinations of coffee and cigarette use.

Example 3: Risk Factors for Low Birth Weight

ResponseLow = low birth weight, i.e. birth weight < 2500 grams(1 = yes, 0 = no)Set of potential predictors

Prev = previous history of premature labor (1 = yes, 0 = no) Hyper = hypertension during pregnancy (1 = yes, 0 = no) Smoke = smoker (1 = yes, 0 = no) Uterine = uterine irritability (1 = yes, 0 = no) Minority = minority (1 = yes, 0 = no) Age = mother’s age in years Lwt = mother’s weight at last menstrual cycle

Analysis in R> Lowbirth = read.table(file.choose(),header=T)> Lowbirth[1:5,] # print first 5 rows of the data set Low Prev Hyper Smoke Uterine Minority Age Lwt race bwt1 0 0 0 0 1 1 19 182 2 25232 0 0 0 0 0 1 33 155 3 25513 0 0 0 1 0 0 20 105 1 25574 0 0 0 1 1 0 21 108 1 25945 0 0 0 1 1 0 18 107 1 2600

Make sure categorical variables are interpreted as factors by using the factor command> Low = factor(Low)> Prev = factor(Prev)> Hyper = factor(Hyper)> Smoke = factor(Smoke)> Uterine = factor(Uterine)> Minority = factor(Minority)

Note: This is not really necessary for dichotomous variables that are coded (0,1).

Fit a preliminary model using all available covariates> low.glm = glm(Low~Prev+Hyper+Smoke+Uterine+Minority+Age+Lwt,family=binomial)> summary(low.glm)Call:glm(formula = Low ~ Prev + Hyper + Smoke + Uterine + Minority + Age + Lwt, family = binomial)

Deviance Residuals: Min 1Q Median 3Q Max -1.6010 -0.8149 -0.5128 1.0188 2.1977

68

Page 69: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.378479 1.170627 0.323 0.74646 Prev1 1.196011 0.461534 2.591 0.00956 **Hyper1 1.452236 0.652085 2.227 0.02594 * Smoke1 0.959406 0.405302 2.367 0.01793 * Uterine1 0.647498 0.466468 1.388 0.16511 Minority1 0.990929 0.404969 2.447 0.01441 * Age -0.043221 0.037493 -1.153 0.24900 Lwt -0.012047 0.006422 -1.876 0.06066 . ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Null deviance: 232.40 on 185 degrees of freedomResidual deviance: 196.71 on 178 degrees of freedomAIC: 212.71

Number of Fisher Scoring iterations: 3

It appears that both uterine irritability and mother’s age are not significant. We can fit the reduced model eliminating both terms and test whether the model is significantly degraded by using the general chi-square test (see the JMP example above).

> low.reduced = glm(Low~Prev+Hyper+Smoke+Minority+Lwt,family=binomial)> summary(low.reduced)

Call:glm(formula = Low ~ Prev + Hyper + Smoke + Minority + Lwt, family = binomial)

Deviance Residuals: Min 1Q Median 3Q Max -1.7277 -0.8219 -0.5368 0.9867 2.1517

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.261274 0.885803 -0.295 0.76803 Prev1 1.181940 0.444254 2.661 0.00780 **Hyper1 1.397219 0.656271 2.129 0.03325 * Smoke1 0.981849 0.398300 2.465 0.01370 * Minority1 1.044804 0.394956 2.645 0.00816 **Lwt -0.014127 0.006387 -2.212 0.02697 * ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 232.40 on 185 degrees of freedomResidual deviance: 200.32 on 180 degrees of freedomAIC: 212.32

Number of Fisher Scoring iterations: 3

69

Page 70: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Reduced Model:

Ho: ln ( θ(~x )

1-θ(~x ) )=η0+η2 L wt+η3Minority+η4 Smoke+η5 Prev+η6 Hyper

Full Model:

ln ( θ(~x )1-θ(~x ) )=η0+η1 Age+η2 Lwt+η3 Minority+η4 Smoke+η5 Prev+η6 Hyper+η7 Uterine

* Recall: θ( x

~)=P( Low=1|X

~)

Residual Deviance Null Hypothesis Model: DH o

=200 . 32 df = 180

Residual Deviance Alternative Hypothesis Model: DH1

=196 .71 df = 178

General Chi-Square Test χ2=DH 0

−D H1=200. 32−196 .71=3 . 607

p−value=P ( χ22>3 .607 )=. 1647

Fail to reject the null, the reduced model is adequate.

Interpretation of Model Parameters OR’s Associated with Categorical Predictors> low.reduced

Call: glm(formula = Low ~ Prev + Hyper + Smoke + Minority + Lwt, family = binomial)

Coefficients:(Intercept) Prev1 Hyper1 Smoke1 Minority1 Lwt -0.26127 1.18194 1.39722 0.98185 1.04480 -0.01413

Degrees of Freedom: 185 Total (i.e. Null); 180 ResidualNull Deviance: 232.4 Residual Deviance: 200.3 AIC: 212.3

Estimated OR’s > exp(low.reduced$coefficients[2:5]) Prev1 Hyper1 Smoke1 Minority1 3.260693 4.043938 2.669388 2.842841

95% CI for OR Associated with History of Premature Labor (Wald Intervals)> exp(1.182 - 1.96*.444)[1] 1.365827> exp(1.182 + 1.96*.444)[1] 7.78532Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.366 and 7.785 times larger for mothers with a history of premature labor.

70

Page 71: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

95% CI for OR Associated with Hypertension> exp(1.397 - 1.96*.6563)[1] 1.117006> exp(1.397 + 1.96*.6563)[1] 14.63401

Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.117 and 14.63 times larger for mothers with hypertension during pregnancy.

95% CI for OR Associated with Smoking> exp(.981849 - 1.96*.3983)[1] 1.222846> exp(.981849 + 1.96*.3983)[1] 5.827086

Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.223 and 5.827 times larger for mothers who smoked during pregnancy.

95% CI for OR Associated with Minority Status> exp(1.0448 - 1.96*.3950)[1] 1.310751> exp(1.0448 + 1.96*.3950)[1] 6.16569

Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.311 and 6.166 times larger for non-white mothers.

OR Associated with Mother’s Weight at Last Menstrual Cycle

Because this is a continuous predictor with values over 100 we should use an increment larger than one when considering the effect of mother’s weight on birth weight. Here we will use an increment of c = 10 lbs., although certainly there are other possibilities.

> exp(-10*.014127) [1] 0.8682549

i.e. 13.2% decrease in the OR for each additional 10 lbs. in premenstrual weight.

A 95% CI for this OR is:> exp(10*(-.014127) - 1.96*10*.006387)[1] 0.7660903> exp(10*(-.014127) + 1.96*10*.006387)[1] 0.9840439

Create a sequence of weights from smallest observed weight to the largest observed weight by ½ pound increments.

> x = seq(min(Lwt),max(Lwt),.5)

Here I have set the other covariates as follows: previous history (1 = yes), hypertension (0 = no), smoking status (1 = yes), and minority (0 = no).

71

Page 72: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

> fit = predict(low.reduced,data.frame(Prev=factor(rep(1,length(x))),Hyper=factor(rep(0,length(x))),Smoke=factor(rep(1,length(x))),Minority=factor(rep(0,length(x))),Lwt=x),type="response")

plot(x,fit,xlab=”Mother’s Weight”,ylab=”P(Low|Prev=1,Smoke=1,Lwt)”)

This is a plot of the effect of premenstrual weight for smoking mothers with a history of premature labor. Using the predict command above similar plots could be constructed by examining other combinations of the categorical predictors.

72

Page 73: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Case Diagnostics (Delta Deviance and Cook’s Distance) As in the case of ordinary least squares (OLS) regression we need to be wary of cases that may have unduly high influence on our results and those that are poorly fit. The most common influence measure is Cook’s Distance and a good measure of poorly fit cases is the Delta Deviance.

Essentially Cook’s Distance (Δ β(−i )or Di ¿measures the changes in the estimated parameters when the ith observation is deleted. This change is measured for each of the

observations and can be plotted versus θ( x

~) or observation number to aid in the

identification of high influence cases. Several cut-offs have been proposed for Cook’s Distance, the most common being to classify an observation as having large influence if Δ β (−i)>1 or, in case of large sample size n, Δ β(−i)>4 /n .

Cook’s Distance

Δβ(−i)=1k ( e χi

2

1−h i) hi

1−hi

where e χ i

=yi− yi

√ni θ(~x i)(1−θ (~x i )) is the Pearson’s residual defined above.

Delta deviance measures the change in the deviance (D) when the ith case is deleted. Values around 4 or larger are considered to cases that are poorly fit.

These cases correspond to cases to individuals where y i=1 but θ( x

~) is small, or cases

where y i=0 but θ( x

~) is large.

In cases of both high influence and poor fit it is good to look at the covariate values for these individuals and we can begin to address the role they play in the analysis. In many cases there will be several individuals with the same covariate pattern, especially if most or all of the predictors are categorical in nature.

> Diagplot.glm(low.reduced)

73

Page 74: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

> Diagplot.log(low.reduced)

Cases 11 and 13 have the highest Cook’s distances although they are not that large. It should be noted also that they are also somewhat poorly fit. Cases 129, 144, 152, and 180 appear to be poorly fit. The information on all of these cases is shown below.

> Lowbirth[c(11,13,129,144,152,180),] Low Prev Hyper Smoke Uterine Minority Age Lwt race bwt11 0 0 1 0 0 1 19 95 3 272213 0 0 1 0 0 1 22 95 3 2750129 1 0 0 0 1 0 29 130 1 1021144 1 0 0 0 1 1 21 200 2 1928152 1 0 0 0 0 0 24 138 1 2100180 1 0 0 1 0 0 26 190 1 2466

74

Page 75: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Case 152 had a low birth weight infant even in the absence of the identified potential risk factors. The fitted values for all four of the poorly fit cases are quite small.

> fitted(low.reduced)[c(11,13,129,144,152,180)] 11 13 129 144 152 180 0.69818500 0.69818500 0.10930602 0.11486743 0.09877858 0.12307383

Cases 11 and 13 have high predicted probabilities despite the fact that they had babies with normal birth weight. Their relatively high leverage might come from the fact that there were very few hypertensive minority women in the study. These two facts combined lead to the relatively large Cook’s Distances for these two cases.

Plotting Estimated Conditional Probabilities ~P( Low=1|x

~)

A summary of the reduced model is given below:> low.reduced

Call: glm(formula = Low ~ Prev + Hyper + Smoke + Minority + Lwt, family = binomial)

Coefficients:(Intercept) Prev1 Hyper1 Smoke1 Minority1 Lwt -0.26127 1.18194 1.39722 0.98185 1.04480 -0.01413

Degrees of Freedom: 185 Total (i.e. Null); 180 ResidualNull Deviance: 232.4 Residual Deviance: 200.3 AIC: 212.3

To easily plot probabilities in R we can write a function that takes covariate values and compute the desired conditional probability.

> x <- seq(min(Lwt),max(Lwt),.5)

> PrLwt <- function(x,Prev,Hyper,Smoke,Minority) {+ L <- -.26127 + 1.18194*Prev + 1.39722*Hyper + .98185*Smoke + + 1.0448*Minority - .01413*x+ exp(L)/(1 + exp(L))+ }> plot(x,PrLwt(x,1,1,1,1),xlab="Mother's Weight",ylab="P(Low=1|x)",+ ylim=c(0,1),type="l")> title(main="Plot of P(Low=1|X) vs. Mother's Weight")> lines(x,PrLwt(x,0,0,0,0),lty=2,col="red")> lines(x,PrLwt(x,1,1,0,0),lty=3,col="blue")> lines(x,PrLwt(x,0,0,1,1),lty=4,col="green")

75

Page 76: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

R Function – Diagplot.logPlot Cook’s Distance and Delta Deviance for Logistic Regression Models

Diagplot.log = function(glm1){

k <- length(glm1$coef)h <- lm.influence(glm1)$hatfv <- fitted(glm1)pr <- resid(glm1, type = "pearson")dr <- resid(glm1, type = "deviance")par(mfrow = c(2, 1))n <- length(fv)index <- seq(1, n, 1)Ck <- (1/k)*((pr^2) * h)/((1 - h)^2)Cd <- dr^2/(1 - h)plot(index, Ck, type = "n", xlab = "Index", ylab = "Cook's Distance", cex = 0.7, main = "Plot of Cook's Distance vs. Index", col = 1)points(index, Ck, col = 2)identify(index, Ck)plot(index, Cd, type = "n", xlab = "Index", ylab = "Delta Deviance", cex = 0.7, main = "Plot of Delta Deviance vs. Index")points(index, Cd, col = 2)identify(index, Cd)par(mfrow = c(1, 1))invisible()

}

76

Page 77: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Diagplot.glm - displays case diagnositic plots for a logistic regression

Diagplot.glmfunction (lm1, lms = summary(lm1), lmi = lm.influence(lm1)) { par(mfrow = c(2, 2)) h <- lmi$hat pr <- residuals(lm1, type = "pearson") dr <- residuals(lm1, type = "deviance") dB <- ((pr^2) * h)/((1 - h)^2) dD <- dr^2/(1 - h) fv <- lm1$fitted.values plot(fv, dB, main = "Plot of dB vs. Fitted Values", xlab = "Fitted Values", ylab = "dB") points(fv[dB > 1], dB[dB > 1], col = "blue") plot(fv, dD, main = "Plot of dD vs. Fitted Values", xlab = "Fitted Values", ylab = "dD") points(fv[dD > 4], dD[dD > 4], col = "blue") index <- seq(1:length(fv)) plot(dB, main = "Plot of dB vs. Index Number", xlab = "Index Number") points(index[dB > 1], dB[dB > 1], col = "blue") identify(index, dB, cex = 0.75) plot(dD, main = "Plot of dD vs. Index Number", xlab = "Index Number") points(index[dD > 4], dD[dD > 4], col = "blue") identify(index, dD, cex = 0.75) par(mfrow = c(1, 1)) invisible()}

77

Page 78: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Interactions and Higher Order Terms (Note ~ uses data frame: Lowbwt ) Working with a slightly different version of the low birth weight data available which includes an additional predictor, ftv, which is a factor that indicates the number of first trimester doctor visits the woman (coded as: 0, 1, or 2+). We will examine how the model below was developed in the next section where we discuss model development.

In the model below we have added an interaction between age and the number of first trimester visits. The logistic model is:

log (θ( x~)

1−θ( x~) )=ηo+η1 Age+η2 Lwt+η3 Smoke+η4 Pr ev+η5 HT +η6 UI+

η7 FTV 1+η8 FTV 2+η9 Age∗FTV 1+η10 Age∗FTV 2+η11 Smoke∗UI

> summary(bigmodel)

Call:glm(formula = low ~ age + lwt + smoke + ptd + ht + ui + ftv + age:ftv + smoke:ui, family = binomial)

Deviance Residuals: Min 1Q Median 3Q Max -1.8945 -0.7128 -0.4817 0.7841 2.3418

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.582389 1.420834 -0.410 0.681885 age 0.075538 0.053945 1.400 0.161428 lwt -0.020372 0.007488 -2.721 0.006513 ** smoke1 0.780047 0.420043 1.857 0.063302 . ptd1 1.560304 0.496626 3.142 0.001679 ** ht1 2.065680 0.748330 2.760 0.005773 ** ui1 1.818496 0.666670 2.728 0.006377 ** ftv1 2.921068 2.284093 1.279 0.200941 ftv2+ 9.244460 2.650495 3.488 0.000487 ***age:ftv1 -0.161823 0.096736 -1.673 0.094360 . age:ftv2+ -0.411011 0.118553 -3.467 0.000527 ***smoke1:ui1 -1.916644 0.972366 -1.971 0.048711 * ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 234.67 on 188 degrees of freedomResidual deviance: 183.07 on 177 degrees of freedomAIC: 207.07

Number of Fisher Scoring iterations: 4> bigmodel$coefficients (Intercept) age lwt smoke1 prev1 ht1 -0.58238913 0.07553844 -0.02037234 0.78004747 1.56030401 2.06567991 ui1 ftv1 ftv2+ age:ftv1 age:ftv2+ smoke1:ui1 1.81849631 2.92106773 9.24445985 -0.16182328 -0.41101103 -1.91664380

78

Page 79: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Calculate P(Low|Age,FTV) for women of average pre-pregnancy weight with all other risk factors absent. Similar calculations could be done if we wanted to add in other factors as well.

First we calculate the logits as function of age for three levels of FTV 0, 1, and 2+ respectively.> L <- -.5824 + .0755*agex - .02037*mean(lwt)> L1 <- -.5824 + .0755*agex - .02037*mean(lwt) + 2.9211 - .16182*agex> L2 <- -.5824 + .0755*agex - .02037*mean(lwt) + 9.2445 - .4110*agex

Next we calculate the associated conditional probabilities.> P <- exp(L)/(1+exp(L))> P1 <- exp(L1)/(1+exp(L1))> P2 <- exp(L2)/(1+exp(L2))

Finally we plot the probability curves as function of age and FTV.> plot(agex,P,type="l",xlab="Age",ylab="P(Low|Age,FTV)",ylim=c(0,1))> lines(agex,P1,lty=2,col="blue")> lines(agex,P2,lty=3,col="red")> title(main="Interaction Between Age and First Trimester Visits",cex=.6)

We also have an interaction between smoking and uterine irritability added to the model. This will affect how we interpret the two in terms of odds ratios. We need to consider the OR associated with smoking for women without uterine irritability, the OR associated with uterine irritability for nonsmokers, and finally the OR associated with smoking and having uterine irritability during pregnancy.

The interaction between in age and FTV produces differences in direction and magnitude of the age effect. For women with no first trimester doctor visits their probability of low birth weight increases with age. However for women with at least one first trimester visit the probability of low birth weight decreases with age. The magnitude of that drop is largest for women with 2 or more first trimester visits.

79

Page 80: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

These estimated odds ratios are given below:

OR for Smoking with No Uterine Irritability> exp(.7800)[1] 2.181472OR for Uterine Irritability with No Smoking> exp(1.8185)[1] 6.162608OR for Smoking and Uterine Irritability > exp(.7800+1.8185-1.91664)[1] 1.977553

This result is hard to explain physiologically and so this interaction term might be removed from the model.

Model Selection MethodsStepwise methods used in logistic regression are the same as those used in ordinary least square regression however the measure is the AIC (Akaike Information Criteria) as opposed to Mallow’s Ck statistic. Like Mallow’s statistic, AIC balances residual deviance and the number of parameters in the model.

AIC = D + 2kφ

Where D = residual deviance, k = total number of estimated parameters, and φ is an estimate of the dispersion parameter which is taken to be 1 in models where overdispersion is not present. Overdispersion occurs when the data consists of the

number of successes out of mi > 1 trials and the trials are not independent (e.g. male birth data from your last homework).

Forward, backward, both forward and backward simultaneously, and all possible subsets regression methods can be employed to find models with small AIC values. By default R uses both forward and backward selection simultaneously. The command to do this in R has the basic form:

> step(current model name)

To have it select from models containing all potential two-way interactions use:

> step(current model name, scope=~.^2)

This sometimes will have problems with convergence due to overfitting (i.e. the estimated probabilities approach 0 and 1 as in the saturated model). If this occurs you can have R consider adding each of the potential interaction terms and then you can scan the list and decide which you might want to add to your existing model. You can then continue adding terms until the AIC criteria suggests additional terms do not improve current model.

80

Page 81: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

These commands are illustrated for the low birth weight data with first trimester visits included in the output shown below.

Base Model> low.glm <- glm(low~age+lwt+race+smoke+ht+ui+ptd+ftv,family=binomial)> summary(low.glm)

Call:glm(formula = low ~ age + lwt + race + smoke + ht + ui + ptd + ftv, family = binomial)

Deviance Residuals: Min 1Q Median 3Q Max -1.7038 -0.8068 -0.5009 0.8836 2.2151

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.822706 1.240174 0.663 0.50709 age -0.037220 0.038530 -0.966 0.33404 lwt -0.015651 0.007048 -2.221 0.02637 * race2 1.192231 0.534428 2.231 0.02569 * race3 0.740513 0.459769 1.611 0.10726 smoke1 0.755374 0.423246 1.785 0.07431 . ht1 1.912974 0.718586 2.662 0.00776 **ui1 0.680162 0.463464 1.468 0.14222 ptd1 1.343654 0.479409 2.803 0.00507 **ftv1 -0.436331 0.477792 -0.913 0.36112 ftv2+ 0.178939 0.455227 0.393 0.69426 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 234.67 on 188 degrees of freedomResidual deviance: 195.48 on 178 degrees of freedomAIC: 217.48

Find “best” model that includes all potential two-way interactions.

> low.step <- step(low.glm,scope=~.^2)Start: AIC= 217.48 low ~ age + lwt + race + smoke + ht + ui + ptd + ftv

Df Deviance AIC+ age:ftv 2 183.00 209.00- ftv 2 196.83 214.83- age 1 196.42 216.42<none> 195.48 217.48- ui 1 197.59 217.59+ smoke:ui 1 193.76 217.76+ lwt:smoke 1 194.04 218.04+ ui:ptd 1 194.24 218.24+ lwt:ui 1 194.28 218.28+ ptd:ftv 2 192.38 218.38

81

Page 82: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

+ ht:ptd 1 194.55 218.55+ age:ptd 1 194.58 218.58+ age:ht 1 194.59 218.59+ age:smoke 1 194.61 218.61+ race:ui 2 192.63 218.63- smoke 1 198.67 218.67+ smoke:ht 1 195.03 219.03+ smoke:ptd 1 195.16 219.16- race 2 201.23 219.23+ race:smoke 2 193.24 219.24+ lwt:ptd 1 195.35 219.35+ lwt:ht 1 195.44 219.44+ age:lwt 1 195.46 219.46+ age:ui 1 195.47 219.47+ ht:ftv 2 194.00 220.00+ lwt:ftv 2 194.19 220.19+ smoke:ftv 2 194.47 220.47+ age:race 2 194.58 220.58+ lwt:race 2 194.63 220.63+ race:ptd 2 194.83 220.83- lwt 1 200.95 220.95+ race:ht 2 195.19 221.19+ ui:ftv 2 195.32 221.32- ht 1 202.93 222.93- ptd 1 203.58 223.58+ race:ftv 4 193.81 223.81

Step: AIC= 209 low ~ age + lwt + race + smoke + ht + ui + ptd + ftv + age:ftv

Df Deviance AIC+ smoke:ui 1 179.94 207.94+ lwt:smoke 1 180.89 208.89- race 2 186.99 208.99<none> 183.00 209.00+ ui:ptd 1 181.42 209.42+ lwt:ui 1 181.90 209.90+ ht:ptd 1 182.06 210.06- smoke 1 186.11 210.11+ age:smoke 1 182.16 210.16+ race:ui 2 180.32 210.32+ age:ptd 1 182.50 210.50- ui 1 186.61 210.61+ smoke:ht 1 182.71 210.71+ lwt:ptd 1 182.75 210.75+ smoke:ptd 1 182.82 210.82+ age:ht 1 182.90 210.90+ age:ui 1 182.96 210.96+ age:lwt 1 183.00 211.00+ lwt:ht 1 183.00 211.00+ race:smoke 2 181.23 211.23+ lwt:ftv 2 181.44 211.44+ ptd:ftv 2 181.57 211.57+ age:race 2 181.62 211.62+ smoke:ftv 2 181.65 211.65+ ht:ftv 2 181.82 211.82+ lwt:race 2 182.55 212.55

82

Page 83: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

+ race:ht 2 182.78 212.78+ race:ptd 2 182.85 212.85- lwt 1 188.88 212.88+ ui:ftv 2 182.94 212.94- ht 1 190.13 214.13- ptd 1 191.05 215.05+ race:ftv 4 181.69 215.69- age:ftv 2 195.48 217.48

Step: AIC= 207.94 low ~ age + lwt + race + smoke + ht + ui + ptd + ftv + age:ftv + smoke:ui

Df Deviance AIC- race 2 183.07 207.07<none> 179.94 207.94+ lwt:smoke 1 178.34 208.34+ ht:ptd 1 178.89 208.89- smoke:ui 1 183.00 209.00+ ui:ptd 1 179.07 209.07+ age:ptd 1 179.35 209.35+ age:smoke 1 179.37 209.37+ smoke:ptd 1 179.58 209.58+ lwt:ptd 1 179.61 209.61+ lwt:ui 1 179.76 209.76+ age:ht 1 179.78 209.78+ smoke:ht 1 179.82 209.82+ age:lwt 1 179.84 209.84+ age:ui 1 179.86 209.86+ lwt:ht 1 179.94 209.94+ lwt:ftv 2 178.25 210.25+ ptd:ftv 2 178.53 210.53+ smoke:ftv 2 178.64 210.64+ race:smoke 2 178.73 210.73+ age:race 2 178.84 210.84+ ht:ftv 2 178.89 210.89+ race:ui 2 179.13 211.13+ ui:ftv 2 179.50 211.50+ race:ht 2 179.52 211.52+ lwt:race 2 179.68 211.68+ race:ptd 2 179.86 211.86- lwt 1 187.15 213.15- ht 1 187.66 213.66+ race:ftv 4 178.51 214.51- ptd 1 188.83 214.83- age:ftv 2 193.76 217.76

Step: AIC= 207.07 low ~ age + lwt + smoke + ht + ui + ptd + ftv + age:ftv + smoke:ui

Df Deviance AIC<none> 183.07 207.07+ lwt:smoke 1 181.40 207.40+ ui:ptd 1 181.88 207.88+ ht:ptd 1 181.93 207.93+ race 2 179.94 207.94+ age:smoke 1 181.97 207.97

83

Page 84: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

+ age:ht 1 182.64 208.64+ age:ptd 1 182.69 208.69+ lwt:ptd 1 182.73 208.73+ lwt:ui 1 182.76 208.76+ smoke:ptd 1 182.85 208.85+ age:lwt 1 182.92 208.92- smoke:ui 1 186.99 208.99+ age:ui 1 182.99 208.99+ smoke:ht 1 183.02 209.02+ lwt:ht 1 183.06 209.06+ smoke:ftv 2 181.48 209.48+ lwt:ftv 2 181.69 209.69+ ptd:ftv 2 181.85 209.85+ ui:ftv 2 182.28 210.28+ ht:ftv 2 182.41 210.41- ht 1 191.21 213.21- lwt 1 191.56 213.56- ptd 1 193.59 215.59- age:ftv 2 199.00 219.00

Summarize the model returned from the stepwise search> summary(low.step)

Call:glm(formula = low ~ age + lwt + smoke + ht + ui + ptd + ftv + age:ftv + smoke:ui, family = binomial)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.582389 1.420834 -0.410 0.681885 age 0.075538 0.053945 1.400 0.161428 lwt -0.020372 0.007488 -2.721 0.006513 ** smoke1 0.780047 0.420043 1.857 0.063302 . ht1 2.065680 0.748330 2.760 0.005773 ** ui1 1.818496 0.666670 2.728 0.006377 ** ptd1 1.560304 0.496626 3.142 0.001679 ** ftv1 2.921068 2.284093 1.279 0.200941 ftv2+ 9.244460 2.650495 3.488 0.000487 ***age:ftv1 -0.161823 0.096736 -1.673 0.094360 . age:ftv2+ -0.411011 0.118553 -3.467 0.000527 ***smoke1:ui1 -1.916644 0.972366 -1.971 0.048711 * Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 234.67 on 188 degrees of freedomResidual deviance: 183.07 on 177 degrees of freedomAIC: 207.07Number of Fisher Scoring iterations: 4

84

Page 85: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

This is the model used to demonstrate model interpretation in the presence of interactions.An alternative to the full blown search above is to consider adding a single interaction term to the “Base Model” from the set of all possible terms.

> add1(low.glm,scope=~.^2)Single term additions

Model:low ~ age + lwt + race + smoke + ht + ui + ptd + ftv Df Deviance AIC<none> 195.48 217.48age:lwt 1 195.46 219.46age:race 2 194.58 220.58age:smoke 1 194.61 218.61age:ht 1 194.59 218.59age:ui 1 195.47 219.47age:ptd 1 194.58 218.58age:ftv 2 183.00 209.00 *lwt:race 2 194.63 220.63lwt:smoke 1 194.04 218.04lwt:ht 1 195.44 219.44lwt:ui 1 194.28 218.28lwt:ptd 1 195.35 219.35lwt:ftv 2 194.19 220.19race:smoke 2 193.24 219.24race:ht 2 195.19 221.19race:ui 2 192.63 218.63race:ptd 2 194.83 220.83race:ftv 4 193.81 223.81smoke:ht 1 195.03 219.03smoke:ui 1 193.76 217.76smoke:ptd 1 195.16 219.16smoke:ftv 2 194.47 220.47ht:ui 0 195.48 217.48ht:ptd 1 194.55 218.55ht:ftv 2 194.00 220.00ui:ptd 1 194.24 218.24ui:ftv 2 195.32 221.32ptd:ftv 2 192.38 218.38

We can than “manually” enter this term to our base model by using the update command in R.

> low.glm2 <- update(low.glm,.~.+age:ftv)> summary(low.glm2)

Call:glm(formula = low ~ age + lwt + race + smoke + ht + ui + ptd + ftv + age:ftv, family = binomial)

Deviance Residuals: Min 1Q Median 3Q Max -2.0338 -0.7690 -0.4510 0.8354 2.3383

85

Page 86: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.636485 1.558677 -1.050 0.29376 age 0.085461 0.055734 1.533 0.12519 lwt -0.017599 0.007653 -2.300 0.02147 * race2 0.994134 0.550962 1.804 0.07118 . race3 0.700669 0.491400 1.426 0.15391 smoke1 0.792972 0.452303 1.753 0.07957 . ht1 1.936204 0.747576 2.590 0.00960 **ui1 0.938620 0.492240 1.907 0.05654 . ptd1 1.373390 0.495738 2.770 0.00560 **ftv1 2.877889 2.253710 1.277 0.20162 ftv2+ 8.264965 2.594444 3.186 0.00144 **age:ftv1 -0.149619 0.096342 -1.553 0.12043 age:ftv2+ -0.359454 0.115429 -3.114 0.00185 **---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 234.67 on 188 degrees of freedomResidual deviance: 183.00 on 176 degrees of freedomAIC: 209Number of Fisher Scoring iterations: 4

Next we could use add1 to consider the remaining interaction terms for addition to this model.

> add1(low.glm2,scope=~.^2)Single term additionsModel:low ~ age + lwt + race + smoke + ht + ui + ptd + ftv + age:ftv Df Deviance AIC<none> 183.00 209.00age:lwt 1 183.00 211.00age:race 2 181.62 211.62age:smoke 1 182.16 210.16age:ht 1 182.90 210.90age:ui 1 182.96 210.96age:ptd 1 182.50 210.50lwt:race 2 182.55 212.55lwt:smoke 1 180.89 208.89 *lwt:ht 1 183.00 211.00lwt:ui 1 181.90 209.90lwt:ptd 1 182.75 210.75lwt:ftv 2 181.44 211.44race:smoke 2 181.23 211.23race:ht 2 182.78 212.78race:ui 2 180.32 210.32race:ptd 2 182.85 212.85race:ftv 4 181.69 215.69smoke:ht 1 182.71 210.71smoke:ui 1 179.94 207.94 **smoke:ptd 1 182.82 210.82smoke:ftv 2 181.65 211.65ht:ui 0 183.00 209.00ht:ptd 1 182.06 210.06ht:ftv 2 181.82 211.82ui:ptd 1 181.42 209.42

86

Page 87: Logistic Regression ~ Handout #1course1.winona.edu/bdeppa/Biostatistics/Handouts/HO… · Web viewHandout 15 – Introduction to Logistic Regression This handout covers material found

ui:ftv 2 182.94 212.94ptd:ftv 2 181.57 211.57

87