intro to logistic and multinomial regression by steven drury

60
Intro to Logistic and Multinomial Regression By Steven Drury

Upload: jeffry-mosley

Post on 18-Dec-2015

240 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Intro to Logistic and Multinomial Regression By Steven Drury

Intro to Logistic and Multinomial Regression

By Steven Drury

Page 2: Intro to Logistic and Multinomial Regression By Steven Drury

Outline of Presentation

●Binary Logistic Regression Model●Cumulative logit model for ordinal outcome●Generalized logit model for nominal outcome

Page 3: Intro to Logistic and Multinomial Regression By Steven Drury

Binary Logistic Regression Model

Suppose a pair of variables (x,y) is observed on some individuals, where x is a continuous variable, whereas y is a binary variable, that is, y assumes only two values.

Examples (Binary by nature).

relief v. no relief from a certain medical condition;

voted ‘yes’ v. ‘no’ on proposition XX;

HIV infection v. no infection;

won v. lost;

dead v. alive;

Page 4: Intro to Logistic and Multinomial Regression By Steven Drury

Continuous Variables Can be Made Binary

Examples:

(i) excess body weight loss <20% v. 20% or more;

(ii) PTSD symptom score ranges 17 to 85 with a cutoff at 50: diagnosed with PTSD if score>=50 v. no PTSD if score <50;

(iii) spends above $X on entertainment weekly v. spends less than $X;

(iv) runs marathon under 3 hours v. runs longer than 3 hours.

Page 5: Intro to Logistic and Multinomial Regression By Steven Drury

Scatterplot

If we plot y (with values coded 0 and 1) against x, the scatterplot may look something like this:

Page 6: Intro to Logistic and Multinomial Regression By Steven Drury

Problem

If we fit a linear regression model to this data, the residuals will not be normally distributed and one of our assumptions is violated.

Page 7: Intro to Logistic and Multinomial Regression By Steven Drury

Introducing the Binary Logistic Regression Model

A binary (dichotomous) logistic regression is used to model

. The model with predictors has the form

.

Define the odds in favor of as the ratio . We can rewrite the logistic regression above in terms on the odds,

)1( YP kxx ,...,1

.)...(1

)...()1(

110

110

kk

kk

xxExp

xxExpYP

1Y.

)0(

)1(

YP

YP

)....()0(

)1(110 kk xxExp

YP

YP

Page 8: Intro to Logistic and Multinomial Regression By Steven Drury

Goodness of FitThere are 3 ways to check how well the model fits the data:

Pseudo R-square – Doesn't represent the proportion of variation in Y like R-square. We are looking for large values to indicate a good fit.

Max-Rescaled R-square - Is defined as pseudo R-square divided by its maximum. We are also looking for large values to indicate a good fit.

Hosmer-Lemeshow goodness-of-fit test - With the null hypothesis that the model has a good fit. P-value in excess of 0.05 is desirable.

Page 9: Intro to Logistic and Multinomial Regression By Steven Drury

Interpretation of Regression Coefficients

When is continuous, then the quantity represents the estimated percent change in odds in favor of Y=1 when is increased by one unit, and the other variables are held fixed.

%100)1)ˆ(( 1 Exp

1x

.1)ˆ(1)ˆ...ˆˆˆ(

)ˆ...ˆ)1(ˆˆ(1

22110

22110

Exp

xxxExp

xxxExp

odds

oddsodds

kk

kk

old

oldnew

1x

1x

Page 10: Intro to Logistic and Multinomial Regression By Steven Drury

Interpretation of Regression Coefficients

If is a categorical variable with two levels, then the quantity represents the estimated percent ratio in odds for the upper level of (when ) and that for the lower level (when ), provided the other variables are held fixed. To see that, write

1x%100)ˆ( 1 Exp

1x 11 xx11 x

).ˆ()ˆ...ˆˆ(

)ˆ...ˆˆˆ(1

220

2210

0

1

1

1

Exp

xxExp

xxExp

odds

odds

kk

kk

x

x

Page 11: Intro to Logistic and Multinomial Regression By Steven Drury

Interpretation of Regression Coefficients

If is a categorical variable with levels, then dummy variables are included into the model that correspond to with the th level being thereference level. The quantity represents the estimated percent ratio in odds for the level and that for the reference level provided the other variables are held fixed. This follows from the fact that

1x m1m

11,..., m m%100)ˆ( 1 Exp

,1 mx 11 xx

).ˆ()ˆ...ˆˆ(

)ˆ...ˆˆˆ(1

0

101

1

1

Exp

xxExp

xxExp

odds

odds

kkmm

kkmm

mx

x

Page 12: Intro to Logistic and Multinomial Regression By Steven Drury

Example

Dermatologists at a large hospital study patients with acute psoriasis, a skin disease. They randomly assign patients to three groups: taking drug A, drug B, or placebo. There are 45 patients in the study, 15 per group. The outcome is whether the patient felt a relief from psoriasis symptoms (1=relief, 0=no relief). Data are collected on gender, age, and group. The following SAS code fits the logistic regression model to the data.

Page 13: Intro to Logistic and Multinomial Regression By Steven Drury

SAS Application: Code

data psoriasis; input gender$ age drug$ relief$ @@; datalines;M 25 A Yes M 25 A Yes M 41 A Yes M 42 A YesM 43 A Yes M 51 A Yes M 59 A Yes M 59 A YesF 29 A Yes F 35 A Yes F 42 A Yes F 56 A YesF 65 A Yes F 40 A No F 61 A No M 29 B YesM 33 B Yes M 39 B Yes M 42 B Yes M 46 B YesM 42 B No M 48 B No M 62 B No F 36 B YesF 47 B Yes F 28 B No F 38 B No F 39 B NoF 50 B No F 60 B No M 42 P Yes M 46 P YesM 24 P No M 25 P No M 60 P No M 67 P NoF 28 P Yes F 32 P Yes F 35 P Yes F 42 P NoF 48 P No F 53 P No F 57 P No F 58 P NoF 65 P No; proc logistic data=psoriasis; class gender (ref='F') drug(ref='P')/param=ref; model relief(event='Yes')=gender age drug/rsq lackfit;run;

Page 14: Intro to Logistic and Multinomial Regression By Steven Drury

The Important Features of the SAS Code

Options (ref='F') and (ref='P') define reference categories for gender and drug.

Option param=ref creates proper dummy variables for gender and drug.

Option rsq computes the pseudo R-square and max-rescaled R-square.

Option lackfit performs the Hosmer-Lemeshow goodness-of-fit test.

Page 15: Intro to Logistic and Multinomial Regression By Steven Drury

Relevant SAS OutputR-Square 0.3304 Max-rescaled R-Square 0.4424

Analysis of Maximum Likelihood Estimates

Parameter DF Estimate Standard Error

Wald Chi-Square

Pr > ChiSq

Intercept 1 1.9698 1.5911 1.5327 0.2157

gender M 1 1.1080 0.7592 2.1297 0.1445

age 1 -0.0722 0.0344 4.4035 0.0359

drug A 1 2.9828 1.0969 7.3945 0.0065

drug B 1 0.3443 0.8445 0.1662 0.6835

Odds Ratio Estimates

Effect Point Estimate 95% Wald Confidence Limits

gender M vs F 3.028 0.684 13.410

age 0.930 0.870 0.995

drug A vs P 19.744 2.300 169.484

drug B vs P 1.411 0.270 7.386

Hosmer and Lemeshow Goodness-of-Fit Test

Chi-Square DF Pr > ChiSq

5.1720 7 0.6390

Page 16: Intro to Logistic and Multinomial Regression By Steven Drury

Results

Age and drug A are significant predictors of relief from psoriasis (age at the 5%, drug A at the 1%).

This model has a good fit because the P-value of the Hosmer-Lemeshow test is 0.6390 > 0.05. Also the pseudo R-squared (0.3304) and max-rescaled R-squared (0.4424) are not very small.

The fitted model is

. ) 0.3443 2.9828 0.0722 1080.19698.1() (ˆ

)(ˆdrugBdrugAageMaleExp

reliefnoP

reliefP

Page 17: Intro to Logistic and Multinomial Regression By Steven Drury

Interpretation of Beta Coefficients

The odds in favor of psoriasis relief for males are 3.028 times those for females (302.8%).

As age increases by one year, the odds in favor of psoriasis relief decrease by 7%=(0.93-1)100%.

The odds in favor of psoriasis relief for drug A patients are 19.744 times those of placebo patients (or 1,974.4%).

The odds in favor of psoriasis relief for drug B patients are 1.411 times those of placebo patients (or 141.1%).

Page 18: Intro to Logistic and Multinomial Regression By Steven Drury

R Application Code

Gender<-c(1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0)

Age<-c(25,25,41,42,43,51,59,59,29,35,42,56,65,40,61,29,33,39,42,46,42,48,62,36,47,28,38,39,50,60,42,46,24,25,60,67,28,32,35,42,48,53,57,58,65)

DrugA<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)

DrugB<-c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)

Relief<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,1,1,1,0,0,0,0,0,0)

logr.drugs <- glm(Relief ~ Gender + Age + DrugA + DrugB, family=binomial)

confint(logr.drugs)

exp(coef(logr.drugs))

exp(cbind(OR = coef(logr.drugs), confint(logr.drugs)))

Page 19: Intro to Logistic and Multinomial Regression By Steven Drury

Relevant R OutputDeviance Residuals:

Min 1Q Median 3Q Max

-2.0900 -0.7084 0.2968 0.8228 1.6565

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.96989 1.59115 1.238 0.21571

Gender 1.10799 0.75922 1.459 0.14446

Age -0.07221 0.03441 -2.098 0.03586 *

DrugA 2.98290 1.09693 2.719 0.00654 **

DrugB 0.34432 0.84455 0.408 0.68350

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 61.827 on 44 degrees of freedom

Residual deviance: 43.779 on 40 degrees of freedom

AIC: 53.779

Number of Fisher Scoring iterations: 5

2.5 % 97.5 %

(Intercept) -1.0468032 5.34676953

Gender -0.3436048 2.68635060

Age -0.1484526 -0.01007708

DrugA 1.0484500 5.47015945

DrugB -1.3282170 2.04585735

(Intercept) Gender Age DrugA DrugB

7.1699028 3.0282637 0.9303383 19.7450880 1.4110286

OR 2.5 % 97.5 %

(Intercept) 7.1699028 0.3510582 209.9290309

Gender 3.0282637 0.7092092 14.6780122

Age 0.9303383 0.8620409 0.9899735

DrugA 19.7450880 2.8532251 237.4980585

DrugB 1.4110286 0.2649492 7.7357880

Page 20: Intro to Logistic and Multinomial Regression By Steven Drury

Minitab: I’m like 95% Confident I Could Train A Chimp To Do This

Page 21: Intro to Logistic and Multinomial Regression By Steven Drury

SPSS Application: Syntax

LOGISTIC REGRESSION VARIABLES relief /METHOD=ENTER gender age drugA drugB /PRINT=GOODFIT CI(95) /CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

Page 22: Intro to Logistic and Multinomial Regression By Steven Drury

Relevant SPSS Output

Model Summary

Step -2 Log likelihood Cox & Snell R

Square

Nagelkerke R

Square

1 43.779a .330 .442

Hosmer and Lemeshow Test

Step Chi-square df Sig.

1 5.172 7 .639

Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1a

gender 1.108 .759 2.130 1 .144 3.028 .684 13.410

age -.072 .034 4.404 1 .036 .930 .870 .995

drugA 2.983 1.097 7.395 1 .007 19.745 2.300 169.501

drugB .344 .845 .166 1 .683 1.411 .270 7.386

Constant 1.970 1.591 1.533 1 .216 7.170

Page 23: Intro to Logistic and Multinomial Regression By Steven Drury

Multinomial Logistic Regression

A natural extension of the binary logistic regression is when the outcome variable is categorical assuming more than two values, e.g., 0, 1, or 2. This model is called a multinomial logistic regression model.

Two models are distinguished: for ordinal outcome (ordered categories such as size)and for nominal outcome (unordered categories, such as race).

Page 24: Intro to Logistic and Multinomial Regression By Steven Drury

Cumulative Logit Model for Ordinal Outcome

For example, if , the cumulative probabilities are

and

)( jyP 4m

),3()2()1()3(

),2()1()2( ),1()1(

yPyPyPyP

yPyPyPyPyP

.1)4()3()2()1()4( yPyPyPyPyP

Page 25: Intro to Logistic and Multinomial Regression By Steven Drury

Cumulative Logit Model for Ordinal Outcome

Define the odds of outcome in category j or below as the ratio These are termed cumulative odds.

Define the logits of the cumulative probabilities (called cumulative logits) by

.)(

)(

jyP

jyP

.)(

)(ln)(logit

jyP

jyPjyP

Page 26: Intro to Logistic and Multinomial Regression By Steven Drury

Cumulative Logit Model for Ordinal Outcome

For instance, if , the cumulative logits are:

and

Since , the logit is not defined.

4m

,)4()3()2(

)1(ln

)1(

)1(ln )1(logit

yPyPyP

yP

yP

yPyP

,)4()3(

)2()1(ln

)2(

)2(ln)2(logit

yPy

yPyP

yP

yPyP

.)4(

)3()2()1(ln

)3(

)3(ln)3(logit

yP

yPyPyP

yP

yPyP

1)4( yP

Page 27: Intro to Logistic and Multinomial Regression By Steven Drury

Cumulative Logit Model for Ordinal Outcome

The cumulative logit model for an ordinal outcome and predictors has the form

Note that this model requires a separate intercept parameter for each cumulative probability. SAS uses this model.

Note that some software packages (in particular, SPSS) use the model

y

kxx ,...,1

.1,...,1 ,...)(logit 11 mjxxjyP kkj

.1,...,1 ),...()(logit 11 mjxxjyP kkj

Page 28: Intro to Logistic and Multinomial Regression By Steven Drury

Goodness of Model Fit

There are only two quantities that may be used to check the model fit. They are: Pseudo R-square and Max-rescaled R-square.

We cannot perform the Hosmer-Lemeshow goodness-of-fit test in multinomial logistic regression.

Page 29: Intro to Logistic and Multinomial Regression By Steven Drury

Interpretation of Beta Coefficients

When is continuous, then the quantity represents the estimated percent change in cumulative odds when is increased by one unit, and the other predictors are held fixed.

If is a categorical variable with several levels, then represents the estimated percent ratio of cumulative odds for the level and that for the reference level, controlling for the other predictors.

1x %100)1)ˆ(( 1 Exp

1x

1x%100)ˆ( 1 Exp

11 x

Page 30: Intro to Logistic and Multinomial Regression By Steven Drury

Examples of Ordinal Outcomes

Example 1. A marketing research firm wants to investigate what factors influence the size of soda (small, medium, large or extra large) that people order at a fast-food chain.

Example 2. A researcher is interested in what factors influence medaling in Olympic swimming (gold, silver, bronze).

Example 3. A study looks at factors that influence the decision of whether to apply to graduate school. College juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school.

Page 31: Intro to Logistic and Multinomial Regression By Steven Drury

Numeric Example

Among variables collected by California Health Institute Survey (CHIS) there were demographic variables: gender (M/F) age (in years) marital status (Married/Not Married) highest educational degree obtained (<HS/Hsgrad/HS+)and health condition (Poor/Fair/Good/Excellent)

The following SAS code runs a cumulative logit model for the ordinal outcome variable health for the data on 32 respondents.

Page 32: Intro to Logistic and Multinomial Regression By Steven Drury

SAS Application: Code

data CHIS; input gender$ age marital$ educ$ health$ @@;datalines;M 46 yes 1 3 M 62 yes 1 1 M 52 yes 2 4 M 50 no 1 2 F 44 no 3 1F 68 no 2 2 F 50 no 3 2 F 93 no 1 1 M 60 yes 2 4 M 88 no 3 3M 58 yes 2 4 M 62 yes 2 3 F 64 yes 3 3 F 49 yes 2 3 F 71 yes 3 4M 32 no 3 3 F 88 no 2 1 F 36 yes 3 4 M 85 no 3 3 F 38 no 3 2M 49 yes 3 4 F 43 no 1 3 M 61 yes 2 3 M 47 yes 3 4 F 36 yes 1 3M 44 yes 1 4 M 41 no 2 3 M 55 yes 1 3 M 37 no 3 2 M 58 yes 2 4F 40 yes 2 3 F 97 no 2 1;proc format;value $maritalfmt 'yes'='married' 'no'='not married'; value $educfmt '1'='<HS' '2'='HSgrad' '3'='HS+';value $healthfmt '1'='poor' '2'='fair' '3'='good' '4'='excellent'; run;

proc logistic;class gender (ref='M') marital (ref='yes') educ (ref='3')/param=ref; model health=gender age marital educ/link=clogit rsq; run;

Page 33: Intro to Logistic and Multinomial Regression By Steven Drury

The Important Features of the SAS Code

Ordinal variables should be entered into SAS as numbers 1, 2, etc. Otherwise SAS orders them alphabetically.

Option link=clogit specifies the cumulative logit link function. Note that by default, link=logit.

Option lackfit cannot be specified because the Hosmer-Lemeshow goodness-of-fit test cannot be performed in the case of multinomial logistic regression.

Page 34: Intro to Logistic and Multinomial Regression By Steven Drury

Relevant SAS OutputR-Square 0.5988 Max-rescaled R-Square 0.6466

Analysis of Maximum Likelihood Estimates

Parameter DF Estimate Standard Error

Wald Chi-Square

Pr > ChiSq

Intercept 1 1 -8.3785 2.2137 14.3258 0.0002

Intercept 2 1 -6.6506 1.9151 12.0597 0.0005

Intercept 3 1 -2.9271 1.5069 3.7731 0.0521

gender F 1 1.8504 0.8187 5.1082 0.0238

age 1 0.0251 0.0234 1.1540 0.2827

marital no 1 4.1511 1.2304 11.3819 0.0007

educ 1 1 2.2937 1.0633 4.6532 0.0310

educ 2 1 0.9264 0.9206 1.0125 0.3143

Odds Ratio Estimates

Effect Point Estimate 95% Wald Confidence Limits

gender F vs M 6.363 1.279 31.662

age 1.025 0.980 1.074

marital no vs yes 63.501 5.694 708.125

educ 1 vs 3 9.912 1.233 79.660

educ 2 vs 3 2.525 0.416 15.343

Page 35: Intro to Logistic and Multinomial Regression By Steven Drury

Results

Gender, marital status and education are associated with health status. Age is not.

This model has a reasonably good fit because the pseudo R-square and max-rescaled R-square are pretty large.

Page 36: Intro to Logistic and Multinomial Regression By Steven Drury

Results

The fitted model is:

. 9264.0 2937.2

4.1511 0251.0 8504.13785.8

) (ˆ) (ˆ

ln) (ˆlogit

'HSgrad'HS''

marriednotagefemale

nt healthor excellegood,fair,P

healthpoorPhealthpoorP

. 9264.0 2937.2

4.1511 0251.0 8504.16505.6

) (ˆ)(ˆ

ln)(ˆlogit

'HSgrad'HS''

marriednotagefemale

nt healthor excellegood,P

ir healthpoor or faPir healthpoor or faP

. 9264.0 2937.2

4.1511 0251.0 8504.19271.2

)(ˆ),(ˆ

ln),(ˆlogit

'HSgrad'HS''

marriednotagefemale

healthexcellent P

ood healthfair, or gpoorPood healthfair, or gpoorP

Page 37: Intro to Logistic and Multinomial Regression By Steven Drury

Interpretation of Beta Coefficients

The estimated odds of worse health for females are 6.363 times those for males (or 636.6%).

As age increases by one year, the estimated odds of worse health increase by 2.5%=(1.025-1)100% (not significant).

The estimated odds of worse health for not married people are 63.501 times those for married (or 6,350.1%).

The estimated odds of worse health for <HS are 9.912 times those for HS+ (or 991.2%).

The estimated odds of worse health for HSgrad are 2.525 times those for HS+ (or 252.2%) (not significant).

These ratios apply to all of the three cumulative probabilities P(poor health), P(poor or fair health) and P(poor, fair, or good health).

Page 38: Intro to Logistic and Multinomial Regression By Steven Drury

R Application: CodeGender<-c(1,0,0,0,0,1,0,0,0,1,1,1,1,0,0,1,1,0,1,0,1,1,0,1,1,1,0,0,1,1,1,1)

Age<-c(62,44,93,88,97,50,68,50,38,37,46,88,62,64,49,32,85,43,61,36,41,55,40,52,60,58,71,36,49,47,44,58)

Marital<-c(1,0,0,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1)

Educ<-c(1,3,1,2,2,1,2,3,3,3,1,3,2,3,2,3,3,1,2,1,2,1,2,2,2,2,3,3,3,3,1,2)

Health<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4)

require(foreign)

require(ggplot2)

require(MASS)

require(Hmisc)

require(reshape2)

Health<-factor(Health)

logr.healthlevel <- polr(Health ~ Gender + Age + Marital + Educ, Hess=TRUE)

summary(logr.healthlevel)

ctable<-coef(summary(logr.healthlevel))

p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2

(ctable <- cbind(ctable, "p value" = p))

ci <- confint(logr.healthlevel)

exp(cbind(OR = coef(logr.healthlevel), ci))

Page 39: Intro to Logistic and Multinomial Regression By Steven Drury

Relevant R OutputCoefficients:

Value Std. Error t value

Gender 1.84490 0.81397 2.267

Age -0.02338 0.02234 -1.046

Marital 4.19114 1.24772 3.359

Educ 1.12441 0.53246 2.112

Intercepts:

Value Std. Error t value

1|2 1.0409 1.9805 0.5256

2|3 2.7789 2.0141 1.3797

3|4 6.4879 2.4298 2.6702

Residual Deviance: 54.22504

AIC: 68.22504

Value Std. Error t value p value

Gender 1.84489537 0.81396860 2.2665437 0.0234181163

Age -0.02337853 0.02233984 -1.0464952 0.2953324826

Marital 4.19113850 1.24772220 3.3590318 0.0007821608

Educ 1.12440869 0.53245920 2.1117274 0.0347098361

1|2 1.04088922 1.98053493 0.5255596 0.5991942069

2|3 2.77894049 2.01410976 1.3797364 0.1676678305

3|4 6.48793684 2.42978798 2.6701658 0.0075813794

OR 2.5 % 97.5 %

Gender 6.3274377 1.3836035 35.292605

Age 0.9768926 0.9323719 1.019613

Marital 66.0980005 8.1282254 1546.311706

Educ 3.0783960 1.1478311 9.538690

Page 40: Intro to Logistic and Multinomial Regression By Steven Drury

Minitab: I’m like 95% Confident I Could Train A Chimp To Do This

Page 41: Intro to Logistic and Multinomial Regression By Steven Drury

SPSS Application: Syntax

PLUM health BY gender marital educ WITH age/LINK=LOGIT

Page 42: Intro to Logistic and Multinomial Regression By Steven Drury

Relevant SPSS Output

Pseudo R-Square

Cox and Snell .599

Nagelkerke .647

McFadden .351

Parameter Estimates

Estimate Std. Error Wald df Sig. 95% Confidence Interval

Lower Bound Upper Bound

Threshold

[health = 1.00] -8.379 2.214 14.326 1 .000 -12.717 -4.040

[health = 2.00] -6.651 1.915 12.060 1 .001 -10.404 -2.897

[health = 3.00] -2.927 1.507 3.773 1 .052 -5.881 .026

Location

age -.025 .023 1.154 1 .283 -.071 .021

[gender=.00] -1.850 .819 5.108 1 .024 -3.455 -.246

[gender=1.00] 0a . . 0 . . .

[marital=.00] -4.151 1.230 11.382 1 .001 -6.563 -1.740

[marital=1.00] 0a . . 0 . . .

[educ=1.00] -2.294 1.063 4.653 1 .031 -4.378 -.210

[educ=2.00] -.926 .921 1.013 1 .314 -2.731 .878

[educ=3.00] 0a . . 0 . . .

Page 43: Intro to Logistic and Multinomial Regression By Steven Drury

Generalized Logit Model for Nominal Outcome

• Suppose is a nominal outcome with levels, and assume that the mth level is the reference.

• Define the generalized logit function as

For example if , and

y m

.1,...,1 ere wh)(

)(ln)(logit

mjmyP

jyPjyP

4m ,)4(

)2(ln)2(logit ,

)4(

)1(ln)1(logit

yP

yPyP

yP

yPyP

.)4(

)3(ln)3(logit

yP

yPyP

Page 44: Intro to Logistic and Multinomial Regression By Steven Drury

Generalized Logit Model for Nominal Outcome

The generalized logit model for nominal outcomewith levels, and response variables has the form

Note that ALL the regression coefficients differ for different j’s.

.1,...,1 ,...)(logit 11 mjxxjyP kjkjj

Page 45: Intro to Logistic and Multinomial Regression By Steven Drury

Interpretation of Beta Coefficients

When is continuous, then the quantity represents the estimated percent change in odds in favorof as opposed to when is increased by oneunit, and the other predictors are held fixed.

If is a categorical variable with several levels, then represents the estimated percent ratio of odds in favor of as opposed to for the level and that for the reference level, controlling for the other predictors.

1x %100)1)ˆ(( 1 jExp

jy my 1x

1x%100)ˆ( 1 jExp jy my 11 x

Page 46: Intro to Logistic and Multinomial Regression By Steven Drury

Examples of Nominal Outcomes

Example 1. People's occupational choices might be influenced by their parents' occupations and their own education level. We can study the relationship of one's occupation choice with education level and father's occupation.

Example 3. Entering high school students make program choices among general program, vocational program and academic program. Their choice might be modeled using their writing score and their social economic status.

Page 47: Intro to Logistic and Multinomial Regression By Steven Drury

Numeric Example

Over the course of a school year, third-graders from three different schools are exposed to three different styles of mathematics instruction: a self-paced computer-learning style, a team approach, and a traditional class approach. The students are asked which style they prefer, and their responses, classified by the type of program they are in (a regular school day versus a regular school day supplemented with an afternoon school program), are recorded.

The following SAS code runs a generalized logit model for the nominal outcome variable style (self/team/class).

Page 48: Intro to Logistic and Multinomial Regression By Steven Drury

SAS Application: Code

data school; length program$ 9;input school program$ style$ count @@; datalines; 1 regular self 10 1 regular team 17 1 regular class 26 1 afternoon self 5 1 afternoon team 12 1 afternoon class 50 2 regular self 21 2 regular team 17 2 regular class 26 2 afternoon self 16 2 afternoon team 12 2 afternoon class 36 3 regular self 15 3 regular team 15 3 regular class 16 3 afternoon self 12 3 afternoon team 12 3 afternoon class 20 ;

proc logistic; freq count; class school(ref='1') program(ref='afternoon')/param=ref; model style(order=data)=school program/link=glogit rsq; run;

Page 49: Intro to Logistic and Multinomial Regression By Steven Drury

The Important Features of the SAS Code

The data set contains frequencies of identical observations. The freq clause has to be used in proc logistic.

The option order=data prescribes SAS to use the last mentioned in the data set value of the outcome variable as the reference.

Option link=glogit specifies the generalized logit link function.

Option lackfit cannot be specified because the Hosmer-Lemeshow goodness-of-fit test cannot be performed in the case of multinomial logistic regression.

Page 50: Intro to Logistic and Multinomial Regression By Steven Drury

Relevant SAS Output

R-Square 0.0808 Max-rescaled R-Square 0.0926

Analysis of Maximum Likelihood Estimates

Parameter style DF Estimate Standard Error

Wald Chi-Square

Pr > ChiSq

Intercept self 1 -1.9707 0.3204 37.8418 <.0001

Intercept team 1 -1.3088 0.2596 25.4174 <.0001

school 2 self 1 1.0828 0.3539 9.3598 0.0022

school 2 team 1 0.1801 0.3172 0.3224 0.5702

school 3 self 1 1.3147 0.3839 11.7262 0.0006

school 3 team 1 0.6556 0.3395 3.7296 0.0535

program regular self 1 0.7474 0.2820 7.0272 0.0080

program regular team 1 0.7426 0.2706 7.5332 0.0061

Page 51: Intro to Logistic and Multinomial Regression By Steven Drury

Relevant SAS Output

Odds Ratio Estimates

Effect style Point Estimate 95% Wald Confidence Limits

school 2 vs 1 self 2.953 1.476 5.909

school 2 vs 1 team 1.197 0.643 2.230

school 3 vs 1 self 3.724 1.755 7.902

school 3 vs 1 team 1.926 0.990 3.747

program regular vs afternoon self 2.112 1.215 3.670

program regular vs afternoon team 2.101 1.237 3.571

Page 52: Intro to Logistic and Multinomial Regression By Steven Drury

Results

This model doesn’t have a very good fit, because both R-square and max-rescaled R-square are pretty small.The fitted model is:

and

, 7474.03 3147.12 0828.19707.1)(logit regularschoolschoolselfP

. 7426.03 6556.02 1801.03088.1)(logit regularschoolschoolteamP

Page 53: Intro to Logistic and Multinomial Regression By Steven Drury

Interpretation of Beta Coefficients

The estimated odds of preferring a self-paced computer-learning style as opposed to a traditional class approach in school 2 is 2.953 times those in school 1 (or 295.3%).

The estimated odds of preferring a self-paced computer-learning style as opposed to a traditional class approach in school 3 is 3.724 times those in school 1 (or 372.4%).

The estimated odds of preferring a self-paced computer-learning style as opposed to a traditional class approach in regular program is 2.112 times those in afternoon program (or 211.2%).

Page 54: Intro to Logistic and Multinomial Regression By Steven Drury

Interpretation of Beta Coefficients

The estimated odds of preferring a team learning approach as opposed to a traditional class approach in school 2 is 1.197 times those in school 1 (or 119.7%).

The estimated odds of preferring a team learning approach as opposed to a traditional class approach in school 3 is 1.926 times those in school 1 (or 192.6%).

The estimated odds of preferring a team learning approach as opposed to a traditional class approach in regular program is 2.101 times those in afternoon program (or 210.1%).

Page 55: Intro to Logistic and Multinomial Regression By Steven Drury

Minitab: I’m like 95% Confident I Could Train A Chimp To Do This

Page 56: Intro to Logistic and Multinomial Regression By Steven Drury

R Application: Code

Page 57: Intro to Logistic and Multinomial Regression By Steven Drury
Page 58: Intro to Logistic and Multinomial Regression By Steven Drury

SPSS Application: Syntax

Only numeric values are allowed as SPSS data.

Schools were renumbered (1->3, 2->2, 3->1) to make school 1 the reference. DATASET NAME DataSet1 WINDOW=FRONT.

NOMREG style (BASE=LAST ORDER=ASCENDING) BY school program

/CRITERIA CIN(95) DELTA(0) MXITER(100) MXSTEP(5) CHKSEP(20) LCONVERGE(0) PCONVERGE(0.000001) SINGULAR(0.00000001)

/MODEL

/STEPWISE=PIN(.05) POUT(0.1) MINEFFECT(0) RULE(SINGLE) ENTRYMETHOD(LR) REMOVALMETHOD(LR)

/INTERCEPT=INCLUDE

/PRINT=PARAMETER SUMMARY LRT CPS STEP MFI.

Page 59: Intro to Logistic and Multinomial Regression By Steven Drury

Relevant SPSS Output

Pseudo R-Square

Cox and Snell .081

Nagelkerke .093

McFadden .041

Parameter Estimates

stylea B Std. Error Wald df Sig. Exp(B) 95% Confidence Interval for Exp(B)

Lower Bound Upper Bound

self

Intercept -1.971 .320 37.842 1 .000

[school=1.00] 1.315 .384 11.727 1 .001 3.724 1.755 7.903

[school=2.00] 1.083 .354 9.360 1 .002 2.953 1.476 5.909

[school=3.00] 0b . . 0 . . . .

[program=1.00] .747 .282 7.027 1 .008 2.112 1.215 3.670

[program=2.00] 0b . . 0 . . . .

team

Intercept -1.309 .260 25.418 1 .000

[school=1.00] .656 .339 3.730 1 .053 1.926 .990 3.747

[school=2.00] .180 .317 .322 1 .570 1.197 .643 2.230

[school=3.00] 0b . . 0 . . . .

[program=1.00] .743 .271 7.533 1 .006 2.101 1.237 3.571

[program=2.00] 0b . . 0 . . . .

Page 60: Intro to Logistic and Multinomial Regression By Steven Drury