Download - Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Session 10

Applied Regression -- Prof. Juran 2

OutlineBinary Logistic Regression • Why?

– Theoretical and practical difficulties in using regular (continuous) dependent variables

• How?– Minitab procedure– Interpreting results– Some diagnostics– Making predictions– Comparison with regular regression model


Logistic RegressionIn our previous discussions of regression analysis, we have implicitly assumed that the dependent variable is continuous.

We have learned some methods for operationalizing binary independent variables (using dummy variables), but have not discussed any method for dealing with categorical or binary dependent variables with regression analysis. (One non-regression method is discriminant analysis.)

There are a number of tools available, but we will focus here on logistic regression.


The basic idea: instead of predicting the exact value of the (binary) dependent variable, we will try to model the probability that the dependent variable takes on the value of 1.

In English, is the probability that the dependent variable is 1, given a particular vector of values for the independent variables.

xXYP 1


Example: Rick Beck Consumer Credit 123456789

101112131415161718192021

A B C D E F G H I J K L M N O PSubject Single Married Divorced Widowed Credit A Credit B Credit C Credit D Credit E Children? Age Income Debt Female July Default

1 1 0 0 0 1 0 0 0 0 0 29 $65,311 $185,246 1 02 0 1 0 0 0 0 1 0 0 1 44 $25,803 $24,699 0 03 0 1 0 0 0 1 0 0 0 1 28 $33,286 $59,406 0 04 0 0 1 0 1 0 0 0 0 0 39 $53,188 $170,868 0 05 0 1 0 0 0 1 0 0 0 1 49 $75,419 $101,881 0 06 0 1 0 0 0 0 0 0 1 1 52 $77,962 $61,582 1 17 0 1 0 0 0 0 1 0 0 1 35 $37,222 $28,267 0 08 0 1 0 0 0 0 0 1 0 1 54 $52,914 $44,654 0 19 0 1 0 0 0 1 0 0 0 1 34 $67,021 $92,176 0 0

10 0 0 1 0 1 0 0 0 0 1 42 $74,753 $191,216 0 011 0 1 0 0 0 0 1 0 0 1 40 $59,282 $52,319 0 012 1 0 0 0 0 1 0 0 0 0 36 $46,501 $71,008 1 013 0 1 0 0 1 0 0 0 0 1 33 $40,820 $159,388 0 014 1 0 0 0 0 1 0 0 0 0 38 $36,557 $64,047 0 015 0 1 0 0 0 0 1 0 0 1 27 $62,586 $56,442 1 016 1 0 0 0 0 1 0 0 0 0 53 $69,656 $94,161 0 017 0 1 0 0 0 0 1 0 0 1 32 $74,703 $66,860 1 018 0 1 0 0 0 0 1 0 0 1 31 $59,561 $54,065 1 019 0 1 0 0 0 0 1 0 0 1 42 $50,329 $41,829 0 020 0 0 1 0 0 1 0 0 0 1 50 $67,447 $89,373 1 0


Why not a normal multiple regression

model? Regression Statistics Multiple R 0.5539 R Square 0.3068 Adjusted R Square 0.3033 Standard Error 0.3006

Observations 1000

ANOVA

df SS MS F Significance F Regression 5 39.7570 7.9514 87.9809 0.0000 Residual 994 89.8340 0.0904

Total 999 129.5910

Coefficients Standard Error t Stat P-value Intercept 0.1776 0.0283 6.2781 0.0000 Single 0.1041 0.0253 4.1154 0.0000 Credit D 0.3377 0.0317 10.6649 0.0000 Credit E 0.5498 0.0416 13.2305 0.0000 Children? -0.0723 0.0232 -3.1116 0.0019

Debt (x1000) -0.0010 0.0002 -5.1030 0.0000


Here we have

Since is an estimated probability, it shouldn’t go outside of the range from zero to one.

But our regression equation is unbounded, and in this data set sometimes takes on illogical estimated values.

xXYP 1

55443322110ˆˆˆˆˆˆ XXXXX

54321 0723.01041.00000.03377.05498.01776.0 XXXXX


We address this problem with a logistic response function:

xXYP 1

pp

pp

XX

XX

e

e

ˆˆˆ

ˆˆˆ

110

110

1


Pi vs. X

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

X

Pi =

Pro

ba

bili

ty t

ha

t Y

= 1


This sort of relationship will meet our criteria of keeping in the proper range. (Note: the cumulative normal distribution has a similar shape, and is the basis for the probit model.)

What we need is a transformation of either X or such that the relationship is linear. This would enable us to use linear regression to create a model.


W e w i l l u s e a t w o s t e p p r o c e d u r e :

F i r s t , c o n s i d e r t h e r a t i o o f t h e p r o b a b i l i t y t h a t Y = 1 t o t h e p r o b a b i l i t y t h a t Y = 0 , w h i c h w e w i l l c a l l t h e o d d s r a t i o :

pp

pp

pp

pp

pp

XX

XX

XX

XX

XX

e

e

e

e

e

YPYP

YPYP

ˆˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆ

110

110

110

110

110

11

1

1)1(1)1(

)0()1(

N o w , t a k i n g l o g a r i t h m s o f b o t h s i d e s ,

1ln pp XXe ˆˆˆ

110ln

pXpX ˆ11

ˆ0

ˆ


ln (Odds Ratio) vs. X

-8

-6

-4

-2

0

2

4

6

8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

X

ln(O

dd

s R

ati

o)


Minitab Results Response Information

Here we get the number of observations that fall into each of the two response categories. The response value that has been designated as the “reference event” is the first entry under Value and labeled as the event. In this case, the reference event is “being in default”.

Response Information

Variable Value CountDefault 1 153 (Event) 0 847 Total 1000


Deviance Table

Source DF Adj Dev Adj Mean Chi-Square P-ValueRegression 5 283.811 56.7621 283.81 0.000 Single 1 13.113 13.1125 13.11 0.000 Credit D 1 60.523 60.5230 60.52 0.000 Credit E 1 84.985 84.9850 84.98 0.000 Children 1 9.932 9.9316 9.93 0.002 Debt 1 39.674 39.6744 39.67 0.000Error 994 571.945 0.5754Total 999 855.756

Similar to T tests for individual slopes

Similar to F test for all slopes


Smaller values of Akaike Information Criterion (AIC) indicate a better fit

Deviance Deviance R-Sq R-Sq(adj) AIC 33.16% 32.58% 583.95

Coefficients

Term Coef SE Coef VIFConstant -1.139 0.337Single 0.970 0.272 1.56Credit D 2.023 0.263 1.18Credit E 3.038 0.348 1.24Children -0.849 0.271 1.57Debt -0.000019 0.000004 1.07

The regression model


The coefficient of 0.970 for Single represents the estimated change in the log of P(default)/P(not default) when the subject is single compared to when he/she is not single, with the other independent variables held constant.

The coefficient of –0.019 for Debt is the estimated change in the log of P(default)/P(not default) with a $1000 increase in Debt, with the other independent variables held constant.


Regression Equation

P(1) = exp(Y')/(1 + exp(Y'))

Y' = -1.139 + 0.970 Single + 2.023 Credit D + 3.038 Credit E - 0.849 Children - 0.000019 Debt

Goodness-of-Fit Tests

Test DF Chi-Square P-ValueDeviance 994 571.95 1.000Pearson 994 642.32 1.000Hosmer-Lemeshow 8 29.76 0.000


Fits and Diagnostics for Unusual Observations

ObservedObs Probability Fit Resid Std Resid 6 1.0000 0.4641 1.2391 1.25 X 39 1.0000 0.4372 1.2864 1.30 X 58 1.0000 0.4671 1.2338 1.25 X 62 1.0000 0.0872 2.2087 2.21 R 66 1.0000 0.6670 0.9000 0.91 X 85 1.0000 0.4510 1.2619 1.28 X 90 0.0000 0.6372 -1.4240 -1.44 X115 0.0000 0.5637 -1.2879 -1.30 X123 1.0000 0.6899 0.8616 0.88 X136 1.0000 0.1037 2.1288 2.14 R


Subject Marital Status B&H Rating Children Age Income Debt Gender Lee Swedowsky Married A 6 24 $ 50,049 $ 92,876 Male Renato Ferreira Single B 1 34 $ 21,334 $ 139,639 Male

Matt Aboud Divorced E 1 40 $ 49,638 $ 33,509 Male Marjorie Coismain Single C 0 27 $ 35,541 $ 25,589 Female

Deb Arnold Married A 2 35 $ 53,269 $ 93,890 Female Shilpi Chandra Widowed D 0 69 $ 44,070 $ 41,143 Female

Manya Klempner Divorced E 1 36 $ 43,243 $ 29,775 Female Sanjit Bakshi Married C 1 32 $ 19,223 $ 18,006 Male Paul Blake Married D 3 34 $ 33,754 $ 55,331 Male

Scott Sandler Married B 2 29 $ 56,893 $ 44,657 Male

Making Predictions


Given the odds ratio, we can determine our estimated for any person: Odds Ratio = OR

1

= OR

1OR

OROR

OR OR

OR1 OR

OR

OR

1


14151617181920212223242526272829303132333435

A B C D E F G H I J K L M NFrom Minitab

Odds 95% CIPredictor Coef S E Coef Z P Ratio Lower UpperConstant -1.1393 0.3374 -3.38 0.001Single 0.9699 0.2718 3.57 0 2.64 1.55 4.49Credit D 2.0234 0.2629 7.7 0 7.56 4.52 12.66Credit E 3.0384 0.3481 8.73 0 20.87 10.55 41.29Children -0.849 0.2708 -3.14 0.002 0.43 0.25 0.73Debt(x1000) -0.019388 0.003607 -5.38 0 0.98 0.97 0.99

Logit ModelSubject Single D E Children Debt logit odds ratio P(default)

Manya Klempner 0 0 1 1 29.775 0.473 1.605 0.616Matt Aboud 0 0 1 1 33.509 0.400 1.492 0.599Shilpi Chandra 0 1 0 0 41.143 0.086 1.090 0.522Marjorie Coismain 1 0 0 0 25.589 -0.666 0.514 0.340Paul Blake 0 1 0 1 55.331 -1.038 0.354 0.262Sanjit Bakshi 0 0 0 1 18.006 -2.337 0.097 0.088Scott Sandler 0 0 0 1 44.657 -2.854 0.058 0.054Renato Ferreira 1 0 0 1 139.639 -3.726 0.024 0.024Lee Swedowsky 0 0 0 1 92.876 -3.789 0.023 0.022Deb Arnold 0 0 0 1 93.89 -3.809 0.022 0.022

=$B$17+SUMPRODUCT(TRANSPOSE($B$18:$B$22),B29:F29)

=EXP(H31)

=I33/(1+I33)


Comparison of Logit vs. Regular Models

0%

10%

20%

30%

40%

50%

60%

70%

Manya Klempner Matt Aboud Shilpi Chandra Marjorie Coismain Paul Blake Sanjit Bakshi Scott Sandler Renato Ferreira Lee Swedowsky Deb Arnold

Applicant

P(D

efau

lt)

Logit

Regular


Variable Distress Count Distress Success 11 Failure 127 Joints Total 138 Logistic Regression Table Odds 95% CI Predictor Coef StDev Z P Ratio Lwr Upr Constant 8.294 2.964 2.80 0.005 Temp -0.16220 0.04664 -3.48 0.001 0.85 0.78 0.93 Log-Likelihood = -31.517 Test that all slopes are zero: G = 13.712 DF = 1, P-Value = 0.000


O-Ring Distress versus Launch TemperatureSimple Logistic Model

0.0

0.2

0.4

0.6

0.8

1.0

1.2

25 30 35 40 45 50 55 60 65 70 75 80 85

Launch Temperature (Degrees F)

Pro

ba

bil

ity

of

Dis

tre

ss )1622.0294.8(1

1)(

TempeTemp

969.01

1)30( )301622.0294.8(

e


SummaryBinary Logistic Regression • Why?

– Theoretical and practical difficulties in using regular (continuous) dependent variables

• How?– Minitab procedure– Interpreting results– Some diagnostics– Making predictions– Comparison with regular regression model


For Session 11 and 12

• Student presentations

Download - Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Top Related