Session 10
Applied Regression -- Prof. Juran 2
OutlineBinary Logistic Regression • Why?
– Theoretical and practical difficulties in using regular (continuous) dependent variables
• How?– Minitab procedure– Interpreting results– Some diagnostics– Making predictions– Comparison with regular regression model
Applied Regression -- Prof. Juran 3
Logistic RegressionIn our previous discussions of regression analysis, we have implicitly assumed that the dependent variable is continuous.
We have learned some methods for operationalizing binary independent variables (using dummy variables), but have not discussed any method for dealing with categorical or binary dependent variables with regression analysis. (One non-regression method is discriminant analysis.)
There are a number of tools available, but we will focus here on logistic regression.
Applied Regression -- Prof. Juran 4
The basic idea: instead of predicting the exact value of the (binary) dependent variable, we will try to model the probability that the dependent variable takes on the value of 1.
In English, is the probability that the dependent variable is 1, given a particular vector of values for the independent variables.
xXYP 1
Applied Regression -- Prof. Juran 5
Example: Rick Beck Consumer Credit 123456789
101112131415161718192021
A B C D E F G H I J K L M N O PSubject Single Married Divorced Widowed Credit A Credit B Credit C Credit D Credit E Children? Age Income Debt Female July Default
1 1 0 0 0 1 0 0 0 0 0 29 $65,311 $185,246 1 02 0 1 0 0 0 0 1 0 0 1 44 $25,803 $24,699 0 03 0 1 0 0 0 1 0 0 0 1 28 $33,286 $59,406 0 04 0 0 1 0 1 0 0 0 0 0 39 $53,188 $170,868 0 05 0 1 0 0 0 1 0 0 0 1 49 $75,419 $101,881 0 06 0 1 0 0 0 0 0 0 1 1 52 $77,962 $61,582 1 17 0 1 0 0 0 0 1 0 0 1 35 $37,222 $28,267 0 08 0 1 0 0 0 0 0 1 0 1 54 $52,914 $44,654 0 19 0 1 0 0 0 1 0 0 0 1 34 $67,021 $92,176 0 0
10 0 0 1 0 1 0 0 0 0 1 42 $74,753 $191,216 0 011 0 1 0 0 0 0 1 0 0 1 40 $59,282 $52,319 0 012 1 0 0 0 0 1 0 0 0 0 36 $46,501 $71,008 1 013 0 1 0 0 1 0 0 0 0 1 33 $40,820 $159,388 0 014 1 0 0 0 0 1 0 0 0 0 38 $36,557 $64,047 0 015 0 1 0 0 0 0 1 0 0 1 27 $62,586 $56,442 1 016 1 0 0 0 0 1 0 0 0 0 53 $69,656 $94,161 0 017 0 1 0 0 0 0 1 0 0 1 32 $74,703 $66,860 1 018 0 1 0 0 0 0 1 0 0 1 31 $59,561 $54,065 1 019 0 1 0 0 0 0 1 0 0 1 42 $50,329 $41,829 0 020 0 0 1 0 0 1 0 0 0 1 50 $67,447 $89,373 1 0
Applied Regression -- Prof. Juran 6
Why not a normal multiple regression
model? Regression Statistics Multiple R 0.5539 R Square 0.3068 Adjusted R Square 0.3033 Standard Error 0.3006
Observations 1000
ANOVA
df SS MS F Significance F Regression 5 39.7570 7.9514 87.9809 0.0000 Residual 994 89.8340 0.0904
Total 999 129.5910
Coefficients Standard Error t Stat P-value Intercept 0.1776 0.0283 6.2781 0.0000 Single 0.1041 0.0253 4.1154 0.0000 Credit D 0.3377 0.0317 10.6649 0.0000 Credit E 0.5498 0.0416 13.2305 0.0000 Children? -0.0723 0.0232 -3.1116 0.0019
Debt (x1000) -0.0010 0.0002 -5.1030 0.0000
Applied Regression -- Prof. Juran 7
Here we have
Since is an estimated probability, it shouldn’t go outside of the range from zero to one.
But our regression equation is unbounded, and in this data set sometimes takes on illogical estimated values.
xXYP 1
55443322110ˆˆˆˆˆˆ XXXXX
54321 0723.01041.00000.03377.05498.01776.0 XXXXX
Applied Regression -- Prof. Juran 8
We address this problem with a logistic response function:
xXYP 1
pp
pp
XX
XX
e
e
ˆˆˆ
ˆˆˆ
110
110
1
Applied Regression -- Prof. Juran 9
Pi vs. X
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
X
Pi =
Pro
ba
bili
ty t
ha
t Y
= 1
Applied Regression -- Prof. Juran 10
This sort of relationship will meet our criteria of keeping in the proper range. (Note: the cumulative normal distribution has a similar shape, and is the basis for the probit model.)
What we need is a transformation of either X or such that the relationship is linear. This would enable us to use linear regression to create a model.
Applied Regression -- Prof. Juran 11
W e w i l l u s e a t w o s t e p p r o c e d u r e :
F i r s t , c o n s i d e r t h e r a t i o o f t h e p r o b a b i l i t y t h a t Y = 1 t o t h e p r o b a b i l i t y t h a t Y = 0 , w h i c h w e w i l l c a l l t h e o d d s r a t i o :
pp
pp
pp
pp
pp
XX
XX
XX
XX
XX
e
e
e
e
e
YPYP
YPYP
ˆˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆ
ˆˆˆ
110
110
110
110
110
11
1
1)1(1)1(
)0()1(
N o w , t a k i n g l o g a r i t h m s o f b o t h s i d e s ,
1ln pp XXe ˆˆˆ
110ln
pXpX ˆ11
ˆ0
ˆ
Applied Regression -- Prof. Juran 12
ln (Odds Ratio) vs. X
-8
-6
-4
-2
0
2
4
6
8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
X
ln(O
dd
s R
ati
o)
Applied Regression -- Prof. Juran 13
Applied Regression -- Prof. Juran 14
Minitab Results Response Information
Here we get the number of observations that fall into each of the two response categories. The response value that has been designated as the “reference event” is the first entry under Value and labeled as the event. In this case, the reference event is “being in default”.
Response Information
Variable Value CountDefault 1 153 (Event) 0 847 Total 1000
Applied Regression -- Prof. Juran 15
Deviance Table
Source DF Adj Dev Adj Mean Chi-Square P-ValueRegression 5 283.811 56.7621 283.81 0.000 Single 1 13.113 13.1125 13.11 0.000 Credit D 1 60.523 60.5230 60.52 0.000 Credit E 1 84.985 84.9850 84.98 0.000 Children 1 9.932 9.9316 9.93 0.002 Debt 1 39.674 39.6744 39.67 0.000Error 994 571.945 0.5754Total 999 855.756
Similar to T tests for individual slopes
Similar to F test for all slopes
Applied Regression -- Prof. Juran 16
Smaller values of Akaike Information Criterion (AIC) indicate a better fit
Deviance Deviance R-Sq R-Sq(adj) AIC 33.16% 32.58% 583.95
Coefficients
Term Coef SE Coef VIFConstant -1.139 0.337Single 0.970 0.272 1.56Credit D 2.023 0.263 1.18Credit E 3.038 0.348 1.24Children -0.849 0.271 1.57Debt -0.000019 0.000004 1.07
The regression model
Applied Regression -- Prof. Juran 17
The coefficient of 0.970 for Single represents the estimated change in the log of P(default)/P(not default) when the subject is single compared to when he/she is not single, with the other independent variables held constant.
The coefficient of –0.019 for Debt is the estimated change in the log of P(default)/P(not default) with a $1000 increase in Debt, with the other independent variables held constant.
Applied Regression -- Prof. Juran 18
Regression Equation
P(1) = exp(Y')/(1 + exp(Y'))
Y' = -1.139 + 0.970 Single + 2.023 Credit D + 3.038 Credit E - 0.849 Children - 0.000019 Debt
Goodness-of-Fit Tests
Test DF Chi-Square P-ValueDeviance 994 571.95 1.000Pearson 994 642.32 1.000Hosmer-Lemeshow 8 29.76 0.000
Applied Regression -- Prof. Juran 19
Fits and Diagnostics for Unusual Observations
ObservedObs Probability Fit Resid Std Resid 6 1.0000 0.4641 1.2391 1.25 X 39 1.0000 0.4372 1.2864 1.30 X 58 1.0000 0.4671 1.2338 1.25 X 62 1.0000 0.0872 2.2087 2.21 R 66 1.0000 0.6670 0.9000 0.91 X 85 1.0000 0.4510 1.2619 1.28 X 90 0.0000 0.6372 -1.4240 -1.44 X115 0.0000 0.5637 -1.2879 -1.30 X123 1.0000 0.6899 0.8616 0.88 X136 1.0000 0.1037 2.1288 2.14 R
Applied Regression -- Prof. Juran 20
Subject Marital Status B&H Rating Children Age Income Debt Gender Lee Swedowsky Married A 6 24 $ 50,049 $ 92,876 Male Renato Ferreira Single B 1 34 $ 21,334 $ 139,639 Male
Matt Aboud Divorced E 1 40 $ 49,638 $ 33,509 Male Marjorie Coismain Single C 0 27 $ 35,541 $ 25,589 Female
Deb Arnold Married A 2 35 $ 53,269 $ 93,890 Female Shilpi Chandra Widowed D 0 69 $ 44,070 $ 41,143 Female
Manya Klempner Divorced E 1 36 $ 43,243 $ 29,775 Female Sanjit Bakshi Married C 1 32 $ 19,223 $ 18,006 Male Paul Blake Married D 3 34 $ 33,754 $ 55,331 Male
Scott Sandler Married B 2 29 $ 56,893 $ 44,657 Male
Making Predictions
Applied Regression -- Prof. Juran 21
Given the odds ratio, we can determine our estimated for any person: Odds Ratio = OR
1
= OR
1OR
OROR
OR OR
OR1 OR
OR
OR
1
Applied Regression -- Prof. Juran 22
14151617181920212223242526272829303132333435
A B C D E F G H I J K L M NFrom Minitab
Odds 95% CIPredictor Coef S E Coef Z P Ratio Lower UpperConstant -1.1393 0.3374 -3.38 0.001Single 0.9699 0.2718 3.57 0 2.64 1.55 4.49Credit D 2.0234 0.2629 7.7 0 7.56 4.52 12.66Credit E 3.0384 0.3481 8.73 0 20.87 10.55 41.29Children -0.849 0.2708 -3.14 0.002 0.43 0.25 0.73Debt(x1000) -0.019388 0.003607 -5.38 0 0.98 0.97 0.99
Logit ModelSubject Single D E Children Debt logit odds ratio P(default)
Manya Klempner 0 0 1 1 29.775 0.473 1.605 0.616Matt Aboud 0 0 1 1 33.509 0.400 1.492 0.599Shilpi Chandra 0 1 0 0 41.143 0.086 1.090 0.522Marjorie Coismain 1 0 0 0 25.589 -0.666 0.514 0.340Paul Blake 0 1 0 1 55.331 -1.038 0.354 0.262Sanjit Bakshi 0 0 0 1 18.006 -2.337 0.097 0.088Scott Sandler 0 0 0 1 44.657 -2.854 0.058 0.054Renato Ferreira 1 0 0 1 139.639 -3.726 0.024 0.024Lee Swedowsky 0 0 0 1 92.876 -3.789 0.023 0.022Deb Arnold 0 0 0 1 93.89 -3.809 0.022 0.022
=$B$17+SUMPRODUCT(TRANSPOSE($B$18:$B$22),B29:F29)
=EXP(H31)
=I33/(1+I33)
Applied Regression -- Prof. Juran 23
Comparison of Logit vs. Regular Models
0%
10%
20%
30%
40%
50%
60%
70%
Manya Klempner Matt Aboud Shilpi Chandra Marjorie Coismain Paul Blake Sanjit Bakshi Scott Sandler Renato Ferreira Lee Swedowsky Deb Arnold
Applicant
P(D
efau
lt)
Logit
Regular
Applied Regression -- Prof. Juran 24
Applied Regression -- Prof. Juran 25
Applied Regression -- Prof. Juran 26
Variable Distress Count Distress Success 11 Failure 127 Joints Total 138 Logistic Regression Table Odds 95% CI Predictor Coef StDev Z P Ratio Lwr Upr Constant 8.294 2.964 2.80 0.005 Temp -0.16220 0.04664 -3.48 0.001 0.85 0.78 0.93 Log-Likelihood = -31.517 Test that all slopes are zero: G = 13.712 DF = 1, P-Value = 0.000
Applied Regression -- Prof. Juran 27
O-Ring Distress versus Launch TemperatureSimple Logistic Model
0.0
0.2
0.4
0.6
0.8
1.0
1.2
25 30 35 40 45 50 55 60 65 70 75 80 85
Launch Temperature (Degrees F)
Pro
ba
bil
ity
of
Dis
tre
ss )1622.0294.8(1
1)(
TempeTemp
969.01
1)30( )301622.0294.8(
e
Applied Regression -- Prof. Juran 28
SummaryBinary Logistic Regression • Why?
– Theoretical and practical difficulties in using regular (continuous) dependent variables
• How?– Minitab procedure– Interpreting results– Some diagnostics– Making predictions– Comparison with regular regression model
Applied Regression -- Prof. Juran 29
For Session 11 and 12
• Student presentations