session 10. applied regression -- prof. juran2 outline binary logistic regression why?...

29
Session 10

Upload: amos-warren

Post on 02-Jan-2016

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Session 10

Page 2: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 2

OutlineBinary Logistic Regression • Why?

– Theoretical and practical difficulties in using regular (continuous) dependent variables

• How?– Minitab procedure– Interpreting results– Some diagnostics– Making predictions– Comparison with regular regression model

Page 3: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 3

Logistic RegressionIn our previous discussions of regression analysis, we have implicitly assumed that the dependent variable is continuous.

We have learned some methods for operationalizing binary independent variables (using dummy variables), but have not discussed any method for dealing with categorical or binary dependent variables with regression analysis. (One non-regression method is discriminant analysis.)

There are a number of tools available, but we will focus here on logistic regression.

Page 4: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 4

The basic idea: instead of predicting the exact value of the (binary) dependent variable, we will try to model the probability that the dependent variable takes on the value of 1.

In English, is the probability that the dependent variable is 1, given a particular vector of values for the independent variables.

xXYP 1

Page 5: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 5

Example: Rick Beck Consumer Credit 123456789

101112131415161718192021

A B C D E F G H I J K L M N O PSubject Single Married Divorced Widowed Credit A Credit B Credit C Credit D Credit E Children? Age Income Debt Female July Default

1 1 0 0 0 1 0 0 0 0 0 29 $65,311 $185,246 1 02 0 1 0 0 0 0 1 0 0 1 44 $25,803 $24,699 0 03 0 1 0 0 0 1 0 0 0 1 28 $33,286 $59,406 0 04 0 0 1 0 1 0 0 0 0 0 39 $53,188 $170,868 0 05 0 1 0 0 0 1 0 0 0 1 49 $75,419 $101,881 0 06 0 1 0 0 0 0 0 0 1 1 52 $77,962 $61,582 1 17 0 1 0 0 0 0 1 0 0 1 35 $37,222 $28,267 0 08 0 1 0 0 0 0 0 1 0 1 54 $52,914 $44,654 0 19 0 1 0 0 0 1 0 0 0 1 34 $67,021 $92,176 0 0

10 0 0 1 0 1 0 0 0 0 1 42 $74,753 $191,216 0 011 0 1 0 0 0 0 1 0 0 1 40 $59,282 $52,319 0 012 1 0 0 0 0 1 0 0 0 0 36 $46,501 $71,008 1 013 0 1 0 0 1 0 0 0 0 1 33 $40,820 $159,388 0 014 1 0 0 0 0 1 0 0 0 0 38 $36,557 $64,047 0 015 0 1 0 0 0 0 1 0 0 1 27 $62,586 $56,442 1 016 1 0 0 0 0 1 0 0 0 0 53 $69,656 $94,161 0 017 0 1 0 0 0 0 1 0 0 1 32 $74,703 $66,860 1 018 0 1 0 0 0 0 1 0 0 1 31 $59,561 $54,065 1 019 0 1 0 0 0 0 1 0 0 1 42 $50,329 $41,829 0 020 0 0 1 0 0 1 0 0 0 1 50 $67,447 $89,373 1 0

Page 6: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 6

Why not a normal multiple regression

model? Regression Statistics Multiple R 0.5539 R Square 0.3068 Adjusted R Square 0.3033 Standard Error 0.3006

Observations 1000

ANOVA

df SS MS F Significance F Regression 5 39.7570 7.9514 87.9809 0.0000 Residual 994 89.8340 0.0904

Total 999 129.5910

Coefficients Standard Error t Stat P-value Intercept 0.1776 0.0283 6.2781 0.0000 Single 0.1041 0.0253 4.1154 0.0000 Credit D 0.3377 0.0317 10.6649 0.0000 Credit E 0.5498 0.0416 13.2305 0.0000 Children? -0.0723 0.0232 -3.1116 0.0019

Debt (x1000) -0.0010 0.0002 -5.1030 0.0000

Page 7: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 7

Here we have

Since is an estimated probability, it shouldn’t go outside of the range from zero to one.

But our regression equation is unbounded, and in this data set sometimes takes on illogical estimated values.

xXYP 1

55443322110ˆˆˆˆˆˆ XXXXX

54321 0723.01041.00000.03377.05498.01776.0 XXXXX

Page 8: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 8

We address this problem with a logistic response function:

xXYP 1

pp

pp

XX

XX

e

e

ˆˆˆ

ˆˆˆ

110

110

1

Page 9: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 9

Pi vs. X

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

X

Pi =

Pro

ba

bili

ty t

ha

t Y

= 1

Page 10: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 10

This sort of relationship will meet our criteria of keeping in the proper range. (Note: the cumulative normal distribution has a similar shape, and is the basis for the probit model.)

What we need is a transformation of either X or such that the relationship is linear. This would enable us to use linear regression to create a model.

Page 11: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 11

W e w i l l u s e a t w o s t e p p r o c e d u r e :

F i r s t , c o n s i d e r t h e r a t i o o f t h e p r o b a b i l i t y t h a t Y = 1 t o t h e p r o b a b i l i t y t h a t Y = 0 , w h i c h w e w i l l c a l l t h e o d d s r a t i o :

pp

pp

pp

pp

pp

XX

XX

XX

XX

XX

e

e

e

e

e

YPYP

YPYP

ˆˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆ

ˆˆˆ

110

110

110

110

110

11

1

1)1(1)1(

)0()1(

N o w , t a k i n g l o g a r i t h m s o f b o t h s i d e s ,

1ln pp XXe ˆˆˆ

110ln

pXpX ˆ11

ˆ0

ˆ

Page 12: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 12

ln (Odds Ratio) vs. X

-8

-6

-4

-2

0

2

4

6

8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

X

ln(O

dd

s R

ati

o)

Page 13: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 13

Page 14: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 14

Minitab Results Response Information

Here we get the number of observations that fall into each of the two response categories. The response value that has been designated as the “reference event” is the first entry under Value and labeled as the event. In this case, the reference event is “being in default”.

Response Information

Variable Value CountDefault 1 153 (Event) 0 847 Total 1000

Page 15: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 15

Deviance Table

Source DF Adj Dev Adj Mean Chi-Square P-ValueRegression 5 283.811 56.7621 283.81 0.000 Single 1 13.113 13.1125 13.11 0.000 Credit D 1 60.523 60.5230 60.52 0.000 Credit E 1 84.985 84.9850 84.98 0.000 Children 1 9.932 9.9316 9.93 0.002 Debt 1 39.674 39.6744 39.67 0.000Error 994 571.945 0.5754Total 999 855.756

Similar to T tests for individual slopes

Similar to F test for all slopes

Page 16: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 16

Smaller values of Akaike Information Criterion (AIC) indicate a better fit

Deviance Deviance R-Sq R-Sq(adj) AIC 33.16% 32.58% 583.95

Coefficients

Term Coef SE Coef VIFConstant -1.139 0.337Single 0.970 0.272 1.56Credit D 2.023 0.263 1.18Credit E 3.038 0.348 1.24Children -0.849 0.271 1.57Debt -0.000019 0.000004 1.07

The regression model

Page 17: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 17

The coefficient of 0.970 for Single represents the estimated change in the log of P(default)/P(not default) when the subject is single compared to when he/she is not single, with the other independent variables held constant.

The coefficient of –0.019 for Debt is the estimated change in the log of P(default)/P(not default) with a $1000 increase in Debt, with the other independent variables held constant.

Page 18: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 18

Regression Equation

P(1) = exp(Y')/(1 + exp(Y'))

Y' = -1.139 + 0.970 Single + 2.023 Credit D + 3.038 Credit E - 0.849 Children - 0.000019 Debt

Goodness-of-Fit Tests

Test DF Chi-Square P-ValueDeviance 994 571.95 1.000Pearson 994 642.32 1.000Hosmer-Lemeshow 8 29.76 0.000

Page 19: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 19

Fits and Diagnostics for Unusual Observations

ObservedObs Probability Fit Resid Std Resid 6 1.0000 0.4641 1.2391 1.25 X 39 1.0000 0.4372 1.2864 1.30 X 58 1.0000 0.4671 1.2338 1.25 X 62 1.0000 0.0872 2.2087 2.21 R 66 1.0000 0.6670 0.9000 0.91 X 85 1.0000 0.4510 1.2619 1.28 X 90 0.0000 0.6372 -1.4240 -1.44 X115 0.0000 0.5637 -1.2879 -1.30 X123 1.0000 0.6899 0.8616 0.88 X136 1.0000 0.1037 2.1288 2.14 R

Page 20: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 20

Subject Marital Status B&H Rating Children Age Income Debt Gender Lee Swedowsky Married A 6 24 $ 50,049 $ 92,876 Male Renato Ferreira Single B 1 34 $ 21,334 $ 139,639 Male

Matt Aboud Divorced E 1 40 $ 49,638 $ 33,509 Male Marjorie Coismain Single C 0 27 $ 35,541 $ 25,589 Female

Deb Arnold Married A 2 35 $ 53,269 $ 93,890 Female Shilpi Chandra Widowed D 0 69 $ 44,070 $ 41,143 Female

Manya Klempner Divorced E 1 36 $ 43,243 $ 29,775 Female Sanjit Bakshi Married C 1 32 $ 19,223 $ 18,006 Male Paul Blake Married D 3 34 $ 33,754 $ 55,331 Male

Scott Sandler Married B 2 29 $ 56,893 $ 44,657 Male

Making Predictions

Page 21: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 21

Given the odds ratio, we can determine our estimated for any person: Odds Ratio = OR

1

= OR

1OR

OROR

OR OR

OR1 OR

OR

OR

1

Page 22: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 22

14151617181920212223242526272829303132333435

A B C D E F G H I J K L M NFrom Minitab

Odds 95% CIPredictor Coef S E Coef Z P Ratio Lower UpperConstant -1.1393 0.3374 -3.38 0.001Single 0.9699 0.2718 3.57 0 2.64 1.55 4.49Credit D 2.0234 0.2629 7.7 0 7.56 4.52 12.66Credit E 3.0384 0.3481 8.73 0 20.87 10.55 41.29Children -0.849 0.2708 -3.14 0.002 0.43 0.25 0.73Debt(x1000) -0.019388 0.003607 -5.38 0 0.98 0.97 0.99

Logit ModelSubject Single D E Children Debt logit odds ratio P(default)

Manya Klempner 0 0 1 1 29.775 0.473 1.605 0.616Matt Aboud 0 0 1 1 33.509 0.400 1.492 0.599Shilpi Chandra 0 1 0 0 41.143 0.086 1.090 0.522Marjorie Coismain 1 0 0 0 25.589 -0.666 0.514 0.340Paul Blake 0 1 0 1 55.331 -1.038 0.354 0.262Sanjit Bakshi 0 0 0 1 18.006 -2.337 0.097 0.088Scott Sandler 0 0 0 1 44.657 -2.854 0.058 0.054Renato Ferreira 1 0 0 1 139.639 -3.726 0.024 0.024Lee Swedowsky 0 0 0 1 92.876 -3.789 0.023 0.022Deb Arnold 0 0 0 1 93.89 -3.809 0.022 0.022

=$B$17+SUMPRODUCT(TRANSPOSE($B$18:$B$22),B29:F29)

=EXP(H31)

=I33/(1+I33)

Page 23: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 23

Comparison of Logit vs. Regular Models

0%

10%

20%

30%

40%

50%

60%

70%

Manya Klempner Matt Aboud Shilpi Chandra Marjorie Coismain Paul Blake Sanjit Bakshi Scott Sandler Renato Ferreira Lee Swedowsky Deb Arnold

Applicant

P(D

efau

lt)

Logit

Regular

Page 24: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 24

Page 25: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 25

Page 26: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 26

Variable Distress Count Distress Success 11 Failure 127 Joints Total 138 Logistic Regression Table Odds 95% CI Predictor Coef StDev Z P Ratio Lwr Upr Constant 8.294 2.964 2.80 0.005 Temp -0.16220 0.04664 -3.48 0.001 0.85 0.78 0.93 Log-Likelihood = -31.517 Test that all slopes are zero: G = 13.712 DF = 1, P-Value = 0.000

Page 27: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 27

O-Ring Distress versus Launch TemperatureSimple Logistic Model

0.0

0.2

0.4

0.6

0.8

1.0

1.2

25 30 35 40 45 50 55 60 65 70 75 80 85

Launch Temperature (Degrees F)

Pro

ba

bil

ity

of

Dis

tre

ss )1622.0294.8(1

1)(

TempeTemp

969.01

1)30( )301622.0294.8(

e

Page 28: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 28

SummaryBinary Logistic Regression • Why?

– Theoretical and practical difficulties in using regular (continuous) dependent variables

• How?– Minitab procedure– Interpreting results– Some diagnostics– Making predictions– Comparison with regular regression model

Page 29: Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)

Applied Regression -- Prof. Juran 29

For Session 11 and 12

• Student presentations