lecture 15: logistic regression: inference and link functions bmtry 701 biostatistical methods ii

23
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II

Upload: jocelin-boone

Post on 28-Dec-2015

229 views

Category:

Documents


2 download

TRANSCRIPT

Lecture 15:Logistic Regression: Inference and link functions

BMTRY 701Biostatistical Methods II

More on our example

> pros5.reg <- glm(cap.inv ~ log(psa) + gleason, family=binomial)> summary(pros5.reg)

Call:glm(formula = cap.inv ~ log(psa) + gleason, family = binomial)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.1061 0.9916 -8.174 2.97e-16 ***log(psa) 0.4812 0.1448 3.323 0.000892 ***gleason 1.0229 0.1595 6.412 1.43e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 512.29 on 379 degrees of freedomResidual deviance: 403.90 on 377 degrees of freedomAIC: 409.9

Other covariates: Simple logistic models

Covariate Beta exp(Beta) Z

Age -0.0082 0.99 -0.51

Race -0.054 0.95 -0.15

Vol -0.014 0.99 -2.26

Dig Exam (vs. no nodule)

Unilobar left 0.88 2.41 2.81

Unilobar right 1.56 4.76 4.78

Bilobar 2.10 8.17 5.44

Detection in RE 1.71 5.53 4.48

LogPSA 0.87 2.39 6.62

Gleason 1.24 3.46 8.12

What is a good multiple regression model?

Principles of model building are analogous to linear regression

We use the same approach• Look for significant covariates in simple models• consider multicollinearity• look for confounding (i.e. change in betas when a

covariate is removed)

Multiple regression model proposal

Gleason, logPSA, Volume, Digital Exam result, detection in RE

But, what about collinearity? 5 choose 2 pairs.

gleason log.psa. volgleason 1.00 0.46 -0.06log.psa. 0.46 1.00 0.05vol -0.06 0.05 1.00

gleason

-1 0 1 2 3 4 5

02

46

8

-10

12

34

5

log.psa.

0 2 4 6 8 0 20 40 60 80 100

020

4060

8010

0

vol

Categorical pairs

> dpros.dcaps <- epitab(dpros, dcaps)> dpros.dcaps$tab OutcomePredictor 1 p0 2 p1 oddsratio lower upper 1 95 0.2802360 4 0.09756098 1.000000 NA NA 2 123 0.3628319 9 0.21951220 1.737805 0.5193327 5.815089 3 84 0.2477876 12 0.29268293 3.392857 1.0540422 10.921270 4 37 0.1091445 16 0.39024390 10.270270 3.2208157 32.748987 OutcomePredictor p.value 1 NA 2 4.050642e-01 3 3.777900e-02 4 1.271225e-05> fisher.test(table(dpros, dcaps))

Fisher's Exact Test for Count Data

data: table(dpros, dcaps) p-value = 2.520e-05alternative hypothesis: two.sided

Categorical vs. continuous

t-tests and anova: means by category> summary(lm(log(psa)~dcaps))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.2506 0.1877 6.662 9.55e-11 ***dcaps 0.8647 0.1632 5.300 1.97e-07 ***---Residual standard error: 0.9868 on 378 degrees of freedomMultiple R-squared: 0.06917, Adjusted R-squared: 0.06671 F-statistic: 28.09 on 1 and 378 DF, p-value: 1.974e-07

> summary(lm(log(psa)~factor(dpros)))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1418087 0.0992064 21.589 < 2e-16 ***factor(dpros)2 -0.1060634 0.1312377 -0.808 0.419 factor(dpros)3 0.0001465 0.1413909 0.001 0.999 factor(dpros)4 0.7431101 0.1680055 4.423 1.28e-05 ***---Residual standard error: 0.9871 on 376 degrees of freedomMultiple R-squared: 0.07348, Adjusted R-squared: 0.06609 F-statistic: 9.94 on 3 and 376 DF, p-value: 2.547e-06

Categorical vs. continuous

> summary(lm(vol~dcaps))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.905 3.477 6.587 1.51e-10 ***dcaps -6.362 3.022 -2.106 0.0359 * ---Residual standard error: 18.27 on 377 degrees of freedom (1 observation deleted due to missingness)Multiple R-squared: 0.01162, Adjusted R-squared: 0.009003 F-statistic: 4.434 on 1 and 377 DF, p-value: 0.03589

> summary(lm(vol~factor(dpros)))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.417 1.858 9.374 <2e-16 ***factor(dpros)2 -1.638 2.453 -0.668 0.505 factor(dpros)3 -1.976 2.641 -0.748 0.455 factor(dpros)4 -3.513 3.136 -1.120 0.263 ---Residual standard error: 18.39 on 375 degrees of freedom (1 observation deleted due to missingness)Multiple R-squared: 0.003598, Adjusted R-squared: -0.004373 F-statistic: 0.4514 on 3 and 375 DF, p-value: 0.7164

Categorical vs. continuous

> summary(lm(gleason~dcaps))Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.2560 0.1991 26.401 < 2e-16 ***dcaps 1.0183 0.1730 5.885 8.78e-09 ***---Residual standard error: 1.047 on 378 degrees of freedomMultiple R-squared: 0.08394, Adjusted R-squared: 0.08151 F-statistic: 34.63 on 1 and 378 DF, p-value: 8.776e-09

> summary(lm(gleason~factor(dpros)))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.9798 0.1060 56.402 < 2e-16 ***factor(dpros)2 0.4217 0.1403 3.007 0.00282 ** factor(dpros)3 0.4890 0.1511 3.236 0.00132 ** factor(dpros)4 0.9636 0.1795 5.367 1.40e-07 ***---

Residual standard error: 1.055 on 376 degrees of freedomMultiple R-squared: 0.07411, Adjusted R-squared: 0.06672 F-statistic: 10.03 on 3 and 376 DF, p-value: 2.251e-06

Lots of “correlation” between covariates

We should expect that there will be insignificance and confounding.

Still, try the ‘full model’ and see what happens

Full model results

> mreg <- glm(cap.inv ~ gleason + log(psa) + vol + dcaps + factor(dpros), family=binomial)

> > summary(mreg)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.617036 1.102909 -7.813 5.58e-15 ***gleason 0.908424 0.166317 5.462 4.71e-08 ***log(psa) 0.514200 0.156739 3.281 0.00104 ** vol -0.014171 0.007712 -1.838 0.06612 . dcaps 0.464952 0.456868 1.018 0.30882 factor(dpros)2 0.753759 0.355762 2.119 0.03411 * factor(dpros)3 1.517838 0.372366 4.076 4.58e-05 ***factor(dpros)4 1.384887 0.453127 3.056 0.00224 ** ---

Null deviance: 511.26 on 378 degrees of freedomResidual deviance: 376.00 on 371 degrees of freedom (1 observation deleted due to missingness)AIC: 392

What next?

Drop or retain? How to interpret?

Likelihood Ratio Test

Recall testing multiple coefficients in linear regression

Approach: ANOVA We don’t have ANOVA for logistic More general approach: Likelihood Ratio Test Based on the likelihood (or log-likelihood) for

“competing” nested models

Likelihood Ratio Test

Ho: small model Ha: large model Example:

)4()3()2(

log)(logit

654

3210

DIDIDI

volPSAGSpi

0and/or ;0;0:

0:

6541

6540

H

H

Recall the likelihood function

n

iiii

n

i i

yi

xxyxyL

x

xxyL

i

1101010

1 10

1010

))exp(1log()(),;,(log

)exp(1

)exp(),;,(

Estimating the log-likelihood

Recall that we use the log-likelihood because it is simpler (back to linear regression)

MLEs:• Betas are selected to maximize the likelihood• Betas also maximize the log-likelihood• If we plus the estimated betas, we get our ‘maximized’

log-likelihood for that model

We compare the log-likelihoods from competing (nested) models

Likelihood Ratio Test

LR statistic = G2 = -2*(LogL(H0)-LogL(H1))

Under the null: G2 ~ χ2(p-q)

If G2 < χ2(p-q),1-α, conclude H0

If G2 > χ2(p-q),1-α conclude H1

LRT in R

-2LogL = Residual Deviance So, G2 = Dev(0) - Dev(1) Fit two models:

0and/or ;0;0:

0:

6541

6540

H

H

> mreg1 <- glm(cap.inv ~ gleason + log(psa) + vol + factor(dpros),+ family=binomial)> mreg0 <- glm(cap.inv ~ gleason + log(psa) + vol, family=binomial)> mreg1Coefficients: (Intercept) gleason log(psa) vol -8.31383 0.93147 0.53422 -0.01507 factor(dpros)2 factor(dpros)3 factor(dpros)4 0.76840 1.55109 1.44743

Degrees of Freedom: 378 Total (i.e. Null); 372 Residual (1 observation deleted due to missingness)Null Deviance: 511.3 Residual Deviance: 377.1 AIC: 391.1

> mreg0Coefficients:(Intercept) gleason log(psa) vol -7.76759 0.99931 0.50406 -0.01583

Degrees of Freedom: 378 Total (i.e. Null); 375 Residual (1 observation deleted due to missingness)Null Deviance: 511.3 Residual Deviance: 399 AIC: 407

Testing DPROS

Dev(0) – Dev(1) =

p – q =

χ2(p-q),1-α, =

Conclusion?

p-value?

More in R

qchisq(0.95,3)-2*(logLik(mreg0) - logLik(mreg1))1-pchisq(21.96, 3)

> anova(mreg0, mreg1)Analysis of Deviance Table

Model 1: cap.inv ~ gleason + log(psa) + volModel 2: cap.inv ~ gleason + log(psa) + vol + factor(dpros) Resid. Df Resid. Dev Df Deviance1 375 399.02 2 372 377.06 3 21.96>

Notes on LRT

Again, models have to be NESTED For comparing models that are not nested, you

need to use other approaches Examples:

• AIC • BIC• DIC

Next time….

For next time, read the following article

Low Diagnostic Yield of Elective Coronary AngiographyPatel, Peterson, Dai et al.NEJM, 362(10). pp. 2886-95March 11, 2010

http://content.nejm.org/cgi/content/short/362/10/886?ssource=mfv