department of epidemiology and public health unit of biostatistics and computational sciences...

51
Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences ssion models for binary and survival d PD Dr. C. Schindler Swiss Tropical and Public Health Institute University of Basel [email protected] meeting of the Swiss Societies of Clinical Neuroph Neurology, Lugano, May 3 rd 2012

Upload: cecilia-willis

Post on 31-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Department of Epidemiology and Public HealthUnit of Biostatistics and Computational Sciences

Regression models for binary and survival data

PD Dr. C. SchindlerSwiss Tropical and Public Health Institute

University of [email protected]

Annual meeting of the Swiss Societies of Clinical Neurophysiologyand of Neurology, Lugano, May 3rd 2012

Page 2: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Binary outcome data

Page 3: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Binary endpoints Y

Examples:

Y = „death within 1 year“

Y = 1, event occurred = 0, otherwise

Y = „disease progression within 2 years“

Y = „remission within three months“

Page 4: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

1. Meaning of E(Y) for a binary variable Y

E(Y) = Mean of Y at the population level = P(Y = 0) · 0 + P(Y = 1) · 1 = P(Y = 1)

Preliminaries:

Thus,

mean of Y = probability of the event represented by Y.

Page 5: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

2. Notion of odds

Odds (Y) = P(Y = 1) / P(Y = 0)

Example:

= P(Y = 1) / [1 – P(Y = 1)]

Y = „disease progression“

P(Y =1) = 0.3

Odds(Y) = 0.3 / [1 – 0.3] = 0.3 / 0.7 = 0.429

P(Y = 1) = Odds(Y) / [1 + Odds(Y)] (*)

P(Y=1) = 0.429 / [1 + 0.429] = 0.3

Page 6: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

3. Notion of odds ratio (OR)

X = „high risk“ (0 -> normal risk, 1 -> increased risk)

P(Y = 1 | X = 0) = P(Y = 1) in subjects with X = 0 = 0.2

Odds(Y | X = 0) = 0.2 / 0.8 = 0.25

P(Y = 1 | X = 1) = P(Y = 1) in subjects with X = 1 = 0.4

Odds(Y | X = 1) = 0.4 / 0.6 = 0.667

OR(Y | X) = OR(Y | X = 1)/OR(Y | X = 0) = 0.667 / 0.25 = 2.67

Page 7: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

with outcome(Y = 1)

w/o outcome(Y = 0)

with risk factor (X = 1)

40 60 100

w/o risk factor (X = 0)

20 80 100

70 130 200

Symmetry of OR:

Prospective (cohort study):OR of Y between X=1 and X=0 = 40/60 : 20/80 = 2.67

Retrospective (case control study)OR of X between Y=1 and Y=0 = 40/20 : 60/80 = 2.67

Page 8: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

4. Calculus of odds and odds ratios

Example: risk of disease progression without risk factors A and B = 20%.OR of disease progression associated with risk factor A = 2.0.OR of disease progression associated with risk factor B = 3.0.

Odds without risk factors = 0.2 / (1 – 0.2) = 0.25.

Corresponding risks:a) 0.5/1.5 = 0.33, b) 0.75/1.75 = 0.43, c) 1.5/2.5 = 0.6

c) Odds with both risk factors = 0.25 2.0 3.0 = 1.5 (if factors do not interact)

b) Odds with risk factor B only = 0.25 3.0 = 0.75a) Odds with risk factor A only = 0.25 2.0 = 0.5

Page 9: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Regresssion models for probabilities / odds

Page 10: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Idea: As in classical regression consider

E(Y) = 0 + 1 · risk score

Equivalent formulation of the model:

P(Y = 1) = 0 + 1 · risk score

Problem:

P(Y = 1) = 0 + 1· risk score

in (0, 1) can take values outside (0,1)

Page 11: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Solution:

P(Y = 1) = F ( 0 + 1 · risk score )

where F(z) is a function whose values are always in (0, 1)

z = linear predictor (linear prediction score)

Page 12: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

-4 -2 0 2 4z

0.0

0.2

0.4

0.6

0.8

1.0

F(z

)Logistic function (standard choice)

F(z) = ez / (1 + ez)

a) ez < 1 + ez => F(z) < 1 b) ez > 0 => F(z) > 0 .

Page 13: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Y = Outcome (1 = present, 0 = absent)

X = risk factor (1 = present, 0 = absent)

Linear predictor (logit): z = 0 + 1 · x

x

x

z

z

ee

eeXYP

10

10

11)|1(

xz eeXY 10)|(Odds

Recalling that P(Y = 1|X) = Odds(Y|X) / [1 + Odds(Y|X)] shows that

Page 14: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

x = 1: 1010 1)(Odds eeY

Odds Ratio of Y between x=1 and x=0:

1010 / eee

x = 0: 010 0)(Odds eeY

= > 0 = ln (Odds(Y|X = 0)) 1 = ln (OR(Y|X))

Page 15: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Logistic regression

x

x

z

z

ee

eeYP

10

10

11)1(

Probit regression

xzYP 10)1(

= cumulative density function of standard normal distribution (another sigmoid shaped function ranging from 0 to 1)

Page 16: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Logistic regression Number of obs = 200 LR chi2(1) = 9.66 Prob > chi2 = 0.0019Log likelihood = -117.34141 Pseudo R2 = 0.0395

------------------------------------------------------------------------------ outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- risk_factor | .9808289 .3227486 3.04 0.002 .3482533 1.613404 _cons | -1.386294 .25 -5.55 0.000 -1.876285 -.896303------------------------------------------------------------------------------

ln (Odds Ratio) = 0.9808 Odds Ratio = e 0.9808 = 2.67

ln (Odds(Y|X = 0) = -1.3863 Odds(Y | X = 0) = e-1.3863 = 0.25

P(Y = 1 | X = 0) = 0.25 / (1 + 0.25) = 0.2

P(Y = 1 | X = 1) = 0.25 2.67 / (1 + 0.25 2.67) = 0.4

Page 17: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Summary

exp(coefficient of risk factor) = OR of outcome between those with and those without risk factor (cohort study)

= OR of risk factor between those with and those without outcome (case-control study)

exp(intercept term) = odds of outcome among unexposed subjects

Page 18: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Direct output of Odds ratio and odds of unexposed:

95%-confidence interval of odds ratio: ( 1.42 , 5.02)

Logistic regression Number of obs = 200 LR chi2(1) = 9.66 Prob > chi2 = 0.0019Log likelihood = -117.34141 Pseudo R2 = 0.0395

------------------------------------------------------------------------------ outcome | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- risk_factor | 2.666666 .8606626 3.04 0.002 1.416591 5.019872 _cons | .2500001 .0625 -5.55 0.000 .153158 .4080755------------------------------------------------------------------------------

0 1 2 3 4 5

Note: 1. Confidence intervals of odds ratios are asymmetrical! 2. If they do not include 1, then the respective association is statistically significant at the 5% level.

Page 19: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Ritchie K. et al., The neuroprotective effects of caffeine – a prospective population study (The Three City Study),Neurology 2007; 69: 536-45

Example from the literature

outcome = cognitive decline, measured as either a) decline by at least 6 units in Isaacs set test or b) decline by at least 2 units in Benton visual

retention test over four years

User
Isaacs set test = test on verbal recall and fluency
Page 20: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

1. On average, odds of CD increased by 6% per additional year of age at baseline. 2. On average, odds of CD increased by 8% per additional unit in baseline cognitive test. 3. Compared to subjects with 5 years of education, the odds of CD among subjects with ≥12 years of education was reduced by more than 40%.

1)

2)

3)

Page 21: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Significantly reduced risk of Isaacs -6 among women who drank more than 3 units of caffeine per day at baseline compared to women who only drank 0 – 1 units

Main result of paper:

Page 22: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

CAVE: Odds ratio (OR) relative risk (RR) Odds risk (relative frequency)

Interpretation of OR as relative risk and of odds as risk (relative frequency) is only appropriate if risks are small (i.e., < 10%)

NOTE: Odds > risk (relative frequency)

possible situations for OR and RR: a) OR < RR < 1 b) OR = RR = 1 c) 1 < RR < OR

Page 23: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Model comparison

Likelihood ratio testAkaike information criterion (AIC)Bayesian information criterion (BIC)Pseudo-R2

Page 24: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Likelihood L of a model

Probability of observing exactly the same outcome data again if exactly the same predictor data are given, provided that the model describes reality in the best possible way with the givenvariables.

ln(L) is always 0 , since probabilities are 1.

The perfect model would have L = 1 and ln(L) = 0. The better the model, the closer ln(L) is to 0.

log-Likelihood ln(L) natural logarithm of the likelihood

Page 25: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Comparison of nested models

A model M1 is said to be nested in another model M2 , if M1 is a special case of M2, e.g., if all the terms of M1 are also in M2 but not vice versa.

Under the hypothesis, that the additional terms of M2 are of no predictive value, the difference

D = 2 [ln(L2) - ln(L1)]

has an approximate Chi2-distribution with df2 - df1 degrees of freedom (where dfi = number of parameters of model i).

likelihood ratio

2 ln(L2/L1)=

CAVE: Both models must be based on exactly the same data. In particular, their n‘s must be identical.

Page 26: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Akaike information criterion(„smaller is better“)

AIC = - 2 ln(L) + 2 p

(p = number of parameters of the model in addition to the intercept parameter)

penalty for complexity of the model

The two models compared must be based on exactly the same data (same n!), but they need not be nestedand can contain different variables.

Page 27: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Bayesian information criterion (Schwarz criterion)(„smaller is better“)

BIC = - 2 ln(L) + p ln(n) (p = number of parameters of the model in addition to the intercept parameter, n = sample size)

penalty for complexity of the model

The two models compared need not be based on the same data nor do they have to be nested.

Page 28: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Pseudo-R2

There exists an analog of R2 with logistic and other generalized linear regression models.

0ln(Lnull model) ln(Lmodel)

Pseudo-R2 = [ ln(Lmodel) - ln(Lnull model) ] / [ 0 - ln(Lnull model ) ]

„variance explained“

„total variance“

Null model = model with an intercept term only.

Page 29: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Goodness of fit of a logistic regression model

Can be assessed using the Hosmer-Lemeshow-Test.

Mechanics of the test:

1. For each subject, the logistic regression model predicts its individual probability of having Y = 1. 2. Subjects are then categorized into a certain number of classes based on the size of their predicted probabilities. 3. In each of the classes, the proportion of subjects with Y = 1 is determined and compared with the mean value of the predicted probabilities (ideally the two values should coincide in each class).

Page 30: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Analysis of survival data

Page 31: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Censored and uncensored survival times

x

o

Observation period (= individual time scale!)

true event

-> uncensored survival time

loss to follow-up

-> censored survival time

Event-free survival until t1

-> censored survival time

t0 t1

Page 32: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

In survival analyses, two outcome variables are needed:

2. Variable for event-free time observed = time until event (uncensored observation) observation time (censored observation)

1. Event variable 1 = event was observed -> uncensored observation

0 = event was not observed -> censored observation

Page 33: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Simple group comparisons of survival data

2. Comparison of survival curves using the log rank or the Wilcoxon test.

1. Construction of survival curves using the method of Kaplan-Meier

Page 34: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

0.00

0.25

0.50

0.75

1.00

0 20 40 60 80 100analysis time (weeks)

group = 1 group = 2 group = 3

Kaplan-Meier survival estimates

Page 35: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

S(t) = Proportion of patients without event until time t

S(t+t) S(t) – S(t) h(t) t h(t) = instantaneous event risk (hazard at time t)

S(t+t) – S(t)t S(t)

-h(t)

S(t+t) – S(t)t

S(t) -h(t)

S(t) S(t)•

= -h(t)

d/dt [ln(S(t))] = -h(t)

S(t)

t

S(0)

S(t+t)

t+t

S(t+t) – S(t) -S(t) h(t) t

h(t) = - d/dt [ln(S(t))]

Page 36: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

h2(t) / h1(t) = HR (hazard ratio)

h2(t) = HR h1(t)

Assumption of proportional hazards (PH)

h1(t) = hazard function in group 1

h2(t) = hazard function in group 2

PH: ratio of hazards is independent of time t

Page 37: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Hazard functions of group 1 and 2 are proportional.Hazard function of group 3 violates PH assumption

0.0

05.0

1.0

15.0

2

0 20 40 60 80 100analysis time (weeks)

group = 1

group = 2

group = 3

Smoothed hazard estimates

0.00

0.25

0.50

0.75

1.00

0 20 40 60 80 100analysis time (weeks)

Kaplan-Meier survival curves

group = 1

group = 2

group = 3

Page 38: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Logarithmized hazards of group 1 and 2 run parrallel (PH-assumption).

0.0

05.0

1.0

15.0

2

0 20 40 60 80 100analysis time (weeks)

group = 1

group = 2

group = 3

Smoothed hazard estimates

-6.5

-6-5

.5-5

-4.5

-4

0 20 40 60 80 100analysis time (weeks)

group = 1

group = 2

group = 3

Logarithmized hazard estimates

Page 39: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Modelling of the hazard ratio

Page 40: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Sir David Roxbee Cox* 1924

1950-1956 Cambridge University

1956-1966 Birkbeck College, London

1966-1988 Imperial College, London

1988-1994 Oxford University

Page 41: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

HR = e x

x = 1 with risk factor

x = 0 without risk factorx dichotomous

HR = e 1

= e = hazard ratio associated with

risk factor

Page 42: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

HR = e x

x continuous, e.g. x = age at baseline

HR = e 1

= e = hazard ratio associated with

a unit increase in x (cross-sectional comparison)

Page 43: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Multiple proportional hazard regression model(„Cox-regression model“)

HR = e1x1 2x2 kxk+ + +......

Reference category: subjects with x1 = x1 = ... = xk = 0

Page 44: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

No. of subjects = 1500 Number of obs = 1500No. of failures = 671Time at risk = 97331 LR chi2(2) = 39.93Log likelihood = -4697.2077 Prob > chi2 = 0.0000

------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- group_2 | 1.393155 .1395814 3.31 0.001 1.144766 1.695439 group_3 | 1.832595 .1779825 6.24 0.000 1.514947 2.216846------------------------------------------------------------------------------

In our example:

On average, the hazards in group 3 and 2 were higher by

83% (95%-CI: 51 to 121%) and 39% (95%-CI: 15 to 70%), respectively,

than in group 1 (= reference group).

Page 45: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

No. of subjects = 1500 Number of obs = 1500No. of failures = 579Time at risk = 55997 LR chi2(2) = 54.84Log likelihood = -4073.1439 Prob > chi2 = 0.0000------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- group_2 | 1.387788 .1550301 2.93 0.003 1.114898 1.727472 group_3 | 2.122101 .222757 7.17 0.000 1.727489 2.606854------------------------------------------------------------------------------

Analysis restricted to first year:

No. of subjects = 872 Number of obs = 872No. of failures = 92Time at risk = 41334 LR chi2(2) = 11.92Log likelihood = -610.6509 Prob > chi2 = 0.0026------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- group_2 | 1.442213 .3266646 1.62 0.106 .9251881 2.248167 group_3 | .5261936 .1709134 -1.98 0.048 .2783979 .9945465------------------------------------------------------------------------------

Analysis restricted to survivors of first year:

Page 46: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Conclusion:

1. Hazard ratio between group 2 and group 1 was very similar in both years, i.e., around 1.4. (confirming the proportionality of the two hazard functions).

2. Hazard ratio between group 3 and group 1 was higher than 2 in the first year but smaller than 1 in the second year (confirming the sharp decrease of the hazard function of group 3, falling below the one of group 1 after 1 year).

Page 47: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Koch M. et al., The natural history of secondary progressive multiple sclerosis, J. Neurol Neurosurg Psychiatry 2001; 81:1039-43

Example from the literature

Some study characteristics: - 5207 patients from British Columbia with a remitting disease at baseline. - Onset of immunomodulatory treatment was considered as censoring event.

Page 48: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Kaplan-Meier curves:

PH-assumption was probably not satisfied by the factor gender!

Page 49: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

1. On average, the covariate-adjusted hazard of secondary progression was higher in men than in women by 43% and the median of time to secondary progressionwas lower by 25% (i.e., 17.1 vs. 22.7 years).

2. On average, the covariate-adjusted hazard of secondary progression increasedby 5% with each additional year of age at baseline.

Page 50: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

CAVE:

1. In general, the ratio of median or mean survival times between two groups is not inversely proportional to the hazard ratio. But in the special case of constant hazards, this relation holds

for the mean survival times.

2. The hazard ratio may be interpreted as relative risk only if the event rates are small in the respective groups during the time period of interest.

Page 51: Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Regression models for binary and survival data PD Dr. C

Thank you for your attentionagain!