principles of survival analysis · review in full detail the principles and mathematical...

105
Principles of Survival Analysis (manuscript in progress – version of 20/09/2012) A.C.C. Coolen, L. Holmberg and J.E. Barrett King’s College London CLARENDON PRESS . OXFORD 2016

Upload: others

Post on 15-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

Principles of Survival Analysis

(manuscript in progress – version of 20/09/2012)

A.C.C. Coolen, L. Holmberg and J.E. BarrettKing’s College London

CLARENDON PRESS . OXFORD

2016

Page 2: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

iv

Page 3: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

PREFACE

Survival analysis is the field of medical statistics concerned with extracting quan-titative regularities from patient survival data. In their simplest form these dataare times recorded from a baseline until the occurrence of a specified irreversiblemedical event such as death, or the onset or first recurrence of a specific disease.The extracted regularities may then be used for regression, i.e. for predictingsurvival probabilities as a function of time for new individuals, or to quantifydifferences in survival statistics of distinct cohorts. The patients in the cohortare usually characterised by one or more measurable features (the covariates),and the task at hand is to extract from the available survival data the relation (ifany exists) between an individual’s covariates and his/her survival probabilityas a function of time.

The mathematically oriented novice who tries to learn about survival anal-ysis soon finds that there appears to be an unwelcome gap in literature. Thereare many textbooks and review papers that give the conventional formulas ofsurvival analysis, explain how one should use standard statistical software pack-ages, and give examples of the result of applying standard software to real data.These books would be perfectly adequate for the end user of survival analysismethods, who works at the application end of the spectrum (usually in a medicalenvironment). Then there are statistics and probability theory papers, that tendto focus on very mathematical/technical questions in survival analysis, and areoften written in the language of measure theory. These serve the theorist, whosemain interst is in mathematics and statistics, and for whom survival analysis isbut one of many application areas that generate the difficult mathematical ques-tions he/she wants to work on. It is difficult, however, to find good textbooksthat sit in the middle, and explain in detail the conceptual and mathematicalbasis of the formulas of survival analysis, for those who want to innovate andexpand the mathematical methods of survival analysis. That readership wantsto know first and foremost where the formulas of conventional survival analysiscome from, what assumptions were made in their derivation, and how exactlykey quantities are defined?

The traditional survival analysis methods, such as proportional hazards re-gression and Kaplan-Meier risk estimators, represented extremely importantbreakthroughs at the time of their conception (the 1970s), and have served themedical community perfectly for many decades. However, they were developed ata time when each university had just one computer (that filled several rooms, andwas probably slower than today’s average laptop), so they had to be simple in or-der to be applied to real data. Nowadays the use of this traditional methodologyis increasingly inappropriate, given the vastly increased complexity and quan-

v

Page 4: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

vi PREFACE

tity of medical data that are now routinely collected. In modern biomedicine wehave new problems and new ambitions: we want to use the wealth of these newdata for personalised medicine, but in doing so we face complex heterogeneousdiseases and cohorts, and a vast dimensional mismatch between the number of(e.g. genetic) covariates and the number of patients on which we have data.

At this moment it would appear to us that the most pressing demand insurvival analysis is for new mathematical tools that can handle complicationslike disease and cohort heterogeneity and dimensional mismatch, which are bothserious obstacles en route to individualised prediction and personalised medicine.We want these new tools not to deviate unnecessarily from the traditional ones,and preferably include these in special simplifying limits. To develop such toolswe inevitably need to return to the drawing board: we need to understand andreview in full detail the principles and mathematical derivations of traditionalsurvival analysis, and then rebuild the edifice in the most transparent mannerto accommodate the new questions that we want survival analysis to answer.

We wrote this book as an attempt to fill the above gap. We try to map outand explain the definitions, assumptions and derivations of the main methods insurvival analysis, illustrating key points with explicit worked out examples. Thestyle is that of the applied mathematician or theoretical physicist, who appreci-ates that there will be time for investigating mathematical subtleties, but whofirst wants to erect the new building in terms of structure. We do not give manyreferences to original journal papers; it is not our aim to create an encyclopeadictext but one that explains principles in a self-contained manner. For those inter-ested in tracing research papers, there are already excellent and comprehensivevehicles available, such as the textbooks by Hougaard [1], Ibrahim et al [2], Kleinand Moeschberger [3], or Crowder [4], which contain a wealth of references toresearch papers and other texts. We also write with the benefit of hindsight. Inthe 1970s, maximum-likelihood estimation was the norm, and maximum likeli-hood is often the language in which original derivations of methods and resultsare given in the original papers. Nowadays, in contrast, we prefer the Bayesianroute, within which maximum likelihood is but a special limit, and which in ourview makes derivations and subtleties significantly more transparent.

We hope that this text may lower the threshold for theorists to move into thisfacinating research area and contribute to the developments of new statisticalmethods that can help to make personalised medical prediction and personalisedmedicine a reality. We also hope that it may simultaneously serve the epidemiolo-gist who wants to acquire a deeper understanding of the potential and limitationsof the methods that he or she is using on a day to day basis.

London, September 2012 Ton Coolen and Lars Holmberg

Page 5: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

ACKNOWLEDGEMENTS

It is our great pleasure to thank the many colleagues and students with whomover the years we enjoyed discussing questions relating to survival analysis andmedical statistics. In particular we would like to mention (in alphabetical order):Shola Agbaje, James Barrett, Eric Blanc, Maria De Iorio, Hans Garmo, NielsKeiding, Katherine Lawler, Cathryn Lewis, Janet Peacock, Akram Shalabi, HansVan Baardewijk, Mieke van Hemelrijck, and Mike Weale.

vii

Page 6: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent
Page 7: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

CONTENTS

1 Why probability and statistics are tricky 1

2 Definitions and basic properties in survival analysis 52.1 Notation, data and objective 52.2 Survival probability and cause-specific hazard rates 62.3 Examples 12

3 Event time correlations and identifiability 143.1 Independently distributed event times 143.2 The identifiability problem 153.3 Examples 16

4 Incorporating cure as a possible outcome 214.1 The clean way to include cure 214.2 The quick and dirty way to include cure 244.3 Examples 26

5 Individual versus cohort level survival statistics 285.1 Population level survival functions 285.2 Population hazard rates and data likelihood 295.3 Examples 31

6 Survival prediction 346.1 Cause-specific survival functions 346.2 Estimation of cause-specific hazard rates 376.3 Derivation of the Kaplan-Meier estimator 396.4 Examples 44

7 Including covariates 517.1 Definition via covariate sub-cohorts 517.2 Conditioning of individual hazard rates on covariates 547.3 Connecting the conditioning and sub-cohort pictures 577.4 Conditionally homogeneous cohorts 587.5 Nonparametrised covariates-to-risk connection 607.6 Examples 61

8 Proportional hazards (Cox) regression 668.1 Definitions, assumptions and regression equations 668.2 Uniqueness and p-values for regression parameters 708.3 Properties and limitations of Cox regression 728.4 Examples 75

ix

Page 8: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

x CONTENTS

A The δ-distribution 86

B Steepest descent integration 88

C Maximum likelihood versus Bayesian estimation 90

D Maximum prediction accuracy with Cox regression 94

E Computational details 96

References 97

Page 9: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

1

WHY PROBABILITY AND STATISTICS ARE TRICKY

Few would disagree with the proposition that probability and statistics are themost tricky and most abused areas of mathematics. In order to get some in-tuition for why this is so, let us start with some simple examples of statisti-cal/probabilistic questions that as yet have nothing to do with survival analysis.

Example 1: The Monty Hall problem. This problem is based loosely on the sce-nario played out at the end of many typical television game shows of the 1970s. Itappears to have been first formulated in a letter to the journal Americal Statisti-cian, by Steve Selvin (in 1975), as a cute illustration of the subtleties of statistics.The archetypical show in the USA after which the problem was named was called‘Let’s Make a Deal’ (USA, 1963-1977) and was hosted by an individual calledMonty Hall.

At the end of the game show, after having beaten the other contestants therewould be a winner, but this winner would face a final challenge. He/she is shownthree closed doors. Behind one of these would be a big prize (a large amount ofmoney, or a car, etc), but behind the other two is something silly (e.g. a goat ora llama). What happens next is:

• The winner is asked to choose one of the three closed doors (this will be arandom choice, as he/she has no clue).

• Monty subsequently opens one of the remaining two doors, behind whichthere is a goat/llama (this is always possible, irrespective of the winner’sinitial choice, since only one of the three doors leads to the true prize).We are then still left with two closed doors, one of which was chosen bythe winner. We still don’t know which of these doors leads to the prize.

• Monty then offers the winner the option to change his/her mind at the lastminute, and switch from the initial choice to the other remaining closeddoor.

1

Page 10: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

2 WHY PROBABILITY AND STATISTICS ARE TRICKY

The statistical question then is: will it make a difference to the likelihood ofwinning the prize if he/she were to switch at the last minute? Intuitively mostof us would be tempted to say no. It would seem that each door simply has a50% chance of leading to the prize, and switching would make no difference. Infact the correct answer is yes. careful analysis of all possible events and theirprobabilities shows that switching at the last minute doubles one’s likelihood ofwinning the prize ... (one can check this by writing out explicitly all possiblepaths of decisions and events, with the correct probabilities for each).

Example 2: share price statistics. Imagine one is asked to produce a statisticalreport on the typical behaviour of share prices over a a twenty year period,of corporations that are listed on the London Stock Exchange (LSE). For thesake of the argument, let us pretend that the most recent financial crisis hadn’thappened. How would we go about this task? It would seem natural to proceedas follows:

• We compile a list of all companies that have been on the LSE since 1992.

• We find or buy the data that give the daily share values of all companieson our list, covering the last 20 years.

• We carry out a careful statistical analysis of these data.

Page 11: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

WHY PROBABILITY AND STATISTICS ARE TRICKY 3

In following this apparently sensible work plan we would make fundamentalmistakes. We are already in trouble from the first step. By putting only thosecompanies on our list that have been on the LSE for the last twenty years, weare biasing our sample to those companies that are sufficiently healthy to remainin business (and hence listed on the LSE) for at least twenty years. Irrespectiveof the statistical analysis methods used, this will inevitably lead to a picture ofshare statistics that is too rosy.

The main pitfalls and dangers in probability and statistics. Probability and statis-tics are in principle as sound and unambiguous as any other area of mathematics.The pitfalls that so often get us into trouble with statistics do not relate to theprecision and consistency of the formal theory or the mathematical manipu-lations, but they tend to emerge when we apply statistics and probability topractical scenarios and real-world problems. The main ones are related to:

• The meaning of of uncertainty

Probabilities quantify uncertainty, but of this there are two types. Probabil-ities can express our ignorance of something that cannot be known (becauseit is still to happen and can go either way, e.g. the probability of findinga six for a dice that is still to be rolled), or of something that is known,but not by us (e.g. the probability of finding a six for a dice that has beenrolled inside a black box). For instance, if in biology or medicine we write

Prob(phenotype) = Prob(genotype)× Prob(phenotype|genotype)

(genotype refers to the genetic code written in an organism’s DNA, andphenotype refers to how this code has been translated into the traits of thisorganism) then the uncertainty expressed by Prob(genotype), an individ-ual’s genotype, would be of type (b), i.e. in principle written in stone, butwe don’t generally have the full required genetic information. In contrast,given the genotype we would still expect variability in the phenotype, asexpressed by Prob(phenotype|genotype), which results at least partly fromnon-predictable events (e.g. cell signalling, mutations); i.e. the latter con-ditional probabilities represent type type (a) uncertainty. In medicine thedifference between the two can be quite relevant.

• Accidental conditioning

This is what happens when we inadvertedly collect our information froma non-representative subset of the events or individuals on which we seekto make statistical statements. It occurs in the Monty Hall problem, whereMonty’s decision of which door to open is constrained by the initial se-lection of the winner, and this brings in subtle extra information that isexploited when the winner switches at the last minute. It also occurs in theexample of share price analysis, where we condition our sample of compa-nies on at least 20 years’ survival. Extra information B generally modifies

Page 12: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

4 WHY PROBABILITY AND STATISTICS ARE TRICKY

the probability to observe an event A: we may wish to sample accordingto a prior P (A), but end up sampling acording to a conditioned posteriormeasure P (A|B), described by the Bayesian relation

posterior︷ ︸︸ ︷P (A|B) =

P (A,B)

P (B)=

prior︷ ︸︸ ︷P (A)×P (B|A)

P (B)

• Limitations of our intuition

Possibly because of the evolutionary advantages of pattern detection, andmany millennia of selection pressure, humans are obsessed with patterns,and consequently very poor at judging likelihoods objectively. We struggleto accept intuitively that even after we have thrown ten successive sixeswith a fair dice (an unlikely sequence of events for which the a prioriprobability is around 1.7 10−7), then for the next roll to give yet anothersix still carries a probability of just 1/6 (in spite of the fact that this wouldlead to an even more remarkable sequence of eleven sixes in a row). By thesame token most humans would struggle to generate sequences of randomnumbers 01001010001011010010...; it is trivial to write a simple computerprogram that can predict the next digit in such human-generated sequencescorrectly with probability of some 60%. This failing statistical intuitionexplains why most would get the answer to the Monty Hall question wrong,and partly explains the profitability of the gambling industry ...

• Assumptions behind methods

All statistical methods involve explicit or implicit assumptions, and manyinvolve further approximations. If nothing is assumed, nothing can be cal-culated. For instance, in least-squares data fitting and principal compo-nent analysis one assumes that the noise in the data is Gaussian; to usethe central limit theorem it is not sufficient to have a large sum of inde-pendent random variables, but we also have to satisfy specific criteria onthe distributions of the random variables (e.g. Lindeberg’s condition), etc.Obviously, the correctness of any outcome of such methods depends on theextent to which these assumptions and approximations are reasonable inthe context of the problem at hand. It is therefore important that one hasa basic understanding of what these assumptions and approximations are,so that one can convince oneself that the chosen method can be used.

• Imprecise definitions

Finally, in statistics especially it is vital that we are very precise in defin-ing our key quantities. When we speak about the probability of getting aspecific disease within a given time, do we mean the probability for oneindividual? Or perhaps the probability for a randomly drawn individualfrom a population? Do we include our ignorance of the previous two? (i.e.the probability that our estimate of the probability is correct ...)

Page 13: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

2

DEFINITIONS AND BASIC PROPERTIES IN SURVIVALANALYSIS

We start with an introduction to the field of survival analysis. We discuss thenature of the medical data to which it is applied, we establish some basic notationconventions, andf we formulate what the statistical methods of survival analysisset out to achieve.

2.1 Notation, data and objective

Notation and data. Imagine we have data on a cohort of N patients, labelled i =1 . . . N . They are subject to R distinct ‘hazards’ or ‘risks’, labelled r = 1 . . . R,which trigger irreversible events such as onset of a given disease of interest,death due to causes other than the disease of interest, etc. We also measurep characteristics of our patients (e.g. gender, blood serum counts, BMI, socio-economic factors, genetic variables, etc), resulting for each patient i in a list of pnumbers (Zi1, . . . , Z

ip), the so-called ‘covariates’. Covariates can be discrete (e.g.

gender) or real-valued (e.g. BMI). We monitor our cohort during a trial of finiteduration, and record for each patient when the first event happened to them,and which event this was; the start of the trial is taken as time zero. To labelthose patients that did not record any event during the trial (they could be lostto follow-up along the way, or may have reached the end of the trial withoutexperiencing any of the R events), we introduce a further ‘risk’ r = 0 (which wewill refer to simply as end-of-trial). Our data thus take the following form. Foreach patient i we have

Zi = (Zi1, . . . , Zip) : the values of p covariates

Xi ≥ 0 : the time at which the first event occurred

∆i ∈ 0, . . . , R : a label indicating which event occurred at time Xi

A typical example of such data is shown in figure 2.1. The covariates (or ‘ex-planatory factors’) can be divided into three qualitatively distinct groups:

uncontrolled covariates: e.g. gender, genetic make-up, etccontrolled covariates: e.g. medical treatment,modifiable covariates: e.g. smoking, drinking, nutrition, etc

Objectives of survival analysis. Survival analysis is the statistical discipline thatdeals with data of the above type, and tries to extract patterns from these datato quantify the relations (if any) between the covariates and the risks. Usually we

5

Page 14: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

6 DEFINITIONS AND BASIC PROPERTIES IN SURVIVAL ANALYSIS

are interested mainly in one particular risk, traditionally chosen as r = 1, so theother risks r ∈ 0, 2, 3, . . . , R are unfortunate complications. More specificallywe would like to (i) evaluate quantively the effects of covariates, (ii) predictevent times in the cohort from knowledge of the covariates, and (iii) compareand validate different models with which to explain our survival data.

Censoring. ‘Censoring’ means that the value of a measurement is only partiallyknown, as is the case here. An end-of-trial outcome ∆i = 0 means that all weknow is that patient i is either ‘lost’ along the way, or will experience his/herfirst actual event from the risk set 1, . . . , R at some time ti ≥ C, where C is theduration of our trial. This latter option is called ‘right censoring’. Alternativetypes of censoring (which we will not deal with here) are ‘left censoring’, i.e.ti ≤ C for some C, or ‘interval censoring’, i.e. ti ∈ [C1, C2] for some C1, C2. Herewe label all the patients i that are censored as ∆i = 0, and use Xi to denote thetime where they left our trial.

Complications. The main complications in survival analysis are caused by (i) thestatistical ‘noise’ caused by censoring, (ii) the fact that different risks preventeach other from happening (or from being observed), e.g. if a patient dies we willnever know whether and when he/she would have got the disease of interest, (iii)possible correlations between the different risks, (iv) heterogeneity in cohorts (interms of covariates, and in terms of what covariates imply in terms of risks), and(v) the fact that most studies are ‘underpowered’, i.e. we want to extract com-plicated statistical patterns with from data on relatively small patient cohorts,which brings the danger of overfitting and non-reproducibility of results.

2.2 Survival probability and cause-specific hazard rates

Joint event times and survival function. Imagine the imaginary situation wherefor each individual i all events r = 0 . . . R could in principle be observed (irre-spective of their nature and their order in time), and let tr denote the time atwhich event r occurs. If we assume also that all events will ultimately alwayshappen (some earlier and some later, and some perhaps only at times so late tobe practically irrelevant), we write the joint distribution for individual i of theevent times as

Pi(t0, . . . , tR) (2.1)

Since we assume that each event will ultimately happen, this distribution mustbe normalised, so ∫ ∞

0

. . .

∫ ∞0

dt0 . . . dtR Pi(t0, . . . , tR) = 1 (2.2)

We can next define the integrated event time distribution for individual i:

Si(t0, . . . , tR) =

∫ ∞t0

. . .

∫ ∞tR

ds0 . . . dsR Pi(s0, . . . , sR)

Page 15: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

SURVIVAL PROBABILITY AND CAUSE-SPECIFIC HAZARD RATES 7

=

∫ ∞0

. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR)

R∏r=0

θ(sr − tr) (2.3)

with the step function, defined as θ(z > 0) = 1 and θ(z < 0) = 0. Si(t0, . . . , tR)gives the probability that for individual i event 0 occurs later than t0, and event1 occurs later than t1, and . . . etc. Note that

Si(0, . . . , 0) =

∫ ∞0

. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR) = 1 (2.4)

We can now define the survival function Si(t) as the probability that for indi-vidual i all events r = 0 . . . R will happen later than time t:

Si(t) = Si(t0, . . . , tR)|tr=t for all r = Si(t, t, . . . , t) (2.5)

Cause-specific hazard rates. We next want to characterise for each individual riskhow likely it is to trigger an event as a function of time, and how it impacts onthe overall survival probability Si(t) defined above. This is done via the so-calledcause-specific hazard rates, defined as

πiµ(t) = −[ ∂

∂tµlogSi(t0, . . . , tR)

]tr=t for all r

(2.6)

Whenever we write log(.) we will mean the natural logarithm. Inserting thedefinition of Si(t) above, and using d

dz θ[z] = δ(z) (see A on the δ-distribution)allows us to work this out:

πiµ(t) =[∫∞

0. . .∫∞

0ds0 . . . dsR Pi(s0, . . . , sR)δ(sµ − tµ)

∏Rr 6=µ θ(sr − tr)

Si(t0, . . . , tR)

]tr=t ∀r

=

∫∞t. . .∫∞t

(∏Rr 6=µ dsr

)Pi(s0, . . . , sµ−1, t, sµ+1, . . . , sR)

Si(t)(2.7)

So πiµ(t)dt gives the probability for individual i that event µ happens in the timeinterval [t, t+ dt), given that no event has happened to i yet prior to time t:

πiµ(t)dt = Prob(tiµ ∈ [t, t+dt)

∣∣∣ i had no events yet at time t)

(dt ↓ 0)

(2.8)

Since πiµ(t) gives a probability of hazardous events per unit time, it is called a‘hazard rate’. It depends on which risk µ we are discussing, hence it is ‘causespecific’. The subtlety is in the conditioning: it is defined conditional on theindividual still being event-free at the relevant time.

Survival function in terms of cause-specific hazard rates. It turns out that theoverall survival probability Si(t) can be written in terms of the cause-specifichazard rates, in a simple way. To see this we calculate

Page 16: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

8 DEFINITIONS AND BASIC PROPERTIES IN SURVIVAL ANALYSIS

d

dtlogSi(t) =

d

dtlogSi(t, t, . . . , t) =

R∑r=0

[ ∂∂tr

logSi(t0, . . . , tR)]tr=t ∀r

= −R∑r=0

πir(t) (2.9)

Hence, using Si(0) = 1,

logSi(t) = logSi(0)−R∑r=0

∫ t

0

ds πir(s) = −R∑r=0

∫ t

0

ds πir(s) (2.10)

so

Si(t) = e−∑R

r=0

∫ t0

ds πir(s)(2.11)

Data likelihood in terms of cause-specific hazard rates. Similarly we can expressalso the likelihood Pi(X,∆)dX to observe in our trial patient i reporting a firstevent of type ∆ at a time in the interval [X,X+dX) (with dX ↓ 0) in terms of thecause specific hazard rates. To observe the above the following three statementsmust be true:

the time of the event is in [X,X + dX),the type of the event is ∆, andno events occurred prior to X.

These three demands can all be written in terms of properties of the joint eventtimes (t0, . . . , tR) of individual i:

θ(t∆−X)θ(X+dX−t∆)∏r 6=∆

θ(tr−X) = 1 (2.12)

and the likelihood Pi(X,∆) can therefore be written as 1

Pi(X,∆) = limdX↓0

1

dXProbi

(θ(t∆−X)θ(X+dX−t∆)

∏r 6=∆

θ(tr−X) = 1)

= limdX↓0

1

dX

∫ ∞0

. . .

∫ ∞0

dt0 . . . tR Pi(t1, . . . , tR)θ(t∆−X)

×θ(X+dX−t∆)∏r 6=∆

θ(tr−X)

=

∫ ∞0

. . .

∫ ∞0

dt0 . . . tR Pi(t1, . . . , tR) limε↓0

hε(t∆−X)∏r 6=∆

θ(tr−X)

(2.13)

1Note that we implicitly assume that the joint event time distribution Pi(t0, . . . , tR) iscontinuous and smooth, so that the probability of seeing ties in the timing of events, i.e.tµ = tν for µ 6= ν, is negligible.

Page 17: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

SURVIVAL PROBABILITY AND CAUSE-SPECIFIC HAZARD RATES 9

Sheet1

Page 1

pat BMI SELENIUM PHYS_ACT_LEIS PHYS_ACT_WORK Smoking Time

1 22.63671875 105 2 1 0 33.6892539357 0

2 34.21875 65 2 2 0 24.810403833 1

3 20.06640625 72 2 3 2 23.1047227926 2

4 28.3984375 81 2 0 2 33.6783025325 0

5 22.94921875 73 0 3 2 32.8843258042 0

6 20.59765625 73 2 0 2 23.2936344969 1

7 26.46875 70 2 3 2 27.9069130732 2

8 26.38671875 91 1 0 1 26.6119096509 2

9 24.296875 73 -10 0 0 26.379192334 1

10 30.01953125 84 1 1 2 26.8254620123 2

11 25.4296875 95 1 0 2 30.6557152635 2

13 23.19921875 67 3 1 0 33.2457221081 0

14 22.90625 82 2 2 0 33.6399726215 0

15 21.62890625 53 1 1 1 33.2457221081 0

16 23.046875 77 2 0 2 33.6125941136 0

17 21.70703125 76 1 2 2 27.8466803559 1

18 22.91796875 102 1 0 2 33.6125941136 0

19 24.5078125 57 2 0 2 25.8097193703 2

20 26.58984375 72 1 2 2 26.803559206 2

21 26.76953125 78 2 0 2 33.234770705 0

22 20.4296875 75 -10 1 1 33.5934291581 0

23 25.0078125 69 0 3 0 33.6125941136 0

24 24.296875 73 2 3 2 21.6098562628 1

25 23.65625 75 2 3 1 33.2320328542 0

26 25.9296875 90 1 1 2 33.5359342916 0

27 23.3671875 58 1 1 2 33.2320328542 0

28 30.08984375 77 -10 -10 2 32.0438056126 2

29 31.08984375 66 1 0 0 29.4893908282 1

30 27.13671875 82 1 -10 1 33.1526351814 0

31 19.828125 68 2 2 2 28.2600958248 2

32 27.30859375 97 2 3 1 33.5742642026 0

33 23.41796875 77 2 0 0 25.9219712526 1

34 20.5078125 78 1 3 0 32.8350444901 0

35 24.90625 75 0 1 0 24.7255304586 1

36 21.70703125 67 0 1 1 33.2265571526 0

38 24.20703125 75 1 1 0 33.2183436003 0

39 22.71875 76 2 2 2 31.5893223819 2

40 30.06640625 79 1 0 0 33.5934291581 0

41 26.1171875 66 3 3 0 33.5633127995 0

42 22.33984375 94 2 0 2 33.5550992471 0

43 24.02734375 77 1 2 0 33.582477755 0

44 23.546875 81 3 -10 0 19.1266255989 1

45 25.58984375 81 2 0 2 31.3456536619 2

46 24.38671875 58 2 2 2 30.984257358 2

47 22.75 96 2 1 2 33.582477755 0

48 31.76953125 75 2 3 1 33.0924024641 0

49 23.9375 64 1 1 1 22.1656399726 2

50 22.88671875 74 1 0 1 33.2128678987 0

51 20.90625 92 2 1 2 32.9034907598 0

52 22.9765625 82 2 1 0 33.196440794 0

53 34.33984375 68 1 0 0 33.2073921971 0

54 29.0078125 81 2 1 1 27.9370294319 2

55 24.33984375 76 2 1 2 22.8364134155 1

56 30.45703125 84 2 2 0 16.2299794661 1

57 25.25 62 1 2 2 23.318275154 2

58 23.71875 76 1 1 2 33.2895277207 0

59 25.12890625 85 2 1 0 24.7091033539 1

60 22.26953125 96 2 1 2 33.1991786448 0

61 24.859375 66 1 -10 2 22.006844627 2

62 26.80859375 69 -10 1 2 23.3538672142 2

63 25.95703125 87 3 0 0 33.1909650924 0

PCcens

Fig. 2.1. Sample survival data from the ULSAM prostate cancer study. Columnone: patient label i. Columns two to six: values of five covariates (Z1

i , . . . , Zi5).

Four are modifiable (BMI, leisure time physical activity, physical activity atwork, smoking) and one is uncontrolled (selenium level in the blood). Lasttwo columns: event time and label (Xi,∆i). Entries ‘-10’ imply missing data.

Page 18: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

10 DEFINITIONS AND BASIC PROPERTIES IN SURVIVAL ANALYSIS

with

hε(z) = ε−1θ(z)θ(ε−z) =

ε−1 for z ∈ [0, ε]0 elsewhere

(2.14)

We note that the function limε↓0 hε(z) has all the properties that define the δ-function (see Appendix A): hε(z) ≥ 0 for all ε > 0,

∫dz hε(z) = 1 for all ε > 0,

limε↓0 hε(z) = 0 for all z 6= 0, and limε↓0 hε(0) = ∞. So limε↓ hε(z) = δ(z), andwe get

Pi(X,∆) =

∫ ∞0

. . .

∫ ∞0

dt0 . . . tR Pi(t1, . . . , tR)δ(t∆−X)∏r 6=∆

θ(tr−X)

= Si(X)πi∆(X)

= πi∆(X)e−∑R

r=0

∫ X0

ds πir(s)(2.15)

where we used (2.7) in the first step, and (2.11) in the second. So the survivalprobabilities Si(t) and the data likelihoods Pi(X,∆) can both be written strictlyin terms of the cause-specific hazard rates. The final picture is as in the diagrambelow. We can therefore anticipate that in all our statistical analyses of the datathe cause-specific hazard rates will play a central role.

P1(t0, . . . , tR) . . . . . . PN (t0, . . . , tR)︸ ︷︷ ︸individual event time statistics

⇓(π1

0(t), . . . , π1R(t)) . . . . . . (πN0 (t), . . . , πNR (t))︸ ︷︷ ︸

individual hazard rates

⇓(X1,∆1) . . . . . . (XN ,∆N )︸ ︷︷ ︸

observed survival data

Starting our description from the distributions Pi(t0, . . . , tR) was useful in termsof understanding how cause-specific hazard rates πir(t) emerge, but working di-rectly at the level of these rates has advantages. It avoids us having to think interms of the event times (t0, . . . , tR) and their distribution (which refers to a hy-pothetical situation where all event times could be observed - including e.g. theonset of a disease after death). Secondly, we will see that upon using the hazardrates as a starting point we can also deal in a transparent way with events thathave a nonzero probability of never happening; these we have so far ruled out asa result of starting with a normalised Pi(t0, . . . , tR).

Cause-specific hazard rates in terms of data probabilities. We have seen that thedata probabilities Pi(X,∆) can be written fully in terms of the cause specific

Page 19: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

SURVIVAL PROBABILITY AND CAUSE-SPECIFIC HAZARD RATES 11

hazard rates πir(t). It turns out that the converse is also true, i.e. the cause-specific hazard rates can be written fully and explicitly in terms of the dataprobabilities Pi(X,∆). To see this, let us first sum over ∆ in (2.15)

R∑∆=0

Pi(X,∆) =( R∑

∆=0

πi∆(X))

e−∑R

r=0

∫ X0

ds πir(s)

= − d

dXe−∑R

r=0

∫ X0

ds πir(s)(2.16)

Hence

e−∑R

r=0

∫ X0

ds πir(s)= 1−

∫ X

0

dt

R∑r=0

Pi(t, r)

=

R∑r=0

∫ ∞X

dt Pi(t, r) (2.17)

Substituting this into the right-hand side of (2.15), followed by re-arranging inorder to make the hazard rate the subject of the equation, then immediatelygives us

πi∆(X) =Pi(X,∆)∑R

r=0

∫∞X

dt Pi(t, r)(2.18)

So, if we wanted, we could build our theory entirely in the language of dataprobabilities Pi(X,∆), as opposed to the language of the cause-specific hazardrates. Summation over all ∆ in both sides of (2.18) gives another transparent

and useful identity relating the cumulative hazard rate∑Rr=0 π

ir(X) (i.e. the rate

of events, irrespective of type, conditional on there not having been any eventsprior to time X) to the distribution P (X) =

∑Rr=0 P (X, r) of reported event

times (of any type):

R∑r=0

πir(X) =

∑Rr=0 Pi(X, r)∑R

r=0

∫∞X

dt Pi(t, r)=

Pi(X)∫∞X

dt Pi(t)(2.19)

Possible pitfalls and misconceptions. The most tricky aspect of survival analysisis its formulation in terms of cause-specific hazard rates, which involve nontrivialconditioning at any time t on there not having been any event prior to time t.This causes interpretation issues. For instance

• Expression (2.11) can be written in a form that factorises over the different

risks as Si(t) =∏r exp[−

∫ t0

ds πir(s)]. Does this imply that the risks areuncorrelated? No, it does not. All risks r 6= µ will generally contributeto each πiµ(t), since the alternative risks modify the conditioning, i.e. the

Page 20: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

12 DEFINITIONS AND BASIC PROPERTIES IN SURVIVAL ANALYSIS

likelihood that nothing has happened yet prior to t. The risks may wellinteract strongly with each other, but we can no longer see this after wehave calculated the rates πiµ(t) and forget about the times (t0, . . . , tR).

• Starting from the survival function (2.11), do we get the survival functionfor the hypothetical situation where risk µ is disabled by setting πiµ(t)

to zero, i.e. Si(t) → exp[−∑r 6=µ

∫ t0

ds πir(s)]? No, we do not. We would

indeed have πiµ(t) = 0 for all t, but that is not all. If we disable a risk µ,

the removal of risk µ can in principle change also all hazard rates πir(t)with r 6= µ, do to correlations connecting the different risks.

2.3 Examples

Example 1: time-independent hazard rates

Here we have πir(t) = πir, independent of t, for all (r, i). Thus∫X

0ds πiµ(s) = πiµX.

This gives the following simple formulae for the survival function and the datalikelihood:

Si(t) = e−t∑R

r=0πir Pi(X,∆) = πi∆e−X

∑R

r=0πir (2.20)

Example 2: a single risk r = 1

Suppose we have only one hazard, r = 1, and for each patient i a single hazardrate πi(t):

Si(t) = e−∫ t

0ds πi(s) Pi(X,∆) = πi(X)e

−∫ X

0ds πi(s)δ∆,1 (2.21)

In this case we can write the event time distribution in terms of the hazard ratevia (2.3):

Pi(t) = − d

dtSi(t) = πi(t)e

−∫ t

0ds πi(s) (2.22)

which is of course no surprise in view of (2.15). Now it makes perfect sense tothink in terms of Pi(t): there is only one risk, so nothing hypothetical about theevent time t (as no other events can prevent it from being observed).

If we have one risk only, and moreover a time independent hazard rate (i.e.a combination if the two examples discussed above) we obtain

Si(t) = e−tπi

Pi(X,∆) = Pi(X)δ∆,1 Pi(t) = πie−tπi

(2.23)

Page 21: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 13

Example 3: the most probably event time distribution for R = 1, given the valueof the average

Finally, let us show how the exponential distribution of event times in (2.21) canbe seen as the simplest natural choice for the case R = 1, in an information-theoretic sense. Suppose the only knowledge we have of Pi(t) is the value of theaverage event time 〈t〉i. The most probably distribution Pi(t) with this average isfound by maximizing the Shannon entropy Hi = −

∫∞0

dt Pi(t) logPi(t), subject

to the two constraints∫∞

0dt Pi(t) = 1 and

∫∞0

dt Pi(t)t = 〈t〉i2. The maximumis found via the Lagrange method:

δ

δPi(x)

∫ ∞0

ds Pi(s) logPi(s) =δ

δPi(x)

λ0

∫ ∞0

ds Pi(s) + λ1

∫ ∞0

ds Pi(s)s

1 + logPi(t) = λ0 + λ1t so Pi(t) = eλ0−1+λ1t

We note that λ1 < 0 is required for Pi(t) to be normalisable. Normalisation gives

1 = eλ0−1

∫ ∞0

dt eλ1t = −λ−11 eλ0−1

So eλ0−1 = −λ1, giving Pi(t) = |λ1|e−|λ1|t. Finally we demand that the averagetime is 〈t〉i:

〈t〉i =

∫ ∞0

dt t|λ1|e−|λ1|t =1

|λ1|

∫ ∞0

ds se−s =1

|λ1|

[− se−s

]∞0

+

∫ ∞0

ds e−s

=1

|λ1|

0−

[e−s]∞

0

=

1

|λ1|

Hence

Pi(t) = πie−tπi

with πi = 1/〈t〉i (2.24)

2Strictly speaking we must demand also that Pi(t) ≥ 0 for all t ≥ 0, but it turns out thatthis latter demand will be satisfied automatically.

Page 22: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

3

EVENT TIME CORRELATIONS AND IDENTIFIABILITY

We have seen that knowing the joint event time statistics Pi(t0, . . . , tR) of an in-dividual i allows us to calculate the cause-specific hazard rates πi0(t), . . . , πiR(t).We now ask the following: if we know the cause-specific hazard rates, which weexpect can be estimated from the data via (2.15), can we deduce from this thedistribution Pi(t0, . . . , tR)? In particular, can we deduce from the hazard rateswhether or not the event times of different risks are statistically independent?This will become important when we turn to competing risks later.

3.1 Independently distributed event times

If the event times are all uncorrelated, i.e. if knowing one such time conveys noinformation on the others, the joint distribution factorises by definition into thesimple form

Pi(t0, . . . , tR) =

R∏r=0

Pir(tr) (3.1)

Via (2.3) we then get

Si(t0, . . . , tR) =

R∏r=0

∫ ∞0

dsr Pir(sr)θ(sr − tr) =

R∏r=0

Sir(tr) (3.2)

Sir(t) =

∫ ∞t

ds Pir(s) (3.3)

So the probability to observe that each event r happens at a time later than tris just the product of the individual survival probabilities Sir(tr) for the risks.The cause specific hazard rates (2.6) become

πiµ(t) = −[ ∂

∂tµ

R∑r=0

logSir(tr)]tr=t for all r

= −[ ∂

∂tµlogSiµ(tµ)

]tµ=t

= − d

dtlogSiµ(t) (3.4)

Hence, if we integrate both sides, and use Sir(0) = 1 (no events have occurredyet at time t = 0):

logSiµ(t) = logSiµ(0)−∫ t

0

ds πiµ(s) = −∫ t

0

ds πiµ(s) (3.5)

14

Page 23: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

THE IDENTIFIABILITY PROBLEM 15

giving, as expected

Siµ(t) = e−∫ t

0ds πiµ(s)

(3.6)

If we now differentiate (3.3) and use our formula for Sir(t) we find that we canexpress the event time probabilities for each risk in terms of the associated hazardrates. This results in the following generalisation to multiple independent risksof formula (2.22):

Pir(t) = − d

dtSir(t) = − d

dte−∫ t

0ds πir(s)

= πir(t)e−∫ t

0ds πir(s)

(3.7)

3.2 The identifiability problem

We have seen above that for the special case of statistically independent eventtimes one can indeed calculate the event time probablities uniquely from thecause-specific hazard rates. However, we can also deduce something else fromthe above derivation:

For any set of cause-specific hazard rates πi0(t), . . . , πiR(t), including thosethat correspond to statistically dependent event times, there always exists adistribution for independent event times that will give exactly the samecause-specifc hazard rates, namely

Pi(t0, . . . , tR) =

R∏r=0

[πir(tr)e

−∫ tr

0ds πir(s)

](3.8)

It follows that knowledge of the cause-specific hazard rates alone (which is allwe may ever hope to extract from survival data) does not generally permit us toidentify the underlying joint distribution of event times – in particular, we cannotfind out from survival data alone whether or not the event times of the differentrisks are statistically independent. This is Tsiatis’ identifiability problem.

Tsiatis’ result appears to have created some pessimism in the past as to whatcan be achieved with statistical analyses, especially in the context of so-called‘competing risks’. We will turn to these in more detail later; for now let us just saythat ‘competing risks’ describes the situation where at the level of populationsor trial cohorts the event times of different risks appear to be correlated. Theidentifiability problem suggested to some that in the case of competing risksthere is not much that survival analysis can do. Let us counteract this with afew observations:

• If the cohort under study is homogeneous, then Pi(t0, . . . , tR) will be iden-tical for all i and thus also describe the statistics of the cohort; here we doindeed have a problem. However, if competing risks are due to population

Page 24: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

16 EVENT TIME CORRELATIONS AND IDENTIFIABILITY

level correlations of hazard rates in an inhomogeneous population, thenthe identifiability problem doesn’t arise. One could imagine all individualshaving independent event times, i.e. Pi(t0, . . . , tR) =

∏r Pir(tr) for each i,

but correlated hazard rates: those individuals with a higher hazard rate fordiabetes might for instance have also a higher hazard rate for pancreaticcancer. Here we would at population level have P (t0, . . . , tR) 6=

∏r Pr(tr)

and S(t) 6=∏r Sr(t). But since the correlations are now generated at the

level of hazard rates (which are in principle accessible via data) this wouldrepresent a competing risk problem that can be solved.

• Even if we do have correlated event times at the level of individuals, thenall is not lost. If we only extract hazard rates from our data then we indeedcannot extract from these the distribution Pi(t0, . . . , tR). But in Bayesianregression we do not require uniqueness of explanations anyway. We wouldcalculate the likelihood of each possible explanation Pi(t0, . . . , tR) for theobserved hazard rates, and find the most plausible one.

• Finally, it might well be that the ‘statistically independent’ event time ex-planation above, that we can always construct for any set of observed haz-ard rates, has unwanted or unlikely mathematical or interpretational fea-tures. For instance, to have mathematically acceptable distributions Pir(t)they need to be normalised, i.e. we must demand

1 =

∫ ∞0

dt Pir(t) =

∫ ∞0

dt πir(t)e−∫ t

0ds πir(s)

=

∫ ∞0

dt[− d

dte−∫ t

0ds πir(s)

]= 1− e

−∫∞

0ds πir(s)

(3.9)

Hence the independent-times explanation for the hazard rates requires that

limt→∞

∫ t

0

ds πir(s) =∞ (3.10)

This is another way of saying that the probability of event r never occurringmust be zero. We will see in an example below that this is not alwayssatisfied.

3.3 Examples

Example 1: true versus independent-times explanation for observed hazard rates

Let us inspect the following event time distribution for the times t1, t2 ≥ 0, withparameters a, b, τ > 0 and ε ∈ [0, 1], for a single individual:

P (t1, t2) = ae−at2[εδ(t1 − t2 − τ) + (1−ε)be−bt1

](3.11)

Page 25: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 17

It has the form P (t1, t2) = P (t1|t2)P (t2), with

P (t2) = ae−at2 , P (t1|t2) = εδ(t1−t2−τ) + (1−ε)be−bt1 (3.12)

P (t1, t2) is clearly nonnegative and normalised, so is a bona fide joint distri-bution. With probability 1 − ε the two times are statistically independent, andwith probability ε event 1 happens precisely a duration τ later than event 2. Theintegrated distribution S(t1, t2) is

S(t1, t2) =

∫ ∞t1

ds1

∫ ∞t2

ds2

ae−as2

[εδ(s1 − s2 − τ) + (1−ε)be−bs1

]= ε

∫ ∞t2

ds2 ae−as2∫ ∞t1

ds1 δ(s1−s2−τ)

+ (1−ε)∫ ∞t2

ds2 ae−as2∫ ∞t1

ds1 be−bs1

= ε

∫ ∞t2

ds2 ae−as2θ(s2+τ−t1) + (1−ε)e−bt1∫ ∞t2

ds2 ae−as2

= ε

∫ ∞max(t2,t1−τ)

ds2 ae−as2 + (1−ε)e−bt1−at2

= εe−a max(t2,t1−τ) + (1−ε)e−bt1−at2

=

εe−at2 + (1−ε)e−bt1−at2 if t2 > t1 − τεe−a(t1−τ) + (1−ε)e−bt1−at2 if t2 < t1 − τ

(3.13)

This gives for the survival function S(t) = S(t, t):

S(t) = e−at(ε+ (1−ε)e−bt

)(3.14)

Next we calculate the cause-specific hazard rates for this example, via (2.6).Since after the partial differentiations we must set t1, t2 → t, we need only usethe formula that applies for t2 > t1 − τ :

π1(t) = − ∂

∂t1logS(t1, t2)

∣∣∣t1=t2=t

= − ∂

∂t1log[εe−at2 + (1−ε)e−bt1−at2

]|t1=t2=t

= −∂∂t1

[(1−ε)e−bt1−at2

]|t1=t2=t

εe−at + (1−ε)e−(a+b)t=

b(1−ε)e−(a+b)t

εe−at + (1−ε)e−(a+b)t

=b(1−ε)e−bt

ε+ (1−ε)e−bt= b

(1 +

ε

1−εebt)−1

(3.15)

π2(t) = − ∂

∂t2logS(t1, t2)

∣∣∣t1=t2=t

= − ∂

∂t2log[εe−at2 + (1−ε)e−bt1−at2

]|t1=t2=t

= −∂∂t2

[e−at2

(ε+ (1−ε)e−bt1

)]|t1=t2=t

εe−at + (1−ε)e−(a+b)t= a (3.16)

Page 26: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

18 EVENT TIME CORRELATIONS AND IDENTIFIABILITY

The hazard rate for cause 1 rate decays monotonically from the initial valueπ1(0) = b(1−ε) down to zero as t→∞. The hazard rate for cause 2 is independentof time.

We can now calculate the alternative explanation Pindep(t1, t2) = P1(t1)P2(t2)for the above cause-specific hazard rates, based on the assumtion of independenttimes, as given in (3.8). For this we first require the time integrals over the hazardrates: ∫ t

0

ds π1(s) =

∫ t

0

dsb(1−ε)e−bs

ε+ (1−ε)e−bs= −

[log(ε+ (1−ε)e−bs

)]t0

= − log(ε+ (1−ε)e−bt

)(3.17)∫ t

0

ds π2(s) = at (3.18)

with which we obtain

P1(t1) = π1(t1)e−∫ t1

0ds π1(s)

= −(ε+ (1−ε)e−bt1

)log(ε+ (1−ε)e−bt1

)(3.19)

P2(t2) = ae−at2 (3.20)

However, P1(t1) is not normalised to 1 as soon as ε > 0, which follows from thefact that here condition (3.10) is violated:

limt→∞

∫ t

0

ds πir(s) = − limt→∞

log(ε+ (1−ε)e−bt

)= log(1/ε) (3.21)

We also see this by explicit integration:∫ ∞0

dt P1(t) =

∫ ∞0

dt π1(t)e−∫ t

0ds π1(s)

=

∫ ∞0

dtd

dt

[− e−∫ t

0ds π1(s)

]= 1− e

−∫∞

0ds π1(s)

= 1− e−(− log ε) = 1− ε (3.22)

Hence, in the independent-times explanation for the cause-specific hazard rateswe have a probability ε that event 1 will never happen. If e.g. event 1 representsdeath and event 2 the onset of some disease, then in the original correlated timedistribution death will inevitably occur, but in the independent-times explana-tion there is a probability ε of our individual being immortal.

Example 2: true versus independent-times explanation for observed hazard rates

Let us not think that the above always happens. Inspect the following event timedistribution for the times t1, t2 ≥ 0, with a parameters a > 0 and a normalisationconstant Z(a), again referring to an individual:

Page 27: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 19

P (t1, t2) =1

Z(a)e−a(t1+t2)−a2t1t2 (3.23)

We will need the following function:

F (x) =

∫ ∞x

ds e−s/s (3.24)

It decreases monotonically, i.e. F ′(x) < 0, from F (0) = ∞ down to F (∞) = 0.Note that F ′(x) = −e−x/x. Let us calculate for this example the jount survivalprobability S(t1, t2):

S(t1, t2) =

∫ ∞t1

ds1

∫ ∞s2

dt2 P (s1, s2) =1

Z(a)

∫ ∞t1

ds1

∫ ∞t2

ds2 e−a(s1+s2)−a2s1s2

=1

a2Z(a)

∫ ∞at1

ds1

∫ ∞at2

ds2 e−s1−s2−s1s2

= − 1

a2Z(a)

∫ ∞at1

ds1e−s1

1+s1

[e−s2(1+s1)

]∞at2

=1

a2Z(a)

∫ ∞at1

ds1e−s1−at2(1+s1)

1+s1=

1

a2Z(a)

∫ ∞1+at1

due−(u−1)−at2u

u

=e

a2Z(a)

∫ ∞1+at1

due−u(1+at2)

u=

e

a2Z(a)

∫ ∞(1+at2)(1+at1)

dxe−x

x

=e

a2Z(a)F(

(1+at1)(1+at2))

(3.25)

The normalisation factor Z(a) follows from using S(0, 0) = 1 (no events yet attime zero):

1 =e

a2Z(a)F (1) so Z(a) = eF (1)/a2 (3.26)

Hence

S(t1, t2) = F(

(1+at1)(1+at2))/F (1) (3.27)

Next we calculate the cause-specific hazard rates, using F ′(x) = −e−x/x:

π1(t) = −[ ∂∂t1

logS(t1, t2)]t1=t2=t

= −[ ∂∂t1

logF(

(1+at1)(1+at2))]

t1=t2=t

= −a(1+at)F ′

((1+at)2

)F(

(1+at)2) =

a

1+at

e−(1+at)2

F ((1+at)2)(3.28)

and since S(t1, t2) is a symmetric function of (t1, t2) we get the same for risk 2,i.e. π2(t) = π1(t). We can rewrite both hazard rates as

Page 28: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

20 EVENT TIME CORRELATIONS AND IDENTIFIABILITY

πr(t) = −1

2

d

dtlogF ((1+at)2) (3.29)

and hence, using logF (∞) = log 0 = −∞,∫ ∞0

dt πr(t) = −1

2

∫ ∞0

dt[ d

dtlogF ((1+at)2)

]= −1

2

[logF ((1+at)2)

]∞0

= −1

2

(logF (∞)− logF (1)

)=∞ (3.30)

We conclude that condition (3.10) is satisfied, and we can have an independent-times explanation for our cause-specific hazard rates with fully normalised eventtime distributions (i.e. all events happen at finite times).

Page 29: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

4

INCORPORATING CURE AS A POSSIBLE OUTCOME

So far we have worked with normalised joint event time distributions Pi(t0, . . . , tR),in which all events will ultimately occur. We also saw that it is quite possibleto define cause-specific hazard rates for which the corresponding event has afinite probability of not happening at all. The question here is how this can beincorporated into our formalism in a precise and transparent way.

4.1 The clean way to include cure

The natural solution is to assign to each risk two random variables, i.e. replacetr → (τr, tt), in which tr is an event time and τr ∈ 0, 1 tells us whether (τr = 1)or not (τr = 0) risk r will actually trigger an event at time tr. The more generalstarting point for any individual i would then have to be the distribution

Pi(t0, . . . , tR; τ0, . . . , τR) (4.1)

which is now normalised according to

1∑τ0=0

. . .

1∑τR=0

∫ ∞0

. . .

∫ ∞0

dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR) = 1 (4.2)

The probability that for individual i all events will actually happen, for instance,would be

Pi(τ0 =1, . . . , τR=1)

=

1∑τ0=0

. . .

1∑τR=0

∫ ∞0

. . .

∫ ∞0

dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR)∏r

δτr,1

=

∫ ∞0

. . .

∫ ∞0

dt0 . . . dtR Pi(t0, . . . , tR; 1, 1, . . . , 1) (4.3)

We next define the integrated event time distribution for individual i, i.e. theprobability that event 0 has not happened yet at time t0, event 1 hasn’t happenedyet at time t1, etc. The conditions for this are now somewhat more involved: foreach r we demand that either τr = 0 or the event time sr is later than tr, i.e.

R∏r=0

[τrθ(sr − tr) + (1− τr)

]= 1 (4.4)

21

Page 30: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

22 INCORPORATING CURE AS A POSSIBLE OUTCOME

Hence

Si(t0, . . . , tR) =

1∑τ0=0

. . .

1∑τR=0

∫ ∞0

. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)

×R∏r=0

[τrθ(sr − tr) + (1− τr)

](4.5)

As before, nothing is assumed to have happened yet at time zero, so

Si(0, . . . , 0)

=

1∑τ0=0

. . .

1∑τR=0

∫ ∞0

. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)

×R∏r=0

[τrθ(sr)+(1−τr)

]=

1∑τ0=0

. . .

1∑τR=0

∫ ∞0

. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR) = 1 (4.6)

The survival function Si(t), i.e. the probability that for individual i all eventsr = 0 . . . R will happen later than time t, now becomes

Si(t) = Si(t, t, . . . , t)

=

1∑τ0=0

. . .

1∑τR=0

∫ ∞0

. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)

×R∏r=0

[τrθ(sr − t) + (1− τr)

](4.7)

In contrast to our earlier formulation, now we need no longer find Si(∞) = 0.Here we get

Si(∞) =

1∑τ0=0

. . .

1∑τR=0

∫ ∞0

. . .

∫ ∞0

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)

R∏r=0

(1− τr)

=

1∑τ0=0

. . .

1∑τR=0

Pi(τ0, . . . , τR)

R∏r=0

(1− τr) (4.8)

This is the probability that all variables τr are zero, i.e. that all events do notoccur. In practice, we normally include the end-of-trial risk r = 0 for the specificpurpose of assigning an event for each individual, so we would choose to have

Page 31: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

THE CLEAN WAY TO INCLUDE CURE 23

Pi(s0, . . . , sR; τ0, . . . , τR) = 0 if τ0 6= 1; this ensures that Si(∞) = 0 (even if noneof the medical events happen, at least end-of-trial will always kick in).

Cause-specific hazard rates. At this stage matters seem to get a bit more tricky,but in fact everything proceeds as before but with slightly more complicatedformulae. To keep our notation compact we will henceforth use the short-hands∑τ0...τR

. . . =∑1τ0=0 . . .

∑1τR=0 . . . and

∫ds0 . . . dsR . . . =

∫∞0. . .∫∞

0ds0 . . . dsR . . ..

Our expressions also compactify if we use the identity

τrθ(sr−tr)+(1−τr) = 1− τrθ(tr−sr) (4.9)

We now define the usual cause-specific hazard rates

πiµ(t) = −[ ∂

∂tµlogSi(t0, . . . , tR)

]tr=t ∀r

(4.10)

Working this out for our present function Si(t0, . . . , tR) gives

πiµ(t) = 1

Si(t0, . . . , tR)

∑τ0...τR

τµ

∫ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)

×δ(sµ−tµ)∏r 6=µ

[1−τrθ(tr−sr)

]tr=t ∀r

=1

Si(t)

∑τ0...τR

τµ

∫ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)

×∏r 6=µ

[1−τrθ(t−sr)

](4.11)

Hence πiµ(t)dt still gives the probability for individual i that event µ happens inthe time interval [t, t + dt), given that no event has happened to i yet prior totime t. What has changed is that finding a nonzero value now requires havingτµ = 1 (hence the new factor in the numerator), and that the conditioning onthe events other than µ has become somewhat more involved.

Survival function and data likelihood. Let us find out which of the propertiesinvolving the survival function survive our generalisation to include cure as anoutcome. Due to the fact that definition (4.10) was still valid, and since stillSi(0) = 1 (no events yet at time zero) our earlier simple expression of the survivalfunction in terms of hazard rates still holds:

d

dtlogSi(t) =

d

dtlogSi(t, t, . . . , t) =

R∑r=0

[ ∂∂tr

logSi(t0, . . . , tR)]tr=t ∀r

= −R∑r=0

πir(t) (4.12)

Page 32: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

24 INCORPORATING CURE AS A POSSIBLE OUTCOME

and thus we continue to have

Si(t) = e−∑R

r=0

∫ t0

ds πir(s)(4.13)

To see event ∆ first, in time interval [X,X + dX), the following conditions needto be met:

τ∆ = 1the time of event ∆ is in [X,X + dX)no events occurred prior to X

The combination of these conditions can be written compactly in terms of(t0, . . . , tR) and (τ0, . . . , τR) as

τ∆θ(t∆−X)θ(X+dX−t∆)∏r 6=∆

[1−τr(X−tr)

]= 1 (4.14)

So the probability P (X,∆) per unit time of this happening, for infinitesimallysmall time intervals dX, becomes

Pi(X,∆) = limdX↓0

1

dX

∑τ0...τR

∫dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR)

×τ∆θ(t∆−X)θ(X+dX−t∆)∏r 6=∆

[1−τr(X−tr)

]=∑τ0...τR

τ∆

∫dt0 . . . dtR Pi(t0, . . . , tR; τ0, . . . , τR)δ(t∆−X)

∏r 6=∆

[1−τr(X−tr)

]= πi∆(X)Si(X) = πi∆(X)e

−∑R

r=1

∫ X0

ds πir(s)(4.15)

So also this relation continues to hold. This is nice, since it implies that at thelevel of survival functions and hazard rates we don’t need to change anything –we now know that if there are risks with nonzero probability of the associatedevent never happening, then we can describe this also at the level of event timesif we want to.

4.2 The quick and dirty way to include cure

An alternative way to include cure, often found in textbooks and papers (re-gretfully) is to extend the time set [0,∞) and include tµ =∞ in the event timedistribution; events that don’t happen are said to happen at t =∞. For instance,in the case where there is just one risk we would write the symbolic expression

Pi(t) = εiPi(t) + (1−εi)δ(t−∞) (4.16)

Page 33: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

THE QUICK AND DIRTY WAY TO INCLUDE CURE 25

with Pi(t) an ordinary normalised distribution, describing event time statisticsfor the case where the event does happen, which would be the case with proba-bility εi = Probi(τ =1) (in terms of our previous set-up). We would then define∫∞

0dt δ(t−∞) = 1, and find∫ ∞

0

dt Pi(t) = εi + (1−εi)∫ ∞

0

dt δ(t−∞) = 1 (4.17)

but

limX→∞

∫ X

0

dt Pi(t) = εi limX→∞

∫ X

0

dt Pi(t) = εi (4.18)

The survival function and the hazard rate would at finite times become

Si(t) =

∫ ∞t

ds Pi(s) = 1−∫ t

0

ds Pi(s) = 1− εi∫ t

0

ds Pi(s) (4.19)

πi(t) = − d

dtlogSi(t) = Pi(t)/Si(t) = εiPi(t)/Si(t) (4.20)

And so we find

Si(t) = e−∫ t

0ds πi(s), πi(t)e

−∫ t

0ds πi(s) = εiPi(t) (4.21)

We can now express εi (the probability of the evnt not happening at all) byintegration of both sides of the second identity over time, since P (t) is normalised:

εi =

∫ ∞0

dt πi(t)e−∫ t

0ds πi(s) = −

∫ ∞0

dt[ d

dte−∫ t

0ds πi(s)

]= 1− e

−∫∞

0ds πi(s) (4.22)

This makes sense, since we know that∫∞

0ds πi(s) < ∞ is indeed the condition

for finding a finite ‘no event’ probability. We see that we can now also write theinitial time distribution Pi(t) as

Pi(t) = πi(t)e−∫ t

0ds πi(s) + e

−∫∞

0ds πi(s)δ(t−∞) (4.23)

It is in principle possible to analysis the situation this way, but it is mathe-matically somewhat messy. For instance: we have had to give up the standardconvention of calculus that

∫∞0

ds G(s) = limz→∞∫ z

0ds G(s), so we would always

have to indicate whether we mean one or the other. It is therefore more proneto mistakes. And as soon as we ask about the joint distribution Pi(t0, . . . , tR) itall becomes even worse ...

Page 34: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

26 INCORPORATING CURE AS A POSSIBLE OUTCOME

4.3 Examples

Let us return to an earlier example, where an independent-times explanationof cause-specific hazard rates led to a risk with a nonzero probability of notgenerating events. We start with the following cause-specific hazard rates, for anindividual subject to two risks:

π1(t) =b(1−ε)e−bt

ε+ (1−ε)e−bt, π2(t) = a (4.24)

We now try to construct the independent-times explanation P (t1, t2; τ1, τ2) =P1(t1, τ1)P2(t2, τ2) for these hazard rates, so S(t1, t2) = S1(t1)S2(t2) with

S1(t) =∑τ

∫ ∞0

ds P1(s, τ)[τθ(s−t)+(1−τ)

](4.25)

S2(t) =∑τ

∫ ∞0

ds P2(s, τ)[τθ(s−t)+(1−τ)

](4.26)

We first calculate the two risk-specific survival probabilities:

S1(t) = e−∫ t

0ds π1(s)

= exp[−∫ t

0

dsb(1−ε)e−bs

ε+ (1−ε)e−bs]

= exp[ ∫ t

0

dsd

dslog(ε+ (1−ε)e−bs

)]= exp

[log(ε+ (1−ε)e−bt

)]= ε+ (1−ε)e−bt (4.27)

S2(t) = e−∫ t

0ds π2(s)

= e−at (4.28)

Thus our equations (4.25,4.26) from which to calculate P1(t, τ) and P2(t, τ) be-come, after working out the summations over τ :

ε+ (1−ε)e−bt =

∫ ∞0

ds P1(s, 0) +

∫ ∞0

ds P1(s, 1)θ(s−t) (4.29)

e−at =

∫ ∞0

ds P2(s, 0) +

∫ ∞0

ds P2(s, 1)θ(s−t) (4.30)

The functions P1,2(s, 0) are obsolete, since if τ = 0 (i.e. if the event doesn’thappen) the associated event time is not used. We use normalisation and write∫∞

0ds P1,2(s, 0) = 1−

∫∞0

ds P1,2(s, 1), giving

ε+ (1−ε)e−bt = 1−∫ ∞

0

ds P1(s, 1)[1− θ(s−t)] = 1−∫ ∞

0

ds P1(s, 1)θ(t−s)

= 1−∫ t

0

ds P1(s, 1) (4.31)

Page 35: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 27

e−at = 1−∫ ∞

0

ds P2(s, τ)[1− θ(s−t)] = 1−∫ ∞

0

ds P2(s, 1)θ(t−s)

= 1−∫ t

0

ds P2(s, 1) (4.32)

Finally we differentiate both sides of both equations. This gives

P1(t, 1) = b(1−ε)e−bt, P2(t, 1) = ae−at (4.33)

This, in turn, means that∫ ∞0

dt P1(t, 0) = 1−∫ ∞

0

dt P1(t, 1) = 1−∫ ∞

0

dt b(1−ε)e−bt

= 1− (1−ε) = ε (4.34)∫ ∞0

dt P2(t, 0) = 1−∫ ∞

0

dt P2(t, 1) = 1−∫ ∞

0

dt ae−at

= 1− 1 = 0 (4.35)

We conclude that

P1(t, τ) = εδτ,0P (t) + (1−ε)δτ,1be−bt (4.36)

P2(t, τ) = δτ,1ae−at (4.37)

with some irrelevant normalised distribution P (t) (since for τ = 0 the event towhich t refers will by definition not materialise). This gives in combination:

P (t1, t2; τ1, τ2) = δτ2,1ae−at2[εδτ1,0P (t1) + (1−ε)δτ1,1be−bt1

](4.38)

This distribution, with a nonzero probability of event 1 never happening, wouldin terms of observed survival data be indistinguishable from the original one,where both events will always happen:

P (t1, t2) = ae−at2[εδ(t1 − t2 − τ) + (1−ε)be−bt1

](4.39)

Both (4.38) and (4.39) give exactly the same cause-specific hazard rates (4.24).The independent-event-times explanation for the fraction ε of cases in (4.39)where in reality event 1 would not be reported in a trial because it happens ata fixed time later than event 2 is to say that in the fraction ε of cases event 1simply does not happen (irrespective of risk 2).

Page 36: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

5

INDIVIDUAL VERSUS COHORT LEVEL SURVIVALSTATISTICS

Much confusion in survival analysis is caused by a failure to be specific andprecise in distinguising between event statistics for given individual (which wehave discussed so far) and event statistics calculated over cohorts (where thereis the additional variability of having multiple distinct individuals, which neednot be clones). Either can be described by cause-specific hazard rates, but theserates will generally differ.

5.1 Population level survival functions

The quantities defined so far describe statistical features at the level of individ-uals. If we want to characterize the cohort as a whole (or if perhaps we have noinformation at the level of individuals) we would work instead with the cohortaverages of these functions, i.e.

S(t) =1

N

N∑i=1

Si(t) P (t0, . . . , tR) =1

N

N∑i=1

Pi(t0, . . . , tR) (5.1)

S(t) gives the probability that a randomly picked individual in the cohort will nothave experienced any event prior to time t, and P (t0, . . . , tR) gives the probabilitydensity for a randomly picked individiual to have joint event times (t0, . . . , tR). Ifwe inspect the derivation of Si(t) from Pi(t0, . . . , tR) we note that we can simply

insert 1N

∑Ni=1 everywhere and get also

S(t) = S(t, t, . . . , t) (5.2)

S(t0, . . . , tR) =

∫ ∞t0

. . .

∫ ∞tR

ds0 . . . dsR P (s0, . . . , sR) (5.3)

In fact, we could have started developing our previous theory fully at populationlevel. This is what most textbooks do. It would effectively have meant droppingthe indices i from all identities in the previous sections. We would have definedpopulation level cause-specific hazard rates πr(t), such that the above populationsurvival function would be written as

S(t) = e−∑R

r=0

∫ t0

ds πr(s)(5.4)

However, one would not have πr(t) = N−1∑i π

ir(t), since log 1

N

∑i(. . .) 6=

1N

∑i log(. . .). We must therefore always clarify whether we talk about individual

28

Page 37: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

POPULATION HAZARD RATES AND DATA LIKELIHOOD 29

or population functions. Failure to make this distinction leads to confusion andmistakes. With the drive towards personalised medicine the differences betweencohort and individual survival statistics will become even more important.

Event time uncertainty versus hazard rate uncertainty. The description of sur-vival statistics at the cohort level, via S(t), involves two sources of uncertainty,which one cannot easily disentangle: the uncertainty of event times, as describedby the individual functions Si(t), and the uncertainty of which individual wepick from the cohort, represented by the averaging N−1

∑i. To illustrate this,

imagine we have just one risk, and we observe at population level what seems tobe a simple exponentially decaying survival function

S(t) = e−πt (5.5)

This can arise in many ways. For instance, all individuals could be identical, andthe uncertainty fully due to event time uncertainty at the individual level: thechoice Si(t) = e−πt for all i would trivially give the above S(t). The oppositeextreme would be the case where the individuals have no event time uncertaintyat all, i.e. Si(t) = θ[t?i − t], so each i dies fully deterministically at some time t?i ,but the pre-ordained times t?i vary from one individual to another. This wouldmean

πi(t) = − d

dtlogSi(t) = − d

dtlog θ[t?i − t] (5.6)

Here we would find, with W (t?) = N−1∑i δ[t

? − t?i ] (the distribution of deathtimes over the population):

S(t) =1

N

∑i

θ[t?i − t] =

∫ ∞0

dt? W (t?)θ[t? − t] =

∫ ∞t

dt? W (t?) (5.7)

It is easy to see that also here we can recover the above exponential form forS(t), if the predestined times of death are distributed over the population expo-nentially, according to W (t?) = πe−πt

?

:

S(t) =

∫ ∞t

dt? πe−πt?

= e−πt (5.8)

So in both case we find the population survival function (5.5), but for very differ-ent reasons. In real patient data one would typically expect to have a combinationof both types of uncertainty.

5.2 Population hazard rates and data likelihood

Relation between population hazard rates and individual hazard rates. It will beinstructive to express the population-level cause-specific hazard rates πr(t) in

Page 38: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

30 INDIVIDUAL VERSUS COHORT LEVEL SURVIVAL STATISTICS

(5.10) in terms of the individual cause-specific hazard rates πir(t) by applicationof (2.7) to population level functions:

πµ(t)S(t) =

∫ ∞t

. . .

∫ ∞t

( R∏r 6=µ

dsr

)P (s0, . . . , sµ−1, t, sµ+1, . . . , sR)

=1

N

∑i

∫ ∞t

. . .

∫ ∞t

( R∏r 6=µ

dsr

)Pi(s0, . . . , sµ−1, t, sµ+1, . . . , sR)

=1

N

∑i

πiµ(t)Si(t) (5.9)

So we find

πµ(t) =

∑i π

iµ(t)Si(t)∑i Si(t)

=

∑i π

iµ(t)e

−∑

r

∫ t0

ds πir(s)∑i e−∑

r

∫ t0

ds πir(s)(5.10)

It is clear that πr(t) 6= N−1∑i π

ir(t) as soon as the cohort is not strictly ho-

mogeneous, i.e. as soon as the hazard rates of different individuals i are not allidentical.

In fact, we see that heterogeneity will give us a time-dependent population-level hazard rate even if all individuals in the population have time-independenthazard rates. Suppose πiµ(t) = πiµ for all (i, µ) and all t. We would then obtain

πµ(t) =

∑i π

iµ(t)Si(t)∑i Si(t)

=

∑i π

iµe−t

∑rπir∑

i e−t∑

rπir

(5.11)

For large times, the individuals with the lowest hazard rates contribute most tothe average in (5.11):

πµ(0) =1

N

∑i

πiµ, limt→∞

πµ(t) = πi?

µ , i? = argmin(∑

r

πir

)(5.12)

In Cox regression (see a later section) one indeed often observes populationhazard ratios that appear to decay over time; it is now clear that this should notnecessarily be interpreted as a time dependence at the level of individuals, as itcould be due simply to cohort heterogeneity.

Data likelihood. In the same way we find that if we work at population level, withpopulation hazard rates and population survival functions, we can no longer use(2.15) to quantify the likelihood of finding an individual reporting a first eventof type ∆ at time t. Instead we would now use

P (X,∆) = π∆(X)e−∑R

r=0

∫ X0

ds πr(s)(5.13)

It now follows from (5.9), upon inserting our formulae for S(t) and Si(t), thatin fact P (X,∆) = N−1

∑i Pi(X,∆). The final picture is therefore as in the

diagram below:

Page 39: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 31

INDIVIDUAL LEVEL POPULATION LEVEL

P1(t0, . . . , tR) . . . . . . PN (t0, . . . , tR)︸ ︷︷ ︸individual event time statistics

P (t0, . . . , tR) =1

N

∑i

Pi(t0, . . . , tR)

individual hazard rates︷ ︸︸ ︷π1

0(X), . . . , π1R(X) . . . . . . πN0 (X), . . . , πNR (X) π∆(X) =

P (X,∆)∑r

∫∞X

dt P (t, r)

P1(X, 0), . . . , P1(X,R) . . . . . . PN (X, 0), . . . , PN (X,R)︸ ︷︷ ︸individual data likelihoods

P (X,∆) =1

N

∑i

Pi(X,∆)

P (X,∆) = π∆(X)e−∑

r

∫ X0

ds πr(s)

(X1,∆1) . . . . . . (XN ,∆N )︸ ︷︷ ︸observed survival data

5.3 Examples

Example 1: impact of heterogeneity on population hazard rates

Imagine a population of two distinct groups of individuals,A andB: 1, . . . , N =A∪B. Let there be NA = fN patients in group A and NB = (1− f)N patientsin group B. They are all subject to just one risk, and have time-independentindividual hazard rates: πi(t) = 1 if i ∈ A and πi(t) = 3 if i ∈ B. At populationlevel this would give the time dependent hazard rate

π(t) =

∑i π

ie−tπi∑

i e−tπi=

∑i∈A e−t +

∑i∈B 3e−3t∑

i∈A e−t +∑i∈B e−3t

=f + 3(1−f)e−2t

f + (1−f)e−2t(5.14)

The result is shown in figure 5.1 for f ∈ 0, 14 ,

12 ,

34 , 1.

Page 40: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

32 INDIVIDUAL VERSUS COHORT LEVEL SURVIVAL STATISTICS

0

1

2

3

4

0 1 2 3

t

π(t)

f=1.00

f=0.75

f=0.50

f=0.25

f=0.00

Fig. 5.1. The population-level hazard rate π(t) as given by (5.14), for differ-ent values of the relative sizes of the sub-classes in the cohort. The hazardrate decays over time as soon as there is heterogeneity in the cohort (i.e. for0 < f < 1), in spite of all individuals of the cohort having strictly time-inde-pendent hazard rates.

Example 2: correlated population risks without correlated individual risks

Imagine again a population of two distinct groups of individuals, A and B:1, . . . , N = A ∪ B. Let there be NA = fN patients in group A and NB =(1− f)N patients in group B. They are all subject to two risks r = 1 and r = 2.Assume that all have independent event times and hence factorising survivalfunctions as in (3.2), and constant hazard rates:

i ∈ A : Pi(t1, t2) =(πA1 e−t1π

A1

)(πA2 e−t2π

A2

)(5.15)

i ∈ B : Pi(t1, t2) =(πB1 e−t1π

B1

)(πB2 e−t2π

B2

)(5.16)

So

i ∈ A :

Si(t1, t2) = SA1(t1)SA2(t2), SA1(t) = e−tπA1 , SA2(t) = e−tπ

A2 (5.17)

i ∈ B :

Si(t1, t2) = SB1(t1)SB2(t2), SB1(t) = e−tπB1 , SB2(t) = e−tπ

B2 (5.18)

Within each group the two risks are clearly independent. At population level wefind the overall survival function

S(t) =1

N

N∑i=1

Si(t) =1

N

∑i∈A

SA1(t)SA2(t) +1

N

∑i∈B

SB1(t)SB2(t)

Page 41: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 33

=NAN

SA1(t)SA2(t) +NBN

SB1(t)SB2(t)

= fe−t(πA1 +πA2 ) + (1−f)e−t(π

B1 +πB2 ) (5.19)

One would naievely expect the population survival functions for the individ-ual risks to be the cohort averages over the corresponding individual survivalfunctions:

Sr(t) =1

N

N∑i=1

Sir(t) =NAN

SAr(t) +NBN

SBr(t)

= fe−tπAr + (1−f)e−tπ

Br (5.20)

This gives for the product S1(t)S2(t):

S1(t)S2(t) =(fe−tπ

A1 + (1−f)e−tπ

B1

)(fe−tπ

A2 + (1−f)e−tπ

B2

)= f2e−t(π

A1 +πA2 ) + (1−f)2e−t(π

B1 +πB2 )

+f(1−f)[e−t(π

A1 +πB2 ) + e−t(π

A2 +πB1 )

]If we had population level independence of risks (as is true at the level of theindividual groups) we would have expected to find S(t) = S1(t)S2(t). Insteadhere we get

S(t)− S1(t)S2(t) = f(1−f)[e−t(π

A1 +πA2 ) + e−t(π

B1 +πB2 )

−e−t(πA1 +πB2 ) − e−t(π

A2 +πB1 )

]= f(1−f)

[e−tπ

A1 − e−π

B1

][e−tπ

A2 − e−tπ

B2

](5.21)

We see that generally our two risks will be correlated at population level, i.e.S(t) 6=

∏r Sr(t), except for f = 0, 1 or when either πA1 = πB1 or πA2 = πB2 .

These are precisely the cases where the correlation C12 over the population ofthe hazard rates of the two risks vanishes:

C12 = 〈π1π2〉 − 〈π1〉〈π2〉

=1

N

∑i

πi1πi2 −

( 1

N

∑i

πi1

)( 1

N

∑i

πi2

)= fπA1 π

A2 + (1−f)πB1 π

B2 −

(fπA1 + (1−f)πB1

)(fπA2 + (1−f)πB2

)= f(1−f)

[πA1 π

A2 + πB1 π

B2 − πA1 πB2 − πB1 πA2

]= f(1−f)

[πA1 − πB1

][πA2 − πB2

](5.22)

This illustrates how risk correlations at population level can emerge in a naturalway as a result of correlations of the cause-specific hazard rates of the individualsin a heterogeneous cohort, in spite of each individual having independent eventtimes.

Page 42: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

6

SURVIVAL PREDICTION

To predict survival we need to know (or estimate) the cause-specific hazard rates.The natural object to use would be the survival function Si(t), which gives theprobability of individual i not experiencing any of the risk events prior to time t.However, sometimes we cannot use this (for instance if we only have informationon the cause-specific hazard rates of the cohort as a whole, rather then those ofthe individual i), or we may wish to calculate different probabilities.

6.1 Cause-specific survival functions

Non-hypothetical survival probabilities. Instead of predicting overall survival, viaSi(t) or S(t), we will often be interested in other predictions. We have alreadyseen Pi(X,µ), the probability density for seeing event µ reported first, in a timeinterval located at time t (see (2.15)):

Pi(X,µ) = πiµ(X)e−∑R

r=0

∫ X0

ds πir(s)(6.1)

From this follows e.g. the so-called cumulative incidence function Fiµ(t), whichis the probability that individual i ‘fails’ from cause µ at any time prior to t:

Fiµ(t) =

∫ t

0

dX Pi(X,µ) =

∫ t

0

dX πiµ(X)e−∑R

r=0

∫ X0

ds πir(s)(6.2)

or the population average of this function Fµ(t), which gives the probability thata randomly drawn individual from our cohort ‘fails’ from cause µ at any timeprior to t:

Fµ(t) =1

N

∑i

∫ t

0

dX πiµ(X)e−∑R

r=0

∫ X0

ds πir(s)(6.3)

Equivalently, in terms of the global cause-specific hazard rates:

Fµ(t) =

∫ t

0

dX πµ(X)e−∑R

r=0

∫ X0

ds πr(s)(6.4)

Again it is important to specify which of the above cumulative incidence functionsone is referring to (unless the cohort consists of clones, where the differencebetween the two vanishes). An equivalent quantity is the cause-specific survivalprobability Giµ(t), defined as the likelihood that at time t individual i has not

34

Page 43: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

CAUSE-SPECIFIC SURVIVAL FUNCTIONS 35

yet failed from cause µ, either because he/she experienced another event priorto t, or because nothing has yet happened at time t:

Giµ(t) = 1− Fiµ(t) = 1−∫ t

0

dX πiµ(X)e−∑R

r=0

∫ X0

ds πir(s)(6.5)

The likelihood that individual i will never report event µ would then be

Giµ(∞) = 1−∫ ∞

0

dX πiµ(X)e−∑R

r=0

∫ X0

ds πir(s)(6.6)

We see that Fiµ(t), Fµ(t), and Giµ(t) all depend on all cause specific hazard rates,not just that of risk µ, since the other risks influence how likely it is for risk µto trigger an event first. Note that even in the case of statistically independentevent times, where Si(t) =

∏r Sir(t), the function Giµ(t) is not the same as

Sir(t): both describe how likely it is for event µ not to have taken place yet attime t, but Giµ(t) takes into account the likelihood that we haven’t seen eventµ because other events happened earlier, whereas Siµ(t) does not.

The effects of disabling risks on cause-specific hazard rates. We might also beinterested in hypothetical quantities, such as what would be the survival proba-bilities if one or more of the risks could be eliminated. Often we wish to studyone specific ‘primary’ risk, and would want to eliminate the obscuring effects ofthe others. If we denote the ‘active’ set of risks as A ⊆ 0, 1, 2, . . . , R, then wehave to disable all risks r /∈ A. We have already noted earlier that this does notsimply mean setting πir(t) or πr(t) to zero for all r /∈ A, due to the conditioningin the definition of the cause-specific hazard rates. If we start from the generaldistribution Pi(t0, . . . , tR; τ0, . . . , τR) and disable all risks other than those in theset A, we effectively change this distribution into

P ′i (t0, . . . , tR; τ0, . . . , τR)

=Pi(t0, . . . , tR; τ0, . . . , τR)

∏r/∈A δτr,0∑

τ ′0...τ′R

∫ds′0 . . . ds

′R Pi(t′0, . . . , t

′R; τ ′0, . . . , τ

′R)∏r/∈A δτ ′r,0

(6.7)

and the new cause-specific hazard rates become

πi′µ(t)

=

∑τ0...τR

τµ∫

ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r 6=µ

[1−τrθ(t−sr)

]∑τ0...τR

∫ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)

∏r

[1−τrθ(t−sr)

]=

∑τ0...τR

τµ∫

ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r/∈A δτr,0

∏r 6=µ

[1−τrθ(t−sr)

]∑τ0...τR

∫ds0 . . . dsR Pi(s0, . . . , sR; τ0, . . . , τR)

∏r/∈A δτr,0

∏r

[1−τrθ(t−sr)

]

Page 44: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

36 SURVIVAL PREDICTION

(6.8)

If we started from events that always happen, i.e. a distribution of the formPi(t0, . . . , tR), then we would have upon disabling the risks r /∈ A:

P ′i (t0, . . . , tR; τ0, . . . , τR) = Pi(t0, . . . , tR)( ∏r∈A

δτr,1

)( ∏r/∈A

δτr,0

)(6.9)

and find

πi′µ(t)

=

∑τ0...τR

τµ∫

ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)δ(sµ−t)∏r 6=µ

[1−τrθ(t−sr)

]∑τ0...τR

∫ds0 . . . dsR P ′i (s0, . . . , sR; τ0, . . . , τR)

∏r

[1−τrθ(t−sr)

]=

∫ds0 . . . dsR Pi(s0, . . . , sR)δ(sµ−t)

∑τ0...τR

τµ∏r∈A δτr,1

∏r/∈A δτr,0

∏r 6=µ

[1−τrθ(t−sr)

]∫

ds0 . . . dsR Pi(s0, . . . , sR)∑τ0...τR

∏r∈A δτr,1

∏r/∈A δτr,0

∏r∈A

[1−θ(t−sr)

](6.10)

As expected, if µ /∈ A (so risk µ is disabled) we always get πi′µ(t) = 0 for all t. Ifµ ∈ A (so risk µ is not disabled), then it is clear from the above that all cause-specific hazard rates of risks in the active set A are affected by our disablingof risks, and that we need to know the distribution Pi(s0, . . . , sR; τ0, . . . , τR) tocalculate the new rates. In the case of (6.10) and µ ∈ A we can simplify ourformula for πi′µ(t) further to

πi′µ(t) =

∫ds0 . . . dsR Pi(s0, . . . , sR)δ(sµ−t)

∏r∈A/µ

[1−θ(t−sr)

]∫

ds0 . . . dsR Pi(s0, . . . , sR)∏r∈A

[1−θ(t−sr)

] (6.11)

which in effect involves only the marginal of Pi(s0, . . . , sR), obtained by inte-grating out all times tr with r /∈ A. We now appreciate why the question ofwhether Pi(s0, . . . , sR; τ0, . . . , τR) can be calculated was a relevant one. In par-ticular, in view of the Tsiatis identifiability problem we cannot expect to writethe new hazard rates in terms of the old ones, since we cannot generally expressPi(s0, . . . , sR; τ0, . . . , τR) in terms of the old hazard rates . . .

Hypothetical survival probabilities for uncorrelated event times. Only if the eventtimes are known to be uncorrelated can we proceed to calculate the new hazardrates. In this case we may write Pi(s0, . . . , sR; τ0, . . . , τR) =

∏Rr=0 Pir(sr, τr), and

simplify the above formula for the new cause-specific hazard rates for µ ∈ A to

πi′µ(t)

=

[∑τµτµ∫

dsµ Piµ(sµ, τµ)δ(sµ−t)]∏

r∈A/µ

[∑τr

∫dsr Pir(sr, τr)[1−τrθ(t−sr)]

]∏r∈A

[∑τr

∫dsr Pir(sr, τr)[1−τrθ(t−sr)]

]

Page 45: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

ESTIMATION OF CAUSE-SPECIFIC HAZARD RATES 37

=

∑τµτµ∫

dsµ Piµ(sµ, τµ)δ(sµ−t)∑τµ

∫dsµ Piµ(sµ, τµ)[1−τµθ(t−sµ)]

= πiµ(t) (6.12)

The reason that the last formula is simply the old rate πiµ(t) is that it no longer in-volves the active risk set A, so it must also be true for the choice A = 0, 1, . . . , R(i.e. none of the risks is disabled), for which we must recover the old hazard rate.In conclusion: if we disable risks, all cause-specific hazard rates will generallybe affected in a complicated way, and to calculate their new values we need toknow the joint event time distribution. Only if the event times are uncorrelated,then disabling risks simply means setting all cause-specific hazard rates of thedisabled risks to zero.

For instance, if all risks except risk µ were eliminated (i.e. A = µ) andthe event times are uncorrelated, then one would have πir(t) = 0 for all r 6= µ,and find the following hypothetical cause-specific survival probability Siµ(t),describing a world where only risk µ is active:

Siµ(t) = 1−∫ t

0

dX πiµ(X)e−∫ X

0ds πiµ(s)

= 1 +

∫ t

0

dXd

dXe−∫ X

0ds πiµ(s)

= 1 +[e−∫ X

0ds πiµ(s)

]t0

= e−∫ X

0ds πiµ(t)

(6.13)

which is identical to Siµ(t) (due to the assumed event time independence), i.e.to the risk-specific survival probability in Si(t) =

∏r Sir(t).

6.2 Estimation of cause-specific hazard rates

Strategies for extracting hazard rate information from the data. The previoussubsection deals with how we can make predictions once we know the cause-specific hazard rates. In reality we must find these rates first. Suppose we havesurvival data D = (X1,∆1), . . . , (XN ,∆N ), referring to N individuals whocan be regarded as independently drawn random samples from a populationcharacterised by an as yet unknown set of population-level cause-specific hazardrates π0, . . . , πR. There are two (connected) approaches to the question of howto get the rates from D3.

The first (traditional) approach is to construct formulas for so-called esti-mators, which are expressions π0, . . . , πR written in terms of the data D, forwhich we can prove that in the limit N → ∞ they converge to the true val-ues π0, . . . , πR. Once these estimators are chosen and their properties verified,one then uses in prediction the estimators instead of the real (unknown) cause-specific hazard rates. There are two downsides to this approach. The first is the

3The procedures described here apply more generally to the extraction of parameters fromdata, not just to the extraction of cause-specific hazard rates π0, . . . , πR from our survivaldata D.

Page 46: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

38 SURVIVAL PREDICTION

difficulty in the construction of good (i.e. unbiased and fast-converging) candi-dates for these formulas, which is simple for trivial quantities but is not trivialfor the present case of hazard rates. The second is that, especially when N isnot very large, we know that estimators are not exact, and we have no way ofaccounting for this imprecision in our predictions. We would just hope for thebest ... The second approach is to use Bayesian arguments, which not only takeinto account our residual uncertainty regarding the true hazard rates after hav-ing observed the data, but also lead us to systematic formulae for estimators. In?? we work out and compare the different routes available for estimating modelparameters from data for a simple example.

The maximum likelihood estimator. Let us try to determine the most probablevalues for our hazard rates, given our observation of the data D, in the Bayesianway. This means finding the maximium over π0, . . . , πR of the distributionP(π0, . . . , πR|D).4 The standard Bayesian identity p(a|b)p(b) = p(b|a)p(a) al-lows us to express this distribution in terms of its counterpart, the data likelihoodP(D|π0, . . . , πR) given the hazard rates:

P(π0, . . . , πR|D) =P(π0, . . . , πR|D)P(D)

P (D)

=P(D|π0, . . . , πR)P(π0, . . . , πR)

P(D)

=P(D|π0, . . . , πR)P(π0, . . . , πR)∫

dπ′0, . . . , π′R P (π′0, . . . , π′R, D)

=P(D|π0, . . . , πR)P(π0, . . . , πR)∫

dπ′0, . . . , π′R P(D|π′0, . . . , π′R)P(π′0, . . . , π′R)(6.14)

Within the Bayesian framework one would choose for the prior the maximum-entropy distribution, subject to applicable constraints (such as πr(t) ≥ 0 for allt). If, on the other hand, we choose a so-called flat prior, i.e. we take P(π0, . . . , πR)to be a independent of π0, . . . , πR (so we have no prior preference either way,beyond the constraints), then the most probable set π0, . . . , πR is the onethat maximises P(D|π0, . . . , πR). This is called a maximum-likelihood estima-tor. Equivalently, we can maximize the logarithm of the data likelihood, viz.L(D|π0, . . . , πR)) = logP(D|π0, . . . , πR), which will give slightly more com-pact equations. To proceed we need a formula for L(D|π0, . . . , πR), which,given our independence assumption and in view of (5.13), is

L(D|π0, . . . , πR) = log∏i

P (Xi,∆i) =∑i

logP (Xi,∆i)

4We will use ordinary Roman capitals (e.g. P (..), W (..)) for distributions describing intrin-sic survival statistics of individuals and groups, and calligraphic Roman capitals (e.g. P(..)) forBayesian probabilities, which quantify our confidence in having extracted information correctlyfrom survival data.

Page 47: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

DERIVATION OF THE KAPLAN-MEIER ESTIMATOR 39

=∑i

log π∆i(Xi)−

∑i

R∑r=0

∫ Xi

0

ds πr(s)

=

R∑r=0

∫ ∞0

ds∑i

log πr(s)δr,∆i

δ(s−Xi)− πr(s)θ(Xi−s)

(6.15)

Now we maximize this latter expression by variation of each of the functionsπr(t), giving us an estimator for each of the population-level cause-specific hazardrates. It is standard convention to write estimators with a ‘hat’ symbol on top,so after differentiation of (6.15) we get

(∀r)(∀t) :1

πr(t)

∑i

δr,∆iδ(t−Xi) =

∑i

θ(Xi−t) (6.16)

(∀r)(∀t) : πr(t) =

∑i δr,∆iδ(t−Xi)∑i θ(Xi−t)

(6.17)

This latter result seems very sensible. At any stage t, the rate for event r isestimated as the total number of observed failures to r per unit time, divided bythe number of patients observed to be still ‘at risk’ at time t.

6.3 Derivation of the Kaplan-Meier estimator

Estimator for the survival function in presence of just one risk. From the estima-tor (7.40) we can construct estimators for the various survival functions definedearlier. In particular, provided we know (or may assume) that the event timesare statistically independent, we can estimate the hypothetical survival functionthat would describe a situation where all risks except µ (the risk of interest, orthe ‘primary’ risk) would be eliminated. The latter is estimated by Sµ(t), where

log Sµ(t) = −∫ t

0

ds πµ(s) = −∫ t

0

ds

∑i δµ,∆i

δ(s−Xi)∑i θ(Xi−s)

= −∑i

∫ t

0

dsδµ,∆i

δ(s−Xi)∑j θ(Xj−s)

= −∑i

δµ,∆iθ(t−Xi)∑

j θ(Xj−Xi)(6.18)

If we denote with Ωµ ⊂ 1, . . . , N the set of all patients that report event µ,this becomes

log Sµ(t) = −∑i∈Ωµ

θ(t−Xi)∑j θ(Xj−Xi)

(6.19)

Note that R(Xi) =∑j θ(Xj−Xi) is the number of individuals still ‘at risk’ at

time Xi. We may now write

Sµ(t) =∏i∈Ωµ

e−θ(t−Xi)/R(Xi) (6.20)

Page 48: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

40 SURVIVAL PREDICTION

Estimator for the overall survival function. The estimate for the overall survivalfunction S(t) = exp[−

∑µ

∫ t0ds πµ(s)] can be obtained by summing over all risks

in the derivation above. We get

log S(t) = −∫ t

0

ds∑µ

πµ(s) = −∫ t

0

ds

∑i δ(s−Xi)∑i θ(Xi−s)

= −∑i

θ(t−Xi)∑j θ(Xj−Xi)

(6.21)

Hence

S(t) = exp[−∑i

θ(t−Xi)∑j θ(Xj−Xi)

]=∏i

e−θ(t−Xi)/R(Xi) (6.22)

Here there is no issue relating to independence of event times; this estimatoris always valid. Even without the arguments leading to (6.22), one can easilyconvince oneself that for N → ∞ the expression (6.22) will indeed converge tothe true survival function. Upon defining the emperical distribution Pem(X) =N−1

∑i δ(X−Xi) we may write:

limN→∞

S(t) = limN→∞

exp[−∑i

θ(t−Xi)∑j θ(Xj−Xi)

]= lim

N→∞exp

[−∫ ∞

0

dX Pem(X)θ(t−X)∫∞

0dX ′ Pem(X ′)θ(X ′−X)

]= lim

N→∞exp

[−∫ t

0

dXPem(X)∫∞

XdX ′ Pem(X ′)

](6.23)

For N → ∞ we will have Pem(X) → P (X) (the true distribution of ‘first eventtimes corresponding to our cohort), and the fraction inside the above integralbecomes identical to the right-hand side of the population level version of (2.19),viz.

R∑r=0

πr(X) =P (X)∫∞

Xdt P (t)

(6.24)

Thus we find, provided limits and integrations commute (i.e. for non-pathologicalP (x)), that

limN→∞

S(t) = e−∑R

r=0

∫ t0

dX πr(X)= S(t) (6.25)

The Kaplan-Meier estimators. From equations (6.20) and (6.22) it is only a smallstep to the so-called Kaplan-Meier curves. We collect and order all distinct times

Page 49: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

DERIVATION OF THE KAPLAN-MEIER ESTIMATOR 41

at which one or more events are reported in our cohort, giving an ordered set oftime points t1 < t2 < t3 < . . ., to be labelled by t`. This allows us to write (6.20)as

Sµ(t) = e−∑

i∈Ωµθ(t−Xi)/R(Xi)

= e−∑

`

∑i∈Ωµ, Xi=t`

θ(t−t`)/R(t`)

= e−∑

`θ(t−t`)R−1(t`)

∑i∈Ωµ, Xi=t`

1(6.26)

We recognize that Dµ(t`) =∑i∈Ωµ, Xi=t`

1 is the number of individuals thatreported event µ at time t`, so

Sµ(t) =∏

`, t`≤t

e−Dµ(t`)/R(t`) =∏

`, t`≤t

[1− Dµ(t`)

R(t`)+O

(Dµ(t`)

R(t`)

)2](6.27)

If finally we truncate the expansion of exponentials after the first two terms(which is valid until times become so large that the number R(t) of individualsat risk become of order one) we obtain what is known as the Kaplan-Meierestimator for the risk-specific survival function for the case of uncorrelated risks:

SKMµ (t) =

∏`, t`≤t

[1− Dµ(t`)

R(t`)

],

R(t) : nr at risk at time tDµ(t) : nr with event µ at time t

(6.28)

Similarly, since the only difference between (6.20) and (6.22) is whether or not welimit the contributing individuals i to those that report event µ, we can retracethe above argument with only minor adjustments, and obtain the Kaplan-Meierestimator of the overall survival function:

SKM(t) =∏

`, t`≤t

[1− D(t`)

R(t`)

],

R(t) : nr at risk at time tD(t) : nr with event at time t

(6.29)

Examples of cause-specific and overall Kaplan-Meier survival curves are shownin figure 6.1.

Some properties of KM curves. The formulae for Kaplan-Meier curves are simpleand compact, but they have limitations. The main one is that SKM

µ (t) onlyestimates the survival probability for risk µ for uncorrelated risks. For correlatedrisks we can still use SKM(t), but SKM

µ (t) may bare no relation at all to the eventstatistics of the individual risks that would be found of the alternative risks weredisabled. Also the expansion used means in both cases that for small values ofD(t`), i.e. for large times when only few patients are still event-free, they are nolonger reliable.

Secondly, by definition KM curves have the shape of descending staircases,with steps at the times where events occurred in the cohort. The smaller the co-hort size N , the smaller the number of steps and the larger the jumps involved.This jagged nature of the curves is an artifact of the procedure of maximum-likelihood that was followed; since common sense dictates that the true sur-vival curves are smooth, a better procedure would be to add a non-flat prior

Page 50: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

42 SURVIVAL PREDICTION

SKM(t) SKM1 (t) SKM

2 (t) SKM0 (t) (EOT)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35

PCN=2047

time (yrs) time (yrs) time (yrs) time (yrs)

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000

BCN=70

time (days) time (days) time (days) time (days)

Fig. 6.1. Kaplan-Meier curves for a large prostate cancer data set (top row,N = 2047 patients, primary risk: onset of prostate cancer) and for a smallerbreast cancer data set (bottom row, N = 70 patients). Left curves: the KMestimator for the overall survival probability (including end-of-trial censoringevents). Middle two columns: KM estimators of cause-specific survival proba-bilities (assuming independence of risks), for the primary risk 1 (cancer onsetfor PC for patients monitored from age 50 onwards, cancer recurrence forBC following a primary tumour at time t = 0) and for risk 2 (other deaths).Right column: KM estmator of the survival probability of the end-of-trial risk,which tells us about the distribution of end-of-trial censoring times. We seethat protate cancer risk increases with time (slope of SKM

1 (t) becomes morenegative with age), and that the recurrence risk for breast cancer patientsdecreases with time (slope of SKM

1 (t) gets less negative over time).

P (π0, . . . , πR) to our derivation, which punishes non-smooth dependencies ofcause-specific hazard rates on time. The only reason this is usually not done isthat we would get a more complicated equation than (7.40), from which πr(t)can no longer be solved in explicit form.

In view of the interpretation and the underlying assumptions of the KMcurves, we should expect that SKM(t) =

∏µ S

KMµ (t). This is indeed true (within

the orders of accuracy considered in the derivation of the formulae):

∏µ

SKMµ (t) =

∏µ

∏`, t`≤t

[1− Dµ(t`)

R(t`)

]

Page 51: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

DERIVATION OF THE KAPLAN-MEIER ESTIMATOR 43

=∏

`, t`≤t

∏µ

[1− Dµ(t`)

R(t`)

](6.30)

If there are no ties in timing, i.e. all events happen at distinct times, thenDµ(t`), D(t`) ∈ 0, 1 and

∏µDµ(t`) = D(t`)δµ,µ(t`) with µ(t`) denoting the

type of event observed at time t`. We then immediately get∏µ

SKMµ (t) =

∏`, t`≤t

[1− D(t`)

R(t`)

]= SKM(t) (6.31)

If there are ties, the identity is true in relevant orders, since

∏µ

SKMµ (t) =

∏`, t`≤t

∏µ

e−Dµ(t`)

R(t`)+O(D2

µ(t`)/R2(t`))

=∏

`, t`≤t

e−D(t`)

R(t`)+O(∑

µ[Dµ(t`)/R(t`)]

2)

=∏

`, t`≤t

[1− D(t`)

R(t`)+O

(D2(t`)

R2(t`)

)]≈ SKM(t) (6.32)

How to measure cause-specific risk in the presence of risk correlations. We havealready seen that the cause-specific survival function Sµ(t) only estimates thetrue survival with respect to risk one in the absence of other risks if all risksare independent. The same is true for the cause-specific Kaplan-Meier curves,which are just approximations of the Sµ(t). So how do we measure then thecause-specific survival prospects of a cohort when there are correlated risks?Unless we know the joint event time distibution, we have no choice but to returnto non-hypothetical survival probabilities, such as the cause specific incidencefunctions

Fiµ(t) =

∫ t

0

dX πiµ(X)e−∑R

r=0

∫ X0

ds πir(s)(6.33)

Fµ(t) =1

N

∑i

∫ t

0

dX πiµ(X)e−∑R

r=0

∫ X0

ds πir(s)(6.34)

The only limitation is that if we then compare two groups in terms of their riskµ incidence, we can never be sure whether any differences are due to changes inthe event time statistics of the risk µ itself, or due to differences in the otherrisks r 6= µ which can influence Fiµ(t) or Fµ(t) via censoring (i.e. by changingthe probability that the type µ events are the first to occur). This limitation isfundamental, as it involves the joint statistics of timings which we know cannotbe inferred from hazard rates, and therefore cannot be inferred from survivaldata.

Page 52: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

44 SURVIVAL PREDICTION

6.4 Examples

Example 1: risk elimination

To get a better feel for the perhaps counter-intuitive statement that removal ofone risk affects the hazard rates of the remaining risks, let us work out a simpleexample where two risks 1 and 2 are both consequences of a third event 3 with anexponentially distributed event time t3 (which we do not observe, and which initself has no negative direct consequences). We assume that always t1 = t3 + 10,and t2 = t3 + 20, so

P (t1, t2, t3) = τ−1e−t3/τδ(t1−t3−10)δ(t2−t3−20) (6.35)

Integration over the unknown t3 gives the event time distribution for the observ-able events:

P (t1, t2) = τ−1

∫ ∞0

dt3e−t3/τδ(t1−t3−10)δ(t2−t3−20)

= τ−1e−(t1−10)/τδ(t2−t1−10)θ(t1−10) (6.36)

From this we obtain

S(t1, t2) =

∫ ∞t1

∫ ∞t2

ds1ds2 τ−1e−(s1−10)/τδ(s2−s1−10)θ(s1−10)

= τ−1

∫ ∞max(t1,10)

ds1 e−(s1−10)/τθ(s1+10−t2)

= τ−1

∫ ∞max(t1−10,0,t2−20)

ds e−s/τ = e−max(t1−10,0,t2−20)/τ

=

1 if t1 < 10 and t2 < 20e−(t1−10)/τ if t1 > 10 and t1 > t2 − 10e−(t2−20)/τ if t2 > 10 and t1 < t2 − 10

(6.37)

To next calculate the hazard rates via (2.7) we need S(t1, t2) for |t1 − t2| small,i.e. we need only the first two options in the result above:

t < 10 : π1(t) = π2(t) = 0 (6.38)

t > 10 : π1(t) = − ∂

∂tlog e−(t−10)/τ = 1/τ, π2(t) = 0 (6.39)

We can understand this: event 2 always hapens after 1 so will never be observed.Event 1 happens with a constant rate (that of the cause, i.e. of event 3) as soonas t > 10.

Next we disable risk 1. Evidently this means that π′1(t) = 0. However, thetime t2 still happens exactly at t2 = t3 + 20, but it is no longer preceded by the‘masking event’ 1. So we must get

t < 20 : π′1(t) = π′2(t) = 0 (6.40)

Page 53: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 45

t > 20 : π′1(t) = 0, π′2(t) = 1/τ (6.41)

Let us also calculate π′2(t) via the formal route, i.e. formula (6.11):

π′2(t) =

∫ds1ds2 P (s1, s2)δ(s2−t)∫

ds1ds2 P (s1, s2)[1−θ(t−s2)

] =P (t)∫∞

tds P (s)

(6.42)

in which

P (t) =

∫ ∞0

dt1 P (t1, t) =

∫ ∞10

dt1 τ−1e−(t1−10)/τδ(t−t1−10)

= τ−1e−(t−20)/τθ(t− 20) (6.43)

and so insertion into our formula for π′2(t) gives indeed

π′2(t) = θ(t− 20)e−(t−20)/τ∫∞

tds e−(s−20)/τθ(s− 20)

= θ(t− 20)e−(t−20)/τ

τ e−(t−20)/τ= θ(t− 20) τ−1 (6.44)

Example 2: correlated risks and false protectivity

Let us inspect a previously used distribution for two event times t1, t2 ≥ 0, withparameters a, b, τ > 0 and ε ∈ [0, 1], assumed to apply at population level:

P (t1, t2) = ae−at2[εδ(t1 − t2 − τ) + (1−ε)be−bt1

](6.45)

We note that the first marginal of this distribution is

P (t1) =

∫ ∞0

dt2 P (t1, t2)

= ε

∫ ∞0

dt2 ae−at2δ(t1 − t2 − τ) + (1−ε)be−bt1∫ ∞

0

dt2 ae−at2

= εθ(t1−τ)ae−a(t1−τ) + (1−ε)be−bt1 (6.46)

Also we note that in the above distribution the two event times are generallycorrelated:

〈t1t2〉−〈t1〉〈t2〉 =

∫dt2 P (t2)t2

∫dt1 P (t1|t2)t1

−(∫

dt2 P (t2)

∫dt1 P (t1|t2)t1

)(∫dt1 P (t2)t2

)=

∫dt2 P (t2)t2

(ε(t2+τ)+(1−ε)1

b

)− 1

a

∫dt2 P (t2)

(ε(t2+τ)+(1−ε)1

b

)

Page 54: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

46 SURVIVAL PREDICTION

= ε(〈t22〉+τ

a) +

1− εab− ε( 1

a2+τ

a)− 1− ε

ab

= ε(〈t22〉 −1

a2) (6.47)

The remaining average is

〈t22〉 =

∫ ∞0

dt t2ae−at = ad2

da2

∫ ∞0

dt e−at = ad2

da2

1

a= 2/a2 (6.48)

Hence

〈t1t2〉−〈t1〉〈t2〉 = ε/a2 (6.49)

We conclude that in the above distribution the two times are positively correlatedas soon as ε > 0.

Let us calculate the actual cause-specific hazard rate π′1(t) that would befound if risk 2 were disabled. This rate is given in (6.11), which here simplifiesto

π′1(t) =

∫dt1dt2 P (t1, t2)δ(t1−t)∫dt1dt2 P (t1, t2)θ(t1−t)

=P (t)∫∞

tdt1 P (t1)

=εθ(t−τ)ae−a(t−τ) + (1−ε)be−bt

ε∫∞

maxt,τdt1 ae−a(t1−τ) + (1−ε)∫∞t

dt1 be−bt1

=εθ(t−τ)ae−a(t−τ) + (1−ε)be−bt

εe−a maxt−τ,0 + (1−ε)e−bt(6.50)

Hence

t < τ : π′1(t) =(1−ε)be−bt

ε+ (1−ε)e−bt= − d

dtlog[ε+ (1−ε)e−bt

](6.51)

t > τ : π′1(t) =εae−a(t−τ) + (1−ε)be−bt

εe−a(t−τ) + (1−ε)e−bt

= − d

dtlog[εe−a(t−τ) + (1−ε)e−bt

](6.52)

We can now immediately read off the true survival function for risk 1, viz. S1(t) =

exp[−∫ t

0ds π′1(s)], that would correspond to a world where risk 2 was disabled:

t < τ : S1(t) = ε+ (1−ε)e−bt (6.53)

t > τ : S1(t) = εe−a(t−τ) + (1−ε)e−bt (6.54)

Next we want to compare this result to the estimator S1(t) in (6.20), ofwhich the risk-1 KM curve SKM

1 (t) is an approximation, which aims to describe

Page 55: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 47

the survival statistics for risk 1 alone, but whose derivation relied on assumingindependence of the event times. To do this we generate N time pairs (ti1, t

i2)

from the distribution P (t1, t2), and define the corresponding survival data D =(X1,∆1), . . . , (XN ,∆N ), where

(∀i = 1 . . . N) : Xi = minti1, ti2), ∆i =

1 if ti1 < ti22 if ti2 < ti1

(6.55)

It is convenient to rewrite S1(t) first as

S1(t) = exp−∑i

δ∆i,1θ(t−Xi)∑j θ(Xj−Xi)

= exp

− 1

N

∑i

δ∆i,1θ(t−Xi)1N

∑j θ(Xj−Xi)

= exp

2∑∆=1

∫ ∞0

dXP (X,∆) δ∆,1θ(t−X)∑2

∆′=1

∫∞0

dX ′P (X ′,∆′)θ(X ′−X)

= exp

−∫ t

0

dXP (X, 1)∫∞

XdX ′ P (X ′, 1) +

∫∞X

dX ′ P (X ′, 2)

(6.56)

Here P (X,∆) is the empirical joint distribution of reported event times andevent types, i.e.

P (X,∆) =1

N

∑i

δ∆,∆iδ(X −Xi) (6.57)

For sufficiently large cohorts, i.e. for N →∞, and given that our ‘patients’ weregenerated independently, the law of large numbers guarantees that P (X,∆) willconverge to the true distribution P (X,∆) defined in (2.15):

P (X,∆) = π∆(X)e−∫ X

0ds π1(s)−

∫ X0

ds π2(s)(6.58)

We have already calulated the cause specific hazard rates π1,2(t) and their timeintegrals for our present example, see (3.15,3.16), which resulted in

π1(t) =b(1−ε)e−bt

ε+ (1−ε)e−bt,

∫ t

0

ds π1(s) = − log(ε+ (1−ε)e−bt

)(6.59)

π2(t) = a,

∫ t

0

ds π2(s) = at (6.60)

Hence we find

P (X, 1) =b(1−ε)e−bX

ε+ (1−ε)e−bX(ε+ (1−ε)e−bX

)e−aX = b(1−ε)e−(a+b)X (6.61)

P (X, 2) = a(ε+ (1−ε)e−bX

)e−aX = aεe−aX + a(1−ε)e−(a+b)X (6.62)

Hence for N →∞ our estimator S1(t) will report

S1(t) = exp

−∫ t

0

dXP (X, 1)∫∞

XdX ′ P (X ′, 1) +

∫∞X

dX ′ P (X ′, 2)

Page 56: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

48 SURVIVAL PREDICTION

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0 1 2 3 4

t

S1(t) vs S1(t)

ε=0.75

ε=0.50

ε=0.25

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0 1 2 3 4

t

SKM1 (t) vs S1(t)

ε=0.75

ε=0.50

ε=0.25

Fig. 6.2. We compare the estimators S1(t) (dashed left) and SKM1 (t) (dashed

right) for the survival function of risk 1 to the true survival function S1(t) forrisk 1 (solid) that would be found if risk 2 were disabled. The joint event timesare assumed to be distributed according to the example (6.45), with parame-ter values a = b = τ = 1. The curves correspond to formulae (6.53,6.54) and(6.63,6.28). The three Kaplan-Meier curves on the right were calulated fromN = 1000 synthetic patient data (Xi,∆i), generated according to (6.55).Since the two risks in this example are positively correlated, the KM esti-mator SKM

1 (t) and its precursor S1(t) (both of which assume there are nocorrelations between the two risks) underestimate the severity of risk 1. Thiseffect is called ‘false protectivity due to competing risks’.

= exp

−∫ t

0

dXb(1−ε)e−(a+b)X

(a+b)(1−ε)∫∞X

ds e−(a+b)s + aε∫∞X

ds e−as

= exp

−∫ t

0

dXb(1−ε)e−(a+b)X

(1−ε)e−(a+b)X + εe−aX

= exp

−∫ t

0

dXb(1−ε)e−bX

(1−ε)e−bX + ε

= e−∫ t

0dX π1(X)

= ε+ (1−ε)e−bt (6.63)

Comparison with the true survival function (6.53,6.54) for risk 1, describingcorrectly the world where risk 2 is disabled, shows that our estimator is onlycorrect for short times. See figure 6.2 for example curves, corresponding to ε ∈ 1

4 ,12 ,

34. As soon as t > τ (and provided ε > 0, so there are indeed event

time correlations) the estimator S1(t) and the Kaplan-Meier curve SKM1 (t) both

grossly over-estimate the survival probability of risk 1. This effect, called ‘falseprotectivity’, is entirely the consequence of the fact that KM-type estimators

Page 57: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 49

1− SKM1 (t)

Fig. 6.3. The incidence estimator 1− SKM1 (t) for prostate cancer, calculate for

three subgroups of a population of males. This illustrates the false protate-ctivity effect. Here smoking seems to have a preventative effect on prostatecancer, but this is in fact caused by correlations between the risk of prostatecancer and the risk of lung cancer; in the presence of such correlations theKaplan-Meier estimators can no longer be trusted.

neglect risk correlations. In our present example the two times are positivelycorrelated, so we know that high-risk individuals with respect to event type2 tend also to be high-risk with respect to event type 1. The early events oftype 2 are therefore more likely to ‘filter out’ those individuals that would havealso given early type 1 events. Event 2 censoring thereby changes over time thecomposition of the population, increasing the fraction of individuals with lowertype 1 risk.

An example of real medical data affected by competing risks are those offigure 2.1. The corresponding Kaplan-Meirer curves are shown in figure 6.3, orrather the incidence estimator 1− SKM

1 (t), which estimates the the probability ofhaving experienced event type 1 (here: prostate cancer) as a function of time. Thecurves are shown for different groups, smokers, ex-smokers, and non-smokers. Theresult suggests a preventative effect of smoking with respect to prostate cancer.In fact, a more careful analysis reveals that this is not real, but caused by thefalse protetectivity effect of lung cancer; those of the smokers who did not getlung cancer by the time they get to the age of 75 are inherently more robust,and therefore also less likely to get prostate cancer.

Page 58: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

50 SURVIVAL PREDICTION

Note that the deviation between the true S1(t) and the estimator S1(t) couldalso work in the opposite direction. If our two risks had been negatively corre-lated, then risk 2 events would have been more likely to filter out individuals withlow type 1 risk; we would then have found our estimators S1(t) and SKM

1 (t) under-estimating the risk 1 survival probability. Risk-specific Kaplan-Meier curves werenot designed for, and should therefore not be used in, a context where differentrisks may be correlated.

Page 59: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

7

INCLUDING COVARIATES

All survival probabilities and data likelihoods considered so far were dependentonly upon the cause-specific hazard rates π = π0, . . . , πR. Let us emphasisethis in our notation, and write

S(t|π) = e−∑

r

∫ t0

ds πr(s)(7.1)

P (X,∆|π) = π∆(X)e−∑

r

∫ X0

ds πr(s)(7.2)

Note that with these conventions we can also write survival functions and dataprobabilities at the level of individuals simply as Si(t) = S(t|πi) and Pi(X,∆) =P (X,∆|πi). If we want to predict survival for individuals on which we havefurther information in the form of the values of covariates Z = (Z1, . . . , Zp),then we would want to use this information. There are two disctinct ways to dothis, both are internally consistent and correct, but they differ in strategy.

7.1 Definition via covariate sub-cohorts

How to relate covariates to prediction. Let us assume for simplicity that our co-variates are discrete. We can then define a sub-cohort ΩZ ⊆ 1, . . . , N consist-ing of those individuals i that have covariates Zi = Z, and apply the analysisin section 5 of the link between individual-level descriptions and population-level descriptions to this sub-cohort. ΩZ will be characterized by some cohort-level cause-specific hazard rates π(Z), which are related to the individual cause-specific hazard rates via (5.10), which takes the form

πµ(t|Z) =

∑i∈ΩZ

πiµ(t)e−∑

r

∫ t0

ds πir(s)

∑i∈ΩZ

e−∑

r

∫ t0

ds πir(s)(7.3)

All our previous analysis linking cohort-level to individual-level quantities ap-plies to ΩZ , so we can immediately write down the formulas for the probabilityS(t|π(Z)) that a randomly drawn individual from ΩZ will be alive at time t, andfor the likelihood per unit time P (X,∆|π(Z)) that a randomly drawn individualfrom ΩZ will report an event of type ∆ at time X:

S(t|π(Z)) = e−∑

r

∫ t0

ds πr(s|Z)(7.4)

51

Page 60: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

52 INCLUDING COVARIATES

P (X,∆|π(Z)) = π∆(X|Z)e−∑

r

∫ X0

ds πr(s|Z)(7.5)

with the generic definitions (7.1) and (7.2). Similarly, we find identity (5.9) trans-lated into

πµ(t|Z)S(t|π(Z)) =1

|ΩZ |∑i∈ΩZ

πiµ(t)S(t|πi) (7.6)

with |ΩZ | =∑i∈ΩZ

1. We think strictly in terms of cohort-level cause-specific

hazard rates π(Z), which by definition depend solely and uniquely on Z, as op-posed to individual-level cause-specific hazard rates π (which can and generallywill vary within the sub-cohort ΩZ ).

Estimation of π(Z) from the data and Bayesian prediction. Within the sub-cohort picture, the Bayesian estimation of π(Z) is straightforward. To distin-guish between the π (cause-specific hazard rates, being functions of time butwithout limiting oneself to individuals with specific covariates) and the time-and covariate-dependent rates π(Z), let us write the latter when used as argu-ments in probability distributions as π?. In Bayesian estimation we would simplywrite

P(π?|D) =P(D|π?)P(π?)∫

dπ?′ P(D|π?′)P(π?′)(7.7)

P(D|π?)) =

N∏i=1

P (Xi,∆i|π?(Zi))

=

N∏i=1

π?∆i

(Xi|Zi)e−∑

r

∫ Xi0

dt π?r (t|Zi)

(7.8)

Here P(π?) is a distribution that codes for any prior knowledge we have onthe relation π?(Z) (including applicable constraints). Fully Bayesian prediction,taking into account our limited certainty on whether we have extracted thecorrect π?(Z) from the data D, would become

S(t|Z,D) =

∫dπ? P(π?|D) S(t|π(Z)) (7.9)

P (X,∆|Z,D) =

∫dπ? P(π?|D) P (X,∆|π?(Z)) (7.10)

Most probable covariate-to-rates relation. The most probable function π?(Z) isthe one that maximises P(π?|D) in (7.7), i.e. that maximises

logP(π?|D) = logP(D|Wh(π?)) + logP(π?)

Page 61: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

DEFINITION VIA COVARIATE SUB-COHORTS 53

=

N∑i=1

logπ?∆i

(Xi|Zi)e−∑

r

∫ Xi0

dt π?r (t|Zi)

+ logP(π?)

=

N∑i=1

log π?∆i(Xi|Zi)−

N∑i=1

∑r

∫ Xi

0

dt π?r (t|Zi) + logP(π?)

=∑r

N∑i=1

δr,∆ilog π?r (Xi|Zi)−

N∑i=1

∫ Xi

0

dt π?r (t|Zi)

+ logP(π?) (7.11)

Unless we have prior evidence that suggests we should couple risks, we shoulduse the maximum entropy prior, which is of the form P(π?) =

∏r P(π?r ). In that

case the posterior P(π?|D) factorises fully over risks, and hence

logP(π?|D) =∑r

logP(π?r |D) (7.12)

logP(π?r |D) =

N∑i=1

δr,∆ilog π?r (Xi|Zi)−

N∑i=1

∫ Xi

0

dt π?r (t|Zi)

+ logP(π?r ) (7.13)

We see that the functions π?r (t|Z) for different risks r are calculated from dis-connected maximisation problems. However, this does not mean that we cansimply forget about the other risks r 6= 1. The risks could still be correlated, soeliminating competing risks still can still impact upon the primary hazard rateπ?1(t|Z). Only with the further assumption of noncorrelating risks can we takeπ?1(t|Z) as a correct measure of risk in a world where only risk 1 can materialise.

Information-theoretic interpretation. There is a nice interpretation of what theabove formulae are effectively doing. To show this we first need to define theempirical covariate distribution and the empirical conditioned data distribution:

P (Z) =1

N

∑i

δ(Z−Zi) (7.14)

P (t, r|Z) =

∑i δ(t−Xi)δr,∆i

δ(Z−Zi)∑i δ(Z−Z

i)(7.15)

From (7.7,7.8) we obtain, using the definitions (7.15):

1

NlogP(π?|D)

=1

N

N∑i=1

logP (Xi,∆i|π?(Zi)) +1

NlogP(π?) + constant

Page 62: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

54 INCLUDING COVARIATES

=

∫dZ∑r

1

N

N∑i=1

δ(Z−Zi)δr,∆i

∫ ∞0

dt δ(t−Xi) logP (t, r|π?(Z)

+1

NlogP(π?) + constant

=

∫dZ P (Z)

∑r

∫ ∞0

dt P (t, r|Z) logP (t, r|π?(Z)

+1

NlogP(π?) + constant

= −∫

dZ P (Z)∑r

∫ ∞0

dt P (t, r|Z) log[ P (t, r|Z)

P (t, r|π?(Z)

]+

1

NlogP(π?) + Constant (7.16)

Apart from the regularising influence of a prior P(π?), the most probable func-tion π? is apparently the one that minimises the Z-averaged Kullback-Leiblerdistance between the empirical covariate-conditioned distribution P (t, r|Z) andits theoretical expectation P (t, r|π(Z)).

7.2 Conditioning of individual hazard rates on covariates

How to relate covariates to prediction. The second approach to bringing in co-variate information is formulated in terms of the individual cause-specific hazardrates. We regard individual covariates as predictors of individual cause-specifichazard rates, which in turn predict individual survival:

Zi → predicts→ πi → predicts→ (Xi,∆i)

The question then is how to formalise this. If we are given the values Z of thecovariates of an individual, then the survival probability and data likelihood forthat individual, conditional on knowing their covariates to be Z, can be writtenas

S(t|Z,W ) =

∫dπ W (π|Z)S(t|π) (7.17)

P (X,∆|Z,W ) =

∫dπ W (π|Z)P (X,∆|π) (7.18)

Here∫

dπ represents functional integration over the values of all hazard rates atall times, subject to πr(t) ≥ 0 for all (t, r), and W (π|Z) gives the probabilitythat a randomly drawn individual with covariates Z will have individual cause-specific hazard rates π.

The distribution W (π|Z) depends strictly on the degree to which Z is in-formative of π, i.e. on biochemistry. If Z is very informative, then W (π|Z) willbe very narrow and point us to a very small set of cause-specific hazard rates

Page 63: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

CONDITIONING OF INDIVIDUAL HAZARD RATES ON COVARIATES 55

compatible with observing covariates Z. One cannot conclude that any patternslinking Z to π, embodied in W (π|Z), are causal. For instance, π and (compo-nents of) Z could both be (partially) effects of a common cause Y . W (π|Z) onlyanswers the question: if one knows Z for an individual, what does this tell usabout his/her π?

Estimation of covariates-to-risk connection. To proceed we need to estimatethe unknown distribution W (π|Z), from analysis of the complete data D =(X1,∆1;Z1), . . . , (XN ,∆N ;ZN ) (i.e. the survival data plus the covariates ofall patients). For infinitely large cohorts we would expect to find W (π|Z) be-coming identical to the empirical frequency W (π|Z) with which the hazard ratesπ are observed among the individuals with covariates Z:

W (π|Z) = limN→∞

W (π|Z) (7.19)

W (π|Z) =

∑i δ(π − πi)δ(Z −Z

i)∑i δ(Z −Z

i)(7.20)

For finite data sets we will only be able to say how likely is each possible functionW (π|Z), in the light of the data D: we will calculate P(W |D), where W isthe conditional distribution W (π|Z). If we assume that all patients in D areindependently drawn from a given population, the standard Bayesian formulaP (A|B) = P (B|A)P (A)/P (B) tells us that

P(W |D) =P(D|W )P(W )∫

dW ′ P(D|W ′)P(W ′)(7.21)

with

P(D|W ) =

N∏i=1

P (Xi,∆i|Zi,W ) =

N∏i=1

∫dπ W (π|Zi)P (Xi,∆i|π) (7.22)

Here∫

dW denotes functional integration over all distributions W (π|Z), subjectto the constraint

∫dπ W (π|Z) = 1, for all Z. The survival prediction formulae

for an individual with covariates Z will then be:

S(t|Z,D) =

∫dW

likelihood that W is right︷ ︸︸ ︷P(W |D) ×

survival prediction given W︷ ︸︸ ︷S(t|Z,W )

=

∫dW P(W |D)

∫dπ W (π|Z)S(t|π) (7.23)

and

P (X,∆|Z,D) =

∫dW P(W |D)P (X,∆|Z,W )

Page 64: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

56 INCLUDING COVARIATES

=

∫dW P(W |D)

∫dπ W (π|Z)P (X,∆|π) (7.24)

Equivalently:

S(t|Z,D) =

∫dπ W (π|Z,D)S(t|π) (7.25)

P (X,∆|Z,D) =

∫dπ W (π|Z,D)P (X,∆|π) (7.26)

with

W (π|Z,D) =

∫dW P(W |D)W (π|Z) (7.27)

The distribution W (π|Z,D) combines two sources of uncertainty: (i) uncertaintyin the individual hazard rates π given an individual’s covariates Z (coded inW (π|Z)), and (ii) our ignorance about which is the true relation W (π|Z), giventhe data D (described by P(W |D)). The first uncertainty can be reduced byusing more informative covariates, the second by acquiring more data.

Information-theoretic interpretation. Again there exists a nice information-theoreticinterpretation of our Bayesian formulae, since

1

NlogP(W |D)

=1

N

N∑i=1

log

∫dπ W (π|Zi)P (Xi,∆i|π) +

1

NlogP(W ) + constant

=∑r

1

N

N∑i=1

∫dZ δ(Z−Zi)

∫ ∞0

dt δ(t−Xi)δr,∆i log

∫dπ W (π|Z)P (t, r|π)

+1

NlogP(W ) + constant

=

∫dZ P (Z)

∑r

∫ ∞0

dt P (t, r|Z) log

∫dπ W (π|Z)P (t, r|π)

+1

NlogP(W ) + constant (7.28)

We can rewrite (7.28) in terms of a Kullback-Leibler distance:

1

NlogP(W |D)

= −∫

dZ P (Z)∑r

∫ ∞0

dt P (t, r|Z) log[ P (t, r|Z)∫

dπ W (π|Z)P (t, r|π)

]+

1

NlogP(W ) + Constant (7.29)

(in which Constant is a new constant that differs from the previous one by a fur-ther W -independent term, being the Z-averaged Shannon entropy of P (t, r|Z)).

Page 65: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

CONNECTING THE CONDITIONING AND SUB-COHORT PICTURES 57

So we see that, apart from the regularising influence of the prior P(W ), the mostprobable function W (π|Z) is the one that minimises the Z-averaged Kullback-Leibler distance between the empirical distribution P (t, r|Z) and its theoreticalexpectation P (t, r|Z) =

∫dπ W (π|Z)P (t, r|π).

7.3 Connecting the conditioning and sub-cohort pictures

Conditioned cohort-level hazard rates in terms of W . It is instructive to inspectthe relation between the two routes for incorporating covariates into survivalprediction in more detail. We note that (7.3) can be written in terms of theempirical estimator W (π|Z) given in (7.20):

πµ(t|Z) =

∫dπ

∑i δZ,Ziδ(π−πi)πµ(t)e

−∑

r

∫ t0

ds πr(s)

∫dπ∑i δZ,Ziδ(π−πi)e−

∑r

∫ t0

ds πr(s)

=

∫dπ W (π|Z)πµ(t)e

−∑

r

∫ t0

ds πr(s)∫dπ W (π|Z)e

−∑

r

∫ t0

ds πr(s)(7.30)

Relation in terms of prediction. We note the difference between the definitionsof S(t|π(Z)) and S(t|Z,W ). The first gives the survival probability for a ran-domly drawn individual from the data set with covariates Z; the second givesthe survival probability for a randomly drawn individual with covariates Z (notnecessarily from the data set). Any finite-size imperfections of the data set willaffect S(t|π(Z)). Within the sub-cohort picture we find, using (7.6) and (7.3):

S(t|π(Z)) =

1|ΩZ |

∑i∈ΩZ

πiµ(t)S(t|πi)

πµ(t|Z)=

1|ΩZ |

∑i∈ΩZ

πiµ(t)e−∑

r

∫ t0

ds πir(s)

πµ(t|Z)

=( 1

|ΩZ |∑i∈ΩZ

πiµ(t)e−∑

r

∫ t0

ds πir(s)) ∑

i∈ΩZe−∑

r

∫ t0

ds πir(s)

∑i∈ΩZ

πiµ(t)e−∑

r

∫ t0

ds πir(s)

=1

|ΩZ |∑i∈ΩZ

e−∑

r

∫ t0

ds πir(s)

=

∫dπ

(∑i∈ΩZδ(π−πi)∑i∈ΩZ

)e−∑

r

∫ t0

ds πr(s)

=

∫dπ W (π|Z)S(t|π) = S(t|Z, W ) (7.31)

Similarly we can connect the expressions P (X,∆|π(Z)) and P (X,∆|Z,W ).Starting from the sub-cohort picture we get

P (X,∆|π(Z)) = π∆(X|Z)S(X|π(Z))

Page 66: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

58 INCLUDING COVARIATES

=1

|ΩZ |∑i∈ΩZ

πi∆(X)S(X|πi)

=

∫dπ( 1

|ΩZ |∑i∈ΩZ

δ(π−πi))π∆(X)S(X|π)

=

∫dπ W (π|Z)π∆(X)S(X|π)

= P (X,∆|Z, W ) (7.32)

There is no contradiction between the two approaches, they just focus on dif-ferent quantities. In the conditioning picture we capture the variability in theconnection Z → π in a distribution W (π|Z), where π refers to the cause-specifichazard rates of individuals. In the sub-cohort picture we describe the variabilityin the connection Z → π via sub-cohort level cause-specific hazard rates π(Z).In both cases we still need to estimate this variability from the data.

7.4 Conditionally homogeneous cohorts

Trivial versus nontrivial heterogeneity. There are two types of cohort heterogene-ity. The trivial one is heterogeneity in covariates, meaning that the Zi are notidentical for all individuals i. We always allow for this by default; it would besilly to include covariates and then assume they take identical values for all indi-viduals (as they would give no information). The nontrivial type of heterogeneityrefers to the link between covariates and risks. A conditionally homogeneous co-hort is one in which the cause-specific hazard rates are identical for all individualsi with identical covariates Zi.

If we work within the sub-cohort picture, formulated in terms of the sub-cohort level cause-specific hazard rates π(Z) of individuals with covariates Z,we need not make any statements on presence or absence of covariate-to-riskheterogeneity. In either case we simply calculate survival statistics and datalikelihood for any individual with covariates Z via

S(t|π(Z)) = e−∑

r

∫ t0

ds πr(s|Z)(7.33)

P (X,∆|π(Z)) = π∆(X|Z)e−∑

r

∫ X0

ds πr(s|Z)(7.34)

We can get these equations also starting from the conditioning picture, but therewe need conditional cohort homogeneity, i.e. W (π|Z) = δ[π−π(Z)], with π(Z)now representing the individual cause-specific hazard rates of individuals withcovariates Z. The risk statistics of each individual with covariates Z are nowdescribed by the cause-specific hazard rates π(Z). Due to the conditional homo-geneity these are trivially identical to the cohort-level hazard rates, and hence(7.33,7.34) again hold. Also the Bayesian estimation of π(Z), which involvesevaluation of the quantity P (X,∆|π(Z)) for the available data points (Xi,∆i),

Page 67: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

CONDITIONALLY HOMOGENEOUS COHORTS 59

'

&

$

%

all modelscovariate-to-risk connection:

W (π|Z)

estimating W from data:P(W |D)

['

&

$

%

(((((

homogeneous cohorts

W (π|Z) = δ[π−π?(Z)] covariate-to-risk connection:π?(Z)

estimating π? from data:P(π?|D)

[

Fig. 7.1. Description of general and of conditionaly homogeneous cohorts,within the framework where covariate information is used to condition theprobability for individuals to have individual cause-specific hazard rates π.In conditionally homogeneous cohorts all cause-specific hazard rates are fullydetermined by the covariates; there are no ‘hidden’ covariates that impactupon risk. The remaining uncertainty is only in our limited ability to inferthe function π?(Z) from the data (this uncertainty is decribed by P(π?|D)).

would proceed identically in both approaches, but there would be different in-terpretations of why we would write all this and what would be the meaning ofπ(Z):

• conditioning picture:

the cohort is taken to be homogeneous in the covariates-to-risk patterns,and we assume that all individuals with covariates Z have individual haz-ard rates π(Z) (capturing covariate-to-risk heterogeneity is not possiblebecause we assumed there isn’t any)

• sub-cohort picture:

we make no assumptions regarding homogenity/heterogeneity, but we as-sume that π(Z) represents sub-cohort level hazard rates (capturing covariate-to-risk heterogeneity is not possible because we lack the information)

The difference between the two interpretations will become relevant when westart thinking about how to capture covariate-to-risk heterogeneity. Then themost suitable starting point will be the conditioning picture, since W (π|Z) isdefined in terms of individual cause specific-hazard rates.

Within the conditioning picture, the assumption of cohort homogeneity mustbe brought in via the prior P(W ) in formula (7.21), by choosing

Page 68: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

60 INCLUDING COVARIATES

P(W ) =

∫dπ? δ

[W −W (π?)

]P(π?) (7.35)

Here W (π?) is the covariates-to-rates distribution of a conditionally homoge-neous cohort with W (π|Z) = δ[π − π?(Z)], and formula (7.21) becomes

P(W |D) =P(D|W )

∫dπ? δ

[W −W (π?)

]P(π?)∫

dW ′ P(D|W ′)∫

dπ? δ[W ′ −W (π?)

]P(π?)

=P(D|W (π?))

∫dπ? δ

[W −W (π?)

]P(π?)∫

dπ? P(D|W (π?))P(π?)

=

∫dπ? δ

[W−W (π?)

]P(π?|D) (7.36)

With (7.36) our earlier prediction formula for the conditioning picture simplifiesto

W (π|Z,D) =

∫dπ? P(π?|D) δ[π − π?(Z)] (7.37)

which, in turn, leads us directly to (7.9,7.10).

7.5 Nonparametrised covariates-to-risk connection

If we do not wish to take into account our uncertainty regarding the covariates-to-risk relations π?(t|Z), we could turn to the simple recipe of the maximumlikelihood estimator. We saw earlier that this means finding the maximum oflogP(π?|D), but with a flat prior P(π?); equivalently, maximising logP(D|π?).A flat prior also factorises trivially over risks, so we find the disconnected max-imisation problems (7.13), with constant functions P(π?r ). Hence the maximumlikelihood estimator πr(t|Z) is found by maximising

L(D|π?r ) =

N∑i=1

δr,∆ilog π?r (Xi|Zi)−

N∑i=1

∫ Xi

0

dt π?r (t|Zi)

=

N∑i=1

∫ ∞0

dtδ(t−Xi)δr,∆i

log π?r (t|Zi)− θ(Xi−t)π?r (t|Zi)

=

∫ ∞0

dt

N∑i=1

∫dZ δ(Z−Zi)

δ(t−Xi)δr,∆i

log π?r (t|Z)− θ(Xi−t)π?r (t|Z)

=

∫ ∞0

dt

∫dZ

log π?r (t|Z)

N∑i=1

δ(Z−Zi)δ(t−Xi)δr,∆i

−π?r (t|Z)

N∑i=1

δ(Z−Zi)θ(Xi−t)

(7.38)

Straigthforward functional differentiation of this latter expression gives:

(∀Z)(∀t ≥ 0) :

Page 69: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 61

1

π?r (t|Z)

N∑i=1

δ(Z−Zi)δ(t−Xi)δr,∆i=

N∑i=1

δ(Z−Zi)θ(Xi−t) (7.39)

which gives us the maximum likelihood estimator

(∀r)(∀t) : πr(t|Z) =

∑i δ(Z−Z

i)δr,∆iδ(t−Xi)∑

i δ(Z−Zi)θ(Xi−t)

(7.40)

This is very similar to the earlier estimator (7.40), but now we find the sumsover individuals restricted to those with covariates Z. Although formally correct,expressions such as (7.40) are in practice rather useless. The problem is thatwe are here estimating functions of p + 1 arguments. Even if we we reduce ourambition and ask for just five or so points per dimension (a rather small number),and we have e.g. p = 5 covariates (a modest number), we would still already needin excess of 5p+1 = 15,625 data points to start covering the space of all (Z, t)combinations. If we want in addition to estimate values of π?r (t|Z) with, say,10% accuracy, we need to multiply the number if data points needed further bya factor 100.

We conclude that, even for conditionally homogeneous cohorts, we have nochoice but to find suitable parametrisations of the functions π?r (t|Z), i.e. we willpropose a specific sensible formula for π?r (t|Z) with a modest number of freeparameters, and use the data to estimate these parameters. This is the ideabehind Cox regression.

7.6 Examples

Let us get intuition for the effect of Bayesian priors in regression. We saw thatnonparametrised maximisation of (7.13) with a flat prior P(π?r ) gives as themost probable hazard rate a ‘spiky’ estimator (7.40), with δ-functions at thetimes where the events in the data set occurred. One does not expect the realhazard rate to have spikes; this knowledge can be coded into a prior of the form

P(π?r ) =1

C(α)e−α∫

dZ W (Z)∫∞

0dt(

dπ?r (t|Z)/dt)2

(7.41)

with some normalisation constant C(α). This prior ‘punishes’ explanations withdiscontinuous behaviour, while reducing to the flat prior for α→ 0. Let us choosethe simplest example, with just one binary covariate Zi ∈ 0, 1 and just onerisk (i.e. ∆i = 1 for all i, so we can drop the index r). We make the most naturalchoice W (Z) = 1

2δZ,0 + 12δZ,1 in (7.41). Expression (7.13) then becomes, apart

from an irrelevant normalisation constant,

logP(π?|D) =

N∑i=1

log π?(Xi|Zi)−N∑i=1

∫ Xi

0

dt π?(t|Zi)

−1

∫ ∞0

dt[(dπ?(t|0)

dt

)2

+(dπ?(t|1)

dt

)2]

Page 70: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

62 INCLUDING COVARIATES

=

N∑i=1

δZi,0 log π?(Xi|0)−N∑i=1

δZi,0

∫ Xi

0

dt π?(t|0)

−1

∫ ∞0

dt(dπ?(t|0)

dt

)2

+

N∑i=1

δZi,1 log π?(Xi|1)−N∑i=1

δZi,1

∫ Xi

0

dt π?(t|1)

−1

∫ ∞0

dt(dπ?(t|1)

dt

)2

(7.42)

The quantity to be maximised has separated into independent expressions, onefor π?(t|0) and one for π?(t|1). For each Z = 0, 1 we have to maximize andexpression of the form

LZ(π) =

N∑i=1

δZi,Z

log π(Xi|Z)−

∫ Xi

0

dt π(t|Z)

−1

∫ ∞0

dt(dπ(t|Z)

dt

)2

(7.43)

To differentiate LZ(π) with respect to π(t|Z) we will use the following identity

δ

δf(t)

∫ ∞0

ds(f ′(s)

)2= 2

∫ ∞0

ds f ′(s)δ

δf(t)f ′(s)

= 2 limε→0

1

ε

∫ ∞0

ds f ′(s)δ

δf(t)

(f(s+ ε)− f(s)

)= 2 lim

ε→0

1

ε

(f ′(t−ε)− f ′(t)

)= −2f ′′(t) (7.44)

Application to LZ(π) tells us that the most probable function π(t|Z) is to besolved from

π(t|Z) = 0 or

1

π(t|Z)

N∑i=1

δZi,Zδ(t−Xi)−N∑i=1

δZi,Zθ(Xi−t) + αd2

dt2π(t|Z) = 0 (7.45)

This can be rewritten in terms of the maximum likelihood estimator

π(t|Z) =

∑i δZi,Zδ(t−Xi)∑i δZi,Zθ(Xi−t)

(7.46)

as

π(t|Z) = 0 ord2

dt2π(t|Z) =

1

α

(1− π(t|Z)

π(t|Z)

) N∑i=1

δZi,Zθ(Xi−t) (7.47)

For α → 0 we recover the maximum-likelihood solutions. For α > 0 we willhave jumps in the first derivative of π(t|Z), but continuous (i.e. non-spiky) rates

Page 71: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 63

π(t|Z), as a consequence of the prior. To calculate the corresponding value forLZ we rewrite

LZ(π) =

N∑i=1

δZi,Z

log π(Xi|Z)−

∫ Xi

0

dt π(t|Z)− 1

2α[π(t|Z)

d

dtπ(t|Z)

]∞0

+1

∫ ∞0

dt π(t|Z)d2

dt2π(t|Z)

=

N∑i=1

δZi,Z

log π(Xi|Z)−

∫ Xi

0

dt π(t|Z)

+1

2α(π(0|Z)π′(0|Z)− π(∞|Z)π′(∞|Z)

)+

1

2

∫ ∞0

dt(π(t|Z)− π(t|Z)

) N∑i=1

δZi,Zθ(Xi−t)

=

N∑i=1

δZi,Z

log π(Xi|Z)− 1

2

∫ Xi

0

dt π(t|Z)

+1

2απ(0|Z)π′(0|Z)

−1

2

N∑i=1

δZi,Z (7.48)

Note that a finite nonzero derivative of π(t|Z) as t→∞ is ruled out as it wouldgive either negative or diverging hazard rates. It is also clear that the maximummust have π(Xi|Z) > 0 for all i, in view of the term with log π(Xi|Z). Zero ratescan only occus in between the data times X1, . . . , XN.

Let us inspect the shape of π(t|Z) when it is nonzero, and assume that thereare no ties, i.e. ti 6= tj if i 6= j. We can then order our individuals i such thatX0 < X1 < X2 < . . .XN−1 < XN (with the definition X0 ≡ 0) . At any timet /∈ X1, . . . , XN equation (7.47) simplifies considerably:

t < X1 :d2

dt2π(t|Z) = γ1(Z) =

1

α

N∑i=1

δZi,Z (7.49)

t ∈ (X`, X`+1) :d2

dt2π(t|Z) = γ`+1(Z) =

1

α

N∑i=`+1

δZi,Z (7.50)

t ∈ (XN−1, XN ) :d2

dt2π(t|Z) = γN (Z) =

1

αδZN ,Z (7.51)

t > XN :d2

dt2π(t|Z) = 0 (7.52)

In each interval I` = (X`−1, X`) we apparently have a hazard rate in the shapeof a local parabola:

Page 72: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

64 INCLUDING COVARIATES

0.0

0.5

1.0

0 1 2 3 4 5 6 7 8

t

π(t|Z)

0.0

0.5

1.0

0 1 2 3 4 5 6 7 8

t

π?(t|Z)

Fig. 7.2. Left: the maximum likelihood estimator π(t|Z) in (7.46) for Z = 0,calculated from a data set with N = 71 patients and a binary covariateZ ∈ 0, 1, in which there are many early and many late events (but withfew at intermediate times). By definition this estimator always consists ofweighted delta-peaks (‘spikes’) at the observed event times. Right: the mostprobable solution π?(t|Z) for Z = 0 within the Bayesian formalism, whichdiffers from the previous one in the addition of a ‘smoothness’ prior P (π).Here α = 50.

t ∈ (X`−1, X`) : π(t|Z) =1

2γ`(t− t`)2 + δ` (7.53)

t > XN : π(t|Z) = π(∞|Z) (7.54)

We only need to determine the constants (t`, δ`) for each interval. The solutions inadjacent time intervals are related by the continuity condition, i.e. limε↓0 π(X`+ε) = limε↓0 π(X` − ε), giving

` < N :1

2γ`(X` − t`)2 + δ` =

1

2γ`+1(X` − t`+1)2 + δ`+1 (7.55)

` = N :1

2γN (XN − tN )2 + δN = π(∞|Z) (7.56)

The second identity which we can use relates to the first derivative of π(t|Z)near each X`. Integration over both sides of (7.47), with ε > 0, gives:

π′(X`+ε|Z)− π′(X`−ε|Z) =1

α

N∑i=1

δZi,Z

∫ X`+ε

X`−εdt(θ(Xi−t)−

δ(t−Xi)

π(Xi|Z)

)=

1

α

N∑i=1

δZi,Z

[(t−Xi)θ(Xi−t)−

θ(t−Xi)

π(Xi|Z)

]X`+εX`−ε

Page 73: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 65

= − 1

α

N∑i=1

δZi,Zπ(Xi|Z)

θ(X`+ε−Xi)− θ(X`−ε−Xi)

= −

δZ`,Zαπ(X`|Z)

(7.57)

Thus we find

` < N : γ`(X` − t`) = γ`+1(X` − t`+1) +1

α

δZ`,Z12γ`+1(X` − t`+1)2 + δ`+1

(7.58)

` = N : γN = 0 or XN−tN =1

π(∞|Z)(7.59)

In combination, we end up with the following iteration for the unknown constants(t`, δ`), where we note (and use) the fact that t` is irrelevant if γ` = 0:

t` = X` −γ`+1

γ`(X` − t`+1)− 1

αγ`

δZ`,Z12γ`+1(X` − t`+1)2 + δ`+1

(7.60)

δ` = δ`+1 +1

2γ`+1(X` − t`+1)2 − 1

2γ`(X` − t`)2 (7.61)

to be iterated downwards, starting with

tN = XN −1

π(∞|Z)δN = π(∞|Z)− γN

2π2(∞|Z)(7.62)

The only remaining freedom in our solution is the value chosen for π(∞|Z).This value is determined by the requirement that our solution must maximiseexpression (7.48). For the present solution π(t|Z) this expression reduces to

LZ(π) =

N∑`=1

δZ`,Z log[1

2γ`(X`−t`)2+δ`

]− 1

2

( N∑i=1

δZi,Z

)[1 + t1

(1

2γ1t

21+δ1

)]−1

2

N∑i=1

δZi,Z

i∑`=1

∫ X`

X`−1

dt(1

2γ`(t− t`)2 + δ`

)=

N∑i=1

δZi,Z

log[1

2γi(Xi−ti)2+δi

]− 1

2

[1 + t1

(1

2γ1t

21+δ1

)+

i∑`=1

(1

6γ`(X`−t`)3 − 1

6γ`(X`−1−t`)3 + δ`(X`−X`−1)

)](7.63)

The resulting solution π?(t|Z) is shown for an example data set in Figure , forα = 50, together with the ‘spiky’ maximum likelihood estimator π(t|Z). Thenew estimator combines evidence from the data (the event times) with our priorbelief that the true hazard rate should be smooth.

Page 74: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

8

PROPORTIONAL HAZARDS (COX) REGRESSION

Most existing survival analysis protocols that aim to quantify the impact ofcovariate values on risk, and/or predict survival outcome from an individual’scovariates, can be obtained from the general Bayesian description in the pre-vious section, upon implementing specific further complexity reductions. Thesereductions are always of the following types (often in combination):

• Within the sub-cohort picture: assumptions on the form of π(Z)

These are designed to reduce the complexity of the mathematical formulas,and are formulated via simple low-dimensional parametrisations.

• Within the conditioning picture: assumptions on the form of W (π|Z)

These aim again to reduce the complexiy of the mathematical formulas,and are implemented via the prior P(W |D) (which is set to zero for allW (π|Z) that are not of the assumed form).

• Assumptions on correlations between risks in the cohort

These relate to the interpretation of results. For instance, the assumptionof independence is needed if we want to interpret the cause specific hazardrates of the primary risk as indicative of risk in a world where the otherrisks are eliminated.

• Mathematical approximations

These are short-cuts which in principle always induce some error. An ex-ample is limiting oneself to the most probable value of a parameter, even ifits distribution has finite width (e.g. maximum likelihood estimation versusBayesian regression).

8.1 Definitions, assumptions and regression equations

Definition of Cox regression. ‘Proportional hazards regression’ or ‘Cox regres-sion’ (dating from 1972) is a formalism that is indeed obtained from the generalBayesian picture via several simplifications of the type listed above. To appre-ciate its definition, let us inspect which formulas we could in principle write forthe hazard rate of the primary risk. We must demand π1(t|Z) ≥ 0 for all t ≥ 0and all Z, so we can always write it in exponential form. If we then also expandthe exponent in powers of Z we see that any acceptable cause-specific hazardrate can be written as

66

Page 75: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

DEFINITIONS, ASSUMPTIONS AND REGRESSION EQUATIONS 67

π1(t|Z) = π1(t|0) e

∑p

µ=1βµ(t)Zµ+ 1

2

∑p

µ,ν=1βµν(t)ZµZν+O(Z3

)(8.1)

Cox regression boils down to an inspired simplification of this general expression,crucial at the time of the method’s conception when computation resources wherevery limited (remember that in 1972 the average university would have just onebig but slow computer):

We assume that (conditional on the covariates Z) all risks are statisticallyindependent, and that the cause-specific hazard rate of the primary risk forindividuals with covariates Z is a function of the following parametrized form:

π1(t|Z) = λ0(t)eβ·Z (8.2)

Here β·Z =∑pµ=1 βµZµ, with time-independent parameters β = (β1, . . . , βp).

We then focus on calculating the most probable β and the most probablefunction λ0(t).

The function λ0(t) ≥ 0 is called the ‘base hazard rate’. It is the primary riskhazard rate one would find for the trivial covariates Z = (0, 0, . . . , 0). The name‘proportional hazards’ refers to the fact that, due to the exponential form of(8.2), the effect of each covariate is multiplicative:

π1(t) = λ0(t)︸ ︷︷ ︸base hazard rate

× eβ1Z1 × . . .× eβpZp︸ ︷︷ ︸‘proportional hazards′

The main implications of (8.2) are that the effects of the covariates are takento be mutually independent and independent of time. One effectively assumesthat there exists a time-independent hyper-plane in covariate space that sepa-rates high risk individuals from low risk individuals:

‘high risk covariates′ : β1Z1 + . . .+ βpZp large

‘low risk covariates′ : β1Z1 + . . .+ βpZp small

In addition we can now quantify the risk impact of each individual covariate µin a single time-independent number, the so-called ‘hazard ratio’

HRµ =π1(t|Z)|Zµ=1

π1(t|Z)|Zµ=0=λ0(t)e

βµ.1+∑

ν 6=µβνZν

λ0(t)eβµ.0+

∑ν 6=µ

βνZν= eβµ (8.3)

Covariates with no impact on risk, i.e. with βµ = 0, would thus give HRµ = 1.Note that in the more general case (8.1) the ratio π1(t|Z)|Zµ=1/π1(t|Z)|Zµ=0

would still have depended on the remaining covariates Zν with ν 6= µ. The mainvirtue of the choice (8.2) is that it is the simplest nontrivial definition to meetthe main criteria that we need to build into any parametrisation of cause-specifichazard rates (nonnegativity, possible dependence on time and on covariates) in

Page 76: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

68 PROPORTIONAL HAZARDS (COX) REGRESSION

which we can effectively decouple the time variable from the variables relatingto covariates.

Derivation of equations for regression parameters. Within Cox regression we seekto find the most probable parameters β = (β1, . . . , βp) and the most probablefunction λ0(t) in (8.2). In Cox’s original paper he did not in fact calculate λ0(t)explicitly, but instead focused on calculating β using an argument (‘partial likeli-hood’) that avoids having to know the base hazard rate. Here we use the benefitof hindsight and the fact that we have already done much of the preparatorywork, and calculate the most probable parameters directly from (7.13) (with aflat prior P(π?1), where the most probable Bayesian solution reduces to maximumlikelihood estimation):

logP(β, λ0|D)

=

N∑i=1

δ1,∆i log π1(Xi|Zi)−

∫ Xi

0

dt π1(t|Zi)

+ constant

=

N∑i=1

∫ ∞0

dt log λ0(t)δ1,∆iδ(t−Xi) + δ1,∆iβ ·Zi

−eβ·Zi∫ ∞

0

dt θ(Xi−t)λ0(t)

+ constant (8.4)

Maximisation of this expression is done as always via the Lagrange formalism.Let us first maximise over λ0(t), and define L(β|D) = maxλ0 logP(β, λ0|D). Itwill again turn out that the constraint λ0(t) ≥ 0 will be met automatically, sothe Lagrange equations from which to solve λ0(t) become

0 =δ

δλ0(t)logP(β, λ0|D)

=1

λ0(t)

N∑i=1

δ1,∆iδ(t−Xi)−N∑i=1

eβ·Zi

θ(Xi−t) (8.5)

It follows that the maximising function λ0(t), given β, is

λ0(t|β) =

∑Ni=1 δ1,∆i

δ(t−Xi)∑Ni=1 eβ·Z

i

θ(Xi−t)(8.6)

For β = 0 this expression reduces to the simple estimator π1(t) in (7.40) forthe cause-specific hazard rate, as one would expect. Having calculated the mostprobable base hazard rate (8.6) in terms of the regression parameters β, we arethen left with the following function to be maximised over β:

L(β|D) = maxλ0logP(β, λ0|D)

Page 77: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

DEFINITIONS, ASSUMPTIONS AND REGRESSION EQUATIONS 69

=

∫ ∞0

dt( N∑i=1

δ1,∆iδ(t−Xi)

)log( N∑i=1

δ1,∆iδ(t−Xi)

)+ constant

−∫ ∞

0

dt( N∑i=1

δ1,∆iδ(t−Xi)

)log( N∑i=1

eβ·Zi

θ(Xi−t))

+

N∑i=1

δ1,∆iβ ·Zi −

N∑i=1

eβ·Zi∫ ∞

0

dt θ(Xi−t)[ ∑N

j=1 δ1,∆jδ(t−Xj)∑N

j=1 eβ·Zj

θ(Xj−t)

]

= L(0|D)−∫ ∞

0

dt( N∑i=1

δ1,∆iδ(t−Xi)

)log( N∑i=1

eβ·Zi

θ(Xi−t))

+

N∑i=1

δ1,∆iβ ·Zi −

N∑i=1

eβ·Zi∫ ∞

0

dt θ(Xi−t)( ∑N

j=1 δ1,∆jδ(t−Xj)∑N

j=1 eβ·Zj

θ(Xj−t)

)

+

∫ ∞0

dt( N∑i=1

δ1,∆iδ(t−Xi)

)log( N∑i=1

θ(Xi−t))

+

N∑i=1

∫ ∞0

dt θ(Xi−t)(∑N

j=1 δ1,∆jδ(t−Xj)∑Nj=1 θ(Xj−t)

)= L(0|D) +

N∑i=1

δ1,∆iβ ·Zi −

N∑i=1

δ1,∆i log( N∑j=1

eβ·Zj

θ(Xj−Xi))

+

N∑i=1

δ1,∆i log( N∑j=1

θ(Xj−Xi))

= L(0|D) +

N∑i=1

δ1,∆iβ ·Zi −

N∑i=1

δ1,∆ilog∑N

j=1 eβ·Zj

θ(Xj−Xi)∑Nj=1 θ(Xj−Xi)

(8.7)

From this result we can immediately derive by differentiation the equation 0 =∂L(β|D)/∂βµ from which to derive the most probable β, giving

for all µ :

N∑i=1

δ1,∆i

Ziµ −

∑Nj=1 Z

jµ eβ·Z

j

θ(Xj−Xi)∑Nj=1 eβ·Z

j

θ(Xj−Xi)

= 0 (8.8)

This is a relatively simple set of coupled nonlinear equations for just p parameters(β1, . . . , βp), which could indeed be analysed with the computing power of the1970s. Once the parameters β have been determined, the corresponding hazardratios follow via (8.3), and the most probable base hazard rate λ0(t) follows from(8.6).

Finally, once the most probable base hazard rate and the most probableregression parameters β are known, the Cox formalism allows us to predict thesurvival time for any individual with covariates Z via the primary risk-specificversion of (7.4), which now reduces to

Page 78: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

70 PROPORTIONAL HAZARDS (COX) REGRESSION

SCox(t|Z) = exp(−∫ t

0

ds π1(s|Z))

= exp(− e

ˆβ·ZΛ0(t|β))

(8.9)

with, upon integrating (8.6) over time,

Λ0(t|β) =

∫ t

0

ds λ0(s|β) =

N∑i=1

δ1,∆i

θ(t−Xi)∑Nj=1 e

ˆβ·Zj

θ(Xj−Xi)(8.10)

(which is Breslow’s estimator, first given in the comments at the end of Cox’s1972 paper). In combination this gives

SCox(t|Z) = exp(−

N∑i=1

δ1,∆i

eˆβ·Zθ(t−Xi)∑N

j=1 eˆβ·Zj

θ(Xj−Xi)

)(8.11)

8.2 Uniqueness and p-values for regression parameters

Curvature of L(β|D) and uniqueness. To find out whether there could be multiplesolutions of equation it is helpful to inspect the second derivative (or curvature)of (8.7). We note that (8.8) was derived from the first derivative of L(β|D):

∂βµL(β|D) =

∑i

δ1,∆i

Ziµ −

∑j Z

jµ eβ·Z

j

θ(Xj−Xi)∑j eβ·Z

j

θ(Xj−Xi)

(8.12)

Hence, upon introducing the short-hand

〈uj〉i =∑j

pj(i)uj , pj(i) =eβ·Z

j

θ(Xj−Xi)∑j eβ·Z

j

θ(Xj−Xi)(8.13)

we obtain by further differentiation

∂2

∂βµ∂βνL(β|D) = −

∑i

δ1,∆i

〈ZjµZjν〉i − 〈Zjµ〉i〈Zjν〉i

= −

∑i

δ1,∆i〈(Zjµ − 〈Zjµ〉i)〉i〈(Zjν − 〈Zjν〉i)〉i (8.14)

Unless all event times are equal, the matrix of second derivatives is seen to benegative definite everywhere, since for any vector y ∈ IRp one has

p∑µ,ν=1

( ∂2

∂βµ∂βνL(β|D)

)yν = −

N∑i=1

δ1,∆i

( p∑µ=1

yµ〈(Zjµ − 〈Zjµ〉i)〉i)2

< 0 (8.15)

Page 79: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

UNIQUENESS AND P -VALUES FOR REGRESSION PARAMETERS 71

Hence there will be only one extremal point of L(β|D) and therefore only onesolution of (8.8), and we know it will indeed be a maximum. From now on we

will write the relevant point as β.

Shape of P(β, λ0|D) near the most probable point. We next write the entries of

the curvature matrix of L(β|D) at the most probable point β as minus Aµν(where A is a positive definite matrix):

Aµν(β) = − ∂2

∂βµ∂βνL(β|D)

∣∣∣ ˆβ=∑i

δ1,∆i〈(Zjµ − 〈Zjµ〉i)〉i〈(Zjν − 〈Zjν〉i)〉i

∣∣∣ ˆβ (8.16)

We can now expand L(β|D) close to the maximum point:

L(β|D) = L(β|D)− 1

2

∑µν

Aµν(β)(βµ − βµ)(βν − βν) +O(|β−β|3) (8.17)

Given the definition of L(β|D), we can now also write

maxλ0P(β, λ0|D) = eL(

ˆβ|D)− 12

∑µνAµν(

ˆβ)(βµ−βµ)(βν−βν)+O(|β− ˆβ|3)(8.18)

It follows that, provided maxλ0P(β, λ0|D) is a narrow distribution around the

most probable point β and provided we can disregard the uncertainty in the basehazard rate5, we can approximate P(β|D) = maxλ0

P(β, λ0|D) by a multi-variateGaussian distribution, from which we can also obtain the Gaussian marginals forindividual regression parameters:

P(βµ|D) ≈ (σµ√

2π)−1e−12 (βµ−βµ)2/σ2

µ , σ2µ = (A−1)µµ(β) (8.19)

p-values for Cox regression parameters and hazard ratios. The latter result (8.19)allows us to define approximate p-values for Cox regression. We do this in theusual way: given an observed value of βµ in regression, we define the p-value as the

probability to observe |βµ| ≥ |βµ| in a ‘null model’. The null model chosen here is

the distribution (8.19) that corresponds to the trivial value β = 0. However, this

is a further approximation, since one could also set only βµ = 0 in the null model,leaving the other regression parameters nonzero (in which case the variance in

5Note: this is an assumption for which we have no justification yet, but which is essentialif in the Cox approach we want to quantify regression uncertainty, since the whole point inCox regression is to eliminate the base hazard rate from the problem and formulate everythingstrictly in terms of β alone.

Page 80: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

72 PROPORTIONAL HAZARDS (COX) REGRESSION

(8.19) could depend, via the matrix A(β), on all other βν with ν 6= µ). If we

choose the null model β = 0 we get

P0(βµ) = (σµ√

2π)−12 e−

12 (βµ)2/σ2

µ , σ2µ = (A−1)µµ(0) (8.20)

and our p-value approximation will be

p− value = Prob(|βµ| ≥ |βµ|

)= 1− 2

σµ√

∫ |βµ|0

dβ e−12β

2/σ2µ

= 1− 2√π

∫ |βµ|/σµ√2

0

dx e−x2

= 1− Erf(|βµ|/σµ

√2)

(8.21)

The ratio |βµ|/σµ is called the z-score. Note that the approximations underlyingthis final simple result (8.21) are quite drastic: (i) forget about uncertainty inthe base hazard rate λ0(t), (ii) approximate the posterior distribution for β by aGaussian, (iii) assume a null model in which all regression parameters are zero,and (iv) ignore all correlations between regression parameters of different covari-ates. Note also that the p-values do not measure the possible error introducedby overfitting. We will come back to overfitting later.

8.3 Properties and limitations of Cox regression

Normalisation of covariates. The optimal value one will find for the regressionparameter vector β will obviously depend on the units chosen for the covariates,since only the sums

∑pµ=1 βµZ

iµ appear in the parameter likelihood. For instance,

a renormalisation Ziµ → %µZiµ for all i would simply rescale the most probable

regression parameters via βµ → βµ/%µ. This implies that, unless we prescribea normalisation convention for the covariates, we cannot use the value of βµdirectly as a quantitative measure of the impact of covariate µ on survival. Inaddition, the definition of the hazard ratio given in (8.3) as yet makes sense onlyfor binary covariates Zµ ∈ 0, 1. To resolve these problems we need a unifiednormalisation of the covariates. One natural convention is to choose units for allcoariates such that

covariate normalisation :1

N

∑i

Ziµ = 0,1

N

∑i

(Ziµ)2 = 1 (8.22)

Unless all covariates take the same value, this can always be achieved by linearrescaling. Upon adopting (8.22), different components βµ can be compared mean-ingfully, with those further away from zero implying a more prominent impactof covariates on survival.

Hazard ratios for normalised covariates. With the normalisation (8.22) we canalso generalise our previous definition (8.3) of hazard ratios (which as yet applied

Page 81: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

PROPERTIES AND LIMITATIONS OF COX REGRESSION 73

to binary covariates only) to include arbitrary (e.g. real-valued) covariates. In abalanced cohort and for binary covariates the normalisation (8.22) would implyZµ ∈ −1, 1, and consistency with our earlier definition (8.3) of hazard ratioswould then demand that we define

HRµ =π1(t|Z)|Zµ=1

π1(t|Z)|Zµ=−1=λ0(t)e

∑ν 6=µ

βνZν+βµ

λ0(t)e

∑ν 6=µ

βνZν−βµ= e2βµ (8.23)

This is the appropriate hazard rate definition corresponding to the convention(8.22). The Gaussian approximation (8.19) allows us to calculate for each covari-ate µ the so-called 95% confidence intervals for hazard ratios:

[HR−µ ,HR+µ ], HR±µ = e2(βµ±dµ) (dµ > 0) (8.24)

such that Prob(βµ−dµ < βµ < βµ+dµ) = 0.95. From (8.19) we can calculate thequanty dµ:

0.95 =

∫ βµ+dµ

βµ−dµ

dβµ

σµ√

2πe−

12 (βµ−βµ)2/σ2

µ = Erf( dµ

σµ√

2

)(8.25)

Hence we can express dµ via the inverse error function as

dµ =√

2 Erf−1(0.95) σµ ≈ 1.96 σµ (8.26)

To be on the safe side, since the above is still an approximation in view of theGaussian assumption for the βµ-distribution, many authors in fact use dµ = 2σµ.This gives the convention

95% confidence interval : HRµ ∈ [e2(βµ−2σµ), e2(βµ+2σµ)] (8.27)

Univariate versus multivariate regression and correlated covariates. The propor-tional hazards assumption in Cox regression, i.e. different covariates contributeeach an independent multiplicative factor to the primary risk hazard rate, will beviolated as soon as covariates are correlated. We should expect strictly uncorre-lated covariates to be the exception rather than the norm. This has implications.It can cause degeneracies in the parameter likelihood, such that there is no longera unique optimal regression vector. To see this just consider the extreme casewhere we have just two covariates Z1,2 ∈ −1, 1, and these two covariates con-tain exactly the same information:

Z2 = Z1 : π1(t|Z) = λ0(t)eβ1Z1+β2Z2 = λ0(t)e(β1+β2)Z1

In either case one can at best find an optimal linear combination of covariates, butno unique values; any risk evidence in the covariates can be shared in arbitraryratios among the two covariates.

Page 82: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

74 PROPORTIONAL HAZARDS (COX) REGRESSION

A further consequence of covariate correlations is that there will now be adifference between the regression parameters βµ that one would find in uni-variate regression, i.e. in Cox regression using π1(t|Zµ) = λ0(t)eβµZµ (withonly one covariate µ included), and multivariate regression, where π1(t|Zµ) =

λ0(t)e∑p

ν=1βνZν such that µ ∈ 1, . . . , p. If covariate µ correlates with one or

more other covariates, then in multi-variate regression the predictive evidencewill generally be ‘shared’ among the covariates, whereas in uni-variate regressionit will not. Again, to appreciate this just consider the previous example, withZ1 = Z2 = Z:

univariate regression : π1(t|Z1) = λ0(t)eβ1Z , π1(t|Z2) = λ0(t)eβ2Z(8.28)

bivariate regression : π1(t|Z1, Z2) = λ0(t)e(β′1+β′2)Z (8.29)

Here (β1, β2) are the regression parameters found upon studying the impactof the two covariates via univariate regression, and (β′1, β

′2) are the regression

parameters found via bivariate regression. Since the data likelihood only dependson the cause-specific hazard rates, we will always find β1 = β2 = β′1 + β′2.Hence, unless β1 = 0 or β2 = 0, one will inevitable have β1 6= β′1 or β2 6= β′2(or both). In fact even for uncorrelated covariates and infinitely large data sets(where there are no issues with finite size corrections and uncertainties) one willstill generally find different regression parameters when comparing univariate tomultivariate regression – see example 1 at the end of this chapter. The regressionparameters (and hence also the hazard ratios) depend on the modelling context;exactly which other covariates were included? They are not objective quantitativemeasures of the impact of individual covariates on risk.

Issues related to inclusion of treatment parameters as covariates. Often one in-cludes treatment parameters as covariates, with the objective of quantifyingtreatment effect on survival. As soon as such treatment decisions involve hu-man judgement based on observing covariates (as they usually do), this mayaffect the outcome of regression for the initial covariates:

• Adding the treatment decision as a new covariate by definition turns theextended covariate set into a correlated one, even if the initial set wasnot. For instance, assume a covariate Z1 ∈ 1, 2, 3 indicates the gradeof a tumour, and the clinical protocol is to give a patient with grade 2or 3 chemotherapy, then we could indicate this decision with a variableZ2 = θ(Z1 − 3/2) (with the step function), and the pair (Z1, Z2) will bestrongly correlated.

• Provided they are medically effective, treatments will reduce or undo anypatterns that connect the other covariates to risk: if high-risk patientsare correctly identified from covariates (implying a significant predictivesignal in the covariates), and selected for medical treatment, then theseindividuals are thereby converted by this treatment to low-risk patients.This removes the prior link between covariates and medical outcome.

Page 83: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 75

Issues related to interpretation of regression parameters. Once the regression pa-rameters β have been calculated, and assuming there are no problems caused byrisk correlations, there are still pitfalls in the interpretation of these parameters.To name two:

• In heterogeneous cohorts one will typically find dependence of the regres-sion parameters on the duration of the trial, with these parameters mov-ing closer to zero with longer trial durations. This need not be due to atrue time dependence, but may well reflect ‘cohort filtering’, similar to themachanism underlying figure 5.1.

• One should not interpret nonzero regression parameters as evidence forcausal effects of the associated covariates on risk. Finding βµ > 0 simplytells us that individuals with larger Zµ are more likely to experience theprimary hazard. Hazard and covariate could both be consequences of acommon cause, or perhaps the impending hazard could even cause the ele-vated covariate. Imagine what we would find upon including the frequencyof hospital visits as a covariate; we would undoubtedly find a significantlylarge associated regression parameter, as individuals who visit hospitalsmore are more likely to to be ill. Naive interpretation of the outcome ofregression would then lead us to recommend that hospital visits shouldgenerally be avoided.

8.4 Examples

Example 1: multivariate versus univariate analysis

We explore further the differences between univariate Cox regression (one co-variate included at a time) and multivariate regression (with multiple covariatessimultaneously), and the effects of covariate correlations. Assume for simplicitythat we have just one risk (so ∆i = 1 for all i) and ony two covariates, and startfrom expression (8.7), with β = (β1, β2) and Zi = (Zi1, Z

i2):

L(β) =

N∑i=1

β ·Zi −N∑i=1

log( 1

N

N∑j=1

eβ·Zj

θ(Xj−Xi))

+ constant (8.30)

Let us assume our data are of the form Xi = f(Zi1 +Zi2), where f(x) is somemonotonically decreasing function (so those with larger values of Zi1+Zi2 experi-ence primary events earlier). This implies that θ(Xj−Xi) = θ[Zi1+Z

i2−Z

j1−Z

j2 ]. In

addition we define L(β) = L(β)/N , and assume that all Zi are drawn randomlyfrom a zero-average distribution P (Z). For very large populations we will thenfind, using the law of large numbers:

limN→∞

L(β) = limN→∞

1

N

N∑i=1

β ·Zi

Page 84: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

76 PROPORTIONAL HAZARDS (COX) REGRESSION

− limN→∞

1

N

N∑i=1

log( 1

N

N∑j=1

eβ·Zj

θ[Zi1+Zi2− Zj1−Z

j2 ])

+ constant

= −⟨

log⟨

eβ·Z′θ[Z1+Z2− Z ′1−Z ′2]

⟩Z ′

⟩Z

+ constant (8.31)

in which 〈. . .〉Z =∫

dZ P (Z) . . .. Our different Cox regression versions are now:

multivariate : find minβ1,β2

⟨log⟨

eβ1Z′1+β2Z

′2θ[Z1+Z2− Z ′1−Z ′2]

⟩Z ′

⟩Z

(8.32)

univariate : find minβ1

⟨log⟨

eβ1Z′1θ[Z1+Z2− Z ′1−Z ′2]

⟩Z ′

⟩Z

(8.33)

find minβ1

⟨log⟨

eβ2Z′2θ[Z1+Z2− Z ′1−Z ′2]

⟩Z ′

⟩Z

(8.34)

Upon doing the required differentiations with respect to the regression parame-ters, we then find the following equations from which to solve β1 and β2:

multivariate :∫dZ P (Z)

∫ dZ ′P (Z ′)Z ′1eβ1Z′1+β2Z

′2θ[Z1+Z2− Z ′1−Z ′2]∫

dZ ′P (Z ′)eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]

= 0 (8.35)

∫dZ P (Z)

∫ dZ ′P (Z ′)Z ′2eβ1Z′1+β2Z

′2θ[Z1+Z2− Z ′1−Z ′2]∫

dZ ′P (Z ′)eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]

= 0 (8.36)

univariate :∫dZ P (Z)

∫ dZ ′P (Z ′)Z ′1eβ1Z′1θ[Z1+Z2− Z ′1−Z ′2]∫

dZ ′P (Z ′)eβ1Z′1θ[Z1+Z2− Z ′1−Z ′2]

= 0 (8.37)

∫dZ P (Z)

∫ dZ ′P (Z ′)Z ′2eβ2Z′2θ[Z1+Z2− Z ′1−Z ′2]∫

dZ ′P (Z ′)eβ2Z′2θ[Z1+Z2− Z ′1−Z ′2]

= 0 (8.38)

We next work out two choices for the covariate statistics P (Z), which wetake to be zero-average Gaussian. In the first choice the covariates are identical,P (Z1, Z2) = δ(Z2−Z1)e−

12Z

21/√

2π. In the second choice they are independent:

P (Z1, Z2) = (e−12Z

21/√

2π)(e−12Z

22/√

2π).

• Correlated (identical) covariates:

Here our previous equations from which to solve (β1, β2) can now all bewritten in terms of the following function:

F (u) =

∫dZ

e−12Z

2

√2π

∫ dY Y e−12Y

2+uY θ[Z−Y ]∫dY e−

12Y

2+uY θ[Z−Y ]

(8.39)

To be specific, one finds

multivariate Cox regression : F (β1 + β2) = 0 (8.40)

Page 85: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 77

univariate Cox regression : F (β1) = F (β2) = 0 (8.41)

One can prove easily that the function F (u) is convex, i.e. F ′′(u) > 0 forall u, so the equation F (u) = 0 has exactly one solution u?. Thus we find:

multivariate Cox regression : β1 + β2 = u? (8.42)

univariate Cox regression : β1 = β2 = u? (8.43)

As expected, multivariate regression and univariate regression do not leadto the same regression parameters. In univariate regression each individualcovariate provides the same amount of evidence for survival outcome, quan-tified by the value u? for each β1,2. In multivariate regression, in contrast,the evidence is shared between the two covariates.

• Uncorrelated covariates:

Here, with P (Z1, Z2) = e−12 (Z2

1+Z22 )/2π, the calculations are slightly more

involved. It will be advantagous to first transfrom the variables Z and Z ′

according to

X1 =1√2

(Z1+Z2), X2 =1√2

(Z1−Z2) (8.44)

P (X1, X2) =1

2πe−

12 (X2

1+X22 ) (8.45)

This will give θ[Z1+Z2−Z ′1−Z ′2] = θ[X1−X ′1]. Let us start with the ratioof integrals appearing in our expression for multivariate analysis:∫

dZ ′P (Z ′)Z ′1eβ1Z′1+β2Z

′2θ[Z1+Z2− Z ′1−Z ′2]∫

dZ ′P (Z ′)eβ1Z′1+β2Z′2θ[Z1+Z2− Z ′1−Z ′2]

=1√2

∫dX ′e−

12 (X′21 +X′22 )(X ′1+X ′2)e

1√2

[X′1(β1+β2)+X′2(β1−β2)]θ[X1−X ′1]∫

dX ′e−12 (X′21 +X′22 )e

1√2

[X′1(β1+β2)+X′2(β1−β2)]θ[X1−X ′1]

=1√2

∫dX ′1 X

′1e− 1

2X′21 + 1√

2X′1(β1+β2)

θ[X1−X ′1]∫dX ′1 e

− 12X′21 + 1√

2[X′1(β1+β2)]

θ[X1−X ′1]

+1√2

∫dX ′2 X

′2e− 1

2X′22 + 1√

2X′2(β1−β2)∫

dX ′2 e− 1

2X′22 + 1√

2X′2(β1−β2)

=1√2

∫X1

−∞dx xe− 1

2x2+ 1√

2x(β1+β2)∫X1

−∞dx e− 1

2x2+ 1√

2x(β1+β2)

+1

2(β1−β2)

=1

2(β1+β2)− 1√

2

∫X1

−∞dx ddxe− 1

2 [x− 1√2

(β1+β2)]2∫X1

−∞dx e− 1

2 [x− 1√2

(β1+β2)]2+

1

2(β1−β2)

Page 86: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

78 PROPORTIONAL HAZARDS (COX) REGRESSION

= β1 −1√2

e− 1

2 [X1− 1√2

(β1+β2)]2∫X1− 1√2

(β1+β2)

−∞ dx e−12x

2

= β1 −1√π

e− 1

2 [X1− 1√2

(β1+β2)]2

1+Erf(X1− 1√

2(β1+β2)√

2

) (8.46)

After further averaging overX we then find that all equations for regressionparameters can now be written in terms of the function

G(u) =

∫dx

e−12x

2

√2π

1√π

e− 1

2 [x− 1√2u]2

1+Erf(x− 1√

2u

√2

)=

∫dx

π√

2

e− 1

2x2− 1

2 [x+ 1√2u]2

1+Erf(x/√

2) (8.47)

To be specific, we find

multivariate Cox regression : β1 = G(β1+β2), β2 = G(β1+β2) (8.48)

univariate Cox regression : β1 = G(β1), β2 = G(β2) (8.49)

So for either version of Cox analysis we have β1 = β2 = β, but the theequations from which to solve β, and therefore the values found for β, arenot identical in the two cases. Even when covariates are not correlated, onecan apparently still find different regression parameters and hazard ratioswhen comparing univariate to multivariate regression:

multivariate Cox regression : β = G(2β) (8.50)

univariate Cox regression : β = G(β) (8.51)

Example 2: effect of duration of trial on regression parameters

Imagine we have data D on a cohort of size N , with one primary risk and theend-of-trial risk. The trial is terminated at time τ > 0. All individuals i with aprimary event prior to time τ will have ∆i = 1, and all others will have ∆i = 0.This implies that if ti is the time at which the primary event would occur forindividual i, then the actually reported data will be

(Xi,∆i) =

(ti, 1) if ti < τ(τ, 0) if ti ≥ τ

(8.52)

We measure one binary covariate Zi ∈ 0, 1, and we assume that our cohortis heterogeneous, involving two distinct event time distributions. We write theprobability to find ti = t as pi(t)

Page 87: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 79

i ≤ N/2 : pi(t) = ae−at, i > N/2 : pi(t) = a(1+Zi)e−a(1+Zi)t(8.53)

with a > 0. Let us first determine the true cause-specific hazard rates for theabove example. The individual data distributions are

Pi(X,∆) = δ∆,1 θ(τ−X)pi(X) + δ∆,0 δ(X−τ)

∫ ∞τ

dt pi(t) (8.54)

We note that∫ ∞X

dt [Pi(t, 0) + Pi(t, 1)] =

∫ ∞X

dt[θ(τ−t)pi(t) + δ(t−τ)

∫ ∞τ

ds pi(s)]

=

∫ ∞X

dt pi(t) (8.55)

Via (2.18) we can now calculate the individual cause-specific hazard rates:

πi0(X) =δ(X−τ)

∫∞τ

dt pi(t)∫∞X

dt [Pi(t, 0) + Pi(t, 1)]=δ(X−τ)

∫∞τ

dt pi(t)∫∞X

dt pi(t)

= δ(X−τ) (8.56)

πi1(X) =θ(τ−X)pi(X)∫∞

Xdt [Pi(t, 0) + Pi(t, 1)]

=θ(τ−X)pi(X)∫∞

Xdt pi(t)

(8.57)

We note that all individuals in the cohort have exponential event time distribu-tions for the primary risk, so prior to the trial termination time τ all should havetime-independent cause-specific hazard rates for the primary risk. This indeedfollows from the above formula:

X>τ, all i : πi1(X) = 0 (8.58)

X<τ, i ≤ N/2 : πi1(X) =ae−aX∫∞

Xdt ae−at

= a (8.59)

X<τ, i > N/2 : πi1(X) =a(1+Zi)e−a(1+Zi)X∫∞

Xdt a(1+Zi)e−a(1+Zi)t

= a(1+Zi) (8.60)

The covariate is positively associated with the risk, since a value Zi = 1 shortensthe average time to the primary event by a factor two for half of our cohort. Wedraw all Zi randomly and independently from P (Z) = 1

2δZ,1 + 12δZ,0. Note that

we can write all individal hazard rates above in the Cox form, with the samebase hazard rate but with different regression parameters for the two sub-groups

i ≤ N/2 : πi1(X) = λ0(t)eβZi

, λ0(t) = aθ(τ−t), β = 0 (8.61)

Page 88: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

80 PROPORTIONAL HAZARDS (COX) REGRESSION

i > N/2 : πi1(X) = λ0(t)eβZi

, λ0(t) = aθ(τ−t), β = ln(2) ≈ 0.693 (8.62)

From this, in turn, we can calculate the true sub-cohort primary risk hazard rate,via (7.3) (in which we abbreviate βi = 0 if i ≤ N/2 and βi = ln(2) for i > N/2):

π1(t|Z) =

∑i∈ΩZ

πi1(t)e−∫ t

0ds [δ(τ−s)+aθ(τ−s)eβ

iZi ]∑i∈ΩZ

e−∫ t

0ds [δ(τ−s)+aθ(τ−s)eβiZi ]

=

∑i∈ΩZ

πi1(t)e−aeβ

iZi∫ t

0ds θ(τ−s)∑

i∈ΩZe−aeβiZi

∫ t0

ds θ(τ−s)(8.63)

We clearly always have π1(t|Z) = 0 for t > τ . For t < τ we find

π1(t<τ |0) =

∑i(1−Zi)πi1(t)e−at∑

i(1−Zi)e−at= a

∑i(1−Zi)∑i(1−Zi)

= a (8.64)

π1(t<τ |1) =

∑i≤N/2 Z

iπi1(t)e−at +∑i>N/2 Z

iπi1(t)e−2at∑i≤N/2 Z

ie−at +∑i>N/2 Z

ie−2at

= a

∑i≤N/2 Z

i + 2∑i>N/2 Z

ie−at∑i≤N/2 Z

i +∑i>N/2 Z

ie−at(8.65)

For N →∞ this would become

π1(t<τ |0) = a, π1(t<τ |1) = a1+2e−at

1+e−at(8.66)

So in combination we can write

π1(t|Z) = aθ(τ−t) eβ(t)Z , β(t) = ln(1+2e−at

1+e−at

)(8.67)

This shows that the effect identified earlier, of heterogeneous cohorts giving de-caying hazard rates even if all individual hazard rates are stricty time indepen-dent, also impacts on regression parameters. Here we find that the sub-cohortprimary hazard rate is nearly of the Cox form, but with a time dependent re-gression parameter which is not allowed in Cox regression.

According to (8.7) we want to maximise in Cox regression the following quan-tity over β (apart from an irrelevant constant):

L(β|D) =β

N

N∑i=1

δ1,∆iZi − 1

N

N∑i=1

δ1,∆ilog[ 1

N

N∑j=1

eβZj

θ(Xj−Xi)]

N

N∑i=1

θ(τ−ti)Zi −1

N

N∑i=1

θ(τ−ti)×

Page 89: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 81

× log[ 1

N

N∑j=1

eβZj(θ(τ−tj)θ(tj−ti) + θ(tj−τ)θ(τ−ti)

)](8.68)

We now inspect the case where our cohort is very large, so that we may sendN →∞. By the law or large numbers we then obtain

limN→∞

L(β|D)

= β⟨1

2Z

∫ τ

0

dt[ae−at + a(1+Z)e−a(1+Z)t

]⟩Z

−⟨1

2

∫ τ

0

dt[ae−at+a(1+Z)e−a(1+Z)t

]× log

⟨1

2eβZ

′∫ ∞t

dt′(ae−at

′+a(1+Z ′)e−a(1+Z′)t′

)⟩Z′

⟩Z

= β⟨1

2Z(

2−e−aτ−e−a(1+Z)τ)⟩

Z−⟨1

2

∫ τ

0

dt[ae−at+a(1+Z)e−a(1+Z)t

]× log

⟨1

2eβZ

′(

e−at+e−a(1+Z′)t)⟩

Z′

⟩Z

=1

4β(

2−e−aτ−e−2aτ)

−⟨1

2

∫ τ

0

dt[ae−at+a(1+Z)e−a(1+Z)t

]log[1

2e−at

]⟩Z

−⟨1

2

∫ τ

0

dt[ae−at+a(1+Z)e−a(1+Z)t

]log[1+

1

2eβ(1+e−at)

]⟩Z

(8.69)

We need to calculate the maximum with respect to β of this expression. Differ-entiation with respect to β gives

4d

dβlimN→∞

L(β|D)

= 2−e−aτ−e−2aτ −∫ τ

0

dt(

3ae−at+2ae−2at) eβ(1+e−at)

2+eβ(1+e−at)

= 2−e−aτ−e−2aτ −∫ τ

0

dt(

3ae−at+2ae−2at)

+ 2a

∫ τ

0

dt3e−at+2e−2at

2+eβ(1+e−at)

= 2

∫ aτ

0

ds3e−s+2e−2s

2+eβ(1+e−s)− 2(1−e−aτ ) (8.70)

So β is the unique solution of∫ aτ

0

ds3e−s+2e−2s

2+eβ(1+e−s)= 1−e−aτ (8.71)

which via the transformation x = e−s can be rewritten as

1−e−aτ =

∫ 1

e−aτdx

3+2x

2+eβ(1+x)= 2e−β

∫ 1

e−aτdx

32 +x

2e−β+1+x

Page 90: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

82 PROPORTIONAL HAZARDS (COX) REGRESSION

0.0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5

β

Fig. 8.1. The most probable parameter value β in Cox regression, for the ex-ample data (8.52,8.53), in the limit N → ∞. All individuals have strictlytime-independent hazard rates, but due to cohort filtering the cohort-levelprimary hazard rate becomes time dependent. In Cox regresison this is notallowed, and as a conseuence one finds that the most probable parameter βbecomes dependent on (and decays with) the duration of the trial.

= 2e−β(1−e−aτ ) + 2e−β(1

2−2e−β)

∫ 1

e−aτdx

1

2e−β+1+x

= 2e−β(1−e−aτ ) + e−β(1−4e−β) log( 2e−β+2

2e−β+1+e−aτ

)(8.72)

Hence we get

(1−e−aτ )(1−2e−β) = e−β(1−4e−β) log( 2e−β+2

2e−β+1+e−aτ

)(8.73)

For small τ we find

aτ(1−2e−β) +O((aτ)2) = −e−β(1−4e−β) log(

1− aτ

2e−β+2

)(1−2e−β) +O(aτ) = −e−β(1−4e−β)

2e−β+2

)+O(aτ)

2(1−2e−β)(e−β+1) = e−β(1−4e−β) +O(aτ)

2 = 3e−β+O(aτ) so β = ln(3/2) +O(aτ) ≈ 0.405 +O(aτ) (8.74)

Numerical solution of β from equation (8.73) for different values of the trial cut-off time τ results in the curve of figure 8.1. Here the cohort ‘filtering’ resultsby definition in a time-independent regression paramater β (since this is what

Cox regression allows for), but the value found for β decays with increasing trial

Page 91: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 83

durations, in spite of the fact that at the level of individuals there is not a singletime-dependent risk parameter.

Example 3: parametrisation of the base hazard rate

The derivation of Cox’s regression equations for the paramaters β was based onfirst maximising (8.4) over the base hazard rates, giving (8.6), which was thensubstituted into (8.4) and subsequently maximised over β. However, it is clearthat the sum over δ-peaks (8.6) is a maximum-likelihood estimator which is onlyrealistic for infinitely large cohorts; we expect the true base hazard rate to besmooth in time. We have already seen earlier that in the bayesian formalismone could deal with this via a suitable smoothness prior (although this leadsto complicated equations). An alternative route for implementing smoothness ofthe base hazard rate within the Cox formalism is to insert into (8.4) a simpleparametrised form λ0(t|θ) :

logP(β,θ|D) =

N∑i=1

δ1,∆ilog λ0(Xi|θ) + β ·

N∑i=1

δ1,∆iZi

−N∑i=1

eβ·Zi∫ Xi

0

dt λ0(t|θ) (8.75)

We now maximise this expression over (β,θ) instead of (β, λ0). One tends to lookfor paramerisations for which the integral in the last line can be done analytically.For instance, a popular paramerisation is

λ0(t|y, τ) =y

τ(t/τ)y−1, τ > 0, y > 0 (8.76)

which gives us the following quantity, to be maximised over (τ, y,β):

L(β, y, τ |D) = log(y

τy)∑i

δ1,∆i+ (y−1)

∑i

δ1,∆ilogXi + β ·

∑i

δ1,∆iZi

−∑i

eβ·Zi

(Xi

τ)y (8.77)

• We first maximise (8.77) over τ via

∂τL(β, y, τ |D) =

y

τ

[ N∑i=1

eβ·Zi

(Xi/τ)y −N∑i=1

δ1,∆i

](8.78)

giving

τy =

∑Ni=1 eβ·Z

i

(Xi)y∑N

i=1 δ1,∆i

(8.79)

Page 92: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

84 PROPORTIONAL HAZARDS (COX) REGRESSION

Upon substituting this optimal value for the time-scale parameter τ into(8.77), we are left with the following function to be maximised over (y,β):

L(β, y|D) = log( y

∑i δ1,∆i∑

i eβ·Zi

(Xi)y

)(∑i

δ1,∆i

)+ (y−1)

∑i

δ1,∆ilogXi

+β ·∑i

δ1,∆iZi −

∑i

δ1,∆i(8.80)

• We next maximize the expression L(β, y|D) over (y,β), via

∂yL(β, y|D) =

(∑i

δ1,∆i

) ∂∂y

log( y

∑i δ1,∆i∑

i eβ·Zi

(Xi)y

)+∑i

δ1,∆ilogXi

=(∑

i

δ1,∆i

)1

y−∑i eβ·Z

i

log(Xi)(Xi)y∑

i eβ·Zi

(Xi)y

+∑i

δ1,∆i logXi

(8.81)

and

∂βµL(β, y|D) =

∑i

δ1,∆iZiµ −

(∑i

δ1,∆i

) ∂

∂βµlog(∑

i

eβ·Zi

(Xi)y)

=∑i

δ1,∆iZiµ −

(∑i

δ1,∆i

)∑i eβ·Z

i

Ziµ(Xi)y∑

i eβ·Zi

(Xi)y(8.82)

Thus we find that the optimal y and β are to be solved simultaneously from thefollowing two equations:

1

y=

∑i eβ·Z

i

log(Xi)(Xi)y∑

i eβ·Zi

(Xi)y−∑i δ1,∆i

logXi∑i δ1,∆i

(8.83)

∑i δ1,∆i

Ziµ∑i δ1,∆i

=

∑i eβ·Z

i

Ziµ(Xi)y∑

i eβ·Zi

(Xi)y(8.84)

In contrast, the standard Cox equations for β are (8.8), which we can also writeas ∑

i δ1,∆iZiµ∑

i δ1,∆i

=

∑i eβ·Z

i

Ziµ∫Xi

0dt λ0(t|β)∑

i δ1,∆i

(8.85)

with the base hazard rate (8.6). It will be clear that the most probable values forβ corresponding to the choice of a paramerised base hazard rate will generally

Page 93: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

EXAMPLES 85

be different from the standard Cox values that follow from (8.8). It follows fromthe above that they will only be identical if(∑

i

eβ·Zi

(Xi)y)(∑

i

eβ·Zi

Ziµ

∫ Xi

0

dt λ0(t|β))

=(∑

i

δ1,∆i

)(∑i

eβ·Zi

Ziµ(Xi)y)

(8.86)

In the simplest case of just one risk (i.e. ∆i = 1 for all i), for instance, thiscondition and the equation for y reduce after some simple rewriting to

1

N

∑i

eβ·Zi

Ziµ

1

N

∑k

θ(Xi−Xk)∑j e

ˆβ·Zj

θ(Xj−Xk)− (Xi)

y∑j eβ·Z

j

(Xj)y

= 0 (8.87)

1

y=

∑i eβ·Z

i

log(Xi)(Xi)y∑

i eβ·Zi

(Xi)y− 1

N

∑i

logXi (8.88)

This set of equations simplifies when written in terms of new variables Yi = Xyi :

1

N

∑i

Ziµ

1

N

∑k

eβ·Zi

θ(Yi−Yk)∑j e

ˆβ·Zj

θ(Yj−Yk)− eβ·Z

i

Yi∑j eβ·Z

j

Yj

= 0 (8.89)

1 =

∑i eβ·Z

i

log(Yi)Yi∑i eβ·Z

i

Yi− 1

N

∑i

log Yi (8.90)

This will only be satisfied in very special cases. Generally, therefore, one shouldnot use parametrised base hazard rates in conjunction with the conventionalformulae for the Cox regression parameters.

Page 94: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

APPENDIX A

THE δ-DISTRIBUTION

Definition. We define the δ-distribution as the probability distribution δ(x) corre-sponding to a zero-average random variable x in the limit where the randomnessin the variable vanishes. So∫

dx f(x)δ(x) = f(0) for any function f

By the same token, the expression δ(x−a) will then represent the distribution fora random variable x with average a, in the limit where the randomness vanishes,since∫

dx f(x)δ(x− a) =

∫dx f(x+ a)δ(x) = f(a) for any function f

Formulas for the δ-distribution. A problem arises when we want to write downa formula for δ(x). Intuitively one could propose to take a zero-average normaldistribution and send its width to zero,

δ(x) = limσ→0

pσ(x) pσ(x) =1

σ√

2πe−x

2/2σ2

(A.1)

This is not a true function in a mathematical sense: δ(x) is zero for x 6= 0 andδ(0) =∞. However, we realize that δ(x) only serves to calculate averages; it onlyhas a meaning inside an integration. If we adopt the convention that one shouldset σ → 0 in (A.1) only after performing the integration, we can use (A.1) toderive the following properties (for sufficiently well-behaved functions f):∫

dx δ(x)f(x) = limσ→0

∫dx pσ(x)f(x) = lim

σ→0

∫dx√2π

e−x2/2f(σx) = f(0)∫

dx δ′(x)f(x) = limσ→0

∫dx

d

dx[pσ(x)f(x)]− pσ(x)f ′(x)

= lim

σ→0[pσ(x)f(x)]

∞−∞ − f

′(0) = −f ′(0)

The following relation links the δ-distribution to the step-function:

δ(x) =d

dxθ(x) θ(x) =

1 if x > 00 if x < 0

(A.2)

86

Page 95: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

THE δ-DISTRIBUTION 87

This one proves by showing that both sides of the equation have the same effectinside an integration:∫

dx

[δ(x)− d

dxθ(x)

]f(x) = f(0)− lim

ε→0

∫ ε

−εdx

d

dx[θ(x)f(x)]−f ′(x)θ(x)

= f(0)− lim

ε→0[f(ε)−0] + lim

ε→0

∫ ε

0

dx f ′(x) = 0

Finally one can use the definitions of Fourier transforms and inverse Fouriertransforms to obtain the following integral representation of the δ-distribution:

δ(x) =

∫ ∞−∞

dk

2πeikx (A.3)

Page 96: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

APPENDIX B

STEEPEST DESCENT INTEGRATION

Steepest descent (or ‘saddle-point’) integration is a method for dealing withintegrals of the following type, with x ∈ IRp, continuous functions f(x) and g(x)of which f is bounded from below, and with N ∈ IR positive and large:

IN [f, g] =

∫IRp

dx g(x)e−Nf(x) (B.1)

We first take f(x) to be real-valued; this is the simplest case, for which findingthe asymptotic behaviour of (B.1) as N →∞ goes back to Laplace. We assumethat f(x) can be expanded in a Taylor series around its minimum f(x?), whichwe assume to be unique, i.e.

f(x) = f(x?)+1

2

p∑ij=1

Aij(xi−x?i )(xj−x?j ) +O(|x−x?|3), Aij =∂2f

∂xi∂xj|x?(B.2)

If the integral (B.1) exists, inserting (B.2) into (B.1) followed by transformingx = x?+y/

√N gives

IN [f, g] = e−Nf(x?)

∫IRp

dx g(x)e− 1

2N∑

ij(xi−x?i )Aij(xj−x?j )+O(N |x−x?|3)

= N−p2 e−Nf(x?)

∫IRp

dy g(x?+y√N

) e− 1

2

∑ijyiAijyj+O(N−

12 |y|3)

(B.3)

From this latter expansion, and given the assumptions made, we can obtain twoimportant identities:

− limN→∞

1

Nlog

∫IRp

dx e−Nf(x) = − limN→∞

1

Nlog IN [f, 1]

= f(x?) + limN→∞

p logN

2N− 1

Nlog

∫IRp

dx e− 1

2

∑ijyiAijyj+O(N−

12 |y|3)

= f(x?) = min

x∈IRpf(x) (B.4)

and

limN→∞

∫dx g(x)e−Nf(x)∫

dx e−Nf(x)= limN→∞

IN [f, g]

IN [f, 1]= limN→∞

IRpdy g(x?+

y√N

) e− 1

2

∑ijyiAijyj+O(N−

12 |y|3)

∫IRp

dy e− 1

2

∑ijyiAijyj+O(N−

12 |y|3)

88

Page 97: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

STEEPEST DESCENT INTEGRATION 89

=g(x?)(2π)p/2/

√DetA

(2π)p/2/√

DetA= g(x?) = g(arg min

x∈IRpf(x)) (B.5)

If f(x) is complex, the correct procedure to be followed is to deform the integra-tion paths in the complex plane (using Cauchy’s theorem) such that along thedeformed path the imaginary part of the function f(x) is constant, and prefer-ably (if possible) zero. One then proceeds using Laplace’s argument and findsthe leading order in N of our integral in the usual manner by extremization ofthe real part of f(x). In combination, our integrals will thus again be dominatedby an extremum of the (complex) function f(x), but since f is complex thisextremum need not be a minimum:

− limN→∞

1

Nlog

∫IRp

dx e−Nf(x) = extrx∈IRpf(x) (B.6)

limN→∞

∫dx g(x)e−Nf(x)∫

dx e−Nf(x)= g(arg extrx∈IRpf(x)) (B.7)

Page 98: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

APPENDIX C

MAXIMUM LIKELIHOOD VERSUS BAYESIAN ESTIMATION

To illustrate the procedures of maximum likelihood and Bayesian estimationof parameters from data we consider the following problem. We are given adice and want to know the true (but as yet unknown) probabilities (π1, . . . , π6)of each possible throw. A fair dice would have πr = 1/6 for all r. Note that∑6r=1 πr = 1. Our data from which to extract the information consists of the

results of N independent throws of the dice:

D = X1, X2, . . . , XN, Xi ∈ 1, 2, . . . , 6 for each i (C.1)

• Ad hoc estimators:

Our problem is sufficiently transparent for us to simply guess suitable esti-mators. It would be natural to choose for πk the empirical frequency withwhich the throw Xi = k is observed:

(∀k = 1 . . . 6) : πk =1

N

∑i

δXi,k (C.2)

This choice satisfies the constraint∑6k=1 πk = 1, and for N → ∞ the law

of large numbers indeed gives limN→∞ πk =∑6r=1 πrδrk = πk. So our πk

are proper estimators. The results of simulating this estimation processnumerically for a loaded dice are shown in figure C.1. One clearly needsdata sets of size N∼2000 or more for (C.2) to approach the true values.

• Maximum likelihood estimators:

The maximum likelihood estimators are determined by maximizing over(π1, . . . , π6) the likelihood of the data D, given the values of (π1, . . . , π6).Here we have

logP (D|π1, . . . , π6) = log

N∏i=1

P (Xi|π1, . . . , π6) =

N∑i=1

log πXi (C.3)

Let us maximize this quantity over (π1, . . . , π6), subject to the constraint∑6r=1 πr = 1, using the Lagrange formalism:

∂πk

N∑i=1

log πXi = λ∂

∂πk

6∑r=1

πr

90

Page 99: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

MAXIMUM LIKELIHOOD VERSUS BAYESIAN ESTIMATION 91

0.0

0.1

0.2

0.3

0 1000 2000 3000 4000 5000 6000 7000 8000 900010000

N

πk

Fig. C.1. The six empirical frequencies (or estimators) πk = N−1∑i δXi,k,

for each possible dice throw k = 1 . . . 6, versus the size N of thedata set. In this example of a loaded dice the actual probabilities are(π1, . . . , π6) = (0.16, 0.16, 0.16, 0.16, 0.16, 0.20).

N∑i=1

1

πkδXi,k = λ, hence πk =

1

λ

N∑i=1

δXi,k (C.4)

Summation over k in both sides gives∑k πk = N/λ, so our normalisation

constraint tells us that λ = N . Hence the maximum likelihood estimatoris identical to our estimator (C.2).

• Bayesian estimation:

Finally, when following the Bayesian route we calculate P (π1, . . . , π6|D),defined as

P (π1, . . . , π6|D) =P (D|π1, . . . , π6)P (π1, . . . , π6)

P (D)

=P (D|π1, . . . , π6)P (π1, . . . , π6)∫

Ωdπ′1 . . . dπ

′6 P (D|π′1, . . . , π′6)P (π′1, . . . , π

′6)

(C.5)

Here Ω is the set of all parameters (π1, . . . , π6) that satisfy the relevant con-straints, i.e. Ω = (π1, . . . , π6) ∈ IR6| πr ≥ 0 ∀r,

∑r≤6 πr = 1. Alterna-

tively (and equivalently) we can integrate over IR6 and implement the con-straints via the prior, i.e. by defining P (π1, . . . , π6) = 0 if (π1, . . . , π6) /∈ Ω.Next we need to determine the values of the prior P (π1, . . . , π6) for (π1, . . . , π6) ∈Ω. Information theory tells us that if the only prior information availableis our knowledge of the constraints, we should choose the prior that max-imizes the Shannon entropy subject to these constraints. This is again

Page 100: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

92 MAXIMUM LIKELIHOOD VERSUS BAYESIAN ESTIMATION

done via the Lagrange method, where we now vary the entries of the priorP (π1, . . . , π6) (it turns out that non-negatively will be satisfied automali-cally, so we only impose the normalisation constraint):

δ

δP (π1, . . . , π6)

∫Ω

dπ′1 . . . dπ′6 P (π′1, . . . , π

′6) logP (π′1, . . . , π

′6)

= Λδ

δP (π1, . . . , π6)

∫Ω

dπ′1 . . . dπ′6 P (π′1, . . . , π

′6)

∀(π1, . . . , π6) ∈ Ω : 1 + logP (π1, . . . , π6) = Λ (C.6)

We see that the maximum entropy prior is flat over Ω, so P (π1, . . . , π6) =1/|Ω|. Hence, upon insertion into (C.5) we get for (π1, . . . , π6) ∈ Ω anexpression that can again be written in terms of the estimator (C.2):

P (π1, . . . , π6|D) =P (D|π1, . . . , π6)∫

Ωdπ′1 . . . dπ

′6 P (D|π′1, . . . , π′6)

=

∏i πXi∫

Ωdπ′1 . . . dπ

′6

∏i π′Xi

=e∑6

k=1log πk

∑iδk,Xi∫

Ωdπ′1 . . . dπ

′6 e∑6

k=1log π′

k

∑iδk,Xi

=eN∑6

k=1πk log πk∫

Ωdπ′1 . . . dπ

′6 eN

∑6

k=1πk log π′

k

(C.7)

Let us work out the denominator, using the standard integral representa-tion δ(z) = (2π)−1

∫∞−∞dx eixz for the delta-function:

1

Nlog Den =

1

Nlog

∫Ω

dπ′1 . . . dπ′6 eN

∑6

k=1πk log π′k

=1

Nlog

∫ ∞0

dπ1 . . . dπ6 δ[1−

6∑k=1

πk

]eN∑6

k=1πk log πk

=1

Nlog

∫ ∞−∞

dx

2πeix

6∏k=1

∫ ∞0

dy eNπk log y−ixy

=1

Nlog

∫ ∞−∞

dx

2π/NeiNx

6∏k=1

∫ ∞0

dy eN[πk log y−ixy

](C.8)

Focusing on the y integral, we note that for large N the dominant contribu-tion comes from the saddle-point, i.e. after shifting the contour in the com-plex plane from the solution of d

dy [πk log y−ixy] = 0, giving y = −iπk/x. So

steepest descent integration gives us (see B for an introduction to steepestdescent integration), using log(−i) = iArg(−i) = −iπ/2:∫ ∞

0

dy eN[πk log y−ixy

]= eN

[πk log(−iπk/x)−πk

]+O(N0)

Page 101: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

MAXIMUM LIKELIHOOD VERSUS BAYESIAN ESTIMATION 93

= eN[πk log πk− 1

2 iππk−πk log x−πk]+O(N0) (C.9)

Hence

1

Nlog Den =

1

Nlog

∫ ∞−∞

dx eN[ix+∑6

k=1

(πk log πk− 1

2 iππk−πk log x−πk)]

+O(N0) +logN

N

=1

Nlog

∫ ∞−∞

dx eN[ix+∑6

k=1πk log πk− 1

2 iπ−log x−1]+O(N0) +

logN

N

=

6∑k=1

πk log πk −1

2iπ − 1 +

1

Nlog

∫ ∞−∞

dx eN(ix−log x) +logN

N+O(

1

N)

(C.10)

Steepest descent integration over x gives N−1∫

dx eN(ix−log x) = 1 + 12 iπ +

O(N−1). Thus we get

1

Nlog Den =

6∑k=1

πk log πk +logN

N+O(N−1)

Den = NeN∑6

k=1πk log πk+O(N0) (C.11)

The end result is the following appealing large-N form of our formula (C.7):

P (π1, . . . , π6|D) =1

NeN∑6

k=1πk log πk−N

∑6

k=1πk log πk+O(N0)

=1

Ne−N

∑6

k=1πk log(πk/πk)+O(N0) (C.12)

The leading order in the exponent, apart from the factor N , is the Kullback-Leibler distance between the true and estimated probability distributions πkand πk. The most probable values of the probabilities are therefore again seento be the estimators (C.2), but now we know more: we also have quantified ouruncertainty for large but finite N .

Page 102: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

APPENDIX D

MAXIMUM PREDICTION ACCURACY WITH COXREGRESSION

Assume we know our regression parameters exactly, the hazard rate is indeedof the Cox form, and there is no censoring (the ideal scenario). We take allcovariates to beindependent zero average and unit variance Gaussian variables.The survival probability for risk 1 is then

S(t|Z) = exp(−

N∑i=1

eˆβ·Zθ(t−Xi)∑N

j=1 eˆβ·Zj

θ(Xj−Xi)

)(D.1)

and we would classify a patient with covariates Z at time t according to themost probable outcome:

σ(Z) = θ[S(t|Z)− 1

2

]= θ[

log(2)− 1

N

N∑i=1

eˆβ·Zθ(t−Xi)

1N

∑Nj=1 e

ˆβ·Zj

θ(Xj−Xi)

](D.2)

For N → ∞ there will be no difference between training and validation sets interms of prediction accuracy, and the fraction predicted correctly will simply be

Qt =⟨1

2+

1

2sgn(X−t) sgn

[log(2)− 1

N

N∑i=1

eˆβ·Zθ(t−Xi)

1N

∑Nj=1 e

ˆβ·Zj

θ(Xj−Xi)

]〉Z,X

=1

2+

1

2

⟨sgn(X−t) sgn

[log(2)− e

ˆβ·Z⟨ θ(t−X ′)⟨

eˆβ·Z ′′θ(X ′′−X ′)

⟩Z ′′,X′′

⟩X′

]〉Z,X(D.3)

We first do the average in the denominator:⟨e

ˆβ·Z ′′θ(X ′′−X ′)⟩Z ′′,X′′ =

∫DZ e

ˆβ·Z∫ ∞X′

ds P (s|Z)

=

∫DZ e

ˆβ·Z∫ ∞X′

ds π1(s|Z)e−∫ s

0ds′ π1(s′|Z)

=

∫DZ e

ˆβ·Z[− e−∫ s

0ds′ π1(s′|Z)

]∞X′

=

∫DZ e

ˆβ·Z(

e−∫ X′

0ds π1(s|Z) − e

−∫∞

0ds π1(s|Z)

)94

Page 103: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

MAXIMUM PREDICTION ACCURACY WITH COX REGRESSION 95

=

∫DZ e

ˆβ·Z(S(X ′|Z)− S(∞|Z)

)(D.4)

Page 104: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

APPENDIX E

COMPUTATIONAL DETAILS

Generation of synthetic data.Monte-Carlo sampling in Bayesian regression.

96

Page 105: Principles of Survival Analysis · review in full detail the principles and mathematical derivations of traditional survival analysis, and then rebuild the edi ce in the most transparent

REFERENCES

[1] P Hougaard P 2001 Analysis of multivariate survival data (New York:Springer)

[2] Ibrahim JG, Chen MH and Sinha D 2005 Bayesian survival analysis (NewYork: Springer)

[3] Klein JP and Moeschberger ML 2005 Survival analysis – techniques forcensored and truncated data (New York: Springer)

[4] Crowder M 2012 Multivariate survival analysis and competing risks (Lon-don: CRC Press)

97