count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli...

94
POLITECNICO DI MILANO S CHOOL OF I NDUSTRIAL AND I NFORMATION E NGINEERING Department of Mathematics Master of Science in Mathematical Engineering Count processes approach to recurrent event data: a Bayesian model for blood donations ENRICO S PINELLI MATRICOLA: 875462 S UPERVISOR: P ROF .ALESSANDRA GUGLIELMI COADVISOR: P ROF .ETTORE L ANZARONE A.Y. 2018-2019

Upload: others

Post on 10-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

POLITECNICO DI MILANOSCHOOL OF INDUSTRIAL AND INFORMATION

ENGINEERING

Department of MathematicsMaster of Science in Mathematical Engineering

Count processes approachto recurrent event data:

a Bayesian model for blood donations

ENRICO SPINELLIMATRICOLA: 875462

SUPERVISOR: PROF. ALESSANDRA GUGLIELMI

COADVISOR: PROF. ETTORE LANZARONE

A.Y. 2018-2019

Page 2: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi
Page 3: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Abstract

This work tries to give a solution to a very important and practical issue: theprediction of the number of donations in a specific blood centre, in order toefficiently plan the collection phase of the blood supply chain.

First, statistical models for estimation of the rate of blood donations are considered.This kind of models allows to predict the return time to donation for an individual.The real data that have been analyzed come from the Milan section’s databases ofAssociazione Volontari Italiani Sangue (AVIS). The class of models and methods usedare those of Bayesian Statistics, and blood donations have been modeled as recurrentevents. Specifically, the focus has been on the rate function, which is the instantaneousprobability of the event occurrence. The object of the inference of this approach is thecounting process {Ni(t) : t ≥ 0}, for each donor i, where Ni(t) represents the number ofdonations made at time t by the i− th donor.

Usually the waiting times between donations are considered, but, on the other hand,modeling the counts allows the process to retain memory and to take place with adifferent occurrence rate depending on the time of the event.

The analysis highlights a decreasing trend of the rate function and identifies somesignificant covariates. Moreover, with the use of random effects in the model, hetero-geneity among individuals is captured and for each donor the posterior density of oneparameter (called frailty) summarises his/her personal propensity to donate.

The behaviour of existing donors has been modeled within the context of recurrentevents. Since the supply of blood is given also by occasional donors or new donors, aBayesian time series model has been proposed to make prediction in this context.

I

Page 4: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi
Page 5: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Estratto in lingua italiana

Questo lavoro cerca di dare una soluzione a un problema molto importante epratico: la previsione del numero di donazioni in un centro di raccolta di sanguespecifico, al fine di pianificare in modo efficiente la fase di raccolta della catena

di approvvigionamento del sangue.

Innanzitutto sono stati considerati i modelli statistici per la stima del tasso di don-azioni di sangue. Questo tipo di modelli consente di prevedere il tempo di ritorno alladonazione per un individuo. I dati reali che sono stati analizzati provengono dai databasedella sezione di Milano dell’Associazione Volontari Italiani Sangue (AVIS). La classedi modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni disangue sono state modellizzate come eventi ricorrenti. Nello specifico, l’attenzione si èconcentrata sulla rate function, che è la probabilità istantanea del verificarsi dell’evento.L’oggetto dell’inferenza di questo approccio è il processo di conteggio {Ni(t) : t ≥ 0}, perogni donatore i, dove Ni(t) rappresenta il numero di donazioni fatte fino al tempo tdall’i− esimo donatore .

Di solito si considerano i tempi di attesa tra le donazioni, ma la modellazione deiconteggi consente al processo di conservare la memoria e di svolgersi con un tasso dioccorrenza diverso in base al passare del tempo.

L’analisi evidenzia una tendenza a decrescere della rate function e identifica alcunecovariate come significative. Inoltre, con l’inclusione di random effects nel modello,l’eterogeneità tra gli individui viene spiegata e per ogni donatore la distribuzione aposteriori di un parametro (chiamato frailty) riassume la sua personale propensione alladonazione.

Il comportamento dei donatori esistenti è stato modellizzato nel contesto di eventiricorrenti. Poiché la fornitura di sangue è data anche da una componente fornita dadonatori occasionali o nuovi donatori, un modello Bayesiano per serie storiche è statoproposto per fare previsioni di questo fenomeno.

III

Page 6: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi
Page 7: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Table of Contents

Abstract I

Estratto in lingua italiana III

Table of Contents V

List of Figures IX

List of Tables XI

Introduction 1

1 Theoretical background on modelling recurrent events 51.1 Framework and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Recurrent events as gap times . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Recurrent events as event counts . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Heterogeneity between individuals . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.1 Covariates in the Poisson process . . . . . . . . . . . . . . . . . . . . 8

1.4.2 Random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Extensions to renewal and Poisson processes . . . . . . . . . . . . . . . . . . 11

1.5.1 "At risk" indicator function . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5.2 General intensity-based model . . . . . . . . . . . . . . . . . . . . . . 12

1.5.3 Multi-state Markov models . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5.4 Modelling the baseline intensity function . . . . . . . . . . . . . . . 13

1.6 The Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6.1 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6.2 Monte Carlo Markov Chains . . . . . . . . . . . . . . . . . . . . . . . 16

1.6.3 Discretization of the Gamma process prior . . . . . . . . . . . . . . . 16

V

Page 8: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

TABLE OF CONTENTS

1.6.4 Autocorrelated prior for the baseline intensity function . . . . . . . 17

1.7 Model evaluation in terms of predictive performances . . . . . . . . . . . . 18

1.7.1 Log posterior predictive density . . . . . . . . . . . . . . . . . . . . . 18

1.7.2 Computation of WAIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.7.3 Evaluating predictive accuracy in the case of recurrent events . . . 19

2 Data source 212.1 The AVIS association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.1 Brief history of AVIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.2 Italian donation rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.1 The EMONET database . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.2 The AVIS database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.3 Data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.4 Suspensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Features selection and data transformation . . . . . . . . . . . . . . . . . . 27

2.4 Descriptive analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 Rate of donations and gap times . . . . . . . . . . . . . . . . . . . . . 28

2.4.2 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Modelling blood donations as recurrent events 373.1 Recurrent event models for blood donations . . . . . . . . . . . . . . . . . . 37

3.2 Modelling choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.1 Baseline intensity function . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.2 Frailty parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.3 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.4 At risk indicator function, censoring and suspensions . . . . . . . . 44

3.3 The Bayesian model for recurrent data of M donors . . . . . . . . . . . . . . 45

3.3.1 The likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.2 Prior elicitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4 The predictive distribution of the counting process of a new incoming donor 46

4 Posterior inference on AVIS data 494.1 Posterior inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Inference on parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 Baseline intensity function . . . . . . . . . . . . . . . . . . . . . . . . 50

VI

Page 9: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

TABLE OF CONTENTS

4.2.2 Covariates coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.3 Random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2.4 Predictive density for the count process of a new incoming donor . 59

4.3 Point predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Forecasting new donors 635.1 State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Descriptive analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3 A Bayesian model for the new donors . . . . . . . . . . . . . . . . . . . . . . 66

5.4 Posterior inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Conclusions and further developments 73

Bibliography 77

VII

Page 10: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi
Page 11: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

List of Figures

FIGURE Page

2.1 Histogram of the empirical rates of donation (number of donations divided for

the years of observation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 Boxplot of the number of days passed from the observed last donation of every

donors to their censoring time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3 Trend of gap times with the number of donations . . . . . . . . . . . . . . . . . 31

2.4 Trend of the gap times with the years passed since entrance . . . . . . . . . . 31

2.5 Histogram of the logarithm of the gap times . . . . . . . . . . . . . . . . . . . . 32

2.6 Boxplots of the BMI according to the values of the categorical covariates . . . 34

2.7 Boxplots of the first donation age according to the values of the categorical

covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.8 Boxplots of the donation rate grouped with the categorical variable . . . . . . 36

2.9 Scatterplot of the donation rates against the continuous variable (AGE and

BMI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1 Histogram of gap times of female donors, the red line corresponds to 180 days 41

3.2 Histogram of gap times of female donors, the red line corresponds to 90 days 41

3.3 Percentage of earlier that allowed donations as a function of the threshold

age for menopause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1 95 % credibility intervals for the baseline intensity function . . . . . . . . . . . 51

4.2 Estimated log posterior predictive density . . . . . . . . . . . . . . . . . . . . . 52

4.3 95 % credibility intervals for the βi ’s parameters . . . . . . . . . . . . . . . . . 52

4.4 Summaries of wi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Predictive densities of Ti,ni+1 given Ti,ni for some donors . . . . . . . . . . . . 55

IX

Page 12: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

LIST OF FIGURES

4.6 95 % posterior predictive credibility intervals of wnewj , j = 1, . . . , J, the frailty

of a new donor from zone j. In grey the estimate obtained with the model with

no areal dependence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.7 Pointwise predictive 95 % credibility intervals for Nnew(t)|xnew, where xnew

is set to the mean (or to the mode) of the features used as covariates . . . . . 58

4.8 Mean functions for Nnew(t)|xnew,data. Unless stated otherwise, the covari-

ates are set to the mean (or to the mode) . . . . . . . . . . . . . . . . . . . . . . 59

5.1 Weekly arrivals of new donors grouped by years . . . . . . . . . . . . . . . . . . 65

5.2 Weekly arrivals of new donors grouped by months . . . . . . . . . . . . . . . . . 66

5.3 Time series of the weekly arrivals . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4 Traceplots variance parameters Model 1 . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Model 1: decomposition of the time series . . . . . . . . . . . . . . . . . . . . . . 69

5.6 Model 2: decomposition of the time series . . . . . . . . . . . . . . . . . . . . . . 70

5.7 Prediction of new weekly arrivals: 95 % credibility intervals . . . . . . . . . . . 70

5.8 Predictive mean of the seasonal component . . . . . . . . . . . . . . . . . . . . . 71

X

Page 13: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

List of Tables

2.1 Variables from table PRESENTAZIONI in EMONET database that are in-

cluded in our dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Variables from table DONAZIONI in EMONET database that are included in

our dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Variables from table ANAGRAFICHE in EMONET database that are included

in out dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Variables from table EMC_DONABILI in EMONET database that are in-

cluded in our dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 Variables from table TIPIZZAZIONE in EMONET database that are included

in our dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6 Variables from table STILIVITA in AVIS database that are included in our

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.7 Variables from table SOSPENSIONI in AVIS database that are included in

our dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.8 Frequency table that relates the type of suspensions to the respect of the

suspensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.9 Description of the features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.10 Number of donors that did exactly n total donations (after the first one) . . . 29

2.11 Table of the sample frequencies of the categorical variable . . . . . . . . . . . . 33

2.12 Mean and standard deviation of the continuous variable . . . . . . . . . . . . . 34

4.1 Bayesian p-values and hazard ratios . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Predictive performances evaluation of models with different sets of covariates

using 10 fold cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Point prediction errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1 Summaries of the empirical distribution of the time series of the weekly arrivals 67

XI

Page 14: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

LIST OF TABLES

5.2 Prediction of future weekly arrivals . . . . . . . . . . . . . . . . . . . . . . . . . 71

XII

Page 15: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Introduction

Human blood is a natural product, not artificially reproducible, so the only way of guar-

anteeing its availability for health purposes is through donations from living individuals.

Blood is needed to save lives, to improve their quality and to extend their lengths. It is

essential in first aids, emergency services, surgeries, organ and bone marrow transplants,

cure of oncological and haematological diseases. Blood is not only essential in exceptional

cases like natural disasters or accidents or in serious pathological conditions, but it is also

a unique source of survival in case of chronic diseases like anaemia, liver dysfunctions,

lack of coagulation factors and disorders of the immune system.

The blood donation supply chain can be divided in four phases: collection, trans-

portation, storage and utilization. In the collection phase donor’s eligibility to donate is

checked and, then, if the donation occurred, blood is screened in laboratory to prevent

infectious diseases and it is possibly fractionated in subcomponents. Afterwards it is

transported and stored to hospitals or transfusions centres, and finally it is used for a

transfusion.

Bas Güre et al. (2018) discuss how the management of blood collection from donors has

not been adequately considered so far. Indeed most of the efforts in scientific literature are

aimed to the demand prediction or to an efficient management of storage and distribution.

Despite of the lack of consideration in scientific literature, collection is one of the most

important phases of the blood donation supply chain. Blood has a shelf life, and so the

demand of hospitals and transfusions centres has to be covered with the maximum

precision, to avoid wastage of this resource. When neither the demand nor an estimate

of it are present, the storage should be planned to keep constant the number of blood

units of each type across days in every centre. Moreover knowing in advance the number

of incoming donors can lead to an optimal planning of the appointment scheduling

system, with the purpose of merging together the production balancing requirements

and the service planning requirements. In this way the quality of the service from donors’

1

Page 16: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

perspective would benefit from it.

In Italy, as in many other western countries, the acquisition of blood products relies

on voluntary donations. The major organization that collects volunteer blood donors is

the Associazione Volontari Italiani Sangue (AVIS). It is straightforward that a precise

arrivals forecast is necessary to have an efficient management of blood collection. Mod-

elling and understanding the behaviour of donors is a way to do so. Some statistical

models have been proposed in scientific literature. Previous works rely on the use of

logistic regressions, or in modelling the gap times between donations in the framework

of recurrent event. Apart from Gianoli (2016), in all the publications frequentist methods

have been used, while the Bayesian approach is largely unexplored.

The class of methods used in this thesis belongs to Bayesian statistics and a recurrent

event approach is adopted, but, unlinke in Gianoli (2016), event counts over time are

modelled, not the waiting times between two successive blood donations.

In the last decades, thanks to the improvements of the performances of computing

systems and to the spread of the MCMC methods, the Bayesian approach is spreading

in the scientific world, since it is able to give a richer inference than classical statistics.

Indeed, probabilistic estimates are exact because they do not rely on a large sample

theory, and instruments like interval estimates have a clearer meaning. Moreover, with

predictive distributions, the Bayesian paradigm offers a natural way to do forecasting.

This thesis deals with the analysis of a dataset built from real data provided by the

AVIS section of Milan.

Suitable data have been downloaded, using SQL queries, from two databases in the

AVIS’ server. Afterwards, a stage of pre-processing followed in order to make the raw

data usable for a statistical analysis. As a result, a dataset of M individuals has been

built. Times of donations, personal features and the total time of observation (namely,

censoring time) were available for each individual.

Subsequently a proper model for treating blood donations as recurrent events has

been formulated. At first, statistical modelling of recurrent event processes has been

studied. A brief research on the state of the art of statistical methods in the field of

recurrent blood donations has been done, either in Bayesian or in frequentist statistics.

Then, a suitable class of models has been identified. However some modifications were

done to adapt the class of models to the real phenomenon. For instance, the model can

handle some typical features of blood donations cycle, such as the mandatory deferral

time after each donation or the suspensions from the activities of donor.

Posterior inference was computed via Stan (see Stan Development Team and others,

2

Page 17: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

2016), a C++ open source software which allows to make MCMC sampling.

Finally, posterior inference in the form of MCMC output has been analyzed and

interpreted; moreover a way to sample from a recurrent event process has been proposed.

Appropriate instruments of goodness of fit and of predictive performance accuracy have

been discussed and used to compare different models (for instance, different parametriza-

tions or different subsets of covariates).

The result of the work summarized above is a mathematical model that can explain

the behaviour of a blood donor starting from the moment of his/her first donation. Individ-

ual features are present in the model as covariates. Some of them have been identified as

statistically significant and correlated to higher or lower number of donations in the time

unit. The model can be also used to do individual-specific prediction of new donations.

Finally, to have a complete modelling of the number of donations in a specific blood

collection centre, the time series of the weekly number of new donors has been briefly

analyzed as a State Space Model. Summing up, the original contributions of this work

are:

• composition of the dataset;

• the study of models for the rate function of recurrent events, particularly using the

Bayesian approach;

• application of the class of models to the dataset;

• predictive accuracy comparison of different models;

• a State Space model to predict the number of new donors;

• Stan implementation of the models.

The thesis is organized as follows.

In the first chapter an overview on recurrent event processes and on the various

modeling techniques will be given, both in frequentist and Bayesian frameworks.

The second chapter is dedicated to the description of the data sources.

The particular modeling choices regarding the examined dataset will be explained in

detail in the third chapter.

The fourth chapter is dedicated to the presentation of the results of the analysis. The

inference a posteriori about the parameters of the model will be shown and commented.

The last chapter is devoted to the time series modelling of new donors’ weekly

number.

3

Page 18: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi
Page 19: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Theoretical background on

modelling recurrent events1

I n this Section, a brief review of the statistical models used in the analysis ofrecurrent events will be presented. By recurrent event processes one refers to thosekind of processes in which events are generated repeatedly over time.

Afterwards there will be a brief recall on what Bayesian Statistics and MCMC methodsare. Model evaluation in terms of predictive performances will conclude the chapter.Almost all the material that is included in this chapter is from Cook and Lawless (2007).

1.1 Framework and notation

A single recurrent event process starting at time t = 0 is characterized by an increasing

sequence of event times {Tk,k ∈ N}, where each element of the sequence denotes the

time of the corresponding event. To this sequence it is associated the counting process{N(t), t ≥ 0}, defined as:

N(t)=∞∑

k=0I(Tk ≤ t), (1.1)

where I(Tk ≤ t) is a function equal to 1 when (Tk ≤ t), and it is equal to 0 otherwise. The

counting process evaluated at time t records the cumulative number of events occurred

in the interval [0, t]. Moreover the number of events occurred in the interval (s, t] can be

expressed as N(t)−N(s).

Let H(t) = {N(s),0 < s ≤ t} be the history of the process, a recurrent event process

can be defined specifying the instantaneous probability that an event occurs given the

previous history and under the hypothesis that two events cannot occur simultaneously.

Considering the probability that an event occurs in the interval (t, t+∆t] one can define

5

Page 20: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 1. THEORETICAL BACKGROUND ON MODELLING RECURRENTEVENTS

the intensity function:

λ(t|H(t))= lim∆t→0

P(N(t+∆t)−N(t)= 1|H(t))∆t

. (1.2)

Once the intensity function is known, it is possible to write the probability of a specified

event history and conditional probabilities for inter-event times through the following

results (see Cook and Lawless, 2007).

Conditionally on H(τ0), the probability density of the outcome "n events, at timest1 < . . . < tn, where n > 0, for a process with an integrable intensity λ(t|H(t)), over thespecified interval [τ0,τ]", is:

exp(−

∫ τ

τ0

λ(u|H(u))du) n∏

j=1λ(t j|H(t j)). (1.3)

For an event with integrable density λ(t|H(t))

P(N(t)−N(s)= 0|H(s)

)= exp(−

∫ t

sλ(u|H(u))du

). (1.4)

Let Wj = T j −T j−1 be the waiting time between the events (j-1) and j, then:

P(Wj > w|T j−1 = t j−1,H(t j−1)

)= exp(−

∫ t j−1+w

t j−1

λ(u|H(u))du). (1.5)

It is clear from the formulas above that the amount of information contained in the

intensity function leads it to play a crucial role in modelling a recurrent event process.

According to the goal of the analysis, it is possible to model event occurrences through

two main ways: event count and gap times. In the first scenario the focus is on the counting

process N(t), while in the second case the waiting times between two consecutive events

are modelled. In the next sections a brief summary of the two approaches will be given.

1.2 Recurrent events as gap times

The analysis of recurrent events as gap times is common when the events are relatively

infrequent or when the system returns to the initial state after every occurrence. In this

case the process is called renewal process and it is a useful framework in system failures

or in case of cyclical phenomena. In a renewal process the gap times Wj = T j −T j−1

between the events j and (j-1) are independent and identically distributed conditionally

to parameters. This condition is equivalent to:

λ(t|H(t))=h(t−TN(t−)), (1.6)

N(t−) := lim∆t→0

N(t−∆t) (1.7)

6

Page 21: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

1.3. RECURRENT EVENTS AS EVENT COUNTS

where h(t) is the hazard function, defined as follows:

h(t)= lim∆t→0

P(W > t+∆t|W ≥ t)∆t

= f (t)S(t)

, (1.8)

where f (t) is the density function of the waiting times, and S(t)= P(W ≥ t) is the survivalfunction of the waiting times. The hazard function is the main focus of a branch of

statistics called survival analysis, in which times such as failures or deaths are analyzed.

These kind of processes are often called time-to-event processes. Similarities between

the hazard function and the intensity function are recognizable. Indeed, both represent

the instantaneous probability that an event occurs at time t. Hence, the same modeling

approach can be followed for both the functions.

A renewal process is equivalent to many time-to-event processes which occur one

after the other, since, as it can be noticed in equation (1.6), the intensity function gets

the same values after every event, losing memory of the past. A renewal process can be

generalized by inducing dependence between gap times through linear models. Thus it is

possible to have a trend in the waiting times.

1.3 Recurrent events as event counts

The main way of representing a recurrent event process through event counts is to model

it as a Poisson process. In this special framework the events occur randomly in such a way

that their number in disjoints time intervals are statistically independent. This peculiar

property is reflected in an equivalent way through the independence of the intensity

function at time t with respect to the history H(t) of the process. Mathematically it

means that the intensity function has no dependence on the history of the process and it

can be expressed in the following form:

λ(t|H(t))= ρ(t), t > 0, (1.9)

where ρ(t) is a non-negative integrable function that is called rate function. If, for each

t, ρ(t) = ρ, which means that the intensity is constant over time, the Poisson process

is called homogeneous, otherwise it is called non-homogeneous. As ρ(t) represents the

probability that an event occurs in the interval [t, t+dt], then ρ(t)dt is equivalent to the

mean number of the events in an infinitesimal time interval. Hence

µ(t)=∫ t

0ρ(u)du, (1.10)

is the mean number of events in the interval [0, t] and it is called cumulative rate function.

The definition of Poisson processes implies the following properties:

7

Page 22: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 1. THEORETICAL BACKGROUND ON MODELLING RECURRENTEVENTS

• if t ≥ s ≥ 0 N(s, t)= N(t)−N(s) has a Poisson distribution with mean µ(t)−µ(s);

• if (s1, t1] and (s2, t2] are non-overlapping intervals then N(s1, t1) and N(s2, t2) are

independent random variables;

• in the case of an homogeneous Poisson process with intensity ρ, the gap times

Wj = T j −T j−1 are independent and identically distributed as Exponential random

variables with survivor function equal to

P(Wj > w)= exp(−ρw) w ≥ 0; (1.11)

• if the Poisson process is non-homogeneous with mean function µ(t), the process

defined with a new time scale s =µ(t) as

M(s)= N(µ−1(s)),0< s (1.12)

is an homogeneous Poisson process with unitary intensity.

Hence the intensity function ρ(t) can be used to model a time trend in the events.

1.4 Heterogeneity between individuals

In some contexts the events generating process may differ among individuals; such

heterogeneity can be modeled by including covariates and random effects in the model.

1.4.1 Covariates in the Poisson process

The most common way of including a vector of time-varying covariates x(t) in an intensity-

based recurrent event process is to consider first of all a baseline intensity function λ0(t),which corresponds to the intensity function of a particular individual (for example an

individual who has x(t)= 0).

The next step is to consider intensities of the form:

λ(t|x(t))=λ0(t)g(x(t);β) (1.13)

where g(x(t);β) is a non-negative integrable function and β is a vector of regression

parameters. Typically g(x(t);β)= exp(x(t)′β). This is called multiplicative model or log-linear model.

When the covariates are time-invariant, their effect on a Poisson process has a simple

interpretation. Indeed, conditionally on the covariates, the corresponding Poisson process

8

Page 23: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

1.4. HETEROGENEITY BETWEEN INDIVIDUALS

would be characterized by intensity λ0(t)g(x;β) and mean function∫ t

0 ρ0(u)dug(x;β). As a

consequence, the mean and the rate functions for two individuals with covariates x1 and

x2 are proportional, andg(x1;β)g(x2;β)

is the constant of proportionality (in the multiplicative

model the constant is exp((x1 − x2)′β)). This property does not hold in general when the

covariates are time-dependent.

Moreover, some generalizations of the multiplicative model (1.13) can be considered.

A possible extension is to include, as covariates, components based on the prior events

history (e.g. the number of events experienced before t or the time since the last event).

Because of history-dependence, in this case the process is not a Poisson process anymore

and it is called modulated Poisson process.

Another possible extension is to consider intensity functions of the form

λ(t|x(t))=λ0(t)+ g(x(t);β) (1.14)

where g(x(t);β) has to be chosen such that λ(t|x(t))≥ 0. This model is called additive.

The last possible extension presented here is the time transform model, analogous to

the accelerated failure time model in survival analysis:

λ(t|x(t))=λ0

(∫ t

0exp(x(u)′β)du

)exp(x(t)′β). (1.15)

In this case s = exp(x(t)′β) can be considered as a transformed time scale.

1.4.2 Random effects

In some situations unobservable factors may create heterogeneity across different in-

dividuals that experience the same recurrent event process. In this case it is useful

to introduce random effects in order to capture this feature in the model. Thus, the

subject-specific intensity function for the i− th individual can be written as:

λi(t|H(t),ui, x,β)= uiλ0(t), (1.16)

where ui is called frailty and it represents the unobservable individual specific random

effect.

Typically, for inference purposes, all the random effects ui can be modeled as inde-

pendent random variables equally distributed with Gamma density with mean equal to

1 and variance equal to φ, with φ≥ 0. This model is equivalent to state that, condition-

ally to ui, the stochastic process {Ni(t) : 0 ≤ t}, which represents the number of events

occurred to individual i, is a Poisson process with intensity equal to uiλ0(t).

9

Page 24: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 1. THEORETICAL BACKGROUND ON MODELLING RECURRENTEVENTS

However, marginalizing the process over the random effects, makes the process no more

Poisson.

Indeed:

E[Ni(t)∣∣λ0(t)]=µ0(t); (1.17)

var[Ni(t)∣∣λ0(t)]=µ0(t)+µ0(t)2φ; (1.18)

cov[Ni(s1, t1) , Ni(s2, t2)∣∣λ0(t)]=φµ0(s1, t1)µ0(s2, t2); (1.19)

where µ0(s, t)= ∫ ts λ0(u)du and µ0(t)=µ0(0, t) and s1 < t1 < s2 < t2.

Of course, some properties of the Poisson process are violated, for instance the mean and

the variance functions are not equal. Moreover the counts in disjoint intervals are not

statistically independent since their covariance is different from zero. From equations

(1.18) and (1.19) it is clear that the variance of the random effects φ quantifies both the

heterogeneity across individuals (since the variance is an increasing function of it) and

the dependence between counts in disjoint intervals.

Marginalizing equation (1.16) over the random effect ui leads to:

λi(t|H(t))=λ0(t)1+φNi(t−)1+φµ0(t)

(1.20)

where Ni(t−)= lims→t− Ni(s).

This can be done by writing

P(Ni(t+∆t)−Ni(t)

∣∣Hi(t))= ∫ ∞

0P

(Ni(t+∆t)−Ni(t)

∣∣Hi(t),ui) P

(Hi(t)

∣∣ui)g(ui|φ)∫ ∞

0 P(Hi(t)

∣∣ui)g(ui|φ)dui

dui,

(1.21)

then, for small ∆t

P(Ni(t+∆t)−Ni(t)

∣∣Hi(t),ui)=λ0(t)ui∆t, (1.22)

remembering that the density g(ui|φ) is a Gamma with scale and shape parameters

equal to φ−1 and that

P(Hi(t)

∣∣ui)= { Ni(t−)∏

j=1uiλ0(ti, j)

}exp

(−∫ ∞

0uiλ0(x)dx

), (1.23)

since it is the expression of the likelihood of the process (1.3).

10

Page 25: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

1.5. EXTENSIONS TO RENEWAL AND POISSON PROCESSES

Substituting (1.22) and (1.23) in (1.21) and simplifying

P(Ni(t+∆t)−Ni(t)

∣∣Hi(t))

∆t=λ0(t)

∫ ∞0 uNi(t−)+φ−1

i exp(−ui(

∫ t0 λ0(x)dx+φ−1)

)dui∫ ∞

0 uNi(t−)+φ−1−1i exp

(−ui(∫ t

0 λ0(x)dx+φ−1))dui

=

(1.24)

=λ0(t)Γ(Ni(t−)+φ−1 +1)Γ(Ni(t−)+φ−1)

(φ−1 +∫ t

0 λ0(x)dx)Ni(t−)+φ−1

(φ−1 +∫ t

0 λ0(x)dx)Ni(t−)+φ−1+1

=λ0(t)1+φNi(t−)1+φµ0(t)

(1.25)

it results (1.20).

Hence, if random effects are present in the model, the intensity depends on the

number of events experienced by the individual.

The random effects approach and the multiplicative model including covariates can be

combined.

1.5 Extensions to renewal and Poisson processes

1.5.1 "At risk" indicator function

Another feature that can be included in the model is the heterogeneity of the observation

time of each individual. In order to do so we introduce the risk indicator function Yi(t),that is equal to 1 when the i − th individual is observed (and he or she is "at risk"

of experiencing the event), otherwise it is equal to 0. For example, if an individual

is observed in the interval [τ0i,τi], then Yi(t) = I(τ0i ≤ t ≤ τi). The notation can also

accommodate settings where individuals are observed over disjoint time intervals, for

example if an individual is lost to followup for a certain period of time.

The right end of the observation window τi is typically called censoring time and it

represents the termination of the study for i− th individual.

It is now possible to define respectively the observed part of the counting process, the

history and the intensity of the observable process:

Ni(t) :=∫ t

0Yi(u)dN(u); (1.26)

Hi(t) :={Ni(s),Yi(s),0≤ s < t}; (1.27)

λi(t|Hi(t)) := lim∆t→0

P(Ni(t+∆t)−Ni(t)= 1|H(t))∆t

. (1.28)

11

Page 26: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 1. THEORETICAL BACKGROUND ON MODELLING RECURRENTEVENTS

In some cases information is incorporated from the history of the process into the

intensity function. As a consequence, ∆Ni(t) := lim∆t→0 Ni(t+∆t)−Ni(t) and Yi(t) are

conditionally independent given the history, and so the intensity of the observable process

is such that

λi(t|Hi(t))=λi(t|Hi(t))Yi(t). (1.29)

Basically, the observable process has intensity 0 outside of the observation scheme.

The likelihood (1.3) can now be expressed in terms of the observable process as:

exp(−

∫ ∞

0λ(u|H(u))Y (u)du

n∏j=1

λ(t j|H(t j)), (1.30)

and it can be used to estimate λ(t|H(t)).

1.5.2 General intensity-based model

In the previous sections the intensity functions of renewal processes and of the counting

processes have been analyzed. In case of renewal processes the intensity is a function

of the time since the last event. This function is called hazard function in analogy to

survival analysis. In case of counting process the intensity is called rate function. Both

models can be extended with covariates and random effects, by multiplying the baseline

intensity function with a function of a linear combination of the covariates and/or with a

parameter called frailty, which represents the variability between individuals that is not

captured by the observed features.

The two models can be combined in order to have dependence both from the recurrent

events count and from the gap-times. In this case the intensity can be written as:

λ(t|H(t))= exp(α+βg1(t)+γI(N(t−)> 0)g2(t−TN(t−))

). (1.31)

The functions g1(t) and g2(t) express the dependence from calendar time and from the

time since the last event, respectively. When the parameter γ is equal to 0 the recurrent

event process is a Poisson process, and when the parameter β is equal to 0 the recurrent

event process is a renewal process, since the intensity depends only on the waiting times.

The intensity depends on the process itself, and so it is not always possible to have a well

defined analytical framework like in the Poisson process model. However, thanks to (1.5),

it is possible to simulate the gap-times and hence to have a Monte Carlo estimate of the

law of N(t), for any t.

12

Page 27: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

1.5. EXTENSIONS TO RENEWAL AND POISSON PROCESSES

1.5.3 Multi-state Markov models

There are at least two possible approaches in order to introduce the dependence of the

recurrent event process on the number of events experienced until time t.The first is to introduce a function of N(t−) in the covariates, while the second is to model

the process as a Multi-state Markov model. In this particular framework every individual

at every time is in a particular state, which it corresponds to the cumulative number of

events experienced until that moment. The transition from a state to another is possible

only from the state k to the state k+1, and to every transition it is associated to an

intensity αk(t), where

αk(t)= lim∆t→0

P(N(t)−N(t−∆t)= 1|N(t−∆t)= k,H(t)

)∆t

. (1.32)

Hence, the intensity of the process can be written as:

λ(t|H(t))=∞∑

k=0αk(t) I(N(t)= k). (1.33)

In the case αk(t)=α(t) for every k the model is the canonical Poisson process with α(t)as a rate function.

1.5.4 Modelling the baseline intensity function

Once covariates and frailties have been introduced in the model, an important issue

is the choice of the baseline intensity function. This choice can be either parametric

or non-parametric. The simplest parametric choice for the baseline intensity function

is the constant intensity. This choice implies an homogeneous Poisson Process, where

gap-times are distributed as Exponential random variables and the mean function is

linear with respect to the time.

In some contexts the intensity function cannot be constant over time. This is the case

either of diseases in which there is a significant infant mortality (decreasing intensity

function) or of aging process in which the events are more likely to happen once some

time is passed (increasing intensity function). Then a possible extension is the Weibull

model, where the gap times are independent random variables distributed with density:

f (x|λ,α)=λαxα−1e−λxα I{x≥0}(x). (1.34)

If α> 1 then the intensity is increasing, if α= 1 the intensity is constant, otherwise it is

decreasing. Under this assumption:

N(t)∼ Poisson(λtα), ∀t ≥ 0 (1.35)

13

Page 28: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 1. THEORETICAL BACKGROUND ON MODELLING RECURRENTEVENTS

and {N(t) : 0≤ t} is a Poisson process.

The baseline intensity function can assume a non-parametric form in the following

way. Let us divide the observation time in K disjoint intervals taking a0 = 0< a1 < . . .<aK as cut-points. For each of the resulting sub-intervals (ak−1,ak] let us assume that

the intensity is constant and equal to λk > 0. Now the baseline intensity function is

characterized by the vector of parameters (λ1, . . . ,λK ). This kind of model can approximate

the shape of every type of intensity function, and the approximation will be as good as Kis large enough. However a larger K implies more parameters to estimate and though a

greater computational effort. In this case, including time varying covariates {xi(t) : 0≤ t},the likelihood of the model can be expressed as the product of the contributions that

every individual has on the specific interval:

K∏k=1

n·kk

M∏i=1

{exp(

ni∑j=1

xi(ti j)′βI(ak−1,ak](ti j)−λk

∫ ak

ak−1

Yi(s)exp(xi(s)′β)ds)}}

, (1.36)

where ti j is the time of the j-th event experienced by the i-th individual, M is the

total number of individuals, and n·k =∑mi=1 nik, where nik is the total number of events

between ak−1 and ak experienced by the i-th individual.

The cut-points can be chosen in different ways. In order to guarantee an estimate of every

λk the observation of at least one individual must fall into the corresponding interval; for

this reason one possible choice is to set ak as thekK

empirical quantile of the distribution

of the event times. Another possible choice, simpler and independent of the observation,

is to divide the observation time in K equispaced intervals. Of course this modelling issue

must be object of an analysis of sensitivity, both on the number of cut-points K and on

their position on the time domain. In the literature of survival analysis Gustafson et al.

(2003) suggest the use of the quantiles, while Yin et al. (2006) and Sahu et al. (1997)

propose the use of equispaced grids. In the field of recurrent event process in a Bayesian

setting Pennell and Dunson (2006) use a tightly spaced grid and an auto-correlated prior

in order to borrow strength between intervals.

If one imposes that λ1, . . . ,λK are independent random variables distributed with

proper Gamma distributions, the resulting cumulative intensity function µ(t)= ∫ t0 λ(u)du

is a realization of a Gamma process (see Kalbfleisch, 1978), which is a particular stochas-

tic process built such that the increments are independent random variables Gamma

distributed, namely:

µ(t)−µ(s)∼Γ(φ(t)−φ(s), c

), (1.37)

14

Page 29: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

1.6. THE BAYESIAN APPROACH

where φ(t) is an increasing function, and c is a positive-valued parameter.

1.6 The Bayesian approach

1.6.1 Bayesian Statistics

In a statistical model once a dataset y= (y1, . . . , yn) is observed it is possible to associate

a measure of beliefs through p(y|θ), which depends on a vector of parameters θ. p(y|θ)

is called likelihood, and it is a probability measure. The vector θ typically summarises

the characteristics of the population from which the dataset y is sampled. While in

the frequentist framework θ is a fixed number, in Bayesian statistics it is a random

variable, and a probability measure π(θ) is associated to its every possible value. π(θ) is

called prior probability. Hence the likelihood function p(y|θ) has to be interpreted as the

probability associated to y once θ is the true parameter vector. Summarising:

• π(θ) is a measure of beliefs that θ represents the true characteristics of the popula-

tion;

• p(y|θ) is a measure of beliefs that y would be sampled from the population if θ is

the true parameter.

The Bayesian approach offers a way to update the prior beliefs about θ with the com-

putation of the posterior distribution π(θ|y), which is a function that summarises the

beliefs about θ once y is observed. This is done by using the Bayes’ Theorem:

π(θ|y)= p(y|θ)π(θ)∫p(y|θ)π(θ)dθ

, (1.38)

where the integral is on all the support of θ.

Once this function is known it is possible to compute all the summaries of the posterior

distribution like the posterior mean E[θ|y], the posterior variance V ar[θ|y] or to make

an interval estimate C such that P(θ ∈ C|y)= 1−α.

The Bayesian method offers a typical scientific approach where some hypothesis on a

phenomenon (summarised in π(θ)) are validated by the collection of data y yielding to a

new point of view, namely the posterior distribution π(θ|y).

In this thesis the Bayesian approach will be followed. In Chapter 3 the statistical

model is set up with the likelihood of the data and the prior elicitation, while in Chapter

4 the posterior inference is showed and commented.

15

Page 30: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 1. THEORETICAL BACKGROUND ON MODELLING RECURRENTEVENTS

1.6.2 Monte Carlo Markov Chains

Equation (1.38) is usually an intractable expression, hence all the inference can be done

by simulating a sample from the posterior distribution. Monte Carlo Markov Chains

(MCMC) methods offer a way to do so.

MCMC is a class of algorithms in which a Markov chain whose stationary distribution

is the posterior distribution is simulated. This means that every step of the Markov chain

can be considered as a draw from the posterior distribution, if we let run the simulation

for enough time. The MCMC algorithms generates a Markov Chain θ(1), . . . ,θ(T), where

θ(t) is independent of θ(1), . . . ,θ(t−2) conditionally on θ(t). Then, under general conditions,

if T →∞ and if h(θ) is a measurable function:

1T

T∑t=1

h(θt)→∫

h(θ)π(θ|y)dθ = E[h(θ)|y]. (1.39)

Hence all the summarises of the posterior distribution can be approximated by averaging

over the MCMC sample.

The MCMC algorithm used in this thesis is the Hamiltonian Monte Carlo (HMC),

which is efficiently implemented in a software called Stan (see Stan Development Team

and others, 2016). Stan is an open source software written in C++ that can be integrated

with the software R with the package rstan.

1.6.3 Discretization of the Gamma process prior

A possible implementation of a non-parametric intensity model is the Gamma process.

In Johnson et al. (2010), Section 13.2.5, a discretizationn of the Gamma process prior

in the survival setting is given, but this can be extended to the framework of recurrent

events. The model is the following.

First of all, a partition of the time domain must be given. Let us call it a0 := 0, . . . ,aK .

Then the idea is to center the intensity function on a certain value λ∗, which corresponds

to the intensity function of an Exponential random variable of parameter λ∗. As a

consequence, all the pieces of the intensity function must satisfy the equation:

E[λk|λ∗]=λ∗. (1.40)

A further requirement is that the prior variance of each λk is inversely proportional

to the length of the corresponding interval ak −ak−1 an to another parameter called w,

which is common to all the steps of the intensity function. Once mean and variance are

16

Page 31: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

1.6. THE BAYESIAN APPROACH

defined, the last condition to impose is that the increments are Gamma-distributed, and

this is equivalent to:

λk|w,λ∗ ind∼ Gamma(λ∗w(ak+1 −ak) , w(ak+1 −ak)

), k = 1, . . . ,K . (1.41)

The parameters λ∗ and w can be fixed or they can be modelled with a prior distribution.

Another important reference for Bayesian modelling of recurrent events is Ouyang

et al. (2013), in which it is also discussed the case where the termination of the obser-

vation of the recurrent event process is dependent on the process itself. In their work

Ouyang et al. (2013) propose to model the steps of the intensity function as a priori

independent and identically distributed, which is the approach that it is used in this

thesis.

1.6.4 Autocorrelated prior for the baseline intensity function

In Pennell and Dunson (2006) the prior structure of λ1, . . . ,λK is built to have correlations

among the parameters. Every steps of the intensity function is written as

λk = λk∆ j, (1.42)

where λk is the initial guess on the baseline intensity in that interval and ∆ j is a

multiplicative effect. The multiplicative effects are modelled in the following way:

∆ j =ν0

j∏h=1

νh j = 1, . . . ,K (1.43)

ν0 ∼Gamma(φ,φ) (1.44)

ν ji.i.d.∼ Gamma(ψ,ψ) j = 1, . . . ,K . (1.45)

It can be noticed that ∆ j =∆ j−1ν j, and so a covariance structure is induced in the multi-

plicative effects. Moreover φ controls the degree of shrinkage of the posterior towards

the initial guess on the baseline, and ψ regulates the smoothness in the deviations from

the prior estimate.

Another autocorrelated prior is proposed in Arjas and Gasbarra (1994). In this case

λ1 ∼Gamma(α1,β1) (1.46)

λk|λk−1, . . . ,λ1i.i.d.∼ Gamma

(α,

α

λk−1

), k = 2, . . . ,K , (1.47)

with α1 and β1 that have to be chosen in order to model the value at time t = 0 of

the baseline intensity function. The parameter α is inversely proportional to the prior

17

Page 32: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 1. THEORETICAL BACKGROUND ON MODELLING RECURRENTEVENTS

variance of the parameters λk. In fact from the following equation

E[λk|λk−1, . . . ,λ1]=λk−1 (1.48)√V ar[λk|λk−1, . . . ,λ1] =λk−1p

α, (1.49)

it can be noticed that, if α is very small, high deviations from the mean are allowed. In

the limiting case of α→∞ the baseline intensity function is a priori constant. Equation

(1.48) is equivalent to assume that the baseline intensity function has a martingale

structure with respect to the prior distribution and the internal filtration.

1.7 Model evaluation in terms of predictiveperformances

The fitting of a statistical model is often followed by its evaluation in terms of predictive

accuracy. The idea is to obtain an unbiased and accurate measure of the out-of-sample

predictive error. This issue has been tackled also in Bayesian statistics, for example in

Gelman et al. (2014) and Vehtari et al. (2017).

The most natural way to estimate the predictive error is through cross-validation,

however it requires multiple fits of the model and, especially in the Bayesian setting,

this could be a problem because of the computational burden of the MCMC methods.

Alternative methods aim to estimate the out-of-sample predictive error with the data,

using a correction for the bias that arises from evaluating the model’s prediction on

the data used to fit it. Some of these measures are the Akaike Information Criterion

(AIC), the Deviance Information Criterion (DIC), or the Watanabe–Akaike information

criterion (WAIC), which is a fully Bayesian method.

1.7.1 Log posterior predictive density

Consider data y1, . . . , yn modeled as observations of independent random variables given

parameter θ. The contribution of the single data point yi to the likelihood of the model is

p(yi|θ), while the total likelihood is p(y|θ)=∏ni=1 p(yi|θ). The notation can be generalized

even when there are covariates substituting p(yi|θ) with p(yi|θ, xi). If a new data point∼y is produced by the true data generating process, the out-of-sample predictive fit for

this datum can be computed as:

log p(∼y |y1, . . . , yn)= logE

[p(

∼y |θ)

∣∣y1, . . . , yn]= log

∫p(

∼y |θ)p(θ|y1, . . . , yn)dθ (1.50)

18

Page 33: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

1.7. MODEL EVALUATION IN TERMS OF PREDICTIVE PERFORMANCES

This quantity can be estimated by:

l ppd = logn∏

i=1p(yi|y1, . . . , yn)=

n∑i=1

log∫

p(yi|θ)p(θ|y1, . . . , yn)dθ, (1.51)

where lppd stands for log pointwise predictive density. Equation (1.51) is a biased

estimate of the (1.50) since the out-of-sample predictive fit is evaluated in the data

point itself, indeed the observation yi appears both in the likelihood p(yi|θ) and in

p(θ|y1, . . . , yn), which is the posterior distribution of θ.

To compute (1.51), it is possible to evaluate the expectation using draws from the posterior

distribution of the parameters p(θ|y1, . . . , yn), that are indicated as θ(s), s = 1, . . . ,S.

computed l ppd =n∑

i=1log

( 1S

S∑i=1

p(yi|θ(s)))

(1.52)

1.7.2 Computation of WAIC

WAIC (introduced by Watanabe in 2010) estimates the out-of-sample predictive measure

by computing expression (1.52) and then adding a bias correction. Then, the expected log

pointwise predictive density is computed as:

el ppdW AIC = l ppd− pW AIC, (1.53)

where pW AIC is the adjustment, that can be computed in two ways:

• pW AIC1 = 2∑n

i=1(logE[p(yi|θ)

∣∣y1, . . . , yn]−E[log p(yi|θ)|y1, . . . , yn]);

• pW AIC2 =∑ni=1 V ar[log p(yi|θ)

∣∣y1, . . . , yn].

Both the measures can be approximated once an MCMC sample is available.

Gelman et al. (2014) recommend pW AIC2, because, in its series expansion, equation (1.53)

resembles leave-one-out cross validation.

1.7.3 Evaluating predictive accuracy in the case of recurrentevents

All the formulas in the previous section rely on the division of the data in some partition

(the yi ’s with which it is possible to compute the probability p(yi|θ)).

In the case of recurrent event process one possibility is to consider the whole process of

events for every individual i in the study. Hence p(yi|θ) is:

19

Page 34: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 1. THEORETICAL BACKGROUND ON MODELLING RECURRENTEVENTS

exp(−

∫ ∞

0λi(u|Hi(u),θ)Yi(u)du

ni∏j=1

λi(ti j|Hi(ti j),θ), (1.54)

In the case of a multiplicative model, with random effects and with the presence of

covariates

λi(t|Hi(t),θ)= wnewλ0(t|H(t),θ)exp(x′iβ). (1.55)

Since the main interest lies in predictive accuracy, wnew is not the random effects of

the individual i (which is estimated in the model), but is the frailty of a new incoming

individual given the observations.

20

Page 35: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Data source2

I n this chapter details on the dataset that has been analyzed are given.The first section is devoted to present the AVIS association, from its history to therules that regulate blood donations. All the information given are taken from the

websites of AVIS and AVIS Milan.Then it follows a thorough description of AVIS and EMONET databeses (the data sources).

2.1 The AVIS association

2.1.1 Brief history of AVIS

The Associazione Volontari Italiani Sangue (AVIS) was born in Milan in 1927 thanks to

the physician Vittorio Formentano, who made an appeal on a daily newspaper of the time

to form a group of donor volunteers. Seventeen persons answered the call and formed

the first AVIS group of the history.

However the official formation of the association is dated on 1929; transfusion thera-

pies started to be accessible to everybody, and not only to wealthy people. At the same

time the memorandum of the association has been approved. A passage of the memoran-

dum can be translated as follows: "The finality of the Association is to promote, especially

in the working class, the humanitarian, social and patriotic concept of the voluntary

offering of their own blood." In this period groups of blood donors associations born in

other cities like Ancona, Bergamo, Brescia, Torino, Napoli, Cagliari, Cremona.

With the purpose to coordinate the local groups spread in Italy, in 1946 the Association

assumed a national form, with Milan as headquarter.

In 1950 the Republic of Italy gave legal recognition to AVIS with Law n. 49; in

1967 Law n. 592 recognized the civic and social role of AVIS in the organization and

promotion in matter of transfusion, while in 1990 another law established the principle

21

Page 36: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 2. DATA SOURCE

of the gratuity of blood donation. Furthermore, it is stated that the voluntary blood

donor associations and the related federations contribute to the institutional aims of the

National Health Service concerning the promotion and development of blood donations

and the protection of donors.

The activity of the association became more and more popular and in 2005 AVIS reached

the goal of one million donors and in 2009 for the first time since the foundation more

than two millions of donations took place in Italy.

In 2017 AVIS had its 90th birthday; through its long life it has become one of the most

important voluntary associations in Italy.

2.1.2 Italian donation rules

Because of the importance of blood in healthcare, there are some rules that regulates

the mechanism of blood donations. These rules are meant to protect both the health

of the patient who will receive the blood and the health of the donor himself/herself. A

legislative act called "Disposizioni relative ai requisiti di qualità e sicurezza del sangue e

degli emocomponenti" (see Ministero Della Salute, 2015) collects all these rules.

Any candidate donor must be between 18 and 60 years old. However the responsible

physician can allow a candidate donor older than 60 years old to donate for the first time.

The anagraphic age limit is increased to 65 years old for periodic donors, even in this case

the physician can allow a person to donate until 70 years old after a clinical evaluation

of the risks correlated to the age. Every donor must weigh more than 50 Kg, the blood

pressure, the frequency of the heartbeats and the level of hemoglobin must lie between

certain ranges. The yearly maximum number of donations for men and for women who

are in menopause is 4, for the other women is 2. By law, the minimum gap time between

two consecutive donations is 90 days. In order to respect the restriction on the yearly

maximum number of donations for women the minimum gap time is put to 180 days,

but this is an internal rule of the association, not a law limit. However the responsible

physician can move up the donation if he or she thinks that the health and the wellness

of the donor are not in danger. The donor can be suspended from the activity for a certain

time or forever if the donation can in some way compromise his/her own health status

or the quality of the component donated. Suspensions are not exceptional events; for

example journeys in exotic countries, dental care, change of the partner or a recent flu

are some causes of temporary suspensions. Of course, the length of the suspension is

related to the severity of the cause.

22

Page 37: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

2.2. DATA SOURCES

2.2 Data sources

The data of Milan’s AVIS section come from two databases: the EMONET database and

the AVIS database. Data used in this work have been collected from multiple tables of

the two databases. The EMONET database is made of tables concerning donations or

personal data of the donors; the AVIS database contains information about suspensions

and donors’ habits. All the data have been extracted using SQL queries on the AVIS’

servers, and have been joined with the unique ID of the donor and/or with the unique ID

of the blood donation. The dataset has been built only with the tables going from 2.1 to

2.7. In the next subsections some tables describes the two databases.

2.2.1 The EMONET database

We have considered five tables in the EMONET database

• tables PRESENTAZIONI and DONAZIONI contain some information about the

donations (see Tables 2.1 and 2.2);

• tables TIPIZZAZIONE and ANAGRAFICHE contain information about the donors

(see Tables 2.5 and 2.3);

• table EMC_DONABILI records the blood components that could be donated (see

Table 2.4 ).

Column Type DescriptionCAI numerical donor unique id

DTPRES date-time date and timeIDPRES numerical donation unique id

TIPO_ATTIVITA categorical (D for donation, C for control)ID_PUNTPREL numerical AVIS location unique id

Table 2.1: Variables from table PRESENTAZIONI in EMONET database that are in-cluded in our dataset

23

Page 38: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 2. DATA SOURCE

Column Type DescriptionCAI numerical donor unique id

DTPRES date-time date and timeIDPRES numerical donation unique id

ID_EMCDON categorical blood component unique id: 1 for whole blood, 2 for plasma, ...VALIDITA categorical V if the donation was effective, N otherwise

Table 2.2: Variables from table DONAZIONI in EMONET database that are included inour dataset

Column Type DescriptionCAI numerical donor unique id

SESSO numerical donor gender (1 for man, 2 for woman)DATANASCITA date donor’s birthday

CAP_DOMIC categorical donor’s domicile postal codeCAP_RESID categorical donor’s residence postal code

Table 2.3: Variables from table ANAGRAFICHE in EMONET database that are includedin out dataset

Column Type DescriptionID_EMCDON categorical blood component unique id: 1 for whole blood, 2 for plasma, ...INTERVALLO numerical minimum gap time between two donations of the component

DESCR character description of the blood componentNDONMAXMAS numerical maximum number of donation in a year for menNDONMAXFEM numerical maximum number of donation in a year for men

Table 2.4: Variables from table EMC_DONABILI in EMONET database that are includedin our dataset

Column Type DescriptionCAI numerical donor unique idAB0 numerical blood type (A, A1, A2, A3, B, AB, A1B, A2B, 0)

TIPO_RH categorical Rhesus factor (POS or NEG)

Table 2.5: Variables from table TIPIZZAZIONE in EMONET database that are includedin our dataset

2.2.2 The AVIS database

In AVIS database two tables have been considered. Table STILIVITA registers some

information about the lifestyle of the donors (Table 2.6), while all the suspensions have

24

Page 39: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

2.2. DATA SOURCES

been recorded in table SOSPENSIONI (Table 2.7).

Column Type DescriptionCAI numerical donor unique id

FUMO categorical smoking habitsALCOOL categorical drinking habits

THE categorical tea consumptionCAFFE categorical coffee consumptionDIETA categorical diet type

STRESS categorical stress levelATTIVITAFISICA categorical physical activity habits

CIRCONFERENZAVITA numerical abdominal circumferenceALTEZZA numerical height

PESO numerical weightBMI numerical Body Mass Index

Table 2.6: Variables from table STILIVITA in AVIS database that are included in ourdataset

Column Type DescriptionCAI numerical donor unique id

TIPO_SOSP categorical T for temporary, D for definitiveDATAINSERIMENTO date-time suspension starting dateDATARIAMMISSIONE date-time suspension ending date

Table 2.7: Variables from table SOSPENSIONI in AVIS database that are included inour dataset

2.2.3 Data selection

For this work the whole period going from the 1st of January 2010 to the 30th of June

2018 has been considered as observation time. The focus of the analysis is on donations of

whole blood performed in the main building of AVIS Milano, that is located in the district

of Lambrate. We have considered only "new" donors, namely people who have become

donors in this period, discarding all the others. For every donor there is an observation

interval that has its origin in his/her first whole blood donation and its end in the 30th of

June 2018, which it is considered as a censoring time. According to this selection criteria

there is a dataset composed of 9175 donors; each donor’s observation time has a length

that is generally different from the others, with a different number of donations.

25

Page 40: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 2. DATA SOURCE

2.2.4 Suspensions

The donor could be suspended from his/her activity for a certain period of time if his/her

wellness or the quality of the blood component are in danger. These facts are registered

and the suspensions are collected in the databases of the Association (see Table 2.7).

In this period 805 suspensions related to 618 donors are registered. However many

of these suspensions are overlapping; this may happen when after a further control the

suspension is extended because the reasons to preclude the person to donate remain. For

each suspension the beginning and the end times are present, and a categorical variable

named TIPO_SOSP points out if it is a life-suspension or a temporary suspensions.

Among these, there are 421 temporary suspensions for 348 donors without an end date,

hence it is difficult to correlate the effect of the suspension on the individuals’ donations.

The remaining ones are not respected in 92 cases, which is about the 25% of the times,

and so a blood donation is performed during the suspension.

Definitive TemporaryNOT RESPECTED 5 87

RESPECTED 42 250

Table 2.8: Frequency table that relates the type of suspensions to the respect of thesuspensions

The fact that not all the suspension are respected does not mean that there is a lack

of control of the Association on this issue, indeed the responsible physician can decide to

move up the end of the suspension, and this is probably that case. A possible solution to

this issue could be to think the real end of the suspension as the minimum between the

time of the successive donation and the registered end of the suspension. However the

temporary suspensions without an end time remain an issue, because with the above

solution there is the possibility that an individual who does not return to donate for a

long time for his/her will can be confused with an individual for whom the donation is

precluded.

Other data that are available are donations of other blood components. It is possible to

think the period of rest after each donation as a suspension from the donations of whole

blood and to include these information in the analysis. From Tables 2.4, 2.1, 2.2 it is

possible to have the starting and the end date of the period of inactivity due to donations

of blood components different from whole blood. Hence the data about suspensions are

completed with 727 observations related to 267 donors. Only 5 of these are not respected.

26

Page 41: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

2.3. FEATURES SELECTION AND DATA TRANSFORMATION

2.3 Features selection and data transformation

Feature Levels Description MissingSESSO 2 Gender of the donor (1 Male, 0 Female) 8FUMO 15 Daily number of cigarettes 315

ALCOOL 6 Daily weigth of alcool consumed 315THE 7 Daily number of cups 2843

CAFFE 7 Daily number of cups 2232DIETA 7 Kind of diet 315

ATTIVITAFISICA 15 Sport level 315PESO - Height in cm 315

ALTEZZA - Weight in kg 315STRESS 5 From absent to stressed 315

AB0 9 Blood type 0RH 2 Positive or negative 0

CAP_DOMIC 1482 Postal code 51

Table 2.9: Description of the features

There are many features in the databases that can be used as covariates in a statistical

model (see Sections 2.2.2 and 2.2.1). Let us focus on the ones described in Table 2.9.

There are missing values for some donors. When the missing values were in a notable

number (like in the variables THE and CAFFE, namely the daily number of cups of coffee

and tea, see Table 2.9) the whole feature have been discarded (column-wise deletion),

while for all the other features just the corresponding individual has been discarded

(row-wise deletion). Most of the features are categorical variable with many levels. In

order to make them suitable for a statistical model they have been transformed into

binary dummy variables.

• The variable FUMO takes the value 1 if the donor is a smoker, 0 if he or she is not;

• the variable ALCOOL takes value 0 if the donor declare to not consume alcoholic

beverages, 1 otherwise;

• the variable ATTIVITAFISICA takes value 0 if the donor declare to have a seden-

tary lifestyle, or if he/she consider low or irregular his/her level of physical activity;

• the blood type is transformed into a 4 levels dummy variable (A,B,0,AB). For

instance (1,0,0,0) is blood type A, (0,1,0,0) is B, and so on.

27

Page 42: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 2. DATA SOURCE

The variables DIETA and STRESS do not seem to be useful for an analysis, almost all

the donors declare to have a balanced diet and an absent level of stress.

Numerical features are also present:

• AGE is the age of the donor when he/she donates for the first time in his/her life;

• with the variables PESO (weight) and ALTEZZA (height) the Body Mass Index

(BMI) has been computed as:

BMI = WeightHeight2

where the weight is expressed in Kg and the height in meters.

2.4 Descriptive analysis

2.4.1 Rate of donations and gap times

At the end of the procedure of data selection 9175 donors were registered in the dataset.

All these persons together did 34864 donations of whole blood in the period that goes

from the 1st of January 2010 to the 30th of June 2018.

As it can be noticed in Table 2.10, about 35 % of them just entered in the study, without

any further donation. Since the goal of the proposed models is to describe donations as

recurrent events these individuals are excluded from the analysis. The others will be

called "recurrent donors".

The total number of donations for a donor does not give all the information about how

much a person donates in a certain time period. This number must be related to the time

in which each individual is observed, for example dividing it for the years of observation.

The empirical rates of donation have been computed (only for recurrent donors) and are

shown in figure 2.1. Notice that the empirical distribution of the yearly rate of donation

is left-skewed: most of the donors did less than two donation per year.

28

Page 43: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

2.4. DESCRIPTIVE ANALYSIS

Total donations (n) Donors Sample frequency0 3238 0.35291 1608 0.17532 1101 0.12003 723 0.07884 555 0.06055 417 0.04546 292 0.03187 262 0.02868 178 0.01949 160 0.0174

10 119 0.013011 101 0.011012 75 0.008213 64 0.0070

>13 282 0.0307

Table 2.10: Number of donors that did exactly n total donations (after the first one)

Empirical yearly rate of donation

RATE

Den

sity

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

0.5

Figure 2.1: Histogram of the empirical rates of donation (number of donations dividedfor the years of observation)

29

Page 44: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 2. DATA SOURCE

It can be noticed that there could be a problem of loss to follow-up. This fact can be

realized by computing, for each donor, the number of days passed from the last donation

to the censoring time (namely, the last day of observation). See Figure 2.2 for the boxplot

of this quantity.

Loss to follow-up happens when an individual voluntarily abandoned the study, and

so he/she does not show up for a long period of time. However blood donations are on a

voluntary basis, hence we do not know if a donor actually decided to stop his/her activity

or he/she is only postponing the next donation.

If one believes that the history of the process influences the fact that some individuals

do not show up for a while then some choices about the censoring time Ci have to be done.

Then the dependence between the process and Ci must be modelled (see Ouyang et al.

(2013) for event dependent censoring time and chapter 7 of Cook and Lawless (2007) for

more details about loss to followup).

Non−recurrent donors recurrent donors

050

010

0020

0030

00

Figure 2.2: Boxplot of the number of days passed from the observed last donation ofevery donors to their censoring time

30

Page 45: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

2.4. DESCRIPTIVE ANALYSIS

●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●●

●●

●●●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●●

2 3 4 5 6 7 8 9 10 11

4.5

5.0

5.5

6.0

6.5

7.0

7.5

repetition

log(

gap

times

)

(a) Boxplots of the waiting times of all in-dividuals Wi, j = Ti, j+1 −Ti, j grouped by thej− th repetition

●●

● ●

● ●

● ●

● ●

5 10 15 20

100

120

140

160

180

200

220

repetitionda

ys

●●

● ●● ●

●●

● ● ● ●

●●

● ● ●

meanmedian

(b) Trend of the mean and of the median ofWi, j with respect to j

Figure 2.3: Trend of gap times with the number of donations

●●●●●●●●●

●●●●●●●

●●●●

●●●

●●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●●●●●

●●

●●

●●●●●●●

●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●

●●

●●●●●

●●●●●

●●●●

●●●●

●●●●●●●●

●●●●●

●●

●●●●●●●●●●●

●●

●●●●●

●●●

●●●●●●●

●●

●●●

●●●●●

●●●●●●●

●●

●●

●●

●●●●●

●●●

●●●

●●●

●●●●

●●●●

●●●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●●

●●●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

1 2 3 4 5 6 7 8 9

4.5

5.0

5.5

6.0

6.5

7.0

7.5

year

log(

gap

times

)

(a) Boxplots of the waiting times of all in-dividuals Wi, j = Ti, j+1 −Ti, j grouped by theyear in which the events occurred

● ●

2 4 6 8

100

150

200

250

year

days

● ●

●●

● ●

(b) Trend of the mean (red) and of the me-dian (blue) of Wi, j with respect to the yearin which the events occurred

Figure 2.4: Trend of the gap times with the years passed since entrance

In Gianoli (2016) it has been observed how the waiting times between two events

31

Page 46: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 2. DATA SOURCE

seem to have a decreasing trend as the number of donation goes by (see Figure 2.3).

Figure 2.3 gives some information about the rate of donations at time t conditionally to

the number of events occurred until that time. However in this thesis we are interested

in deepen how the rates of donations change once times is passed, without taking into

account the information on the number of events experienced. In this sense is more

meaningful to investigate the relationships between the gap times and the year in which

the corresponding events occurred.

By looking at Figures 2.4a and 2.4b a trend between the two is not evident, in fact the

medians seem constant over the years (remind that the years are counted from the first

donation of each individual). The tail of the empirical distributions becomes longer as it

can be noticed by the growth of the mean and by the boxplots.

log(gap times)

Fre

quen

cy

4.5 5.0 5.5 6.0 6.5 7.0 7.5

050

010

0015

0020

00

Figure 2.5: Histogram of the logarithm of the gap times

An interesting fact is the bimodality of the distribution of the gap times that reflects

the difference of the donations rule between the two genders: men are allowed to donate

before women. In Figure 2.5 the histogram of the gap times is shown, the red lines

correspond to the logarithms of 90 and 180, namely the minimum waiting times for men

and women.

32

Page 47: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

2.4. DESCRIPTIVE ANALYSIS

2.4.2 Covariates

As mentioned before, some features are used as covariates in the statistical model that

we propose. Time-dependent covariates are not taken into account in this work; all

the covariates we include in the models were registered at the entry time in the study,

specifically the time that a person decides to sign up in AVIS.

In Table 2.11 all the categorical covariates are summarised with their sample fre-

quency. Some of them are objective (like sex, blood type or Rhesus factor), while the

others are declared by the person her/him-self (smoke and alcohol habits and level of

physical activity).

Variable Value Sample frequencySex F 0.372

M 0.628Smoke Non-smoker 0.656

Smoker 0.344Alcohol Not consumer 0.697

Consumer 0.303Physical Activity Sedentary life 0.327

Active life 0.673AB0 A 0.432

B 0.123AB 0.0120 0.462

RH POS 0.865NEG 0.135

DIETA Balanced 0.938Highly caloric 0.011Lowly caloric 0.004

Vegan/Vegetarian 0.016STRESS Absent 0.0658

Negative 1 0.824Negative 2 0.060Negative 3 0.011

Positive 0.004

Table 2.11: Table of the sample frequencies of the categorical variable

There are more men than women donors in the dataset. The majority of the population

has blood type group 0, and the positive Rhesus factor is more frequent than the negative

one. For what concern living habits variables it seems that donors have an healthy life.

33

Page 48: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 2. DATA SOURCE

In fact there are more non-smokers than smokers, and an active life is declared by most

of the individuals. Moreover the consumers of alcoholic beverages are outnumbered.

Variable Sample mean Standard deviationAGE 31.64 9.83BMI 23.67 3.88

Table 2.12: Mean and standard deviation of the continuous variable

For what concerns the continuous covariates, in Table 2.12 empirical mean and

standard deviations can be found. This values are used in the standardization of these

features.

From this table it can be noticed that in mean a person becomes donor for the first

time at about 32 years. Notice in the boxplots in Figure 2.7 that the first and the third

quantile are about at 25 and at 40 years.

The Body Mass Index is thought to be a measure that divides continuously the weight

situation of a person from underweight to severe obesity.

●●●●

●●●

●●

●●

●●

●●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●●●●

●●

●●

●●

F M

1525

3545

Sex

BM

I

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

0 1

1525

3545

Active life

BM

I

●●

●●

●●●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

NO YES

1525

3545

Smoke

BM

I

●●

●●●●

●●●●

●●●●●

●●

●●

●●

●●

●●●●●●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

NO YES

1525

3545

Alcohol

BM

I ●

●●●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

NEG POS

1525

3545

Rhesus factor

BM

I

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●●

●●●●●●●

●●●

●●●

●●●●●●

●●

●●

●●●●●●

●●

● ●

●●

●●●●

●●●

0 A AB B

1525

3545

Blood type

BM

I

Figure 2.6: Boxplots of the BMI according to the values of the categorical covariates

The normal weight range goes from 18K gm2 to 25

K gm2 . From Table 2.12 and Figure

2.6, one can say that the donors are in a situation of wellness.

34

Page 49: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

2.4. DESCRIPTIVE ANALYSIS

Figures 2.6 and 2.7 are useful to discover some correlation pattern between the

continuous and categorical variables. However it is not clear from the box-plots if a

significant correlation exists.

The goal of the model that it is proposed in this thesis is to estimate the donation

rate, namely the number of donations in the time unit. It is evident, from figure 2.8, that

the distribution of the rates reaches higher values in males than in females. This was

expected since, according to law, men have the double of the possibilities to donate that

women have. No other correlations are evident in the mentioned figure.

●●

F M

2030

4050

60

Sex

Firs

t don

atio

n ag

e

●●●●●

●●●●

0 1

2030

4050

60

Active life

Firs

t don

atio

n ag

e

●●

NO YES

2030

4050

60

Smoke

Firs

t don

atio

n ag

e

●●●

NO YES

2030

4050

60

Alcohol

Firs

t don

atio

n ag

e

NEG POS

2030

4050

60

Rhesus factor

Firs

t don

atio

n ag

e

0 A AB B

2030

4050

60

Blood type

Firs

t don

atio

n ag

e

Figure 2.7: Boxplots of the first donation age according to the values of the categoricalcovariates

In Figure 2.9 the logarithm of the empirical rate is plotted against the corresponding

values of the BMI and of the first donation age. In red there is the line obtained with the

OLS estimator. The estimated correlation is positive for both the covariates, but further

investigations must be done in order to establish the significance of this relationship.

35

Page 50: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 2. DATA SOURCE

●●

●●

●●●

●●●●●●

●●●●●●●●

●●●

●●●●●

● ●

●●●

●●●●●

●●●

●●

F M

02

46

Sex

Don

atio

n ra

te (

N/y

ears

)

●●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●●

●●●●

●●●●

●●●

●●

●●

●●●●●

●●

●●

NO YES

02

46

Active life

Don

atio

n ra

te (

N/y

ears

)

●●●●

●●●●●

●●●●

●●●

●●●●

●●

●●

●●●

●●●

●●

●●●●

●●●●

●●

●●●●

●●

●●●

NO YES

02

46

Smokers

Don

atio

n ra

te (

N/y

ears

)

●●

●●●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●●

●●●●

●●●

●●●

●●

●●

●●

NO YES

02

46

Alcohol consumers

Don

atio

n ra

te (

N/y

ears

)

●●●

●●●●

●●

●●

●●●●

●●

●●●●●

●●●

●●

●●

●●●●

●●●●●

●●

●●●●

●●●●●

●●●

●●

●●

0 A AB B

02

46

Blood type

Don

atio

n ra

te (

N/y

ears

)

●●●●●

●●

●●●●●●●

●●●●●

●●

●●

●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●●●

●●●

●●

NEG POS

02

46

Rhesus factor

Don

atio

n ra

te (

N/y

ears

)

Figure 2.8: Boxplots of the donation rate grouped with the categorical variable

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

● ●

●●●

●● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

● ●

● ●

● ●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●● ●

●●●

● ●

● ●●

●●

● ●

●●

●●

● ●

●● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

●●

●● ●

●●

● ●●

● ●

●● ●

●●

● ●●

● ●

●●

● ●

●●

● ●●

●●●

●●

●●

●●

●●●

●●

●●

● ●

● ●●

● ● ●

● ●●

●●

●●

● ●●

●●

● ●

● ●

●●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●●

●●

●●

●● ●

●●

● ●

●●

●●●

● ●

● ●

● ●●

●●

●●

●●

● ●

●●● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

● ●

●●

● ●

●●●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

●●●

●●

● ●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●●

● ●

● ●

●●

●●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

● ● ●●

●●● ●

● ●●

● ●

●●

●●

●●●

●●

●●●

● ●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●

● ●●

●●

●●

● ●

● ●

●● ●

● ●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

● ●

●● ●

● ●●

●●

● ●

●●●

●●

●●●

●●●●

●●

● ●

●●●

●● ●

●●

●●

●●

●●

●●

●●●

● ●●

●●

●●

●●

● ●

●●

●●●

●●

● ●

● ●

●●

● ●

● ●

●●●

●●

●●

●●

●●●

● ●●

●●

●● ●

●●●

●●

●●●

● ●●

● ●●

● ●

●●

● ●

● ●

●●

●●

●●● ●●●

●●

● ● ●●

● ●

●●● ●

●●

●●

●●

●●

●●

●●

●●●

●● ●

●●

●●

●●●

●●

●●

●●●●●

●●

●● ●●

●●

● ●●

● ●● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

● ●● ●

● ●●

●●

●●

●●

●●●

●●●

●● ●

● ●

●●●

●●

●●

●●

●●● ●

●● ●●

●●

●●

● ●●

●●

● ●

●●

●●

● ●● ●

●● ●● ●

●●●

● ●

●●

● ●●●

●●

●●●

●●

●●

●●●●

● ●

●●●

●●

● ●●

●●●

●●●● ●●

●●

●●● ●●●

●●●●●

● ● ●●

●●● ●● ●●

●●

● ●●●

●●

●●●

15 20 25 30 35 40 45

−1.

5−

0.5

0.5

1.5

BMI

log(

rate

)

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●● ● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

● ● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ● ●

●● ●

●●

●●

●●

●●

● ●

● ●

●●

●● ●

● ●

●●●

●●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

● ●

● ●●

●●

● ●

● ●

● ●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●●

●●●

●●

● ●●

● ●

●●

●●

●●

●●

● ●

● ●●

●●

● ●●

● ●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

● ●●

● ●

● ●

● ●

● ●

● ●

●●

● ●●

● ●

●● ●

● ●

●● ●

●●

●●●

●●

●●

●●

●●

● ●●

●● ●

●●

●●

●●

● ●●

●●

●●

●●

● ●●

● ●●

● ●●

●●

●●

●●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

●●●

●●

●●

● ●●

●●

● ●

●●

●●●

● ●

● ●

●●●

●●

● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●● ●

●●

● ●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●●

●●

●●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●●●

●●

●●

●● ●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

● ●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●● ●●

●●

● ●● ●

●●●

● ●

●●

●●

●●●

●●

● ● ●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●●

●●

●●

● ●

● ●

● ●

●●

● ●

● ●

● ●

● ●

●● ●

●●

●●

● ●

●●

●●

●●

●● ●

●●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

● ●

●●●

● ●

●●

●●

●●

● ●

● ●

●●●

●●●

● ●

●●

●●●

●●

● ●●

● ●● ●

●●

●●

●● ●

●● ●

●●

●●

●●

● ●

● ●

● ● ●

● ●●

● ●

●●

●●

●●

● ●

●● ●

●●

● ●

●●

●●

● ●

●●

● ●●

●●

●●

● ●

● ●●

● ●●

●●

●● ●

● ●●

●●

● ●●

●● ●

●●●

●●

●●

●●

● ●

● ●

●●

●●● ●● ●

● ●

●●●●

● ●

● ●●●

●●

●●

● ●

●●

●●

●●

●●●

●● ●

●●

●●

●●●●

●●

●●

●● ● ●●

● ●

●●●●

●●

●●●

● ● ● ●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●● ●

● ●● ●

●● ●

● ●

● ●

●●

●●●

●●●

●● ●

●●

●●●

●●

●●

●●

●●● ●

●● ●●

●●

●●

● ●●

●●

● ●

●●

●●

● ●● ●

● ●●●●

●● ●

●●

●●

● ●● ●

●●

●●●

●●

●●

● ●●●

●●

●●●

●●

●●●

● ●●

● ● ●● ●●

●●

●●●●● ●

●● ●● ●

●● ●●

●●● ●● ● ●

●●

●●●●

●●

● ●●

20 30 40 50 60

−1.

5−

0.5

0.5

1.5

First donation age

log(

rate

)

Figure 2.9: Scatterplot of the donation rates against the continuous variable (AGE andBMI)

36

Page 51: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Modelling blood donations as

recurrent events3

I n this chapter, the model used to analyze the blood donations data will be ex-plained in detail. First, state of the art of predictive models of blood donation isdiscussed. Afterwards, there will be a subsection for each class of parameters in

the model, then the model will be summarized in the third section of the chapter. In theend it is explained how to obtain an MCMC sample of recurrent event processes.

3.1 Recurrent event models for blood donations

Bosnes et al. (2005) set up a logistic regression model to predict if donors actually show

up on a scheduled donation session. However this kind of approach focuses on a single

donation on a specific date, and, as a consequence, gives a limited insight into the long-

term behaviour of blood donors. Logistic regression has been applied even in Flegel et al.

(2000), where the probability that a person returns to donate within a preselected time

interval has been modelled. James and Matthews (1996) follow a time-to-event approach,

using non-parametric methods of survival analysis. Indeed the Kaplan-Meier estimator

for the hazard function of the first 5 donations cycle has been built. Proportional hazard

model is then used to establish covariates effect. Ownby et al. (1999) approach gap times

recurrent event modelling with a proportional hazard model to describe the first 10

return times to donation of an individual. Then, the first 5 return times were combined

using an homogeneous Poisson process with proportional hazards. All the previous

publications rely on frequentist methods. For what concerns the Bayesian setting, in

Gianoli (2016) blood donations are treated as recurrent events in the framework of gap

times between events. In particular, a class of autoregressive Bayesian semiparametric

37

Page 52: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 3. MODELLING BLOOD DONATIONS AS RECURRENT EVENTS

models for gap times have been considered. Fixed and time-dependent covariates were

considered and an individual-time-specific random effect has been modelled through a

Dirichlet process (DP) mixture prior, inducing clustering among donors.

3.2 Modelling choices

In the framework of recurrent event process the goal is to estimate the intensity function:

λ(t|H(t))= lim∆t→0

P(N(t)−N(t−∆t)= 1|H(t))∆t

. (3.1)

Before starting the discussion it is important to clarify what does the time variable tmean. For every individual the origin of the time axis is the time of his/her first whole

blood donation and so the time t is the number of days passed since that moment.

Hence for the individual i the set of observations is composed by:

Ti,1, . . . ,Ti,ni ,

where ni is the number of donations experienced by the donor, and so Ti, j is the number

of days passed from time t = 0 to the j− th recurrent blood donation (after the first one).

For every individual a blood donation occurred at time t = 0; this donation has not to be

considered as an event time but just part of the initial conditions, hence all the analyses

are done conditionally to this event.

Once the intensity function and the time scale are defined it is possible to compute

the likelihood of the realization of the recurrent event process of one single donor using

the formula in Section 2.1 in Cook and Lawless (2007):

P(n events at times t1 < t2 < ...< tn|H(τ0))=exp(−

∫ τ

τ0

λ(u|H(u))Y (u)du) n∏

j=1λ(t j|H(t j))

0= τ0 < t1 < . . .< tn < τ,

where Y (t) is the "at risk" indicator function, a binary function that indicates if an

individual is at risk to experience an event or not (see Section 1.5.1).

Let us denote with λi(t|H(t)) the intensity function and with Yi(t) the "at risk"

indicator function related to the i− th individual. The intensity will be modeled in the

framework of the multiplicative model as explained in Section 1.4. Hence :

λi(t|H(t))=Yi(t)×wi ×λ0(t|H(t))×exp(x′iβ), (3.2)

where:

38

Page 53: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

3.2. MODELLING CHOICES

• λ0(t|H(t))) is the baseline intensity function;

• wi is an individual-specific random effect (also said frailty);

• xi is the vector of covariates of the i− th individual and β a vector of coefficients;

• Yi(t) is the "at risk" indicator function that is considered as a datum for each

individual.

3.2.1 Baseline intensity function

The baseline intensity function will be modeled as the product of two components.

λ0(t|H(t))=( K∑

k=1λk × I(ak−1,ak](t)

(I(t−TN(t−)>φG )(t)

). (3.3)

The first component is independent of H(t) (the history of the process until time t) and

it is expressed as a piece-wise constant function (see Section 1.5.4). This model is very

flexible but requires to partition the time domain in a fixed number of intervals. Let us

call K the number of cut-points (denoted as a0 = 0, . . . ,aK ). For each one of these intervals

there is a parameter λk that can be interpreted, in analogy to the homogeneous Poisson

process, as an occurrence rate of the events in that interval. The choice of the knots will

be object of a predictive performance analysis, both on the choice of K (5,10 or 20) and on

type of division of the time domain. About the latter, two kinds of cut-points have been

considered:

• the quantiles of the donation times of each individual which were performed from

the 1st of July 2001 to the 31st of December 2009. Recall that the time of a donation

of a person is the number of days passed from the first donation of that individual;

• an equispaced grid from time 0 to the maximum observed time.

As already mentioned in Chapter 1, it is common to choose the quantiles of the event

times as cut-points of the time domain. However this is a data driven choice and, by

definition, it is not independent of the data that the model aims to fit. To keep balanced

the number of events in each interval and to make a data-independent choice of the grid,

we selected the quantiles of the event times of a time window in the past that lasted as

the one used to extract the data for this thesis.

A prior probability is assigned to the rates λ1, . . . ,λK .

λki.i.d.∼ Γ(αλ,βλ), k = 1, . . . ,K αλ,βλ fixed. (3.4)

39

Page 54: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 3. MODELLING BLOOD DONATIONS AS RECURRENT EVENTS

While the first part of the baseline intensity function has no dependency from the

past, the second component of (3.3) depends on the history of the process. Moreover it

repeats itself equal after each event, like the intensity function of a renewal process.

Since in this model the features of a Poisson process and of a renewal process coexist

we are in the case of the general intensity-based model (see Section 1.5.2). The indicator

function in the second part of (3.3) has the goal to model the fact that a person cannot

donate for a certain period of time φG , which depends on his/her gender. Indeed the

intensity is set equal to 0 for φG days after every event, and so it is the probability

to donate. According to AVIS rules, the post-donation rest time φG should be equal to

φM = 90 days for men, and φF = 180 days for women. However there are donations

that happen before (see Figures 3.1 and 3.2), since a physician is allowed to move up

donations. Hence we set the parameter φM to 85 days and φF to 150 days, discarding

from the analysis all the donors that at least once did not respect this further restriction.

The thresholds have been fixed heuristically. The goal of this choice was to discard as

least as possible individuals from the study and to allow reasonable early donations. With

this particular choice only 5 men and 82 women has been discarded. If the information

about the fertility status of a female donor was available, it would have been possible

to apply the threshold of the men even to women in menopause, like the association (in

principle) does. However, since no particular trend has been noticed between the gap

times of the women and their age at the times of donation it has been decided not to

investigate any further in the Association databases and to treat the female population

as one but to lower more -with respect to males- the post-donation rest time.

40

Page 55: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

3.2. MODELLING CHOICES

Histogram of gap times: Females

Gap times

Fre

quen

cy

0 200 400 600 800 1000

010

020

030

040

0

Figure 3.1: Histogram of gap times of female donors, the red line corresponds to 180 days

Histogram of gap times: Males

Gap times

Fre

quen

cy

0 200 400 600 800 1000

050

010

0015

0020

0025

00

Figure 3.2: Histogram of gap times of female donors, the red line corresponds to 90 days

41

Page 56: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 3. MODELLING BLOOD DONATIONS AS RECURRENT EVENTS

20 30 40 50 60

02

46

810

12

Threshold age for menopause

% e

arly

wom

en d

onat

ions

Post donation rest time

150160170175180

Figure 3.3: Percentage of earlier that allowed donations as a function of the thresholdage for menopause

In fact, as it can be noticed in Figure 3.3, for each possible choice of the post-donation

rest time for women φF and for each reasonable choice of a threshold age for menopause

there remains a significant percentage of "earlier than allowed" donations, namely early

donations of young women. With a choice of 150 days for the post-donation rest time and

setting 50 years as the menopausal age only 1 % of early donations are observed.

3.2.2 Frailty parameters

The random effects or frailties are denoted by wi, where the subscript i is the index

of the individual. These parameters are meant to capture the heterogeneity between

individuals and have a multiplicative effect on the intensity function, which means

that a value greater or smaller than 1 can be interpreted respectively as a more or

as a less propensity to experience an event. Usually these parameters are modeled

as Gamma random variables with mean equal to 1 and variance equal to η. In this

work these conditions holds conditionally to the variance parameter η, which it has its

marginal prior distribution inducing correlations among the random effects through

exchangeability. Summing up:

wi|η iid∼ Γ(η−1,η−1), i = 1, . . . , M (3.5)

η∼Γ(2,2), (3.6)

42

Page 57: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

3.2. MODELLING CHOICES

where the scale and the shape parameters of η have been chosen after an analysis of

sensitivity.

Another option is to consider a division of the individuals into groups according

to their postal code. In this case the frailties are areal-dependent, and one variance

parameter η j is estimated for the j− th zone. The areal dependence of the random effects

has been addressed in literature by many authors. See for example Banerjee et al. (2003),

Henderson et al. (2002), Li and Ryan (2002). The prior structure of the random effects’

parameters is mainly based on the distance matrix of the areas. However, to keep the

model simple, a different prior has been chosen in this work. The parameters η j, for

j = 1, . . . , J, (J is the number of areas) are a priori exchangeable, hence correlation is

induced among them once the hyperparameters are marginalized. In this case the model

is:

wi|η jiid∼ Γ(η−1

j ,η−1j ), i = 1, . . . , M and j is the zone of the i− th individual (3.7)

η j|αη,βηiid∼ Γ(αη,βη), j = 1, . . . , J (3.8)

αη,βηiid∼ Γ(a,b), a,b fixed (3.9)

a = 3,b = 2 (3.10)

Once the posterior distribution of the variance parameter η is known it is possible to

compute the predictive density of a new donor’s random effect, which we call wnew. Let

us indicate with L (η|data) the posterior law of η.

L (wnew|data)=∫

L (wnew|η,data)L (η|data)dη=∫

L (wnew|η)L (η|data)dη (3.11)

Then, if for every η(s) in an MCMC sample from L (η|data) of dimension S, wnew,(s) is

sampled independently from a Gamma distribution of scale and shape parameters equal

to1η(s) , the result {wnew,(1), . . . ,wnew,(S)} is an MCMC sample from L (wnew|data), namely

the predictive density of the frailty of a new incoming donor.

3.2.3 Covariates

As mentioned in the previous chapter, donor-specific fixed-time covariates are considered

in the analysis. The maximum number of covariates included in the model is 9, but

models with less covariates will be compared through goodness-of-fit indicators. The

whole set of covariates considered is as follows:

• age at the time of the first donation (standardized);

43

Page 58: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 3. MODELLING BLOOD DONATIONS AS RECURRENT EVENTS

• binary variable for gender (1 male, 0 female);

• Body Mass Index (standardized);

• binary variable for smoker (1 smoker, 0 otherwise);

• binary variable for alcohol consumption (1 consumer, 0 otherwise);

• binary variable if the donor has an active life (1 if yes, 0 if not);

• dummy variable for blood type 0 (equal to 1 if the donor’s blood type is 0, otherwise

0);

• dummy variable for blood type A (equal to 1 if the donor’s blood type is A, otherwise

0);

• binary variable for Rhesus factor (1 if positive, 0 negative).

A dummy variable for blood type AB has not been considered since very few donors in

the dataset are AB-typed.

To have a non-informative prior distribution, the parameters β1, . . . ,βp, are a priori

independent identically normal distributed random variables with mean 0 and variance

equal to 100.

3.2.4 At risk indicator function, censoring and suspensions

The interval of observation is not the same for all the individuals. In fact the time-axis

origin is the time of the first donation of a donor, while the other extreme of the interval

of observation is the number of days between the 30th of June 2018 and the time-axis

origin. This is a censoring phenomenon.

Moreover some donors cannot be observed in a certain period since a suspension from the

donations can occur if there are some health issues. The suspensions of each donor are

available in the AVIS database and they are treated in the model as data. The modeling

of these two phenomena is done with a function Yi(t), that is equal to 1 if donor i is not

censored or not suspended at time t, and 0 otherwise. As explained in Cook and Lawless

(2007), if the value of Yi(t) is independent of the recurrent event process the intensity

function can be rewritten as:

Yi(t)λ(t|H(t)), (3.12)

44

Page 59: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

3.3. THE BAYESIAN MODEL FOR RECURRENT DATA OF M DONORS

and the likelihood becomes:

exp(−

∫ τ

0λi(u)Yi(u)du

{ M∏j=1

λi(ti j)}. (3.13)

However since the suspensions regard few individuals and they are very noisy (see 2.2.4),

they are not included in the function Yi(t) for the analysis.

3.3 The Bayesian model for recurrent data of Mdonors

3.3.1 The likelihood

For any i = 1, . . . , M we define the observations as:

• nik = number of events experienced by the i−th individual in the interval (ak−1,ak];

• n·k =∑Mi=1 nik = total number of events in the interval (ak−1,ak];

• ni =∑Kk=1 nik = total number of events experienced by the i− th individual;

• Yi(t)= I(i− th individual is observed at time t);it contains information about censoring and, possibly, suspensions;

• τik =∫ ak

ak−1Yi(u)I(u−TNi (u−)>φG )(u)du =

total time that the i− th individual has been observed in the interval (ak−1,ak];

• xi = (xi1, . . . , xip)′, p ≤ 9

p-dimensional vector of covariates of the i− th individual.

The likelihood function of the proposed model is:K∏

k=1

n·kk

M∏i=1

{wni

i exp(x′iβ−wi exp(x′iβ)λkτik

)}}. (3.14)

3.3.2 Prior elicitation

The parameters of the model can be expressed as a vector θ defined in the following way:

θ := (λ1, . . . ,λk,β1, . . . ,βp,w1, . . . ,wM ,η), (3.15)

or

θ := (λ1, . . . ,λk,β1, . . . ,βp,w1, . . . ,wM ,η1, . . . ,ηJ), (3.16)

where:

45

Page 60: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 3. MODELLING BLOOD DONATIONS AS RECURRENT EVENTS

• λ1, . . . ,λK are the interval-specific rates;

• β := (β1, . . . ,βp)′ is the p-dimensional vector of covariates coefficients;

• w1, . . . ,wM are the individual specific random effects;

• η1, . . . ,ηJ or η are respectively the variances of the random effects with or without

areal dependence.

Given the parameter θ and the vector xi, the intensity function of the i− th individual is:

λi(t|H(t),θ)=(I(t−TNi (t−)>φG )(t)

K∑k=1

wi exp(x′iβ)λkI(ak−1,ak](t) i = 1, . . . , M, (3.17)

where φG is a fixed parameter that depends on the sex of the individual and it represents

the post-donation rest time.

A priori independence among blocks of parameters is assumed, with marginal priors as

follows:

β∼N (0,σ2Ip) σ2 fixed Ip identity matrix ∈Rpxp. (3.18)

λkiid∼ Γ(αλ,βλ) k =1, . . . ,K αλ,βλ fixed. (3.19)

If the model has zone-dependent frailties:

wi|η jind∼ Γ(η−1

j ,η−1j ), i = 1, . . . , M. (3.20)

η j|αη,βηiid∼ Γ(αη,βη) j = 1, . . . , J. (3.21)

αη,βηiid∼ Γ(a,b) a,b fixed. (3.22)

otherwise:

wi|η iid∼ Γ(η−1,η−1), i = 1, . . . , M. (3.23)

η∼Γ(aη,bη) aη,bη fixed. (3.24)

3.4 The predictive distribution of the countingprocess of a new incoming donor

The predictive distribution of the point process of a new incoming donor with known

covariates xnew can be computed as:

L (Nnew(t)|data, xnew)=∫

L (Nnew(t)|wnew,λ, xnew,β)L (wnew,λ,β|data) dwnew dλ dβ,

(3.25)

46

Page 61: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

3.4. THE PREDICTIVE DISTRIBUTION OF THE COUNTING PROCESS OF A NEWINCOMING DONOR

which can be estimated through MCMC once L (Nnew(t)|wnew,λ, xnew,β) is analytically

known (for example in the case of a Poisson process).

The analytical expression of the law of Nnew(t) given the parameters would require the

computation of the law of all the event times Tnew1 ,Tnew

2 , . . ..

Nnew(t)≥ k ⇐⇒ Tnewk ≤ t, (3.26)

However the distribution of Tnewk is not trivial if the intensity function depends on the

history of the process, like the one used in this thesis. As a consequence, the estimation

of L (Nnew(t)|data, xnew) can be done via MCMC, extracting one realization of a process

for each vector of parameters drawn from the posterior distribution. In order to do this,

it is necessary to be able to sample a realization of a recurrent event process of intensity

function λ(t|H(t),θ). A possible strategy is to use the inversion method to draw a sequence

of event times{T1,T2, . . .

}by using the cumulative distribution function of T j given T j−1

obtained in (1.5).

Hence the sampling scheme is:

• tnew0 := 0;

At step j:

• derive F(tnewj |tnew

j−1 ,θ, xnew)= P(T j < tnewj |T j−1 = tnew

j−1 ,θ, xnew);

• Sample U j ∼Uni f ([0,1]);

• tnewj solves the equation U j = F(tnew

j |tnewj−1 ,θ, xnew).

The algorithm terminates when tnewj exits from the time window.

The resulting sequence {tnew0 , tnew

1 , . . .} is a realization of a recurrent event process

of intensity function λ(t|H(t),θ). An MCMC sample of L (Nnew(t)|data, xnew) can be

obtained in the following way:

• θ(s) is a vector of the MCMC sample from the posterior distribution of the model

(3.14) (s = 1, . . . ,S);

• sample {tnew0s , tnew

1s , . . .} from a recurrent event process of intensity function λ(t|H(t),θ(s), xnew);

• Nnews (t)=∑

i≥1 I(tis ≤ t).

{Nnew1 (t), . . . , Nnew

S (t)} is a sample from L (Nnew(t)|data, xnew), the posterior distribution

of the point process of a new incoming donor.

47

Page 62: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi
Page 63: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Posterior inference on AVIS data4

T his chapter presents posterior and predictive inference for the models describedin Chapter 3 and applied to the AVIS data (see Chapter 2). In the first sectionit is described how the inference has been obtained, while the second section is

devoted to illustrate inference about the parameters.

4.1 Posterior inference

Sampling from the posterior distribution has been done using Stan (Stan Development

Team and others, 2016), which is a more efficient software for MCMC sampling rather

than the ones written in the BUGS language, like JAGS or WinBUGS. However, Stan

has the drawback that sampling from a discrete parameter cannot be done, which was

not relevant in this model since all the parameters are continuous.

Unless stated otherwise, the sampling is performed with 50000 iterations of warm-

up plus other 50000 of sampling thinned of 25. The result is an MCMC sample of

2000 observations. The likelihood of the model is (3.14); the prior for the parameters is

specified in (3.18) and (3.19), and (3.23) (or (3.20)). The covariates are the ones described

in Paragraph 3.2.3. The time domain (in days) is [0,3100].

The initial dataset was composed of 9175 donors. Among these individuals, 3238

persons just entered in the study without performing any donation apart from the one at

time t = 0. Since the goal of this work is to model the behaviour of a donor experiencing

multiple blood donations in a specific blood collection point, these 3238 individuals have

been excluded from the analysis. Among the remaining 5937 persons, there are 87 donors

who at least once did not respect the post-donation rest time of 85 days for men and 150

for women. Moreover, other 92 donors have some missing values in their covariates. The

final sample is composed of 5758 donors and 25073 whole blood donations.

The convergence diagnostics of all the simulations has always been checked, showing

that all the MCMC chains have reached stationarity.

49

Page 64: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 4. POSTERIOR INFERENCE ON AVIS DATA

4.2 Inference on parameters

4.2.1 Baseline intensity function

The baseline intensity function is modelled as a step function. The steps are a priori

independent random variables Gamma distributed, with fixed scale and shape para-

meters (αλ = βλ = 2 in equation (3.4)). A sensitivity analysis showed that the model is

robust with respect to the choice of the hyperparameters; simulations with parameters

αλ = βλ = 3, αλ = βλ = 0.01 and αλ = 2.5 βλ = 1.5 have been run to check robustness of

the model.

Preliminary choices to be discussed are the type and the numbers of intervals (5,

10 or 20 intervals, denoted by K in equation (3.3)). The evaluated intervals were either

equispaced or having as cut-points the empirical quantiles of the donations occurred in

the eight and a half years previous the study (remind that the study lasted from the

1st of January 2010 to the 30th of June 2018, namely eight and a half years). Posterior

inference for each of these six possible choices is shown in Figure 4.1. Observe that in all

the plots in Figure 4.1 the baseline intensity function has a decreasing trend with time,

meaning that the propensity to donate is higher at the beginning of the life as a donor.

However there are some fluctuations at the end of the time domain in Figure 4.1e and in

the first intervals in Figures 4.1d and 4.1f.

If the choice of the cut-points is made by the quantiles, the first interval (in days) is

[0,85], where no events occurred because of the post-donation rest time. In this case the

MCMC algorithm sampled from the prior, namely a Gamma distribution with shape and

scale equal to αλ =βλ = 2.

The six choices were evaluated using WAIC. However the diagnostic of this method

was not good (the majority of the components of the sum in pW AIC2 exceed the value 0.4,

which, according to Gelman et al. (2014), can lead to an unreliable estimate of the l ppd).

Hence the log-posterior predictive density was evaluated on the data. From Figure 4.2

it seems that the growth of the predictive performances of the model by doubling the

intervals from 10 to 20 is not significant, indeed an "elbow" appears. Moreover the

estimated l ppd are nearly equal for the two choices of cut-points in the case of 20

intervals.

The inference on the other parameters is robust with respect to the choice of the

intervals.

50

Page 65: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

4.2. INFERENCE ON PARAMETERS

0 500 1000 1500 2000 2500 3000

0.00

00.

005

0.01

00.

015

Days

Bas

elin

e ra

te fu

nctio

n

(a) 5 equispaced intervals

0 500 1000 1500 2000 2500 3000

0.00

00.

005

0.01

00.

015

Days

Bas

elin

e ra

te fu

nctio

n

(b) 5 quantiles-defined intervals

0 500 1000 1500 2000 2500 3000

0.00

00.

005

0.01

00.

015

Days

Bas

elin

e ra

te fu

nctio

n

(c) 10 equispaced intervals

0 500 1000 1500 2000 2500 3000

0.00

00.

005

0.01

00.

015

Days

Bas

elin

e ra

te fu

nctio

n

(d) 10 quantiles-defined intervals

0 500 1000 1500 2000 2500 3000

0.00

00.

005

0.01

00.

015

Days

Bas

elin

e ra

te fu

nctio

n

(e) 20 equispaced intervals

0 500 1000 1500 2000 2500 3000

0.00

00.

005

0.01

00.

015

Days

Bas

elin

e ra

te fu

nctio

n

(f) 20 quantiles-defined intervals

Figure 4.1: 95 % credibility intervals for the baseline intensity function51

Page 66: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 4. POSTERIOR INFERENCE ON AVIS DATA

5 10 15 20

−14

9300

−14

9200

−14

9100

−14

9000

Number of intervals

elpp

d

INTERVALS

quantilesequispaced

Figure 4.2: Estimated log posterior predictive density

4.2.2 Covariates coefficients

The model takes into account also the dependence of the intensity function from some

individual features. The relationship between these quantities is captured by a multi-

plicative effect on the baseline intensity function. The multiplicative effect is expressed as

the exponential of a linear combination of the covariates with coefficients βi, i = 1, . . . , p.

The variables associated to each of the coefficients are presented in Section 3.2.3.

Figure 4.3 reports the 95 % credibility intervals for the covariates coefficients.

● ●

−0.

20.

00.

20.

40.

6

Age Sex BMI Smoke Alcohol Active life Type 0 Type A Rh +

Figure 4.3: 95 % credibility intervals for the βi ’s parameters

To see how much significant these parameters are, Bayesian p-values can be com-

52

Page 67: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

4.2. INFERENCE ON PARAMETERS

puted.

Bayesian p−value =min{P(βi > 0|data),P(βi < 0|data)

}. (4.1)

A low Bayesian p-value denotes that 0 lies in the tail of the posterior distribution, and so

that the coefficient is significant.

Coefficient Bayesian p-value Hazard ratio (q0.025,q0.975)Age 0.00 (1.23,1.30)Sex 0.00 (1.58,1.80)BMI 0.19 (0.98,1.04)

Smoke 0.00 (0.81,0.92)Alcohol 0.22 (0.92,1.03)

Active life 0.14 (0.98,1.10)Type 0 0.35 (0.91,1.07)Type A 0.31 (0.90,1.07)Rh + 0.03 (0.86,1.01)

Table 4.1: Bayesian p-values and hazard ratios

The variable Age, which denotes the age of the donor at the entrance in the study, has a

positive effect on the intensity function: the elder the individual the higher the rate of

donation. A positive effect is given by the variable Sex too, which means that men have

higher propensity to donate rather than women. The other two significant covariates are

Smoke and Rh+, which have a negative effect on the rate function. Individuals with a

negative Rhesus factor can receive transfusions only by other individuals of negative

Rhesus factor. The fact that a positive Rhesus factor diminishes the intensity function

may suggest that the Rh negative individuals feel more responsibilities in their role,

considering that this feature of the blood is known to be less frequent than a positive

Rhesus factor.

The effect of the covariates can be quantified by computing exp(βi), which is the ratio

among the intensity functions of two individuals that differ in covariate xi of one unit.

In the case of categorical variables this is the effect of the group xi. In survival analysis

exp(βi) is called hazard ratio. The 95 % credibility intervals of the hazard ratios are in

the third column of Table 4.1.

To find an optimal subset of covariates, a sensitivity analysis has been done. Three

models with a different subset of covariates have been compared. The cut-points were

fixed to 10 intervals quantiles-defined. As explained before, WAIC was not reliable to

estimate the log-posterior predictive density, and so 10 fold-cross validation has been

used as a measure of predictive accuracy. The three compared set of covariates are:

53

Page 68: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 4. POSTERIOR INFERENCE ON AVIS DATA

• the maximal set of covariates (p = 9);

• only the significant covariates (Age, Sex, Smoke, Rh +) and the dummies for the

blood type (p = 6);

• only the significant covariates.

Age Sex BMI Smoke Alcohol Active life Type 0 Type A Rh lppd p3 3 3 3 3 3 3 3 3 -149085.04 93 3 7 3 7 7 3 3 3 -149087.98 63 3 7 3 7 7 7 7 3 -149087.14 4

Table 4.2: Predictive performances evaluation of models with different sets of covariatesusing 10 fold cross validation.

The estimates of the coefficients are robust with respect to the presence of other features

in the model. The results of predictive accuracy comparison are in Table 4.2. The differ-

ence in the estimated lppd’s seems not significant. In this case the best practice could be

to select the simplest model, and so the model with p = 4 (or p = 6 if one want to have in

the model the information about blood types).

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●●●

●●

● ●

●●

● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●● ●

● ● ●

●●

●●

●● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

● ●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●● ●

●●

●●

● ●

●●●

●●

●●

●●

● ●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●● ●

● ●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●●

●●●

●●

●●

●●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●●● ●

●●

●●

●●

●●

●●

● ●●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

● ●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●●

●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●●

●●

● ●

●●

●●

●●●●

●●

●●●

● ●

●●

●●

●●●

●● ●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●● ●

● ●

●●●●●

●● ●●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

● ●●

●●

●●

●●

●●

●●● ●

●●

●●

● ●●●

●●●●●

●●

●●

●●●●●

●●●

●● ●●

−7.0 −6.5 −6.0 −5.5 −5.0 −4.5 −4.0

−2

−1

01

2

log(Rate)

log(

w_p

ost_

mea

n)

(a) Scatterplot of the posterior meanof all wi and of the empirical rate ofdonations of i− th individual

0 1 2 3 4 5 6 7

0.0

0.5

1.0

1.5

2.0

Den

sity

FRAILTY

New donorDonor−specific

(b) Marginal posterior densities of thefrailty of 3 donors in the sample andposterior predictive posterior densityof wnew

Figure 4.4: Summaries of wi

54

Page 69: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

4.2. INFERENCE ON PARAMETERS

2000 2200 2400 2600 2800 3000

0.00

00.

004

0.00

8

Days

Den

sity

FRAILTY

New donorDonor−specific

Donor-specificNew donor

(a) Predictive density for the new donationof the 19− th individual

0 2 4 6 8 100.

00.

10.

20.

30.

40.

50.

60.

7

Den

sity

FRAILTY

New donorDonor−specific

Donor-specificNew donor

(b) Posterior density of w19 and predictivedensity for wnew

1000 1500 2000 2500 3000

0.00

00.

004

0.00

80.

012

Days

Den

sity

FRAILTY

New donorDonor−specific

Donor-specificNew donor

(c) Predictive density for the next donationof the 28− th individual

0.0 0.5 1.0 1.5 2.0 2.5 3.0

01

23

45

Den

sity

FRAILTY

New donorDonor−specific

Donor-specificNew donor

(d) Posterior density of w28 and predictivedensity for wnew

Figure 4.5: Predictive densities of Ti,ni+1 given Ti,ni for some donors

55

Page 70: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 4. POSTERIOR INFERENCE ON AVIS DATA

●●

●●

●●

●●

020

4060

80100

0 1 2 3 4 5 6

CI for w

_new[j]

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

Cities in the province of M

ilanD

istricts in the city of Milan

Provinces of Lom

bardyC

ities outside Lombardy

No zone dependence

ArlunoCorbetta

CuggionoMagentaParabiago

PeroRho

SedrianoSettimo Milanese

ArconateBaranzate

Castano PrimoGarbagnate Milanese

LegnanoNovate Milanese

SenagoCormano

Paderno DugnanoCambiago

GrezzagoBasiano

CarugateCassano d'Adda

Cernusco sul NaviglioGorgonzola

InzagoMelzo

PaulloPeschiera Borromeo

Vaprio d'AddaCerro al Lambro

MelegnanoSan Colombano al Lambro

AlbairateAbbiategrasso

BinascoGaggiano

LacchiarellaLocate di Triulzi

Motta ViscontiGudo Visconti

RozzanoAssago

BressoCinisello Balsamo

Cologno MonzeseCorsico

Cusano MilaninoPioltello

San Donato MilaneseSan Giuliano Milanese

Sesto San Giovanni20121

2012220123

2012420125

2012620127

2012820129

2013120132

2013320134

2013520136

201372013820139

2014120142

2014320144

201452014620147

2014820149

2015120152

20153201542015520156

201572015820159

2016120162

BGBS

COCR

LCLO

MBMN

OthersPV

SOVA

Figure

4.6:95%

posteriorpredictive

credibilityintervals

ofwn

ewj

,j=

1,...,J,the

frailtyofa

newdonor

fromzone

j.Ingrey

theestim

ateobtained

with

them

odelwith

noarealdependence.

56

Page 71: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

4.2. INFERENCE ON PARAMETERS

4.2.3 Random effects

As already mentioned, the individual specific random effects have been modelled in two

ways. The first one is to model each of the random effects with an exchangeable Gamma

prior with mean equal to 1 and a common variance parameter called η.

One random effect is estimated for each individual in the dataset. As it can be seen

in Figure 4.4a, it seems that there is a linear association between the observed rate of

donation of the donor and the frailty’s posterior mean. Every individual in the dataset

contributes to estimate the variance η of the random effects’ population, and so it is

possible to estimate the predictive density of a new incoming donor’s frailty wnew (see

3.2.2). Moreover every donor in the dataset is characterized by the posterior density of

his/her frailty (some examples, compared to the predictive density of a new incoming

donor, in Figure 4.4b), and so it is possible to do an individual-specific prediction. In

Figure 4.5 the predictive density of a new donation given the last observed donation

is shown for some donors in the dataset. In blue it is displayed the predictive density

computed with the predictive density of wnew (as if the donor was not in the sample),

while the one in red is computed with the individual-specific random effect wi.

Summing up, in Figures 4.4 and 4.5 it is noticeable that there is heterogeneity

between individuals that is not captured by observable features. The random effects

do not concentrate on a single value (like if every donor experiences the recurrent

event process in the same manner), but they are spread in a wide range (see Figure

4.4a). Therefore, this approach is useful to make an individual-specific prediction for

the individuals in the sample (see Figures 4.5a and 4.5c). Furthermore, the variability

with which each new donor approaches his/her-self to blood donation is captured in the

predictive density of the random effect of a new donor wnew.

4.2.3.1 Areal dependent frailties

The second approach to random effects has been to divide the individuals assigning an

area to each one of them according to the postal code of their own residence. The number

of different zones will be denoted by J. The prior is specified in (3.20), (3.21) and (3.22).

In this way the n j parameters are exchangeable and they can "borrow strength" from

each others for the estimate of the posterior distribution. The division in zones has been

done in the following way:

• each municipality in the province of Milan has its specific zone (51 zones);

57

Page 72: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 4. POSTERIOR INFERENCE ON AVIS DATA

• each of the 38 postal code associated to a district in the municipality of Milan has

its specific zone (38 zones);

• one zone for each province in Lombardy (11 zones: Bergamo, Brescia, Como, Cre-

mona, Lecco, Lodi, Mantova, Monza e Brianza, Pavia, Sondrio, Varese) ;

• another zone which collects all the municipalities that do not belong to the previous

categories.

This division results in J = 102 zones.

Figure 4.6 shows the inference for the posterior predictive density of the random

effects of a new incoming individual from zone j. By looking at this plot no particular

dependence from the area of origin is inferred, since the differences with the estimate ob-

tained with the model without zone dependence (in grey in the figure) are not significant.

In addition, each of the credibility intervals in Figure 4.6 has been colored according to

a further division in 4 macro-areas (red for districts of Milan, blue for municipalities

in the province of Milan, yellow for cities in other provinces in Lombardy and green for

the "rest of the world"). Fitting the model with this division (J = 4) did not revealed any

significant dependence of the random effects from the zones resulted from this additional

division.

0 200 400 600

01

23

45

67

CI male donor

Day

N(t

)

0 200 400 600

01

23

45

67

CI female donor

Day

N(t

)

Figure 4.7: Pointwise predictive 95 % credibility intervals for Nnew(t)|xnew, where xnew

is set to the mean (or to the mode) of the features used as covariates

58

Page 73: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

4.2. INFERENCE ON PARAMETERS

0 200 400 600

01

23

4

Day

E[N

(t)|

data

, x]

MaleFemale

0 200 400 600

01

23

4Day

E[N

(t)|

data

, x]

Male,Age=0Male,Age=1Male,Age=−1

0 200 400 600

01

23

4

Day

E[N

(t)|

data

, x]

Male,Non−smokerMale,Smoker

0 200 400 600

01

23

4

Day

E[N

(t)|

data

, x]

Male,Rh −Male,Rh +

Figure 4.8: Mean functions for Nnew(t)|xnew,data. Unless stated otherwise, the covari-ates are set to the mean (or to the mode)

4.2.4 Predictive density for the count process of a newincoming donor

The law of N(t) given the data can be estimated by simulating one recurrent event

process for each drawn in the MCMC sample (see 3.4). Figure 4.7 displays 95 % posterior

predictive credibility intervals for Nnew(t) in the case of a male and a female new

59

Page 74: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 4. POSTERIOR INFERENCE ON AVIS DATA

incoming donor (t varies from 0 to the first two years after the first blood donation). The

plots refer to a vector of covariates xnew, equal to 0 for the continuous variables (i.e. the

sample mean) and equal to the sample modes for the categorical variables (non-smoker,

non-alcohol consumer, blood type 0, positive Rh).

Figure 4.8 displays the posterior mean function for Nnew(t)|xnew, for some possible

covariates configurations. The mean function of a man doubles the mean function of a

woman. This is natural since, according to law, a man has the double of the opportunity

to donate in one year with respect to a woman. The posterior predictive credible bands

for Nnew(t) tends to be larger and larger, since the lower bound remains near to 0 for all

the time domain, while the upper bound increases with the time.

4.3 Point predictions

Let us consider the Mean Absolute Error (MAE) between some predicted values y∗i and

the respective real observed values yi.

MAE = 1M

M∑i=1

|y∗i − yi| (4.2)

MAE is easily interpretable as the average absolute error between the predictions

and the real values. In the case of recurrent events yi and y∗i represent days, and so the

forecast error has a clear unit of measurement.

In order to have an intuitive measure of accuracy, point predictions coming from the

posterior predictive distribution for the last donation has been considered.

To have an unbiased measure of the prediction error the dataset has been divided

into train and test set. The train set is composed of all the donations except the last of

each donor, which is considered censored after the last but one donation. The test set is

composed of the last donations of each donor. By applying this division, it is possible to

have an MCMC sample of wi for i = 1, . . . , M, and to evaluate the model using even these

individual-specific parameters.

using the model with 10 cut-points quantiles-defined, for each i = 1, . . . , M, an MCMC

sample from L (Ti,ni |Ti,ni−1,data, xi) has been obtained. Then, mean and median from

the posterior predictive distribution have been estimated. All the different subset of

covariates has been considered to evaluate the point predictive accuracy (p = 4,6,9).

Summing up:

60

Page 75: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

4.3. POINT PREDICTIONS

• MEANp4, MEANp6, MEANp9 are the predictors that use the mean of the pos-

terior predictive distribution for Ti,ni given Ti,ni−1. The subscripts indicate how

many covariates are used in the test set;

• MEDI ANp4, MEDI ANp6, MEDI ANp9 are the predictors that use the median of

the posterior predictive distribution for Ti,ni given Ti,ni−1. The subscripts indicate

how many covariates are used in the test set.

Moreover some "naive" predictors have been taken into account.

• NAIVE_MEAN for donor i predicts Ti,ni−1 plus the mean gap times among Wi,1 =Ti,2 −Ti,1 , ..., Wi,ni−2 = Ti,ni−1 −Ti,ni−2;

• NAIVE_MEDIAN for donor i predicts Ti,ni−1 plus the median of the gap times

among Wi,1 = Ti,2 −Ti,1, ..., Wi,ni−2 = Ti,ni−1 −Ti,ni−2;

• NAIVE_MEAN_ALL for donor i predicts Ti,ni−1 plus the mean gap times of all the

donors of the same sex as i;

• NAIVE_MINIMUM for donor i predicts Ti,ni−1 plus the minimum gap time accord-

ing AVIS rules (i.e. 90 days if i is male, and 180 if i is female).

It can be noticed in Table 4.3 that the predictors that perform better according to

MAE are MEDI ANp6 and MEDI ANp9, but they are comparable to the naive estimator

N AIV E_MEAN_ALL, which uses the donor-specific information to predict the next

donation.

Another possible measure of point prediction error is the Root Mean Square Error

(RMSE).

RMSE =√√√√ 1

M

M∑i=1

(y∗i − yi)2 (4.3)

By computing the square of each error, RMSE penalizes more higher deviations

from the prediction with respect to MAE. However RMSE does not possess the same

proprieties of interpretability that MAE has.

Posterior mean predictors performs better in terms of RMSE. Moreover, according to

this measure and unlike to MAE, the naive predictors (apart from N AIV E_MEAN_ALL)

do not offer the same accuracy of the predictive posterior summaries (see Table 4.3).

61

Page 76: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 4. POSTERIOR INFERENCE ON AVIS DATA

PREDICTOR MAE RMSEMEANp4 136.32 227.68MEANp6 137.96 225.25MEANp9 137.98 225.21

MEDI ANp4 120.21 234.78MEDI ANp6 118.15 230.00MEDI ANp9 118.19 229.96

N AIV E_MEAN 125.79 247.85N AIV E_MEDI AN 124.24 256.59

N AIV E_MEAN_ALL 117.81 231.18N AIV E_MINIMUM 133.95 260.44

Table 4.3: Point prediction errors

62

Page 77: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Forecasting new donors5

Previous chapters deal with modelling the behaviour of already enrolled donors.To have a complete picture of the number of blood donations in a specificcollection center, a time series model for new donors is proposed in this chapter.

First, State Space Models will be presented. Later, this family of models will be applied toAVIS data in order to estimate the weekly number of new incoming donors.

5.1 State Space Models

State Space Models (SSMs) are widely used in time series analysis. Within this frame-

work, the time series is decomposed in two parts. The first part represents the obser-

vational level and it usually consists of temporally independent specifications of the

elements of the time seris. The second part, instead, describes the evolution of the process

at a latent, unobserved level. The unobservable variables introduced are often referred

to as states. The result is a very flexible and general latent-variables class of models

which can be used in many applications.

SSM were originally introduced to model continuous time series data, but subse-

quently a straightforward extension to discrete-valued time series has been developed.

SSM can be tackled also within Bayesian perspective. One of the most general case of

SSM was introduced by West et al. (1985) and it is called dynamic generalized linearmodel (DGLM). Consider the time series y1, . . . , yT , let EF(µ,φ) denotes an exponential

family distribution with mean µ and variance φ c(µ), where c(µ) is a function of the

mean.

63

Page 78: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 5. FORECASTING NEW DONORS

The decomposition for t = 1, . . . ,T is given by the equations

Observation equation: yt|xt,θind∼ EF(µt,φ) (5.1)

Link function: g(µt)=z′txt (5.2)

System equation: xt =G txt−1 +wt (5.3)

Residual equation : wt|θ ind∼ N(0,W) (5.4)

where

• zt is a known vector at time t, which could possibly include covariates;

• xt is a time-dependent latent state at time t;

• G t is the matrix that describes the evolution of the latent state;

• θ is a vector of all the hyperparameters (including φ and W).

A prior distribution on the hyperparameters θ and on the initial state x0 would complete

the formulation of the model in the Bayesian perspective.

DGLM considers only linear models at the link relation and at the system evolution

levels, however linearity is usually a suitable hypothesis in many applications. The

general formulation of the observation equation allows to treat both continuous (e.g.

with a Gaussian density) and discrete (with Poisson, Binomial or Negative Binomial

distributions) time series data.

Some useful features (like the level of the series, the local growth and the seasonality)

can be represented within this formulation. In Section 5.3 this issue will be deepened

with the specification of the employed model.

5.2 Descriptive analysis

As mentioned in the previous chapter, 9175 donors have become donors in the period

that goes from the 1st of January 2010 to the 30th of June 2018. While not all the donors

were considered in the analysis of the blood donations as recurrent events, in this case

there is no reason to keep some of them out of the analysis. Indeed in the first case the

goal of the analysis was to have an estimate of the behaviour of the existing donors, and

so an individual that just entered in the study without any further donations could not

be considered as drawn from the population of the recurrent donors. On the other side,

64

Page 79: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

5.2. DESCRIPTIVE ANALYSIS

each entrance in the study is associated with a whole blood donation and it is part of the

blood supply chain, even if performed by a non-recurrent donors.

The time series that will be considered are the weekly number of new incoming

donors. The data collection period starts the 1st of January 2010, which is Friday. As a

consequence yt, with t = 1, . . . , N, is the number of new donors in the week t, which goes

from Friday to the subsequent Thursday. The resulting time series has length N = 443.

Figure 5.1, which displays boxplots of the weekly arrivals of new donors grouped by

years, shows that this number has grown over the years.

●● ●

2010 2011 2012 2013 2014 2015 2016 2017 2018

020

4060

8010

0

YEAR

WE

EK

CO

UN

T

Figure 5.1: Weekly arrivals of new donors grouped by years

A seasonal trend can be seen in Figure 5.2. Indeed, the number of new donors declines

in January, August and December.

Figure 5.3 shows the whole time series. It is interesting to observe that some high

peaks appears in 2016 and 2017, maybe some exceptional events occurred at that time.

In particular, at the end of August 2016 an earthquake hit Central Italy causing wounded

and damages. As a consequence, health authorities made appeals to call people to donate

their blood to contain the emergency.

The empirical distribution of the time series is summarized in Table 5.1. On average

there are about 20 new donors every week.

65

Page 80: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 5. FORECASTING NEW DONORS

1 2 3 4 5 6 7 8 9 10 11 12

010

2030

4050

60

MONTH

WE

EK

CO

UN

T

Figure 5.2: Weekly arrivals of new donors grouped by months

0 100 200 300 400

020

4060

8010

012

0

Index

Wee

kly

arriv

als

2010 2011 2012 2013 2014 2015 2016 2017 2018

Figure 5.3: Time series of the weekly arrivals

5.3 A Bayesian model for the new donors

In this section we describe the class of models used in this work. The counts of the new

arrivals are modelled in the observation equation as independent Poisson random vari-

ables conditionally to the parameters. For each t the Poisson parameters are decomposed

as the exponential of the sum of two components. The first component is the trend µt,

while the second is the seasonal effect τt, which has a periodicity of 52 weeks, namely a

66

Page 81: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

5.3. A BAYESIAN MODEL FOR THE NEW DONORS

Minimum 21st Quartile 13

Median 203rd Quartile 27Maximum 99

Mean 20.66Sd 9.94

Table 5.1: Summaries of the empirical distribution of the time series of the weeklyarrivals

year. The term δt has the interpretation of the local growth of the µt parameter. All of the

three class of parameters µt, δt and τt are modelled a priori as Random Walks centered

in a linear combination of the parameters in the past. In particular, µt is centered in µt−1

with a slope correction given by δt, which in turn is centered in δt−1. For what concerns

the seasonal effect τt, for any t the sum of the components in every period∑S−1

s=0 τt−s has

mean equal to zero. No particular features are present to be used as covariates.

The standard deviations of the hidden variables are a priori assumed indepedent and

marginally uniformly distributed in the interval [0,T], where T has been fixed to 100.

The following model (Model 1) has been implemented in Stan (Stan Development Team

and others, 2016), with 100000 iterations of warm-up, and 200000 iterations of sampling

(thinned every 50 iterations), so that, an MCMC sample of 4000 observations has been

obtained.

Summing up:

yt|λtind∼ Poisson(λt) t = 1, . . . , N (5.5)

log(λt)=µt +τt (5.6)

µt|µt−1,δt,σµind∼ N (µt−1 +δt,σ2

µ) trend (5.7)

σµ ∼Uni f orm([0,T]) (5.8)

δt|δt−1,σδind∼ N (δt−1,σ2

δ) local growth (5.9)

σδ ∼Uni f orm([0,T]) (5.10)

τt|τt−1, . . . ,τt−S+1,στind∼ N (−

S−1∑s=0

τt−s,σ2τ) season effect (5.11)

στ ∼Uni f orm([0,T]) (5.12)

As it can be seen in Figure 5.4, the traceplot of the parameter σδ shows that the

chain had a not negligible autocorrelation.

67

Page 82: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 5. FORECASTING NEW DONORS

sigma0 sigma1 sigma2

0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000

0.15

0.20

0.25

0.03

0.06

0.09

0.40

0.45

0.50

SIGMA_MU SIGMA_DELTA SIGMA_TAU

Figure 5.4: Traceplots variance parameters Model 1

As an alternative we have considered a second model (Model 2), removing the local

slope component δt. The characteristic of the sampler are the same of Model 1 and the

convergence of the chain has been checked.

yt|λtind∼ Poisson(λt) t = 1, . . . , N (5.13)

log(λt)=µt +τt (5.14)

µt|µt−1,σµind∼ N (µt−1,σ2

µ) trend (5.15)

σµ ∼Uni f orm([0,T]) (5.16)

τt|τt−1, . . . ,τt−S+1,στind∼ N (−

S−1∑s=0

τt−s,σ2τ) season effect (5.17)

στ ∼Uni f orm([0,T]) (5.18)

5.4 Posterior inference

Figure 5.5 and 5.6 display the posterior means of each of the class of parameters that

decompose the time series under the two models. The series {µt, t = 1, . . . , N} has an

increasing trend, which confirms the rise of the number of new arrivals through the

years observed in Figure 5.1.

Let θ(s)t denotes the s− th draw of the MCMC sample of the parameter θt. The MCMC

sampling from the predictive distribution L (yN+k|data), k ≥ 1 can be obtained with the

68

Page 83: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

5.4. POSTERIOR INFERENCE

following scheme:

• draw δ(s)N+k from N

(δ(s)

N+k−1,(σ(s)δ

)2) (or set it equal to 0 in the case of Model 2);

• draw µ(s)N+k from N

(µ(s)

N+k−1 +δ(s)N+k−1,(σ(s)

µ )2);• draw τ(s)

N+k from N(−∑S

p=1τ(s)N+k−p, (σ(s)

τ )2);• draw y(s)

N+k from Poisson(exp(µt +τt)

).

The sequence {y(s)N+k : s = 1, . . . ,4000} is an MCMC sample from the predictive distribution

of yN+k. Figure 5.7 shows the credible bands for the prediction of new arrivals for

k = 1, . . . ,52. The two models are in agreement. After the 10− th week the prediction

starts to oscillating with larger amplitude as the weeks pass by. This is due to the

fluctuations of the prediction of the seasonal components, see Figure 5.8. Table 5.2 shows

the numerical values of the prediction in the first 12 weeks, before the oscillations of the

credible bands.

0 100 200 300 400

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Week

Tren

d

0 100 200 300 400

−0.

002

0.00

00.

002

0.00

40.

006

0.00

80.

010

0.01

2

Week

Loca

l gro

wth

0 100 200 300 400

−0.

4−

0.3

−0.

2−

0.1

0.0

0.1

0.2

Week

Sea

sona

l Tre

nd

Figure 5.5: Model 1: decomposition of the time series

69

Page 84: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

CHAPTER 5. FORECASTING NEW DONORS

0 100 200 300 400

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Week

Tren

d

0 100 200 300 400−

0.4

−0.

3−

0.2

−0.

10.

00.

10.

2

Week

Sea

sona

l Tre

nd

Figure 5.6: Model 2: decomposition of the time series

0 10 20 30 40 50

050

100

150

200

Week

Pre

dict

ion

(a) Prediction Model 1

0 10 20 30 40 50

050

100

150

200

Week

Pre

dict

ion

(b) Prediction Model 2

Figure 5.7: Prediction of new weekly arrivals: 95 % credibility intervals

70

Page 85: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

5.4. POSTERIOR INFERENCE

0 10 20 30 40 50

−1.

0−

0.5

0.0

0.5

1.0

Step

Sea

sona

l Tre

nd: p

redi

ctiv

e m

ean

Figure 5.8: Predictive mean of the seasonal component

Step forward q0.025 Median q0.975

1 6.00 15.00 30.002 5.00 15.00 35.003 5.00 16.00 41.004 4.00 17.00 44.005 4.00 16.00 48.006 3.00 15.00 49.007 3.00 16.00 61.008 3.00 16.00 69.009 2.00 12.00 56.00

10 1.00 11.00 53.0011 2.00 17.00 81.0012 3.00 25.00 138.00

Table 5.2: Prediction of future weekly arrivals

71

Page 86: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi
Page 87: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Conclusions and further

developments

In this thesis, we have proposed a statistical model to describe and predict recurrent

blood donations. Forecasting the number of arrivals in a blood collection centre is very

important to plan efficiently the storage of this resource in the transfusion centres. A

solution to this problem would bring benefits to all the healthcare system by improving

the quality of the service from donors’ point of view, by reducing the costs of the service

and by leading to an increase of the number of donations. This work has been possible

thanks to the collaboration of AVIS Milan, who provided the data.

The approach followed in this work has been to consider a donor in the study once

he/she donates for the first time in his/her life. Then, all the successive donations has

been modelled as a recurrent event process, using the Bayesian approach. Since the focus

was on event counts over time, the intensity function has been modelled in the framework

of the multiplicative model, with a step function as a baseline intensity function. The

analysis revealed a decreasing trend of the rate of donations, meaning that a donor has a

higher propensity to donate at the beginning of his/her donor-life rather than once some

time is passed.

Four covariates have been identified as significant. These are the gender of the donor,

his/her age, the smoke habits and the Rhesus factor. However, all these covariates are

time-fixed because they were considered at the beginning of the study, hence a possible

extension could be to introduce time-dependent features in the model.

The heterogeneity among donors has been captured using random effects in the

intensity function. These parameters are individual-specific and allow to discriminate

among donors, summarising in the posterior distribution of the random effects their

reliability. Moreover with this approach it is possible to customize the prediction for each

donor in the sample, and, in case of new incoming donors, to make prediction taking into

73

Page 88: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

account the variability between individuals.

Suspensions from donation could be, in principle, handled by the model. However the

corresponding data revealed to be noisy, and so this phenomenon was not included in the

model formulation. A better comprehension of these data can be useful to formulate the

model in a proper way in order to handle suspensions in the model.

Another question which deserves to be deepened is the different deferral time of

women whether they are in menopause or not. Identifying the two sub-populations could

be a way to improve the model, since the mandatory rest time after the donation is a

fundamental part of it.

To have a complete picture of the number of blood donations in a specific blood

collection center, the new donors arrivals’ time series has been modelled. However this

part of the work has to be intended as a preliminary work, and indeed some issues

arose. For example, Stan software has been used to make posterior inference, but a

more suitable MCMC algorithm should be used (e.g. Particle filters methods). Moreover,

covariates were not included in this model, but appropriate features could reduce the

variability of the prediction. The resulting prediction were not satisfying since there

were oscillations of the credible bands due to the seasonal components. An improvement

of the proposed model should include a theoretical study of the property of the model to

understand this phenomenon.

Page 89: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi
Page 90: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi
Page 91: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Bibliography

Arjas, E. and Gasbarra, D. (1994).

Nonparametric Bayesian inference from right censored survival data, using the Gibbs

sampler.

Statistica sinica, pages 505–524.

Banerjee, S., Wall, M. M., and Carlin, B. P. (2003).

Frailty modeling for spatially correlated survival data, with application to infant

mortality in Minnesota.

Biostatistics, 4(1):123–142.

Bas Güre, S., Carello, G., Lanzarone, E., and Yalçındag, S. (2018).

Unaddressed problems and research perspectives in scheduling blood collection from

donors.

Production Planning & Control, 29(1):84–90.

Bosnes, V., Aldrin, M., and Heier, H. E. (2005).

Predicting blood donor arrival.

Transfusion, 45(2):162–170.

Cook, R. J. and Lawless, J. (2007).

The statistical analysis of recurrent events.

Springer Science & Business Media.

Flegel, W., Besenfelder, W., and Wagner, F. (2000).

Predicting a donor’s likelihood of donating within a preselected time interval.

Transfusion Medicine, 10(3):181–192.

Gamerman, D., Abanto-Valle, C. A., Silva, R. S., and Martins, T. G. (2015).

Dynamic Bayesian models for discrete-valued time series.

Handbook of Discrete-Valued Time Series, pages 165–186.

77

Page 92: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Gelman, A., Hwang, J., and Vehtari, A. (2014).

Understanding predictive information criteria for Bayesian models.

Statistics and computing, 24(6):997–1016.

Gianoli, I. (2016).

Analysis of gap times of recurrent blood donations via bayesian nonparametric models.

MSc. Thesis, Politecnico di Milano.

Gustafson, P., Aeschliman, D., and Levy, A. R. (2003).

A simple approach to fitting Bayesian survival models.

Lifetime data analysis, 9(1):5–19.

Henderson, R., Shimakura, S., and Gorst, D. (2002).

Modeling spatial variation in leukemia survival data.

Journal of the American Statistical Association, 97(460):965–972.

James, R. and Matthews, D. (1996).

Analysis of blood donor return behaviour using survival regression methods.

Transfusion medicine, 6(1):21–30.

Johnson, W., Branscum, A., Hanson, T. E., and Christensen, R. (2010).

Bayesian ideas and data analysis: an introduction for scientists and statisticians.

CRC Press.

Kalbfleisch, J. D. (1978).

Non-parametric Bayesian analysis of survival time data.

Journal of the Royal Statistical Society: Series B (Methodological), 40(2):214–221.

Li, Y. and Ryan, L. (2002).

Modeling spatial survival data using semiparametric frailty models.

Biometrics, 58(2):287–297.

Ministero Della Salute (2015).

Disposizioni relative ai requisiti di qualità e sicurezza del sangue e degli emocompo-

nenti.

Gazzetta Ufficiale.

Ouyang, B., Sinha, D., Slate, E. H., and Van Bakel, A. B. (2013).

Bayesian analysis of recurrent event with dependent termination: an application to a

heart transplant study.

78

Page 93: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

Statistics in medicine, 32(15):2629–2642.

Ownby, H., Kong, F., Watanabe, K., Tu, Y., Nass, C. C., and Study, R. E. D. (1999).

Analysis of donor return behavior.

Transfusion, 39(10):1128–1135.

Pennell, M. L. and Dunson, D. B. (2006).

Bayesian semiparametric dynamic frailty models for multiple event time data.

Biometrics, 62(4):1044–1052.

Sahu, S. K., Dey, D. K., Aslanidou, H., and Sinha, D. (1997).

A Weibull regression model with gamma frailties for multivariate survival data.

Lifetime data analysis, 3(2):123–137.

Soyer, R., Aktekin, T., and Kim, B. (2015).

Bayesian modeling of time series of counts with business applications.

Handbook of Discrete-Valued Time Series, Davis RA, Holan SH, Lund R, RavishankerN, pages 245–264.

Stan Development Team and others (2016).

Stan modeling language users guide and reference manual.

Technical report.

Vehtari, A., Gelman, A., and Gabry, J. (2017).

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

Statistics and Computing, 27(5):1413–1432.

West, M., Harrison, P. J., and Migon, H. S. (1985).

Dynamic generalized linear models and Bayesian forecasting.

Journal of the American Statistical Association, 80(389):73–83.

Yin, G., Ibrahim, J. G., et al. (2006).

Bayesian transformation hazard models.

In Optimality, pages 170–182. Institute of Mathematical Statistics.

Page 94: Count processes approach to recurrent event data: a ...di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi

80