the waiting time paradox and biases in infectious disease observational data

28
The Waiting Time Paradox and biases in infectious disease observational data Ping Yan Lecture at Summer School on Mathematics of Infectious Diseases Program Centre for Disease Modelling, York University

Upload: cassidy-sweet

Post on 30-Dec-2015

23 views

Category:

Documents


0 download

DESCRIPTION

The Waiting Time Paradox and biases in infectious disease observational data. Ping Yan Lecture at Summer School on Mathematics of Infectious Diseases Program Centre for Disease Modelling, York University. Data in infectious disease studies are often observational , not following - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox and biases in infectious disease observational data

Ping YanLecture at Summer School on Mathematics of Infectious Diseases Program

Centre for Disease Modelling, York University

Page 2: The Waiting Time Paradox and biases in infectious disease observational data

Outline

2

1. Data in infectious disease studies are often observational, not following hypothesis design of experiment repetition and randomization

2. Advanced statistical methods involve statistical modelling. While many infectious disease models focus on the epidemiology aspects (transmission process), statistical models focus on the stochastic mechanism from which data are generated (data generating process).

3. The two types of models need to be well integrated. For statistical estimation purposes, even with statistical models (such as conditioning) well intended to capture the length-bias, important information is still lost without modelling the underlying transmission process.

4. Conversely, without statistical modelling for the data generation process, important input parameters in mathematical models may be severely biased based on naïve (or not-so-naïve) statistical methods.

Page 3: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox (Feller 1966)

3

Question: what is the expected waiting time to the next bus arrival ?tW

2. An “inspector” (or a customer), inspects “at random” so that the inspection time t is uniformly distributed between the last bus and the next bus.

1. Buses arrive at a constant rate ; the inter-arrival times X’s are independently and identically distributed, mean

.

Argument 1:

. ][ 21 WE

The inspection time t is uniformly distributed between two buses, for symmetry

.1/ ][ WE

• the “memoryless” property of the exponential distribution implies that the remaining time to the next bus follows the same exponential distribution, thus

Argument 2: Because buses arrive at constant rate X’s are iid. exponentially distributed ,

• If shouldn’t ? 2][2][ )( WEXE B,][ WE

Paradox ! Didn’t we assume that X’s are independently and identically distributed, mean = ?

just to show how classic it is in introductory probability textbooks

Page 4: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox (Feller 1966)

4

Question: what is the expected waiting time to the next bus arrival ?tW

][ 21 WE is correct if there is no variation in inter-arrival times,

(if buses are as punctual as Swiss trains.)0cv

][ WE is correct if the variance of inter-arrival times satisfies

(TTC seems to be worse than this.)

1cv

Variation matters

][XVarcvcoefficient of variation

X (B)= duration from the last bus to the next bus seen by the inspector

The distribution of X (B) is different from that of X, (paradox as it is assumed X to be iid.)

,)()()( xxfxf XB )1(][ 2)( cvXE BLength-biased distribution w. p.d.f.

,)()( xFxf XW The waiting time W has p.d.f. )1(][ 221 cvWE

symmetry with respect to X(B)

}Pr{)( xXxF X where

Page 5: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox and bias in observational data

A different way of looking at the same problem:

1. Occurrence of the initiating event has constant rate (i.e. time of occurrence is uniform at any given time interval)

3. The duration X is iid. with p.d.f. and mean )(xf X

2. The duration X is independent from the random process that generates the initiating event.

4. At a snapshot, only those who have experienced the initiating event but not the subsequent event are included in a sample, with observed duration X(B) .

A sample containing only observations made of X(B) is called a “prevalence cohort”.

The distribution from a prevalence cohort corresponds to the p.d.f. and mean

,)()()( xxfxf XB )1(][ 2)( cvXE B

because those with longer duration have greater chance to be included in data.

Page 6: The Waiting Time Paradox and biases in infectious disease observational data

Observational data arising in a prevalence cohort

Assume the duration X iid. with p.d.f. and mean),(xf X .Under equilibrium: the incidence of the initiating event occurs at constant rate

Naïve estimation for the distribution of X (e.g. incubation time, survival time, etc.) based on such prevalence cohort data leads to over-estimation.

1. The observed duration is length-biased

.)()()( xxfxf XB X(B) has p.d.f.

W has p.d.f. ,)()( xFxf XW }Pr{)( xXxF X where

prevalence = # or % { individuals experienced the initial event but not the subsequent event }

2. Size biased estimation for prevalence estimate

Under-equilibrium, prevalence = incidence x duration.

i.e. the sample is size-biased in favor of cohorts with larger prevalence

The length-bias in observed duration leads to “size-bias” in sampled prevalence.

Page 7: The Waiting Time Paradox and biases in infectious disease observational data

Waiting Time Paradox in disease screening via repeated testing

1. Replacing buses with repeated testing: the inter-testing intervals X’s are iid., mean = .2. Replacing an “inspector” by sero-conversion, which, under equilibrium, has constant rate,

such that given any time interval (between two tests), a sero-conversion may occur and the sero-conversion time is uniformly distributed in the interval.

X(B) has length-biased distribution:

X(B)= duration from the last (neg.) test to the next (pos.) test covering a sero-conversion

)1(][ 2)( cvXE B

)1(2

1][ 2cvWE The average waiting time from sero-conversion to the next (pos.) testing:

)1(

2

1 2cvpu

If we add an average “window period” from infection to sero-conversion, the prevalence of infected but not yet tested (queue), prev. = incidence x mean duration

(under equilibrium conditions)

Keeping and unchanged, the testing strategy determines . up )1( 2cv

Page 8: The Waiting Time Paradox and biases in infectious disease observational data

Waiting Time Paradox in disease screening via repeated testing

)1(

2

1 2cvpu

The prevalence of infected but not yet tested (queue),

(under equilibrium conditions)

Objective: Under different scenarios of infection incidence determine the optimal testing frequency so that the queue of infected but untested is reduced to satisfy a cost-effective criterion.

Generally, the larger the incidence rate , the more cost-effective it is for more frequent testing. Cost-effectiveness is compromised if there is large variation between inter-testing intervals or among individuals.

Each infected but not yet tested individual (in ) may be associated with a cost c to the societyup

Both costs are determined according to different contexts.

Each test is associated with a cost κ.

Page 9: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox as seen in R0 formulation

N = # of infections produced by a typical infectious individual while seeded into an infinitely large susceptible population

An infected individual produces new infections accounting to a counting process with intensity )(xk

R0 = mean value of N , can be expressed as .)(00

dxxkR

The premises:

01)( dxxke xIf ,10 R ,0 such that

Malthusian number describing the early exponential growth

Re-write: ,)( 1000

)(

0

Rdxxgedxe xR

xkx ,)(

10 gL

R

.1)(0

dxxg)(gL = Laplace transform of ,)(0

)(R

xkxg

then

satisfying

(Ref: Wallinga and Lipstisch, 2007; Heesterbeek and Roberts, 2007)

Postulate: g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

Page 10: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox as seen in R0 formulation

1000

)( )(0

Rdxxgedxe xR

xkx ,)(

10 gL

R

.1)(0

dxxg

)(gL = Laplace transform0

)()( Rxkxg

Postulate: g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

1. assessing the meaning of this random variable; 2. assessing whether it is observable; 3. if observable, collect data and estimate g(x);4. estimate separately (usually via curve fitting);5. evaluate the Laplace transform , analytically or numerically.)(gL

If true, the tasks:

In the above, there is no assumption about the integral k(x), i.e. the model

)(xk instantaneous rate at time x

However, in order to assess 1., we put into a structured model framework: the SEIR.

Page 11: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox as seen in R0 formulation

,)(

10 gL

R .1)(0

dxxg)(gL = Laplace transform of ,)(0

)(R

xkxg satisfying

Postulate: g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

Assuming R0 > 1:

1. In the SIR model, with exponentially distributed infectious period S I R

IIR 1/10

11)( IgL is the Laplace transform of the exponential distribution with mean .I

In this case, g(x) is the p.d.f. the infectious period.

2. In the SEIR model, with both the latent period and the infectious period being exponentially distributed S I R

E

I

IER 11/1/10

11 11)( IEgL = product of two Laplace transforms of the exponential distributions.

In this case, g(x) is the p.d.f. the sum of the latent period and the infectious period.

Anderson and May (1991) : generation time = latent period + infectious period.

Page 12: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox as seen in R0 formulation

,)(

10 gL

R .1)(0

dxxg)(gL = Laplace transform of ,)(0

)(R

xkxg satisfying

Postulate: g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

In the SEIR model, if the latent period and the infectious period are arbitrarily distributed(with specific distributions)

ITETperiodlatent for transformLaplace)( sLE

period infectiousfor transformLaplace)( sLI

)](1)[(0

IE

I

LLR

(Yan, 2007)

including:IR 10 (no latent period, exponentially disted infectious period)

IER 110 (exponentially disted latent and infectious periods)

I

I

I

E

E

E

IR

11

10

, IE

IE ,

where are the mean values of the latent and infectious periods

are coefficient of variation parameters

(gamma disted latent and infectious periods, Anderson and Watson, 1980)

Page 13: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox as seen in R0 formulation

,)(

10 gL

R .1)(0

dxxg)(gL = Laplace transform of ,)(0

)(R

xkxg satisfying

Postulate: g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

In the SEIR model, if the latent period and the infectious period are arbitrarily distributed

0

)(

0

)(10

)(

1

)(

1

)](1)[( dxedxxfeLLLR

I

II

I xFxE

xL

EIE

I

0

)(

0)()( dxedxxfeL

I

I xFxE

xg

p.d.f. of the latent period p.d.f. of W in length-biased infectious period

from a snapshot point of view

WET

Call it generation time : • if , consistent with that by Anderson and May (1991);1cv• if , consistent with that by Gani and Daly (2001): mean latent period + half of the mean infectious period

0cv

WTE g(x) is p.d.f. of with mean value )1( 221 cvIE

Fine (2003): the latent period + part of the infectious period ….

• not exactly, need to emphasize length-biased infectious period• could be even longer than the “natural” infectious period .1 if cv

Page 14: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox as seen in R0 formulation

,)(

10 gL

R .1)(0

dxxg)(gL = Laplace transform of ,)(0

)(R

xkxg satisfying

Postulate: g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.

1. assessing meaning of this random variable; 2. assessing whether it is observable; 3. if observable, collect data and estimate g(x);4. estimate separately (usually via curve fitting);5. evaluate the Laplace transform , analytically or numerically.)(gL

If true, the tasks:

g(x) is p.d.f. of the generation time, defined as the latent period plus part of

the length-biased infectious period, with mean value

For 1. above,

)1( 221 cvIE

• another individual • the transmission process

In the above definition, the generation time does not involve:

The “snapshot” may be thought as the time of infection of an infectee in relation to the infectious period of its infector; whereas in theory, it could be a snapshot by any “inspector”.

Page 15: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox as seen in R0 formulation

,)(

10 gL

R .1)(0

dxxg)(gL = Laplace transform of ,)(0

)(R

xkxg satisfying

g(x) is p.d.f. of the generation time, defined as the latent period plus part of

the length-biased infectious period, with mean value )1( 221 cvIE

From Wallinga and Lipsitch (2007):

i. from the infection time of the infector looking forward to the infection time of the infectee; ii. from the infection time of the infectee looking back to the infection time of the infector.

Svensson (2007) made the distinction:

Seems like, if we assign the “snapshot” as the time of infection of an infectee in relation to the infectious period of its infector, then the generation interval in Wallinga and Lipsitch (2007) should be understood in the sense of (ii) in Svensson (2007).

But …. there are strings attached ….

Page 16: The Waiting Time Paradox and biases in infectious disease observational data

The Waiting Time Paradox as seen in R0 formulation

ii. from the infection time of the infectee looking back to the infection time of the infector.

Svensson (2007):

If associating with the generation interval in Wallinga and Lipsitch (2007) and understood it in the sense of (ii) in Svensson (2007), there are hidden assumptions.

WTE

• The infection times of infectees must be exchangeable so that any randomly chosen infectee (if more than one), while looking back, gives the same distribution for . WTE

• The infection times of infectees must be uniformly distributed in the (length-biased) infectious period so that W has p.d.f. with mean i.e. symmetry.

I

I xF

)( ),1( 2

21 cvI

• The system is at equilibrium so that infectors arrive at constant rate.

This puzzle further leads to the observation problem: can we collect data at the early phase of an outbreak and use the above theory ?

• The infectious period contains infectee, hence length-biased, mean ).1( 2cvI 1

Things that I don’t understand: ,)(

10 gL

R )(gL = Laplace transform of

with mean )1( 221 cvIE

WTE

1. valid interpretation at equilibrium2. is the Malthusian number, far from equilibrium

Page 17: The Waiting Time Paradox and biases in infectious disease observational data

A generalization of the Waiting Time Paradox: left-truncation

Moving away from for general observation bias without being in equilibrium)(10 gLR

The same issue: the observed X(B) is length-biased.

Initial event occurs over time t following a random process with intensity ).(t• Individuals who have experienced the initial event are enrolled at time • Individuals are followed until an endpoint event, taking place at time

Et )(BXt

Previously (the Waiting Time Paradox),

• the time from initial event to enrolment E, follows the distribution with p.d.f. )(eF X

• the time from initial event to the observed endpoint X(B) , follows p.d.f. )(xxf X

• assumed equilibrium ;)( t

• called “enrolment” as “snapshot”, assumed uniform distribution in any fixed time interval

Generalization )(t• is not constant;

• enrolment is random, independent from the random process of the initiating event.

• This observation scheme is subject to left-truncation.

Page 18: The Waiting Time Paradox and biases in infectious disease observational data

A generalization of the Waiting Time Paradox: left-truncation

The observed X(B) is length-biased: in favor of longer durations.

The objective: estimating the distribution of the duration X between the two events.

Naïve analyses: treating X(B) as if X from designed experiments, lead to over-estimation

Not-so-naïve method through conditioning:

.Et X(B) arises from the conditional distribution of X given because the eligibility of ,EX enrolment is not having experienced the endpoint event at

Statistical methods are on the conditional distribution rather than where

)()( eFxf XX ),(xf X

}.Pr{)( eXeF X

Such a method provides a length-bias adjusted estimation, but is only able to estimate part of the distribution. Some information is lost in the data, unless is explicitly modelled.

)(t

Call for joint modelling: transmission model for how epidemiology generate data and statistical

model for how data are observed.

Page 19: The Waiting Time Paradox and biases in infectious disease observational data

Right-truncation: length-bias in favour of observing short durations

Previously, left-truncation, in favour of observing long durations

Very common in surveillance: inclusion criteria is the occurrence of the subsequent event prior to the time of data analysis.

Example: Initiating event = diagnosis of a diseaseSubsequent event = the disease is reported and entered into a registry

Objective: assessment of the reporting delay X.

Bias: the case has to be reported before the time at analysis; systematically observing data with short delay.

Page 20: The Waiting Time Paradox and biases in infectious disease observational data

Right-truncation: length-bias in favour of observing short durations

As reported by Dec 31, 1999

Reporting delay adjusted trend

Example: Initiating event = diagnosis of a diseaseSubsequent event = the disease is reported and entered into a registry

Objective: assessment of the reporting delay X.

Bias: systematically observing data with short delay.

The gap between reported (bars) and projected (lines) trends implied long delay between diagnosis and data entry (into national registry).

Reporting delay is a very important issue in all disease surveillance

Annual AIDS incidence in Canada as seen in 1992 and 1999

0

500

1000

1500

2000

2500

As reported by Dec.31, 1992

Reporting delay adjusted trend based on 1992 data presented in April 1993 along with the AIDS surveillance report.

Page 21: The Waiting Time Paradox and biases in infectious disease observational data

Adjustment of reporting delay

);( CtN # cases diagnosed at time t and reported by time C (as a proportion of N(t) )

)(tN # cases diagnosed at time t (to be estimated)

All we need to do is to estimate this proportion, which is }Pr{)( tCXtCFX

Naïve analysis always leads to severe under-estimation of reporting delay.

Naïve analysis − median reporting delay 1.6 months − 95% completeness within14 months

Adequately accounting for right-truncation and other (adm.) processes, useful tools can be developed to reflect real-time trend and built into the surveillance.

)(

);()(

tCF

CtNtN

X Then:

Not-so-naïve analysis: − median delay approx. 9 months − 85% completeness within 5 years.

Page 22: The Waiting Time Paradox and biases in infectious disease observational data

Other examples of reporting delay in disease surveillance: did we learn the lesson?

SARS outbreak in Toronto, 2003Pre-mature declaration that SARS was over

Recall the strong protest against WHO’s travel advisory on Apr. 23 ?

As turned out:

H1N1 during the spring of 2009May 14: Is the worst over?

As it turned out:

Page 23: The Waiting Time Paradox and biases in infectious disease observational data

Right-truncation: length-bias in favour of observing short durations

Another example: Initiating event = HIV infection via transfusionSubsequent event = onset of AIDS illnessesObjective: estimate the incubation period X.

Data: transfusion-associated AIDS cases, assembled by the U.S. CDC, with transfusion as the only known risk factor, retrospective ascertained date at infection / transfusion

• naïve estimation: as if data from random experiment, iid. )(xFX

• not-so-naïve: right-truncation )()( tCFxF XX data from the conditional distribution

naïve

not-so-naïve

0.5

Brookmeyer and Gail (1994):

• naïve estimation potentially under-estimate median by 50%, compared with the not-so-naïve analysis (by conditioning)

Based on above data, at C = June 30, 1988

??

uncertainty subject to a constant proportionality

Kalbfleisch and Lawless (1989):

• with analysis by conditioning, the larger the C, the longer are the estimated mean and median

• without knowing the AIDS incidence, there is a loss of information in data so that one can only estimate up to a constant of proportionality the early part of the incubation period distribution.

Page 24: The Waiting Time Paradox and biases in infectious disease observational data

Right-truncation: length-bias in favour of observing short durations

Another example: Initiating event = HIV infection via transfusionSubsequent event = onset of AIDS illnessesObjective: estimate the incubation period X.

Data: transfusion-associated AIDS cases, assembled by the U.S. CDC, with transfusion as the only known risk factor, retrospective ascertained date at infection / transfusion

• naïve estimation: as if data from random experiment, iid. )(xFX

• not-so-naïve: right-truncation )()( tCFxF XX data from the conditional distribution

Lui, et al. (1986): C = April 30, 1985• conditioning: mean years 5.4

6.2• naïve analysis: mean years

Brookmeyer and Gail (1994): • naïve estimation potentially under-estimate median by 50%, compared with the not-so-naïve analysis (by conditioning)

Lagakos, et al. (1988): C = June 30, 1986 • conditioning: median 8.5 years

Kalbfleisch and Lawless (1989): • with analysis by conditioning, the larger the C, the longer are the estimated mean and median

By the 1990s when large scale multi-center cohort data became available, it turned out that the median incubation period is approximately 10 years.

Page 25: The Waiting Time Paradox and biases in infectious disease observational data

Right-truncation: length-bias in favour of observing short durations

The above statements are very important in observational data of an emerging disease with respect to retrospectively ascertained durations (incubation period, serial interval, etc.), later analyses suggest longer distribution than earlier analyses.

The underlying disease trend matters.

Kalbfleisch and Lawless (1989):

• with analysis by conditioning, the larger the C, the longer are the estimated mean and median

• without knowing the incidence of the initiating event, there is a loss of information that one only estimates up to a constant of proportionality the early part of the duration distribution.

Call for jointly model the disease process (e.g. transmission model) and model the data generation process.

Page 26: The Waiting Time Paradox and biases in infectious disease observational data

A general topic related to statistics issues and disease models

Example: stochastic versus deterministic models for

)(tId

)(tId

1 ,5.1 ,1000 n 1 ,5.1 ,10000 n

Every deterministic compartment model, such as SIR, has a stochastic counterpart.

:10 RAssume

In these graphs, R0 =1.5

n = population size

Deterministic ↔ what must happen:

• a bell-shaped determined by mathematical law. )(tId

Stochastic ↔ what might happen:

• even R0 > 1, there is a positive probability (1/3 in above cases), very few transmissions occur then followed by extinction

• otherwise, after “simmering” for a short random period of time, it takes off;− if , the path is bell-shaped resemble but the origin is random. n )(tId

Page 27: The Waiting Time Paradox and biases in infectious disease observational data

A general topic related to statistics issues and disease models

d. large number of parameters

Statistical challenges in estimating parameters in transmission models

a. models are built on unobservable events (e.g. time at infection, the passing of infection from one individual to another, duration of latency (not infectious), duration of infectiousness, duration of immunity, etc.)

• data are based on observable events (e.g. clinical onset of illness, stages of illness, duration of illness, death, physical recovery, etc.)

b. Observational data subject to length-bias, size bias, missing values, etc.

c. some seemingly “large” data (in terms of large population) arise from a single (or few) realization of a random phenomenon (i.e. an outbreak) … extremely small “sample size”

)(tId

)(tId

1 ,5.1 ,1000 n 1 ,5.1 ,10000 n

Page 28: The Waiting Time Paradox and biases in infectious disease observational data

Summary

1. Data in infectious disease studies are often observational, not following hypothesis design of experiment repetition and randomization

Data in most introductory statistics textbooks are from repetition of random experiment by design. Naïve adaptation of these models and methods may lead to severe bias.

2. Advanced statistical methods involve statistical modelling. While many infectious disease models focus on the epidemiology aspects (transmission process), statistical models focus on the stochastic mechanism from which data are generated (data generating process).

Although transmission process is part of the data generation mechanism, the observer sees data through additional filters, such as data management and administrative processes.

3. The two types of models need to be well integrated. For statistical estimation purposes, even with statistical models (such as conditioning) well intended to capture the length-bias, important information is still lost without modelling the underlying transmission process.

4. Conversely, without statistical modelling for the data generation process, important input parameters in mathematical models may be severely biased based on naïve (or not-so-naïve) statistical methods.

The gap is identified. Still lots of work need to be done.

Ditto.