icpsr 2011 - bonus content - modeling with data

43
Modeling with Data INTRO TO COMPUTING FOR COMPLEX SYSTEMS (Session XVI) Jon Zelner University of Michigan 8/11/2010

Upload: daniel-martin-katz

Post on 08-Jul-2015

918 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ICPSR 2011 - Bonus Content - Modeling with Data

Modeling with Data

INTRO TO COMPUTING FOR COMPLEX SYSTEMS(Session XVI)

Jon ZelnerUniversity of Michigan

8/11/2010

Page 2: ICPSR 2011 - Bonus Content - Modeling with Data

Data in the Modeling Process

Model

Agents

Environment Simulated Behavior

Observed Behavior

Page 3: ICPSR 2011 - Bonus Content - Modeling with Data

Pattern Oriented Modeling (POM)

Term coined by Grimm et al. in 2005 Science paper

Modeling process should be guided by patterns of interest Can use patterns @ multiple levels:

Individual agents Environment Aggregate agent behavior

Patterns should be used both to guide model development and to calibrate and validate models.

Page 4: ICPSR 2011 - Bonus Content - Modeling with Data

Types of data for modelers Counts/Proportions:

Infections Occupied patches

Distributions Age Lifespan Duration of infection

Rates Birthrates Transmission rate

Time Series: Evolution of outbreak in

time Timeline of conflict Number of firms over

time

Qualitative: ‘Norovirus-like’

outbreaks Size and shape of forest

patches Clusters of settlements

Page 5: ICPSR 2011 - Bonus Content - Modeling with Data

Pattern Oriented Modeling

Page 6: ICPSR 2011 - Bonus Content - Modeling with Data

Pattern Oriented Modeling: Kayenta Anasazi (Axtell et al. 2001) Trying to understand

population growth and collapse among the Kayenta Anasazi in U.S. Southwest

Many factors in this: Weather Farming Kinship

Optimize models by explaining multiple patterns @ one time.

Page 7: ICPSR 2011 - Bonus Content - Modeling with Data

Anasazi (cont’d)

Page 8: ICPSR 2011 - Bonus Content - Modeling with Data

Inference for POM

Bayesian/Qualitative Use some kind of quality function to score goodness of runs

and optimize by minimizing distance between model output and optimum quality and/or data. Number of occupied patches Size of elephant herds

Frequentist/Likelihood-based Define a likelihood function for Data | Model Simulate runs from the model and evaluate likelihood of

data as (# runs == Data) / # runs

Page 9: ICPSR 2011 - Bonus Content - Modeling with Data

How Infections Propagate After Point-Source Events

An Analysis of Secondary Norovirus Transmission

Jon Zelnera,b, Aaron A. Kinga,c, Christine Moee & Joseph N.S. Eisenberga,d

University of Michigana Center for the Study of Complex Systems, b Sociology & Public Policy,

c Ecology & Evolutionary Biology, d School of Public Health

Emory Universitye Rollins School of Public Health

Page 10: ICPSR 2011 - Bonus Content - Modeling with Data

Norovirus (NoV) Epidemiology Most common cause of non-bacterial

gastroenteritis in the US and worldwide. Est. 90 million cases in 2007 Explosive diarrhea & projectile vomiting

in symptomatic cases.

Single-stranded, non-enveloped RNA virus Member of family Caliciviridae

Often transmitted via food Salad greens Shellfish

Most person-to-person transmission is via the environment and fomites.

Page 11: ICPSR 2011 - Bonus Content - Modeling with Data

Why model transmission after point-source events? Typical analysis of point-source events

focuses on primary, one-to-many risk: How many cases are created by an

infectious food handler? How many people infected after water

treatment failure?

However, actual size of point source events is underestimated without including secondary transmission risk.

Within-household transmission is an important bridge between point-source events. So, even if within-household Ro < 1,

household cases have important dynamic consequences at the community level.

H IS

S

IA

Page 12: ICPSR 2011 - Bonus Content - Modeling with Data

NoV Transmission Dynamics

Norovirus transmission dynamics tend to be locally unstable but globally persistent. E.g., small, explosive outbreaks in Mercer

County, but no local NoV epidemic Multiple reported NoV outbreaks

throughout New Jersey every week. Stochasticity operates at multiple levels.

Disease/Contact

Page 13: ICPSR 2011 - Bonus Content - Modeling with Data

NoV Transmission Dynamics

Exponential Growth, Global Invasion

(e.g.,Pandemic Flu)

Short, Explosive & Limited

(Typical of NoV outbreaks)

Page 14: ICPSR 2011 - Bonus Content - Modeling with Data

Outbreak Data

Gotz et al. (2001) observed 500+ households exposed to NoV after a point-source outbreak in a network of daycare centers in Stockholm, Sweden. Traceable to salad prepared by a

food handler who was shedding post-symptoms.

Followed 153 of these households Eliminating those with only one

person. 49 had secondary cases 104 have no secondary cases

Page 15: ICPSR 2011 - Bonus Content - Modeling with Data

Deterministic SEIR model Infinite population

Mass-action mixing

Frequency-dependent transmission

When I > 0, a fraction of the susceptible population is infected at every instant Constant average rate of recovery Doesn’t matter who is infected ‘Nano-fox’ problem !

!

dSdt

= "#SI

dEdt

= #SI "$E

dIdt

= $E " %I

dRdt

= %I

Page 16: ICPSR 2011 - Bonus Content - Modeling with Data

Why use a stochastic model? Deterministic models work well

when assumptions are plausible, but are less useful when: Populations are small:

e.g.,Household outbreak

Global contact patterns deviate from homogeneous mixing: Social networks Realistic behavior

Disease natural history is not memoryless: Recovery period is not

exponentially or gamma distributed

Lots of variability in individual infectiousness

Exponential RV

Lognormal RV

Page 17: ICPSR 2011 - Bonus Content - Modeling with Data

Progression of NoV Infection

Short incubation period (~1.5 days)

Typical symptom duration around 1.5 days. Exceptional cases up to a year have been reported.

Most people shed asymptomatically after recovery of symptoms: Typically for several days Not uncommon for shedding to last > 1 month, year or more 15-50% of all infections may be totally asymptomatic

Page 18: ICPSR 2011 - Bonus Content - Modeling with Data

Basic NoV Transmission Model for Household Outbreaks

SEIR Transmission Model Individuals may be in one of four states:

Susceptible Exposed/Incubating Infectious Recovered/Immune

Multiple boxes in E & I states correspond to shape parameter of gamma distributed waiting times.

Background infection parameter, α. (Fixed to 0.001/day)

Although NoV immunity tends to be partial and short-lived, this model is adequate for analyzing short-lived outbreaks.

Page 19: ICPSR 2011 - Bonus Content - Modeling with Data

Analysis Objectives Estimate daily person-to-person rate of infection (β).

Estimate average effective duration of infection (1/γ) and shape parameter of gamma-distributed infectiousness duration.

Effect of missing household sizes on results.

Effect of asymptomatic infection.

0.14/infections per day

1.2 days; γs = 1

Minimal

.035 increase in β for each 10% increase in proportion of individuals who are asymptomatically infectious

Page 20: ICPSR 2011 - Bonus Content - Modeling with Data

What makes these data challenging to work with? We want to understand:

Daily person-to-person rate of infection (β). Average effective duration of infection (1/γ). Variability in 1/γ. Generation of asymptomatic infections.

But household data are noisy and only partially observed: We know time of symptom onset but are missing:

Time of infection Time of recovery Firm estimate of asymptomatic ratio & infectiousness Household Sizes (!)

Strength of these data are that we can treat each household as an independent trial of a random infection process.

Page 21: ICPSR 2011 - Bonus Content - Modeling with Data

Likelihood Function for Fully Observed Household Outbreaks

λ(Sij ,I ij ,β ,α) = Sij βI ij +α( )Force of Infection @ t

i, a = exp −λ(Sij ,I ij ,β ,α)(tj+1− tj )( )j=0

NQ −1

∏Likelihood of no infections over all infection-free intervals

Probability of all infections

!

! i,b = "(Sik,Iik,#,$)k=1

NK

%

x = infection; = symptom onset; = recovery

Page 22: ICPSR 2011 - Bonus Content - Modeling with Data

Likelihood Function for Fully Observed Household Outbreaks

Likelihood of a household observation

!

! i = ! i,a " ! i,b !

O = ii∈H∏Likelihood of all household

observations

x = infection; = symptom onset; = recovery

Page 23: ICPSR 2011 - Bonus Content - Modeling with Data

Likelihood Function for Fully Observed Household Outbreaks

λ(Sij ,I ij ,β ,α) = Sij βI ij +α( )Force of Infection @ t

i, a = exp −λ(Sij ,I ij ,β ,α)(tj+1− tj )( )j=0

NQ −1

∏Likelihood of no infections over all infection-free intervals

Probability of all infections

!

! i,b = "(Sik,Iik,#,$)k=1

NK

%

Likelihood of a household observation

!

! i = ! i,a " ! i,b !

O = ii∈H∏Likelihood of all household

observations

Page 24: ICPSR 2011 - Bonus Content - Modeling with Data

Unobserved Infection States

+ 104 Households w/ No Secondary Cases

Page 25: ICPSR 2011 - Bonus Content - Modeling with Data

Unobserved Infection States

Use data augmentation to generate complete observations.

For each symptom onset event (q):

Draw incubation time, k, from distribution Infection time, a = q – k If you draw any a < 0, whole sample has

likelihood = 0.

Draw recovery time, r, from symptom duration distribution. If r > observation period, w:

r = w For right-censoring in data.

Repeat for many (1K+) samples

Page 26: ICPSR 2011 - Bonus Content - Modeling with Data

Unobserved Infection States

x = infection; = symptom onset; = recovery

Evaluate likelihood w/respect to β and α for each sample. E(L) is estimated likelihood of data.

Page 27: ICPSR 2011 - Bonus Content - Modeling with Data

Unobserved Household Sizes

Sizes of households in Stockholm outbreak are unknown.

Expected number of cases is: S(βI + a)Δt

Missing S!

Solution: Assume exposed households are sampled at random from

the whole population. For each augmented household time series, sample household

size from Swedish census distribution. Save samples by setting a lower bound:

Likelihood of outbreak with have fewer individuals than observed infections = 0, so don’t sample these.

Page 28: ICPSR 2011 - Bonus Content - Modeling with Data

Results: MLE Parameter Values and 95% Confidence Intervals

1/γ limited to values >= 1 day; infectiousness duration < 1 day not plausible

Page 29: ICPSR 2011 - Bonus Content - Modeling with Data

Results: Likelihood Surface

Contour plot shows likelihood for combinations of β and 1/γ for γs = 1.

Triangle is location of MLE; Dashed oval 95% confidence bounds

Parameter space isn’t very large, optimize using brute force.

Page 30: ICPSR 2011 - Bonus Content - Modeling with Data

Goodness of fit Simulate from SEIR model using fitted parameters

and same demographics as outbreak.

If :

!

" = #SI

Draw number of new infections, x, from

!

Binomial(S,")

S = S – x

E = E + x

Draw symptom onset times from for all new infections.

t = t + dt

At end of step:

Transition from

!

E "I those who have infectiousness onset time <= t.

Transition

!

I"R those who have recovery time <= t

Else:

STOP

If :

!

" = #SI

Draw number of new infections, x, from

!

Binomial(S,")

Draw number never symptomatic, a, from

!

Binomial(x,")

S = S – (x-a)

E = E + (x-a)

R = R + a

Draw symptom onset times from for all new infections.

t = t + dt

At end of step:

Transition from

!

E "I those who have infectiousness onset time <= t.

Transition

!

I"R those who have recovery time <= t

Else:

STOP

Page 31: ICPSR 2011 - Bonus Content - Modeling with Data

Goodness of fit Simulate from SEIR model using

fitted parameters and same demographics as outbreak.

Quantify model performance based on closeness to outbreak characteristics Average number of infections in

households with secondary cases. Simulated: 1.9, SD = 0.2 Stockholm: 1.6

Average number of households with no secondary cases. Simulated: 110.5, SD = 5.5 Stockholm: 104

!# of infections in households w/ 2-ary transmission

!# of households with zero secondary cases

Page 32: ICPSR 2011 - Bonus Content - Modeling with Data

Sensitivity Analysis:Household Sizes

Want to understand the extent to which using sampled household sizes biases results.

Simulate outbreaks with household sizes drawn from Swedish census distribution. Estimate parameters using:

Sampled household sizes Known sizes from simulation

Compare results.

Page 33: ICPSR 2011 - Bonus Content - Modeling with Data

Results: Sensitivity Analysis Estimate parameters for outbreak with β = 0.14/day

and 1/γ = 1.2 days

Dashed lines show fit when household sizes are known, solid are unknown.

Results almost exactly the same.

Page 34: ICPSR 2011 - Bonus Content - Modeling with Data

Asymptomatic Infections Problem: Only observed symptomatic infections

Asymptomatics likely don’t contribute much to outbreaks in households with symptomatic cases, but can be infected during these outbreaks. Are very important for seeding new outbreaks:

Stockholm outbreak started by post-symptomatic food-handler Afternoon Delight outbreak in Ann Arbor Subway outbreaks in Kent County, MI

Full analysis of asymptomatic infections requires active surveillance e.g., Stool and environmental samples.

Solution: Estimate parameters for outbreaks with varying levels of asymptomatic infection using simulated data.

Page 35: ICPSR 2011 - Bonus Content - Modeling with Data

Modeling asymptomatic infection

π is proportion of new infections that are asymptomatic. Assume asymptomatic infections are non-infectious during household

outbreak.

Sample 20 outbreaks each for combinations of: Β = {0.075,0.085,…,.2} π = {0, .1,…,.5}

Page 36: ICPSR 2011 - Bonus Content - Modeling with Data

Modeling asymptomatic infection

If :

!

" = #SI

Draw number of new infections, x, from

!

Binomial(S,")

S = S – x

E = E + x

Draw symptom onset times from for all new infections.

t = t + dt

At end of step:

Transition from

!

E "I those who have infectiousness onset time <= t.

Transition

!

I"R those who have recovery time <= t

Else:

STOP

If :

!

" = #SI

Draw number of new infections, x, from

!

Binomial(S,")

Draw number never symptomatic, a, from

!

Binomial(x,")

S = S – (x-a)

E = E + (x-a)

R = R + a

Draw symptom onset times from for all new infections.

t = t + dt

At end of step:

Transition from

!

E "I those who have infectiousness onset time <= t.

Transition

!

I"R those who have recovery time <= t

Else:

STOP

Page 37: ICPSR 2011 - Bonus Content - Modeling with Data

Modeling asymptomatic infection

π is proportion of new infections that are asymptomatic. Assume asymptomatic infections are non-infectious during household

outbreak.

Sample 20 outbreaks each for combinations of: Β = {0.075,0.085,…,.2} π = {0, .1,…,.5}

Estimate parameters using data augmentation method. Assume π = 0, as when fitting Stockholm data.

Find expected value of β for each tau when estimated β = 0.14.

Page 38: ICPSR 2011 - Bonus Content - Modeling with Data

Modeling asymptomatic infection

Page 39: ICPSR 2011 - Bonus Content - Modeling with Data

Norovirus outbreaks in realistic communities Norovirus has interesting qualitative outbreak dynamics in the

community. Outbreaks are explosive but typically limited. Multiple levels of transmission:

Can embed findings about household transmission.

Community rate of transmission is unknown.

Data on community and region-level Norovirus outbreaks are rare.

Take a pattern-oriented approach to building community-level models of NoV transmission.

Build a model based on observed patterns and data that can recreate outbreaks with NoV-like characteristics.

Page 40: ICPSR 2011 - Bonus Content - Modeling with Data

Detailed Transmission Model

S E IS

IA1

R

IA2

(βIS*IS) + (βIA*IA)

NoV transmission is marked by heterogeneous asymptomatic infectious periods.

~5% of the population will shed for 100+ days.

Existing theory predicts that increasing variability in individual infectiousness makes outbreaks less predictable, but smaller on average.

Want to understand how this heterogeneity impacts outbreak dynamics in the context of heterogeneous contact structure.

Page 41: ICPSR 2011 - Bonus Content - Modeling with Data

Contact structure

Household sizes: Assume a representative community, i.e., household sizes are a

random sample from the census distribution of household sizes.

Contacts in the community: Individuals separated into compartments:

School, work, etc

Social network: How do we choose a network topology that is useful and informative?

Food handlers: About 1% of U.S. adults are food handlers Average norovirus point-source outbreak size is about 40

Page 42: ICPSR 2011 - Bonus Content - Modeling with Data

Empirical contact networks

Many empirical community contact networks have an exponentially distributed degree. Moderate

heterogeneity in contact

Page 43: ICPSR 2011 - Bonus Content - Modeling with Data

Outbreak Realizations

= Household Transmission

= Community Transmission

= Point Source Event