how many cases are too many? detection of disease...

How Many Cases Are Too Many?

Detection of Disease Outbreaks and Clusters

Lance A. Waller, Department of Biostatistics, Rollins School of Public Health, Emory University

[email protected]

mailto:[email protected]

mailto:[email protected]

How many are too many?

What sets off the public health “alarm”?

For anthrax and smallpox…

ONE (no statistics needed)

(rare enough and dangerous enough)

What about…

…a more subtle pattern?5 flu cases in a single day.20 acute asthma attacks in one

neighborhood.

We want to detect anomolies, patterns of cases differing from the “usual” pattern.

What are we looking for?

Among “Epidemiologic clues that may signal a covert bioterrorism attack” CDC’sThe Public Health Response to Biological and Chemical Terrorism: Interim Planning Guidance for State Public Health Officials (July 2001):

“Disease with unusual geographic or seasonal distribution”

http://www.bt.cdc.gov/Documents/Planning/PlanningGuidance.PDF

John Snow, M.D. 1845 map

!

Snow, J. (1949) Snow on Cholera.Oxford University Press: London.

What we want...

Statistical assessments of the “unusualness” of observed patterns in space and time.Suggests statistical tests of: H0: No clusters in the data.

Yes/no answer?Easy to ask, harder to answer.

Distributed “by chance”…

Need to “operationalize” H0

What sort of data arise under H0?What counts as evidence against H0 ?

Simple random (uniform) pattern?

Scan statistics

Count events in moving window.In time:

Consideration: Cluster “anywhen”, or outbreak now?

4 3 2 20

Wallenstein, S. (1980) A test for detection of clustering over time. American Journal of Epidemiology 111, 367-372.

Scan statistic in space

2

0

3

1

Kulldorff, M. (1997) A spatial scan statistic. Communications in Statistics-Theory and Methods 26, 1481-1496.

Complication

Heterogeneous population density

Refine the question…

“Are there clusters in the data?” to

“Are there clusters in the data after adjusting for heterogeneities in the population at risk?”

Complication:

Where is “where”?Which location for each case?Example: Maxcy (1926) study of endemic typhus fever in Montgomery, AL, 1922-1925.

Lilienfeld, D.E. and Stolley, P.D. (1994) Foundations of Epidemiology, Third Edition. Oxford University Press: New York, pp. 136-140.

Maxcy, K.F. (1926) “An epidemiological study of endemic typhus (Brill’s disease) in the Southeastern United States with special reference to its

mode of transmition.” Public Health Reports 41, 2967-2995.

Residence location Place of employment

Refine the question…

“Are there clusters in the data after adjusting for heterogeneities in the population at risk?” to…

“Are there clusters of case residences in the data after adjusting for heterogeneities in the population at risk?”

We’re building a conceptual model…

What we have…

Disease surveillance (ongoing collection, monitoring, and analysis of disease data).Vital statistics (birth/death certificates)Notifiable diseases (required reporting)Registries (link multiple sources of information on each case, e.g. SEER)Health surveys (NHANES, NHIS, BRFSS)

Teutsch, S.M. and Churchill, R.E. (1994) Principles and Practice of Public Health Surveillance. Oxford University Press: New York.

Data components

Types of location (time or space) data:

Point data (case locations)• Latitude/longitude• Street address• Confidentiality?

Regional count data• Counts for enumeration

districts

Background data

Types of background data:Point locations for non-cases (“controls”)• Is the spatial distribution of

cases close to that of controls?

Regional census counts• Are the observed number of

cases close to the number expected under H0?

Point data

Case locations geocoded from registry or billing records.Controls:

All non-cases (e.g., birth records)Sample (perhaps matched) of non-cases.Different outcome (e.g., nonrespiratory ED visits, compared to respiratory ED visits)

Regional Count Data

Aggregate to regional counts, often to preserve confidentiality.

4 1

211 2

Complication:

Counts lose some resolution...

4 1

211 2

Modifiable Areal Unit Problem

Different aggregations can lead to different results.

4 1

211 2

0 0 0 0

2210

20

24

0

MAUP example: John Snow

?

Monmonier, M (1991) How to Lie with Maps. University of Chicago Press: Chicago. p. 142.

Operationalizing H0 :

Case/control point data:Random labeling hypothesisSay n0 control, n1 case locations.H0: Case/control label randomly assigned to the n = n0 + n1 total locations.

Operationalizing H0 :

Regional count data:Constant risk hypothesisEach individual subject to same risk.Expected count = (risk)*(population size).

Variable total: Poisson counts.Fixed total: Multinomial counts.

4 1

211 2

5 2

101 1

3 0

410 3

H0 drives type of test

Random labeling: often compare observed spatial intensities (expected number of events per unit area) of cases and controls.Constant risk: compare observed to those expected counts (goodness of fit).

What deviation from H0 ?

Tests of clustering: check tendency for cases to occur in clusters. Tests to detect clusters: find most likely cluster(s).General tests: detect clusters or clustering anywhere.Focused tests: detect clusters or clustering around suspected foci.

Besag, J. and Newell, J. (1991) “The detection of clusters in rarediseases”. Journal of the Royal Statistical Society-A 154 327-333.

How weird? (Monte Carlo test)

Random labeling/constant risk simulate data sets under H0.For any test statistic, calculate value in observed data, Tobs.Simulate many data sets under H0, and calculate the test statistic for each (T1,T2,…,Tnumsim ).p-value = proportion of test statistics from simulated data sets exceeding Tobs (fraction of T’s > Tobs).

Example: Regional Counts

Comparing observed to expected.Pearson’s chi-square statistic:

X2 =Sum of (Oi – Ei)2

But X2 ignores location of lack of fit.

Spatial goodness-of-fit

Instead of squaring (Oi – Ei), what if we link (Oi – Ei) and (Ok – Ek) by proximity of regions i and k ?Say, sum wik (Oi – Ei)(Ok – Ek), where wik gives link between i and k ?This (essentially) gives Tango’s index of clustering.

Tango, T. (1990) An index for cancer clustering. EnvironmentalHealth Perspectives 87, 157-162.

Finding spatial clusters?

Spatial scan statistic (SaTScan)Scan on windows with distance radii.

Turnbull et al’s Cluster Evaluation Permutation Procedure (CEPP).

Scan on window of constant population size (e.g., 10,000 people at risk).

Besag and Newell’s approachScan on window of constant number of cases (e.g., 10 cases).

All seek collection least consistent with H0 .

New York Leukemia

592 cases 1978-1982, 8 counties, 790 census regions, ~ 1 million people.

Example: case/control point data

Kelsall and Diggle (1995)Compare ratio of case intensity to control intensity.Random labeling simulations.Identify locations where case intensity significantly exceeds control intensity (pointwise test of significance).

Approach to detect clusters.

Kelsall, J.E. and Diggle, P.J. (1995) Non-parametric estimation ofspatial variation in relative risk. Statistics in Medicine 14, 2335-2342.

Archeology data

Alt and Vach (1991)143 grave sites, 30 with affected teeth (“cases”)Question: families buried together?Tested question: Do gravesites with affected teeth cluster?

Alt, K.W., and Vach, W. (1991) “The reconstruction of ‘genetickinship’ in prehistoric burial complexes – problems and statistics”

In Classification, Data Analysis, and Knowledge Organization:Models and Methods with Applications. H.-H. Beck and P. Ihm (eds.)

Springer: Berlin.

Case and control intensities

f

Y

Z

Affected

4000 6000 8000 10000

4000

6000

8000

10000

**

*

*

*

*

*

*

* *

***

*

*

**

**

*

*

*

**

*** **

*

Affected, bw = 500

g

Y

Z

Non-affected

4000 6000 8000 10000

4000

6000

8000

10000

o

oooo

oo o o

oo

oo

o

oo

oooo

oo

oo

o

o

ooo

oooo

oo

oo

o

o

oo

oo

oo

oo

o

o

oo

ooo

oo

o

o

o

o

ooo o

ooo

o

oooo

o

o

o

oo

oo

o

o

oo

o

o

o

o

o

o

oo oooo

oo

o

ooo

ooooo

o

ooo

oo

o

Non-affected, bw = 500

Relative risk surface

r

Y

Z

Relative risk surface

4000 6000 8000

4000

6000

8000 **

*

*

*

*

*

*

* ****

*

*

**

**

*

*

*

***** **

*

o

ooooooo o

oooo

ooo

oooo

oo

oo

o

o

oooooo

o

oooo

o

o

oo

oo

ooo

o

ooooooo

oo

o

o

o

o

ooo o

ooo

o

oooo

o

oo

o ooo

o

o

oo

o

o

o

o

o

o

oo oooo

ooo

ooo

ooooo

o

ooo

ooo

Relative risk surface, bw= 500

Spatial scan statistic

Most likely cluster (p-value = 0.067)

Important ideas

What question do I want to answer?What data can I get?What statistical method will I use? What question can I answer with the data I have and the method?Does this match my first question?

Additional important ideas

Results depend on data structure (MAUP).Every test involves a specific definition of “cluster”…ask yourself:

What data results from H0 (the model of “no clustering”)?

• Can you simulate data from H0?

What constitutes evidence against H0(the model of “clustering”)?

• Do your data appear consistent with H0?

Reading listBesag, J. and Newell, J. (1991). The detection of clusters in rare diseases. Journal of the Royal Statistical Society, Series A 154, 143-155. Kelsall, J.E. and Diggle, P.J. (1995) Non-parametric estimation of spatial variation in relative risk. Statistics in Medicine 14, 2335-2342. Kulldorff, M. (1997) A spatial scan statistic. Communications in Statistics-Theory and Methods 26, 1481-1496.Neutra, R.R. (1990). Counterpoint from a cluster buster. American Journal of Epidemiology 132, 1-8.Rothman, K. (1990). A sobering start to the cluster busters’ conference. American Journal of Epidemiology 132 (Supplement), S6-S13.Snow, J. (1946) Snow on Cholera. Oxford University Press.Tango, T. (1990) An index for cancer clustering. Environmental Health Perspectives 87, 157-162.Turnbull, B.W., Iwano, E.J., Burnett, W.S., Howe, H.L., and Clark, L.C. (1990). Monitoring for clusters of disease: application to leukemia incidence in upstate New York. American Journal of Epidemiology 132 (Supplement), S136-S143. Wallenstein, S. (1980) A test for detection of clustering over time. American Journal of Epidemiology 111, 367-372.Waller, L.A. and Jacquez, G.M. (1995). Disease models implicit in statistical tests of disease clustering. Epidemiology 6, 584-590.Waller, L.A. (2002). Methods for detecting disease clustering in time or space”. In Statistical Methods and Principles in Public Health Surveillance. R. Brookmeyer and D. Stroup (eds). Oxford University Press.