pandemic response in the era of big data (prier, 2015)

37
PANDEMIC RESPONSE IN THE ERA OF BIG DATA 1 Pandemic Response in the Era of Big Data: Exploring the Complexities of Global Influenza Surveillance and Information Overload Kyle Prier Johns Hopkins Bloomberg School of Public Health

Upload: kyle-prier

Post on 12-Apr-2017

93 views

Category:

Documents


0 download

TRANSCRIPT

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 1

Pandemic Response in the Era of Big Data:

Exploring the Complexities of Global Influenza Surveillance and Information Overload

Kyle Prier

Johns Hopkins Bloomberg School of Public Health

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 2

Pandemic Response in the Era of Big Data:

Exploring the Complexities of Global Influenza Surveillance and Information Overload

Introduction

In March 2009, the Mexican Ministry of Health reported to the World Health

Organization (WHO) Global Influenza Surveillance Network (GISN) an unusual increase of

Influenza-like Illnesses (ILI) during a period of seasonal outbreak decline (1). This initial

deviation reported from Mexican authorities became the basis for an alert to global public health

officials of an outbreak of a novel, highly contagious influenza strain, which we now refer to as

the 2009 H1N1 pandemic.

After the March 2009 report by Mexican authorities of the novel influenza A in Mexico,

rapid testing protocols to confirm H1N1 virological were quickly developed. Using GISN

specimens of viral strains, within days, the Centers for Disease Control and Prevention (CDC)

developed and shared a real-time reverse transcriptase-polymerase chain reaction (RTPCR)

protocol that could quickly identify cases of H1N1 (2, 3). These testing protocols were critical to

collect data so that officials could respond to the emerging epidemic.

By July 2009, after 4 months of the initial report by Mexican authorities, the H1N1

pandemic had infected over 100,000 individuals globally. By July 2009, data were coming in so

rapidly, it became too burdensome on a global level to efficiently track and validate cases using

the rapid CDC protocol (2). During this time in July 2009, WHO officials resorted to relying

upon general qualitative indicators of basic pandemic changes that were communicated via e-

mail by country officials (2). Quite literally, at the height of the H1N1 pandemic, the highest

level global public health decision makers were asking countries if the H1N1 pandemic was

getting better or worse, despite the availability of reported data via existing protocols.

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 3

During the height of the pandemic, WHO later determined there was not sufficient time to

develop even “preliminary estimates of severity parameters such as the case fatality ratio,” and

that these estimates “lagged behind key decision-making and response-planning” (2). The rapid

increase of information and data at the global level during the H1N1 pandemic contributed to

subsequent negative effects on key decision making, which ultimately negatively impacted the

ability of WHO and countries to institute appropriate pharmaceutical and non-pharmaceutical

interventions (4). In the H1N1 case, key decision makers became overwhelmed with the rate and

amount of data, which effectively muted their organizational ability to most effectively manage

the H1N1 pandemic.

Information Overload

The organizational inability of the WHO and other key decision makers to respond

efficiently to the H1N1 pandemic can be attributed to the theory and concept of information

overload. Information overload can be described as psychological and organizational

phenomenon that occurs when the amount of input of information to a system organization

exceeds its processing capacity (5, 6). In describing the concept of information overload, Shenk

argues that “at a certain level of input, the law of diminishing returns takes effect” and that the

“glut of information” leads to a negative situation that “cultivate[s] stress, confusion and even

ignorance” (7). He further argues that information overload leaves us “less cohesive as a

society”, and that on an individual level it “diminishes our control over our own lives,” while

strengthening the positions of those “already in power” (7).

Throughout the emergence and growth of electronic and computer-mediated communication

systems, researchers have warned against the threat of information overload among various

organizations (8, 9). The issue of information overload has extended into the public health

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 4

surveillance arena. More specifically, the increased availability and frequency of infectious

disease data from electronic sources has been burdensome for organizations to respond to disease

outbreaks quickly and accurately (10). In 2006, the HealthMap project was created primarily to

mitigate “information overload” among public health organizations to effectively monitor global

infectious diseases (10).

Big Data

In 2014, researchers, governments, corporations, organizations, and individuals have access

to almost unfathomable amounts of data relating to human behavior, communication, and health

(just to name a few). The influx in available data has been made possible due to the influence of

the Internet and the evolution World Wide Web.

In 2004, 14.2 % of the global population had access to the Internet, compared to 35.5 % in

2012 (11). The World Wide Web has matured significantly over the last decade from a

communication medium of passive content distribution (Web 1.0) to a platform of interactive

user collaboration and user-generated content (Web 2.0). Today’s Web is made up of numerous

online social networks like Twitter, Facebook, LinkedIn, Pintrest, and Tumblr (among many

others). Such an influx of user-generated content has been accompanied by advancements in

computer hardware and software.

New disciplines within mathematics and computer science (e.g. Data Mining, Natural

Language Processing, and Machine Learning) have emerged to study such large amounts of data

in order to induce trends and relationships among potential variables. A fascination with the

utility of “big data” has bled into many disciplines, including public health. Consider some of the

buzz words: mHealth (12), eHealth (13, 14), infodemiology (15, 16), and infoveillance (16).

Purpose

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 5

The primary purposes of this paper are (1) explore the evolution and complexities

associated with disease surveillance, (2) consider the utility and implications of these methods on

the capacity of organizations to promptly and effectively respond to infectious disease threats.

Additionally, this paper explores emerging novel syndromic surveillance systems,

including the analytical evaluation of a novel global Twitter-based influenza surveillance system

to determine how closely these Twitter data correlate with traditional weekly influenza reports

among several English-speaking countries.

Influenza

Influenza (or the “flu”) is a contagious acute viral infection that primarily impacts the

brochi, throat, and occasionally the lungs, with an incubation period of about 2 days (17). With

such a relatively short time between infection and onset, infected individuals will often quickly

develop symptoms of fever, cough, sore throat, runny/stuffy nose, muscle aches, and general

malaise and fatigue (18).

Although influenza impacts all age groups (19-21), mortality rates of influenza cases are

more likely among population groups who are at a higher risk including young children less than

2 years, the elderly (65+ years), and those who have preexisting chronic illnesses (e.g. chronic

lung disease, heart disease, asthma, diabetes, weakened immune systems, and morbid obesity)

(22-26).

The primary purposes of this paper are (1) explore the evolution and complexities associated

with disease surveillance, (2) consider the utility and implications of these methods on the

capacity of organizations to promptly and effectively respond to infectious disease threats.

Additionally, this paper explores emerging novel syndromic surveillance systems,

including the analytical evaluation of a novel global Twitter-based influenza surveillance system

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 6

to determine how closely these Twitter data correlate with traditional weekly influenza reports

among several English-speaking countries.

Disease Burden

Although most people with the flu will recover within a few days to less than 2 weeks,

some will develop life-threatening complications from the flu including pneumonia and

worsening of preexisting conditions (26). Although it is sometimes difficult to identify influenza-

related mortalities, it is estimated that seasonal influenza accounts for about 250,000 to 500,000

deaths globally (17). In addition to flu related deaths, influenza epidemics and pandemics

decrease worker productivity and economic output, while generating considerable costs for

necessary treatment and prevention interventions each year.

Disease Transmission

Influenza is spread primarily from person-to-person contact. The virus is believed to be

spread through droplets or aerosols created when infected individuals cough, sneeze, or talk from

up to 6 feet away (27). Influenza often spreads efficiently and quickly through villages, cities,

schools, and other areas where human-to-human contact is likely (17).

Seasonal vs. Pandemic Influenza

In temperate areas, influenza typically occurs annually as regional or national epidemics

(28). This yearly emergence of seasonal influenza is due primarily to constant antigenic drift of

influenza viruses (29, 30). In addition to annual or seasonal influenza infections, global flu

pandemics do rarely occur as novel influenza A viruses emerge (31). Examples of such global

influenza pandemics include the 1918 Spanish Influenza, a 1957 Asian Influenza, a 1968 Hong

Kong Influenza, as well as the 2009 H1N1 Influenza (32).

Disease Surveillance

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 7

Traditional Methods

Timely and accurate surveillance of seasonal and pandemic flu trends is a crucial tool for

global, national, state, and local organizations to better understand key important

epidemiological and virological aspects of the pandemic. An accurate understanding of these

aspects of a pandemic better enables organizations to determine allocation of key resources (e.g.

vaccinations and antivirals), how to communicate to the public, or whether restrictions on travel

should be implemented. The World Health Organization identifies the primary goal of global

influenza surveillance “to develop a global picture of the event through sharing and analysis of

information provided by individual countries” (2).

Traditional influenza surveillance methodologies typically refer to virological

surveillance, or identification of influenza specimen strains in a laboratory setting. In the United

States virological surveillance is primarily reported through FluView, which includes laboratory

data from 85 WHO Collaborating Laboratories in the US as well as data from the 60 laboratories

in the US that make up the National Respiratory and Enteric Virus Surveillance System

(NREVSS) (33)) . Additionally, the WHO Global Influenza Surveillance Network (GISN) has

been in use for over 60 years and comprises over 131 National Influenza Centers (NICs) in 105

countries (2).

Syndromic Surveillance

In contrast to traditional methods of virological surveillance, nontraditional syndromic-

based surveillance methods have become more widely implemented and developed over the last

decade due to a greater “atmosphere of concern” after the terrorist attacks of September 11,

2001(34, 35). Syndromic surveillance systems primarily implement various statistical analyses of

data that relate to individual behavioral patterns that indicate or suggest influenza infection (36).

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 8

Such behavioral indicators may include various data points such as healthcare visits, drug

purchases, or work absence.

Syndromic surveillance in the US is primarily conducted via the U.S. Outpatient Influenza-

like Illness Surveillance Network (ILINet). ILINet consists of about 2900 sentinel outpatient

healthcare providers that send weekly reports to the CDC with information on the number of

patient visits who meet the case definition of having an Influenza-like Illness (ILI) (37).

States also have their own methods of surveillance, and some are implementing more novel

web-based approaches. For example, the Maryland Department of Health and Mental Hygiene

has recently implemented a new online tracking survey called the Maryland Resident Influenza

Tracking Survey (MRITS) (38). MRITS is intended to compliment information from sentinel

providers to help identify ILI prevalence through an internet survey to residents.

Early Warning Capability

The practices and aims of virological and syndromic influenza surveillance are perhaps

better understood in terms of medical screening and testing. Syndromic surveillance as a

screening tool aims to provide easily accessible and timely data, which could then be verified

through virological testing. Generally, syndromic surveillance data should be available as close

to real-time as possible, as syndromic surveillance’s primary functions is to provide early

warning of potential novel outbreaks (36). As with all medical tests, there is a tradeoff that must

be made between the specificity and sensitivity of such tests. In general, practitioners determine

to what extent their screening methodology should be prone to type 1 statistical errors by giving

a false alarm. This determination is carefully determined based on the purpose of the surveillance

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 9

system. Protocols have been developed by the CDC and the Institute of Medicine to assess the

quality and effectiveness of syndromic surveillance systems (39, 40).

Post 9/11 and Bioterrorism. The terrorist attacks of September 11, 2001 undoubtedly

mark the beginning of the emergence of new policies and actions by governments to protect

against potential vulnerabilities for terrorist attacks. In particular, the threat of bioterrorism led

the United States government to further the design and implementation of automated, electronic

surveillance systems that could provide early warnings of emerging epidemics. This new post

9/11 environment of placed a “premium on timeliness,” which lead to an “emphasis on

automation of the full cycle of surveillance” (41).

The US military has maintained a laboratory-based global influenza surveillance program

since 1976 initially under a program of the United States Air Force (42). In 1997 this program

was expanded to include all US military services under the Department of Defense Global

Emerging Infections Surveillance and Response System (GEIS), which monitors various

infectious disease outbreaks throughout the world (43). In February 2012, the Department of

Defense (DOD) reorganized the GEIS under the newly established Armed Forces Health

Surveillance Center (AFHSC), which is currently designated as the primary source for all DOD-

level surveillance data (44). The GEIS is a member of the WHO Global Outbreak and Response

Network (GOARN), which comprises various international institutions through which weekly

privileged public health related alerts can be shared (45).

Novel Syndromic Systems

In 2001, the DOD began implementation of their first version of the Electronic

Surveillance System for the Early Notification of Community-based Epidemics (ESSENCE I),

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 10

which provided syndromic surveillance for US active military (46). Because of increased

concern for bioterrorism, an automated surveillance reporting mechanism was implemented with

ESSENCE I in 2002 (35).

Since ESSENCE I multiple newer versions of the syndromic surveillance system have been

developed and catered for various civilian areas and military facilities (47). ESSENCE II is a

system jointly developed by the Johns Hopkins University Applied Physics Laboratory in

conjunction with the Maryland Department of Health and Mental Hygiene, the District of

Columbia Department of Health, and the Virgina Department of Health (47). ESSENCE II began

implementing other secondary data sources related to disease transmission into more innovative

statistical and computational models.

GEIS began researching and developing additional novel syndromic surveillance methods

with ESSENCE II. As a part of this development, the Bio-ALIRT Biosurveillance Detection

Algorithm was developed by the Defense Advanced Research Projects Agency (DARPA). Other

contractors involved include Johns Hopkins University Applied Physics Laboratory, the Walter

Reed Army Institute of Research, the University of Pittsburgh/Carnegie Mellon University, the

General Dynamics Advanced Information Systems, the Stanford University Medical Informatics

group, the Potomac Institute, CDC, and the IBM corporation (48).

Additionally, GEIS introduced BioWar into their systems, which is a computer simulation

by Carnegie Mellon University that models disease transmission using social network,

communication media, weather models, and other untraditional data sources (49).

Web-based Surveillance. In addition to the collection of disease surveillance data from

sentinel outpatient providers, there has been increased interest in the use of social media and

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 11

other Web-based sources for disease surveillance. Traditional disease surveillance methods rely

on the aggregation of data sourced from actual clinical observations, which is comparatively

time-consuming and expensive (50). Alternatively, novel web-based surveillance platforms have

the potential to provide more cost-effective solutions of surveillance to provide real-time

surveillance and aggregation of disease-specific data from around the world.

Notable Projects. HealthMap is an internet-based service that collects and combines

disease outbreak data from various information sources. Since 2006 the HealthMap project

attempts to automate the process of data query, filtering, integration, and visualization of various

web-based reports of outbreaks disease outbreaks(10). Examples of the project’s sources for

aggregation include: news reports (via Google News aggregation service), official alerts and

announcements (e.g. WHO, CDC), as well as through “expert-curated” accounts who report

disease outbreaks via the ProMED Mail service (10).

Flu Near You uses elements of crowd-sourcing to estimate and communicate local ILI

activity. Flu Near You was created as a partnership between HealthMap at Boston Children’s

Hospital, the American Public Health Association (APHA), and the Skoll Global Threats Fund

(skollglobalthreats.org). Flu Near You estimates local influenza infection rates through self-

reported symptomatic data from its users. Users must be at least 13 years and old and reside in

either the United States or Canada. Additionally, Flu Near You aggregates and visualizes

regional ILI activity from user self-reported data, CDC weekly flu activity reports, and Google

Flu Trends ILI detection.

Twitter. Social media platforms like Twitter have enabled people to share concise

messages about their thoughts, opinions, and feelings that often relate to their personal lives.

Twitter is a prevalent microblogging service where users can post short status updates (or tweets)

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 12

in less than 140 characters. On Twitter, users are able to follow other users, which enables

followers to receive notifications and status updates of those whom they follow. Because tweets

are public, a user who is followed by another user does not have to reciprocate, or confirm, the

connection as is the case with other Social media services like Facebook. Jansen et al.

characterize microblogging platforms like Twitter by the following characteristics: (1) short

messages or status updates, (2) instantaneous or real-time publication of messages, and (3) a

subscription component for users to receive status updates of other users (51). In addition to

these 3 components, a fourth characteristic that should be included is that of the dissemination of

messages through various devices, platforms, and applications. This cross-platform

interoperability component is enabled primarily by the provision of an application programming

interface (API), which enables computer programmers to develop software that can easily

exchange data directly with a service provider like Twitter. The Twitter APIs not only promote

the adoption of the Twitter platform on other platforms like mobile devices and other web

services, but the APIs enable researchers and other 3rd parties to easily access status updates and

other meta-data on a large scale. The ability of public access of tweets coupled with the Twitter

APIs enable researchers to quickly and cheaply collect large samples of conversational data.

The information contained in tweets often provides relevant real-time insight and

information that relates to the larger contexts beyond the individual (50). Researchers have used

Twitter data in a variety of applications to infer global trends and events including political

opinion (52), and earthquake monitoring (53). Social media sources like Twitter often contain

geospatial components that could prove useful in tracking health conditions in various locales

where outpatient provider data may be sparse.

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 13

Locale-specific information about users is often tracked and calculated by services for

various marketing and advertising objectives regardless of user input or awareness. At the most

granular level, the geographic details of social media users can be directly provided by the users

in the form of latitude and longitudes from GPS equipped mobile devices. Users may be unaware

that they are directly submitting or publishing data that contain geospatial elements. Additionally

users may perceive that services provide added value e.g. traffic/commute times, location of

nearby events. Users may also provide geospatial data in order to help others (examples include

Waze for traffic or Mr. Checkpoint for location of DUI checkpoints). Social media services

often approximate users’ locations through a combination of various meta-data even if the

location is not provided by the mobile device. Geospatial data can be gleaned from users by

approximating location through users’ Internet Protocol addresses, phone numbers, and search

queries.

Analysis of Twitter Data

Twitter Surveillance System

Broniatowski et al. have developed and implemented an automated software platform to

query, filter, and integrate Twitter conversational data that estimates ILI prevalence globally

(54). Broniatowski et al. demonstrated that their platform’s estimates of ILI during the 2012-

2013 flu season were strongly correlated with CDC weekly surveillance data reports for the

United States (r = 0.93, p < 0.001) (54). Although Broniatowski et al. have reported how well

their Twitter-based surveillance method performs within the United States, it is currently

unknown how well their methodology will perform in other countries and locales. Furthermore,

it is unknown how the Twitter platform will perform with non-English conversational data.

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 14

Research Objectives

The primary objective of this study is to analyze the performance of the Broniatowski et al.

platform for several different English speaking countries. For this study we included the United

States as a baseline, and we selected four additional English-speaking countries in the UK to

assess the Twitter platform’s performance outside the United States. We included English-

speaking countries only, so that we could better assess the impact on performance by location.

This study extends upon the previous performance estimates by including weekly estimates over

3 flu seasons between August 15, 2011 and January 5, 2014. The following research questions

are addressed:

(1) How do the global weekly ILI estimates from the proposed Twitter surveillance

platform correlate with national influenza-like illness (ILI) incidence estimates as reported

weekly by the government surveillance networks in the United States, England, Scotland,

Wales, and Northern Ireland?

(2) How does performance of ILI estimation via Tweets differ by country and year?

Methods

The proposed platform implements a supervised classification model that determines

whether a Tweet indicates infection rather than just concern or discussion of influenza

symptoms. Broniatowski et al. couple this classification model with a specialized geolocation

system to infer influenza prevalence parameters. The details of this inference platform have been

described in detail previously (54, 55).

Data Collection

Broniatowski et al. use the Twitter API to access and download real-time “streams” of

public conversational data. Data were downloaded via 2 separate streams: a “general” stream that

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 15

is a random sample of all Tweets and a “health” stream that only downloads Tweets that mention

predefined health-related terms (54). The general stream sample is a random representative

sample of 1% of all tweets, while the health stream sample represents 1% of all Tweets that

include the health terms. This stratified sampling method makes it possible to normalize

influenza prevalence estimation (strata are Tweets that mention health terms, and all Tweets).

Tweets that indicate ILI infection are identified and coded by a complex, automated computer

algorithm. Additionally, Broniatowski et al. used human readers to code a small portion of

Tweets as a means of data validation and cross-checking with the computational method. After

filtering both streams by location, the influenza prevalence among the health stream tweets is

then normalized by the proportion of all Tweets by location from the general stream. For this

study, the totals of all Tweets collected was not available- instead, daily ILI estimates were

provided that were generated by the Broniatowski et al. procedure. The details of this inference

platform have been described in detail previously (54, 55).

Government-reported weekly ILI estimates for the various countries were collected by

me electronically primarily through the various official online data portals. All data were

publicly available and contained no personal identifiers and were therefore not subject to

Institutional Review Board (IRB) consideration according to, the National Human Subjects

Protection Advisory Committee (NHRPAC) (56) and the Johns Hopkins University IRB (57).

All data retrieved were regional and country estimates that contained no individual-specific

information. Data were accessed during January 2014. For the United States, online archives of

the CDCs FluView were accessed, which is a weekly report prepared by the CDCs Influenza

Division, using data from ILINet (58). ILI estimates through ILINet are determined each week

by the percentage of outpatient visits that are due to an influenza-like-illness. ILI is

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 16

symptomatically defined by ILINet as patients having fever (100 degrees Fahrenheit or more)

and cough and/or sore throat (37). I retrieved and stored weekly ILI estimates from week 35 of

2009 through Week 2 of 2014.

Weekly reporting periods for the United States are from Sunday through Saturday, whereas

the ISO-8601 week is Monday through Sunday. Because the UK reports conform to the ISO

Mon-Sun week, I grouped daily Twitter estimates into weeks according to the standard used by

government reports. This caveat is important especially if one tries to compare weekly ILI

reports and predictions between various countries. The weekly periods used in the analysis and

results are based on the standard used by each specific country. In this case, weeks for both US

government and Twitter estimates are from Sunday through Saturday, whereas weeks for the

other countries are Monday through Saturday. Because government reported data only provide

weekly totals of ILI, it was not possible to standardize data by time across countries.

For the United Kingdom, I collected weekly ILI estimates for each country from the Public

Health England (PHE) Influenza surveillance network reports from week 40 of 2010 through

week 7 of 2014. Weekly ILI estimates from PHE are determined from clinical data by general

practitioners (GPs). In the UK, GP clinical data are provided weekly to PHE by the various GP

networks for each country. For England, this data comes from the Royal College of General

Practitioners (RCGP). The RCGP weekly returns service has been run by the RCGP since 1966

(59). Each country within the UK uses slightly different schemas for GP-based surveillance,

which creates additional challenges in normalizing data between countries. For example, the

RCGP England and NHS Wales include only primary or first-time consultations, while Health

Protection (HP) Scotland includes repeat consultations (59-61). Because we cannot meaningfully

differentiate between first visits and follow-up visits in Scotland, we cannot normalize the data

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 17

with other countries in the UK. Additionally, PHE notes that health-seeking behaviors differ

among populations- specifically, according to reports from the Northern Ireland Department of

Health, Social Services and Public Safety (PHA Ireland), those in Northern Ireland go to a GP

more often than those in England (59, 62). Furthermore, the various national surveillance

systems define flu activity differently- RGCP England and HP Scotland use ILI, NPHS Wales

uses influenza, while PHA Northern Ireland combines influenza and ILI (59).

Results

Overall Correlation. For the entire period October 2011 through December 2013, the

correlation of Twitter and government ILI estimates in the United States was 0.84 (p <0.001).

Twitter ILI correlation was significantly lower in the UK with 0.41 correlation for England (p

<0.001), 0.37 for Scotland (p<0.001.), 0.35 for Wales (p=0.001), and 0.36 (p<0.001) for

Northern Ireland.

Yearly. The Twitter algorithm performed better each year in the US with a correlation of

0.71 during 2011-2012 (p<0.001), 0.89 in 2012-2013 (p<0.001) and 0.93 in 2013 (p = 0.001).

The Twitter algorithm performed the best during the 2013 time period in the United States

compared with all other years and countries (r = 0.93).

Among all other countries, the Twitter algorithm performed worse with each subsequent

time period. Additionally, for the 2013 time period Pearson’s correlation coefficient was found

not to be significant a=0.05 for all countries besides the United States. Outside of the United

States the Twitter algorithm performed best in Scotland during 2011-2012 (r=0.59).

Time Series Graphical Visualization. Both weekly Twitter and Government estimates

are plotted over the three periods used previously as well as over the entire 2011-2014 dataset.

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 18

A visualization of the entire series from 2011 to 2014 encompasses three separate flu seasons

and gives a visual overview of the performance of both surveillance methods over time.

Figures 1 and 2 cover the entire time period (combined). Figure 1 plots estimations from both

Twitter and government for the United States, while Figure 2 contains similar plots for each

country of the United Kingdom in one figure. The combined overview and helps describe

seasonal trends and lags between surveillance methods.

In the United States, Twitter ILI estimates are generally higher than the CDC estimates

for both 2010 and 2013 flu seasons. During times of lower flu activity (off season), Twitter

estimates are higher in the United States. In the UK, Twitter estimates are generally higher that

government estimates during the off seasons as well. Also, during the 2011-2012 and 2013-

2014 flu seasons Twitter estimates are higher in UK countries.

As is expected with this data, all of the time series exhibit seasonality; however, the

strength of yearly seasonality does vary among countries and time periods (especially for UK

countries in 2011-2012 and 2013-2014). During flu seasons in 2011-2012 and 2013-2014 in

the UK, Twitter estimates are considerably higher and more apparent than government reports.

Government estimates lag behind Twitter estimates for all countries in the UK during all time

periods.

In the United States, however, there is much smaller evidence of lag in 2011-2012 and

2013-2014, but there appears to be little or no lag during 2012-2013. For the 2012-2013 season

in the United States, both CDC and Twitter estimates appear closely correlated. Conversely,

Twitter and Government estimates for countries in the UK indicate a significant lag between

surveillance methods of at least 10 weeks. This lag is most evident in the 2012-2013 flu

season, which was the season with the most influenza activity among all countries.

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 19

Discussion

The Twitter ILI surveillance method performed much better in the US in terms of its

correlation with CDC weekly reports. It is expected that the Twitter estimates would perform

better with CDC data in the US, primarily because the computational model and methods were

developed and trained to predict CDC ILI. The CDC ILI predictions are different than other

countries. As mentioned previously, ILI case definition is slightly different, and the CDC does

not use the ISO week reporting period like most other countries. The lack of uniformity in data

collection, evaluation, and distribution among the various countries is problematic in

implementing and assessing infectious disease activity at a global level. For example, the ILI

case definition is slightly different between the CDC, WHO, and the UK. It is particularly

concerning; however, that the Twitter method generally overestimates CDC reports especially

during off seasons and during times of increasing ILI rates at the start of an outbreak. It is

particularly challenging to develop a computational method that can differentiate between tweets

from infected individuals and those who are merely discussing infection (i.e. people talking

about flu activity but are not infected). It makes sense that chatter about flu would be higher

during times of increased infection rates.

The comparatively low correlation among UK countries does demonstrate the difficulty

and challenges in implementing and assessing novel computational instruments such as the

Twitter-based method used for this study. Currently, various countries and organizations use

differing criteria and methodologies for disease surveillance. Because of differing collection and

coding methods of these countries’ national health organizations, it is difficult to estimate and

compare activity between countries. Therefore, we will draw special attention to those areas

where comparisons are rendered ineffective or impossible to make due to this challenge.

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 20

Limitations. While this study demonstrates compelling correlations between government-

reported and socially-generated health data, it is not without limitation. Additional analytic

methods should be implemented in future studies that can provide additional assessments of time

series data. Seasonality and lag should be assessed and modeled using more advanced statistical

methods like the Box-Jenkins method, which implements autoregressive moving average

(ARMA) and autoregressive integrated moving average (ARIMA) models to time-series data

(66). Additionally, autocorrelation and partial autocorrelation plots could be used to assess

stationarity and seasonality and to identify the indicated model(s) for the data. While such

statistical tools are interesting, these analyses are beyond the scope of this study. Additionally,

this dataset has a limited time period overall as well as only partial data for the 2013-2014

season. There is only partial data for the 2013-2014 season because the data was accessed in

January 2014. Currently, it is difficult to access and analyze data from various countries. For

example, since January 2014 the UK has changed the format and frequency of influenza reports

(59). Assessing lag is problematic because of potential confounders. Such factors that may

influence lag are reporting methods and turnaround times, including whether locales use internet-

based data submission tools. Furthermore, the ILI estimates provided by governments are

adjusted internally before reported. The CDC will even retrospectively change previously

reported estimates if new laboratory data become available (58).

Influence of Social Factors. There are also significant social, cultural, economic, and

even language considerations that should be assessed among the populations from which the

various surveillance methods sample. One must consider the profiles or potential biases of the

populations sampled by both government and Twitter methods. With each surveillance method,

one should consider which segments of the population are misrepresented. Which groups of

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 21

people are these surveillance methods not detecting? How do rates of type II errors (false

negatives, or a failure to detect infectious disease infection) vary among population subgroups

for a given surveillance instrument? The ideal surveillance method should have constant type II

errors across all population subgroups.

From a public health intervention standpoint, it is imperative that we better determine the

extent to which these surveillance methods potentially misrepresent key population groups. This

issue is compounded by the potential reality that underrepresented population groups may be at

greater risk for infection and subsequent morbidity and mortality. An accurate assessment of

sampling biases and population characteristics would enable us to better control for these

variations and ultimately ensure that interventional responses are prioritized and delivered to

population groups with the greatest risk.

Government Syndromic Surveillance. For example, people in Northern Ireland are more

likely to go to a general practitioner when sick and have symptoms of the flu. Because the UK’s

syndromic surveillance methods rely upon ILI reports from Northern Ireland general

practitioners, the government’s ILI estimates for this area tend to be higher than other areas

where individuals do not seek care when they have ILI symptoms (59). Because most

government syndromic surveillance methods rely upon reports by sentinel physician networks,

any factors that increase barriers to care would likely increase the likelihood of type II error

among population sub-groups affected by such factors. Those who live in rural areas (or areas

where it is difficult to seek medical care) often have greater barriers to care (76), and they may

be more likely to treat ILI symptoms without visiting a physician. Similarly, individuals without

health insurance and lower socio-economic status face greater barriers to care (77), and they may

be less likely to visit a doctor when experiencing ILI symptoms. Furthermore, there is evidence

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 22

that people with chronic mental disorders experience significant barriers to primary care access,

which could indicate these individuals would be more likely to be excluded from government-

based surveillance methods (63). Conversely, individuals with pre-existing medical conditions

that receive more frequent medical care (most likely older segments of the population), may in

fact be more likely to go to their doctor when sick. In this particular example, these individuals

would likely be overrepresented through government syndromic surveillance; however, this

group is likely more at risk for infection and greater mortality and morbidity. In this case, this

population subgroup may benefit more from current surveillance methods. A counterexample

would be young children (especially younger than 2 years) and pregnant women with low

socioeconomic status who have significant barriers to care. Migrant worker families with young

children and pregnant mothers may be less likely to go to a doctor when sick due to a

combination of potential factors (including legal issues, access to transportation, familiarity with

the healthcare system, language barriers, and other socioeconomic and cultural factors) (74, 75).

Like the example of elderly individuals, these individuals have a high risk for influenza infection

and complications; however, the latter would be more likely underrepresented by government

surveillance.

Future work is needed to integrate other existing datasets that include population-based

data that may be relevant to healthcare access and other risk factors for influenza infection and

complications. Additional research should explore methods that can control for such factors, so

we can more accurately identify disease within the populations and sub-groups from which

various surveillance methods sample.

Twitter Surveillance. In addition to the challenges with traditional syndromic influenza

surveillance methods, it is difficult to describe and characterize the users who tweet about their

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 23

own health statuses. While the sensitivity, or recall, of government methods is likely influenced

by access to care, the precision of Twitter methods rely on factors that influence the likelihood of

individuals posting tweets that include individual syndromic information. Regarding Twitter

surveillance, it is important to consider how various factors related to Twitter usage relate to

various population groups and sub-groups globally. In addition to the socio-economic factors

mention regarding government surveillance, additional factors should be considered like internet

access, social media usage, online communication preferences, and factors that describe types of

people who share personal syndromic information via Twitter.

It is important to determine how various groups of people are misrepresented with Twitter

surveillance. Lack of internet availability and access among less developed countries is certainly

a concern regarding the ability of a Twitter surveillance method to assess infectious disease

prevalence. Geopolitical factors influence internet access and access to specific web services

including Twitter. Individuals living in some countries may be hesitant to post personal

information on the Web due to political and social pressures. Activity of Twitter users also varies

by continent- North America has the greatest number of active users, while Africa has the least

amount of active users (69). Additionally, there are differences among who uses Twitter based

on simple demographic indicators like age, gender, and race/ethnicity.

There is evidence that Twitter users (in the United States) make up a highly non-uniform

sample of the US population in regards to geography, gender, and race/ethnicity (64). One study

reports that Twitter users in the US are more likely men from urban areas (64). In the UK,

younger people between the ages of 18 and 24 are much more likely to use Twitter than other

age groups, while individuals 65 years and older are adopting social media platforms more

frequently, especially Facebook (65). In reality, the demographics and characteristics of Twitter

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 24

users is changing, and it is difficult to access demographic information about Twitter users.

Current methods of determining demographics of Twitter users rely on traditional survey

methods using telephone interviews (68). Although Twitter stores detailed information about its

users internally, we are limited to the information the company provides through the Twitter API

(69). Furthermore, the long-term of the availability of Twitter data is not known. Although

Twitter currently provides access to information, there is a possibility the company could move

towards a more private model like Facebook.

There are likely many other factors that influence the likelihood of a Twitter user to

mention symptomology. There is a need to describe and better understand Twitter users who post

tweets about their own health. For example, one would expect women to be more likely to tweet

about their health, yet in the US, Twitter users are more likely to be younger men from urban

areas. There is evidence men and women use social media sites like Twitter differently (71).

There is evidence that online users discuss and share personal health-related topics, and such

users receive greater social support, emotional support, information support, and sometimes

tangible benefit (72, 73). However, additional research is needed to assess what factors influence

how people discuss personal health symptoms specifically on the Twitter platform. Future

studies and research should seek to control for various characteristics and demographics of

Twitter users, although the Twitter platform makes it difficult (if not impossible) to accurately

describe the characteristics of all Twitter users. The reality and nature of the Twitter platform

makes it challenging to normalize Twitter because of the complexities and difficulty in

characterizing Twitter users. These issues with the Twitter platform certainly challenge the

generalizability of Twitter as a stand-alone biosurveillance instrument. From a practical

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 25

perspective, public health researchers and practitioners should continue to develop methods that

combine traditional syndromic surveillance methods with other novel syndromic methods (67).

For this particular analysis, it is assumed that the government weekly reports are the most

accurate data sources that are readily accessible to the public. Because of this assumption, these

official reports are used as the standard to which the Twitter estimates are compared. There is the

possibility that the novel Twitter surveillance method is more indicative of actual influenza

infection than government reports (70), although this is difficult (if not impossible) to assess on a

global-population level. Government reports may provide a sufficient indication of influenza

outbreak within their specific country; however, because of the varying reporting methods it is

difficult to track infection trends between countries. The challenges of global infectious disease

surveillance are highlighted in how the WHO was unable to respond effectively to the 2009

H1N1 pandemic. A novel passive surveillance method (like the Twitter method) is particularly

promising because it implements a single method across all areas, which bypasses the issue of

various governments’ reporting methods and practices. Further research should address how to

integrate both traditional and novel methods of infectious disease surveillance into a single

instrument that can deliver rapid report packages to key decision makers at global, national,

regional, and community levels. Additional work should focus on creating such a product that

minimizes the potential of information overload, while retaining key information needed for

decision making.

Reflection

This research practicum has been extremely challenging to say the least. Additionally, this

experience has been rewarding and beneficial for me. During the course of my research, I had

to learn and develop new analytic skills to be able to work with time series data of this scale. I

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 26

was also able to develop computer and statistical programming skills in order to acquire,

organize, clean, analyze, and visualize data. I was able to use the Python programming

language to develop programs to automate the process of data collection and preparation.

In addition to analytic and technical challenges and growth, I have learned a tremendous

amount about disease surveillance in general and how disease surveillance is changing and

evolving today. It is very exciting to work with large datasets and to assess health conditions

and outcomes on a large population-based scale. In addition to working with large-scale data, it

was even more interesting for me to research and learn how individual stakeholders and groups

integrate such large amounts of data into their specific decisions and interventions at local

levels.

The analysis of the Twitter data highlights very well the difficulty dealing with large,

time-series datasets. It has been difficult for me to fully understand these various datasets,

especially regarding how to make specific interventional decisions at various micro and macro

levels. It was honestly quite reassuring to learn that organizations like the WHO have also

struggled to work through this process in order to make timely and appropriate decisions to

save lives.

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 27

Table I. Correlation of Twitter weekly ILI Estimates by Country and Calendar Year with National Surveillance ILI Reports

Note: Correlation coefficients, r, marked with an asterisk were not statistically significant (p < .05)

Time Period

2011-2014 (Combined) 2011-2012 2012-2013 2013-2014

10/3/2011 – 12/29/2013 10/3/2011 -

9/30/2012

10/1/2012 -

9/29/2013

9/30/2013 –

12/29/2013

Country

Pearson’s Product-

Moment Correlation

Coefficient, r

United States 0.84 (p < 0.001) 0.71 (p < 0.001) 0.89 (p < 0.001 ) 0.93 (p = 0.001)

England 0.41 (p < 0.001) 0.51 (p = 0.0002 ) 0.48 (p = 0.0004) -0.18 (p = 0.554 )*

Scotland 0.37 (p < 0.001) 0.59 (p < 0.001) 0.54 (p < 0.001) -0.38 (p = 0.196)*

Wales 0.35 (p = 0.001 ) 0.54 (p < 0.001) 0.37 (p = 0.0069) 0.33 (p = 0.271 )*

N. Ireland 0.36 (p < 0.001) 0.46 (p=0.0006 ) 0.44 (p = 0.0014) -0.35 (p = 0.267)*

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 28

Figures

Figure 1-1.

Figure 1-2.

02

46

810

Influ

enza

pre

vale

nce

(%)

12

34

56

ILI P

reva

lenc

e (%

of o

utpa

tient

vis

its)

2011w40 2012w1 2012w14 2012w27 2012w40 2013w1 2013w13 2013w26 2013w40 2014w1 2014w13

Week

ILI USA (Government) ILI USA (Twitter)

United States

2011 - 2014 Influenza

05

10

15

20

25

Influe

nza

pre

va

lence

(%

)

020

40

60

80

ILI P

reva

lence

(%

)

2011w40 2012w1 2012w14 2012w27 2012w40 2013w1 2013w13 2013w26 2013w40 2014w1 2014w13

Week

ILI England (Government) ILI Scotland (Government)

ILI Wales (Government) ILI N. Ireland (Government)

ILI England (Twitter) ILI Scotland (Twitter)

ILI Wales (Twitter) ILI N. Ireland (Twitter)

United Kingdom

2011 - 2014 Influenza

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 29

Appendix

0

510

15

20

25

Influe

nza

pre

va

lence

(%

)

010

20

30

40

ILI P

reva

lence

(%

)

2011w40 2012w1 2012w14 2012w27 2012w40 2013w1 2013w13 2013w26 2013w40 2014w1 2014w13

Week

ILI Wales (Government) ILI Wales (Twitter)

Wales

2011 - 2014 Influenza

01

23

45

Influe

nza

pre

va

lence

(%

)

.51

1.5

22.5

ILI P

reva

lence

(%

of outp

atien

t vis

its)

2011w31 2011w40 2011w48 2012w5 2012w14 2012w22 2012w31

Week

ILI USA (Government) ILI USA (Twitter)

United States

2011-2012 Influenza Season

12

34

5

Influe

nza

pre

va

lence

(%

)

12

34

5

ILI P

reva

lence

(%

of outp

atien

t vis

its)

2013w35 2013w40 2013w44 2013w48 2014w1 2014w5 2014w9 2014w13

Week

ILI USA (Government) ILI USA (Twitter)

United States

2013-2014 Influenza Season

24

68

10

Influe

nza

pre

va

lence

(%

)

12

34

56

ILI P

reva

lence

(%

of outp

atien

t vis

its)

2012w31 2012w40 2012w48 2013w5 2013w13 2013w22 2013w31

Week

ILI USA (Government) ILI USA (Twitter)

United States

2012-2013 Influenza Season

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 30

0

20

40

60

80

% V

isits

for I

LI

2012w1 2012w27 2013w1 2013w26 2014w1 2014w26Time

ILI USA (Government) ILI N. Ireland (Government)

ILI England (Government) ILI Wales (Government)

ILI Scotland (Government)

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 31

References

1. World Health Organization. New influenza A (H1N1) virus infections: Global surveillance

summary, may 2009. 2009.

2. Briand S, Mounts A, Chamberland M. Challenges of global surveillance during an influenza

pandemic. Public Health. 2011;125(5):247-56.

3. World Health Organization. CDC protocol of realtime RTPCR for swine influenza A (H1N1).

2009.

4. Lipsitch M, Riley S, Cauchemez S, Ghani AC, Ferguson NM. Managing and reducing

uncertainty in an emerging influenza pandemic. N Engl J Med. 2009;361(2):112-5.

5. Milord JT, Perry RP. A methodological study of overloadx. J Gen Psychol. 1977;97(1):131-7.

6. Speier C, Valacich JS, Vessey I. The influence of task interruption on individual decision

making: An information overload perspective. Decision Sciences. 1999;30(2):337-60.

7. Shenk D. Information overload, concept of. Encyclopedia of International Media and

Communications. 2003;2.

8. Hiltz SR, Turoff M. Structuring computer-mediated communication systems to avoid

information overload. Commun ACM. 1985;28(7):680-9.

9. Berghel H. Cyberspace 2000: Dealing with information overload. Commun ACM.

1997;40(2):19-24.

10. Freifeld CC, Mandl KD, Reis BY, Brownstein JS. HealthMap: Global infectious disease

monitoring through automated classification and visualization of internet media reports. J Am

Med Inform Assoc. 2008 Mar-Apr;15(2):150-7.

11. Internet users (per 100 people). data retrieved june 3, 2014, from world DataBank: World

development indicators database. [Internet].; 2014

http://findit.library.jhu.edu/resolve?sid=Refworks&charset=utf-

8&__char_set=utf8&genre=article&aulast=World%20Bank&date=2014&atitle=Internet%20user

s%20(per%20100%20people).%20Data%20retrieved%20June%203%2C%202014%2C%20from

%20World%20DataBank%3A%20World%20Development%20Indicators%20database.&au=Wo

rld%20Bank%20&.

12. Kay M. mHealth: New horizons for health through mobile technologies. World Health

Organization. 2011.

13. Black AD, Car J, Pagliari C, Anandan C, Cresswell K, Bokun T, et al. The impact of eHealth

on the quality and safety of health care: A systematic overview. PLoS medicine.

2011;8(1):e1000387.

14. Eysenbach G, CONSORT-EHEALTH Group. CONSORT-EHEALTH: Improving and

standardizing evaluation reports of web-based and mobile health interventions. J Med Internet

Res. 2011 Dec 31;13(4):e126.

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 32

15. Eysenbach G. Infodemiology: Tracking flu-related searches on the web for syndromic

surveillance. AMIA Annu Symp Proc. 2006:244-8.

16. Eysenbach G. Infodemiology and infoveillance: Framework for an emerging set of public

health informatics methods to analyze search, communication and publication behavior on the

internet. J Med Internet Res. 2009 Mar 27;11(1):e11.

17. Influenza (seasonal) fact sheet no. 211 [Internet]; March 2014. Available from:

http://www.who.int/mediacentre/factsheets/fs211/en/http://findit.library.jhu.edu/resolve?sid=Ref

works&charset=utf-

8&__char_set=utf8&genre=article&aulast=World%20Health%20Organization&date=March%2

02014&volume=2014&issue=May%2025&atitle=Influenza%20(Seasonal)%20Fact%20Sheet%2

0No.%20211&au=World%20Health%20Organization%20&.

18. Flu symptoms & severity [Internet]; September 2013 . Available from:

http://www.cdc.gov/flu/about/disease/symptoms.htmhttp://findit.library.jhu.edu/resolve?sid=Ref

works&charset=utf-

8&__char_set=utf8&genre=article&aulast=Centers%20for%20Disease%20Control%20and%20

Prevention&date=September%202013&volume=2014&issue=May%2025&atitle=Flu%20Sympt

oms%20%26%20Severity&au=Centers%20for%20Disease%20Control%20and%20Prevention%

20&.

19. Glezen WP, Greenberg SB, Atmar RL, Piedra PA, Couch RB. Impact of respiratory virus

infections on persons with chronic underlying conditions. JAMA. 2000;283(4):499-505.

20. Glezen WP, Couch RB, MacLean RA, Payne A, Baird JN, Vallbona C, et al. Interpandemic

influenza in the houston area, 1974–76. N Engl J Med. 1978;298(11):587-92.

21. Monto AS, Kioumehr F. The tecumseh study of respiratory illness. IX. occurence of

influenza in the community, 1966--1971. Am J Epidemiol. 1975 Dec;102(6):553-63.

22. Glezen WP. Serious morbidity and mortality associated with influenza epidemics. Epidemiol

Rev. 1982;4:25-44.

23. Monto AS. Influenza: Quantifying morbidity and mortality. Am J Med. 1987;82(6):20-5.

24. Barker WH. Excess pneumonia and influenza associated hospitalization during influenza

epidemics in the united states, 1970-78. Am J Public Health. 1986 Jul;76(7):761-5.

25. Barker WH, Mullooly JP. Impact of epidemic type A influenza in a defined adult population.

Am J Epidemiol. 1980 Dec;112(6):798-811.

26. People at high risk of developing Flu–Related complications [Internet]. . Available from:

http://www.cdc.gov/flu/about/disease/high_risk.htmhttp://findit.library.jhu.edu/resolve?sid=Ref

works&charset=utf-

8&__char_set=utf8&genre=article&aulast=Centers%20for%20Disease%20Control%20and%20

Prevention&volume=2014&issue=4%20June%202014&atitle=People%20at%20High%20Risk%

20of%20Developing%20Flu%E2%80%93Related%20Complications&au=Centers%20for%20D

isease%20Control%20and%20Prevention%20&.

27. How flu spreads [Internet]. . Available from:

http://www.cdc.gov/flu/about/disease/spread.htmhttp://findit.library.jhu.edu/resolve?sid=Refwor

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 33

ks&charset=utf-

8&__char_set=utf8&genre=article&aulast=Centers%20for%20Disease%20Control%20and%20

Prevention&volume=2014&issue=25%20June&atitle=How%20Flu%20Spreads&au=Centers%2

0for%20Disease%20Control%20and%20Prevention%20&.

28. Noble G. Epidemiological and clinical aspects of influenza, beare AS, basic and applied

influenza research, 1982, 11-50.

29. Simonsen L, Clarke MJ, Schonberger LB, Arden NH, Cox NJ, Fukuda K. Pandemic versus

epidemic influenza mortality: A pattern of changing age distribution. J Infect Dis. 1998

Jul;178(1):53-60.

30. Webster RG, Bean WJ, Gorman OT, Chambers TM, Kawaoka Y. Evolution and ecology of

influenza A viruses. Microbiol Rev. 1992 Mar;56(1):152-79.

31. Dolin R. Influenza-interpandemic as well as pandemic disease. N Engl J Med.

2005;353(24):2535.

32. Noble G. Epidemiological and clinical aspects of influenza. 1982.

33. The national respiratory and enteric virus surveillance system (NREVSS) [Internet].; 2014 .

Available from:

http://www.cdc.gov/surveillance/nrevss/http://findit.library.jhu.edu/resolve?sid=Refworks&chars

et=utf-

8&__char_set=utf8&genre=article&aulast=Centers%20for%20Disease%20Control%20and%20

Prevention&date=2014&volume=2014&issue=June%2018&atitle=The%20National%20Respira

tory%20and%20Enteric%20Virus%20Surveillance%20System%20(NREVSS)&au=Centers%20

for%20Disease%20Control%20and%20Prevention%20&.

34. Henning KJ. What is syndromic surveillance? Morb Mortal Weekly Rep. 2004:7-11.

35. Marsden-Haug N, Foster VB, Gould PL, Elbert E, Wang H, Pavlin JA. Code-based

syndromic surveillance for influenzalike illness by international classification of diseases, ninth

revision. Emerg Infect Dis. 2007 Feb;13(2):207-16.

36. Stoto MA, Schonlau M, Mariano LT. Syndromic surveillance: Is it worth the effort? Chance.

2004;17(1):19-24.

37. Overview of influenza surveillance in the united states [Internet]. Available from:

http://www.cdc.gov/flu/weekly/overview.htm#Viralhttp://findit.library.jhu.edu/resolve?sid=Ref

works&charset=utf-

8&__char_set=utf8&genre=article&aulast=Centers%20for%20Disease%20Control%20and%20

Prevention&volume=2014&issue=June%2018&atitle=Overview%20of%20Influenza%20Surveil

lance%20in%20the%20United%20States&au=Centers%20for%20Disease%20Control%20and%

20Prevention%20&.

38. Maryland resident influenza tracking survey [Internet]. Available from:

http://flusurvey.dhmh.md.gov/http://findit.library.jhu.edu/resolve?sid=Refworks&charset=utf-

8&__char_set=utf8&genre=article&aulast=Maryland%20Department%20of%20Health%20and

%20Mental%20Hygiene&volume=2014&issue=June%2023&atitle=Maryland%20Resident%20I

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 34

nfluenza%20Tracking%20Survey&au=Maryland%20Department%20of%20Health%20and%20

Mental%20Hygiene%20&.

39. German RR, Lee L, Horan J, Milstein R, Pertowski C, Waller M. Updated guidelines for

evaluating public health surveillance systems. MMWR Recomm Rep. 2001;50(RR-13):1-35.

40. Mallon TM. Progress in implementing recommendations in the national academy of sciences

reports:“Protecting those who serve: Strategies to protect the health of deployed US forces”. Mil

Med. 2011;176(7S):9-16.

41. Buehler JW, Sonricker A, Paladini M, Soper P, Mostashari F. Syndromic surveillance

practice in the united states: Findings from a survey of state, territorial, and selected local health

departments. Advances in Disease Surveillance. 2008;6(3):1-20.

42. Canas LC, Lohman K, Pavlin JA, Endy T, Singh DL, Pandey P, et al. The department of

defense laboratory-based global influenza surveillance system. Mil Med. 2000 Jul;165(7 Suppl

2):52-6.

43. Global emerging infections surveillance & response system [Internet]. Available from:

http://www.afhsc.mil/geishttp://findit.library.jhu.edu/resolve?sid=Refworks&charset=utf-

8&__char_set=utf8&genre=article&aulast=Armed%20Forces%20Health%20Surveillance%20C

enter&volume=2014&issue=June%2018&atitle=Global%20Emerging%20Infections%20Surveil

lance%20%26%20Response%20System&au=Armed%20Forces%20Health%20Surveillance%20

Center%20&.

44. Department of Defense. Comprehensive health surveillance. 2012. Report No.: DoDD

640.02E.

45. Homeland Security: Improving Public Health Surveillance, Hearing Before the Subcomitte

on Government Reform, House of Representitives of the 108th Congress, 1st Sess, 2003).

46. Mandl KD, Overhage JM, Wagner MM, Lober WB, Sebastiani P, Mostashari F, et al.

Implementing syndromic surveillance: A practical guide informed by the early experience. J Am

Med Inform Assoc. 2004 Mar-Apr;11(2):141-50.

47. Lombardo MJ, Burkom H, Elbert ME, Magruder S, Lewis MSH, Loschen MW, et al. A

systems overview of the electronic surveillance system for the early notification of community-

based epidemics (ESSENCE II). Journal of urban health. 2003;80(1):i32-42.

48. Siegrist D, Pavlin J. Bio-ALIRT biosurveillance detection algorithm evaluation. Morb Mortal

Weekly Rep. 2004:152-8.

49. Carley KM, Altman N, Kaminsky B, Nave D, Yahja A. BioWar: a city-scale multi-agent

network model of weaponized biological attacks. 2004.

50. Dredze M. How social media will change public health. Intelligent Systems, IEEE.

2012;27(4):81-4.

51. Jansen BJ, Zhang M, Sobel K, Chowdury A. Twitter power: Tweets as electronic word of

mouth. J Am Soc Inf Sci Technol. 2009;60(11):2169-88.

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 35

52. O'Connor B, Balasubramanyan R, Routledge BR, Smith NA. From tweets to polls: Linking

text sentiment to public opinion time series. ICWSM. 2010;11:122-9.

53. Sakaki T, Okazaki M, Matsuo Y. Earthquake shakes twitter users: Real-time event detection

by social sensors. Proceedings of the 19th international conference on world wide web; ACM;

2010.

54. Broniatowski DA, Paul MJ, Dredze M. National and local influenza surveillance through

twitter: An analysis of the 2012-2013 influenza epidemic. PloS one. 2013;8(12):e83672.

55. Paul MJ, Dredze M. A model for mining public health topics from twitter. HEALTH.

2012;11:16-.

56. National Human Subjects Protection Advisory Committee. Recommendations on public use

data files. Office for Human Research Protection; 2002.

57. IRB office preliminary determinations for MPH and other degree students [Internet].

Available from: http://www.jhsph.edu/offices-and-services/institutional-review-board/student-

projects/other-degree-

students.htmlhttp://findit.library.jhu.edu/resolve?sid=Refworks&charset=utf-

8&__char_set=utf8&genre=article&aulast=JHSPH%20Institutional%20Review%20Board&volu

me=2014&issue=12%2F18&atitle=IRB%20Office%20Preliminary%20Determinations%20for%

20MPH%20and%20Other%20Degree%20Students&au=JHSPH%20Institutional%20Review%2

0Board%20&.

58. FluView: Weekly U.S. influenza surveillance report [Internet]. Available from:

http://www.cdc.gov/flu/weekly/pastreports.htmhttp://findit.library.jhu.edu/resolve?sid=Refworks

&charset=utf-

8&__char_set=utf8&genre=article&aulast=Centers%20for%20Disease%20Control%20and%20

Prevention%20Influenza%20Division&volume=2014&issue=January%2013&atitle=FluView%

3A%20Weekly%20U.S.%20Influenza%20Surveillance%20Report&au=Centers%20for%20Dise

ase%20Control%20and%20Prevention%20Influenza%20Division%20&.

59. Sources of UK flu data: Influenza surveillance in the UK [Internet].; 2014 . Available from:

https://www.gov.uk/sources-of-uk-flu-data-influenza-surveillance-in-the-

ukhttp://findit.library.jhu.edu/resolve?sid=Refworks&charset=utf-

8&__char_set=utf8&genre=article&aulast=Public%20Health%20England&date=2014&volume

=2014&issue=February%2015&atitle=Sources%20of%20UK%20flu%20data%3A%20influenza

%20surveillance%20in%20the%20UK&au=Public%20Health%20England%20&.

60. Weekly influenza activity in wales report [Internet].; 2014

http://findit.library.jhu.edu/resolve?sid=Refworks&charset=utf-

8&__char_set=utf8&genre=article&aulast=Public%20Health%20Wales&date=2014&volume=2

014&issue=January%2028&atitle=Weekly%20Influenza%20Activity%20in%20Wales%20Repo

rt&au=Public%20Health%20Wales%20&.

61. National influenza report [Internet].; 2014 . Available from:

http://www.hps.scot.nhs.uk/resp/influenzareports.aspxhttp://findit.library.jhu.edu/resolve?sid=Re

fworks&charset=utf-

8&__char_set=utf8&genre=article&aulast=Health%20Protection%20Scotland&date=2014&vol

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 36

ume=2014&issue=January%2021&atitle=National%20Influenza%20Report&au=Health%20Prot

ection%20Scotland%20&.

62. Nidirect [Internet].; 2014 . Available from:

http://www.dhsspsni.gov.uk/http://findit.library.jhu.edu/resolve?sid=Refworks&charset=utf-

8&__char_set=utf8&genre=article&aulast=Northern%20Ireland%20Department%20of%20Heal

th%2C%20Social%20Services%20and%20Public%20Safety&auinit=%20Social%20Services%2

0and%20Public%20Safety&date=2014&volume=2014&issue=January%2014&atitle=nidirect&a

u=Northern%20Ireland%20Department%20of%20Health%2C%20Social%20Services%20and%

20Public%20Safety%20&.

63. Miller CL, Druss BG, Dombrowski EA, Rosenheck RA. Barriers to primary medical care

among patients at a community mental health center. Psychiatric Services. 2003;54(8):1158-60.

64. Mislove A, Lehmann S, Ahn Y, Onnela J, Rosenquist JN. Understanding the demographics

of twitter users. ICWSM. 2011;11:5th.

65. UK seniors choose facebook [Internet].; 2013 . Available from:

http://www.emarketer.com/Article/UK-Seniors-Choose-

Facebook/1010484http://findit.library.jhu.edu/resolve?sid=Refworks&charset=utf-

8&__char_set=utf8&genre=article&date=2013&volume=2014&issue=12%2F18&atitle=UK%2

0Seniors%20Choose%20Facebook&.

66. Anderson OD. Time series analysis and forecasting: The box-jenkins approach. Butterworths

London and Boston; 1976.

67. Wagner, M. M., Espino, J., Tsui, F. C., Gesteland, P., Chapman, W., Ivanov, O., ... &

Hutman, J. (2004). Syndrome and outbreak detection using chief-complaint data—experience of

the Real-Time Outbreak and Disease Surveillance project. Morbidity and Mortality Weekly

Report, 28-31.

68. Duggan, M., & Brenner, J. (2013). The demographics of social media users, 2012 (Vol. 14).

Washington, DC: Pew Research Center's Internet & American Life Project.

69. Java, A., Song, X., Finin, T., & Tseng, B. (2007, August). Why we twitter: understanding

microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD

2007 workshop on Web mining and social network analysis (pp. 56-65). ACM.

70. Aramaki, E., Maskawa, S., & Morita, M. (2011, July). Twitter catches the flu: detecting

influenza epidemics using Twitter. In Proceedings of the Conference on Empirical Methods in

Natural Language Processing (pp. 1568-1576). Association for Computational Linguistics.

71. Heil, B., & Piskorski, M. (2009). New research: Men follow men and nobody tweets. 2009-

06-01). http:∥ blogs. hbr. org/cs/2009/06/new _ twitter _ research _ men _ follo. html.

72. Rains, S. A., & Keating, D. M. (2011). The social dimension of blogging about health:

Health blogging, social support, and well-being. Communication Monographs, 78(4), 511-534.

73. Mo, P. K., & Coulson, N. S. (2008). Exploring the communication of social support within

virtual communities: A content analysis of messages posted to an online HIV/AIDS support

group. Cyberpsychology & behavior, 11(3), 371-374.

PANDEMIC RESPONSE IN THE ERA OF BIG DATA 37

74. Phillips, K. A., Mayer, M. L., & Aday, L. A. (2000). Barriers to care among racial/ethnic

groups under managed care. Health Affairs, 19(4), 65-75.

75. Ngo‐Metzger, Q., Massagli, M. P., Clarridge, B. R., Manocchia, M., Davis, R. B., Iezzoni, L.

I., & Phillips, R. S. (2003). Linguistic and cultural barriers to care. Journal of general internal

medicine, 18(1), 44-52.

76. Heckman, T. G., Somlai, A. M., Peters, J., Walker, J., Otto-Salaj, L., Galdabini, C. A., &

Kelly, J. A. (1998). Barriers to care among persons living with HIV/AIDS in urban and rural

areas. AIDS care, 10(3), 365-375.

77. Newacheck, P. W., Stoddard, J. J., Hughes, D. C., & Pearl, M. (1998). Health insurance and

access to primary care for children. New England Journal of Medicine, 338(8), 513-519.