a web scraper and entity resolver for converting...
Post on 05-Jun-2020
14 Views
Preview:
TRANSCRIPT
A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING PUBLIC EPIDEMIC REPORTS INTO LINKED DATA
By
MATTHEW A. DILLER
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2018
© 2018 Matthew A. Diller
To my Mom
4
ACKNOWLEDGMENTS
First and foremost, I would like to thank my mother whose support and dedication
to my education has doubtlessly been the biggest contributing factor to my academic
achievements thus far in life. I truly believe that, had you not fostered in me a passion
for learning when I was younger, I probably would not have decided to become a
scientist.
I would also like to thank my advisor and mentor, Bill Hogan, who first introduced
me to the field of biomedical informatics back in 2014. Had it not been for that chance
encounter between you and Mitch, I probably would never have had the opportunity to
dive into this field, which I have grown to love over the last few years. There have been
many times in the last few years that I’ve hit a brick wall of self-doubt only to have it
quashed by your encouragement and praise; for this reason (and many others), you
have truly been a great mentor.
In addition, I would like to thank both Jiang Bian and Amanda Hicks for their
mentorship and assistance, which have been invaluable to me throughout this journey.
In addition to having benefited from your wisdom, I have also learned from you the
value of mentorship to a student and will try to emulate it for any students that I get to
mentor in the future.
Finally, I would like to thank my best friend, Todd Sahagian, for his support and
his assistance with the validation step of corpus annotation. I couldn’t have done it
without you, man.
5
TABLE OF CONTENTS page
ACKNOWLEDGMENTS ...................................................................................................... 4
LIST OF TABLES ................................................................................................................ 6
LIST OF FIGURES .............................................................................................................. 7
LIST OF ABBREVIATIONS ................................................................................................. 8
ABSTRACT ........................................................................................................................ 10
CHAPTER
1 INTRODUCTION ........................................................................................................ 12
2 METHODS .................................................................................................................. 17
Statement of Purpose ................................................................................................. 17 Data Source ................................................................................................................ 17
Components ................................................................................................................ 18 Web Scraper ............................................................................................................... 18 Named-entity Recognition .......................................................................................... 19 Entity Resolution ......................................................................................................... 31
3 RESULTS .................................................................................................................... 43
Web Scrapper ............................................................................................................. 43 Named-entity Recognition .......................................................................................... 43 Entity Resolution ......................................................................................................... 45
4 DISCUSSION .............................................................................................................. 51
Limitations ................................................................................................................... 55 Future Work ................................................................................................................ 58
APPENDIX
ANNOTATION GUIDELINE ........................................................................................ 61
LIST OF REFERENCES ................................................................................................... 63
BIOGRAPHICAL SKETCH ................................................................................................ 67
6
LIST OF TABLES
Table page 3-1 Baseline set of features selected for the first round of training from the
Stanford NER NERFeatureFactory ........................................................................ 48
3-2 Summary of the CRF model performance for the first and second rounds of training, and the final round of testing.................................................................... 49
3-3 Additional features added to the baseline features for the second round of model training ......................................................................................................... 50
7
LIST OF FIGURES
Figure page 2-1 Flow chart of methods. ........................................................................................... 40
2-2 Example web page from the National Wildlife Health Center’s Avian Influenza News Archive .......................................................................................................... 41
2-3 Images that illustrate heterogeneity in the formatting and use of punctuation in headings of reports. ............................................................................................ 41
2-4 Example output file of NLP pre-processing step. .................................................. 42
3-1 Number of avian influenza epidemics by country from 2006 to 2017. .................. 48
3-2 Number of avian influenza epidemics by year from 2005 to 2017. ....................... 49
3-3 Number of avian influenza epidemics by host from 2006 to 2017. ....................... 49
3-4 Number of avian influenza epidemics by influenza pathogen from 2006 to 2017. ....................................................................................................................... 50
8
LIST OF ABBREVIATIONS
API Application programming interface
BIO2 Beginning-inside-outside format 2
brat Brat rapid annotation tool
CDC Centers for Disease Control and Prevention
CMM Conditional Markov model
CRF Conditional random field
ER Entity resolution
FN False negative
FP False positive
GPHIN Global Public Health Intelligence Network
HMM Hidden Markov model
HTML Hypertext Markup Language
IAA Inter-annotator agreement
IO Inside-outside format
IRI Internationalized Resource Identifier
ISO International Organization for Standardization
JSON Javascript Object Notation
MedISys Medical Information System
NCBI National Center for Biotechnology Information
NER Named-entity recognition
NLP Natural language processing
RDFS Resource Description Framework Schema
SARS Severe Acute Respiratory Syndrome
9
SPARQL SPARQL Protocol and Resource Description Framework Query Language
SQL Structured Query Language
TP True positive
UAE United Arab Emirates
UK United Kingdom
US United States
USGS United States Geological Survey
WHO World Health Organization
10
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING PUBLIC EPIDEMIC REPORTS INTO LINKED DATA
By
Matthew A. Diller
August 2018
Chair: William R. Hogan Major: Medical Sciences
The growing use of public health information systems that can detect epidemics
rapidly by extracting information from textual Web-based reports, such as online news
articles, as they are published in real time has shown promising results for disease
surveillance. Unfortunately, current resources that rely on Web-based reports for
disease surveillance are not designed to automatically extract this epidemiological data
from these reports, and therefore do not take advantage of the full potential of the data
contained in each.
One of the challenges to extracting such data from unstructured, textual Web-
based reports is identifying and linking data that are about individual epidemics in
multiple reports that have been published over a period of time. Therefore, the focus of
this work is to develop and evaluate a set of tools that use state-of-the-art informatics
technologies to extract data about avian influenza epidemics from textual Web-based
reports and use this data to identify and link reports that are about the same epidemics.
The online data source that I use is the Avian Influenza News Archive, which is
maintained by the United States Geological Survey. This archive contains serially-
published reports about avian influenza epidemics from November 7, 2006 to
11
September 28, 2017 and is publicly available. The first tool that I developed is a web
scraper that extracts the report text from each web page and stores it locally. The
second tool is a named-entity recognizer that labels named entities in each report that
refer to or denote locations, dates, host organisms, and influenza pathogens. Finally,
the third tool is a rule-based entity resolver that uses the labeled terms to identify
mentions of individual epidemics in each report and link them to epidemics identified in
other reports.
In total, the scraper extracted the text of 1,963 epidemic reports from which the
entity resolver identified 1,144 individual epidemics. Despite having a small training
dataset for NER model training, the overall results for all four named entities were
satisfactory (precision=0.9220, recall=0.7821, F-score=0.8463). Across all of the
identified epidemics, China was the most common location (18.62% of all epidemics);
H5N1 was the most common influenza subtype (65.25% of all epidemics); and birds
were the most common host (46.77% of all epidemics). Taken together, this work
illustrates the feasibility of applying entity resolution to textual Web-based reports to
identify and link reports that are about the same epidemic.
12
CHAPTER 1 INTRODUCTION
Recent epidemics of various infectious diseases—such as Ebola, severe acute
respiratory syndrome (SARS), Zika, and avian influenza—have demonstrated the need
for public health information systems that can detect epidemics rapidly, assess the
regional risk of epidemics with a high degree of specificity and sensitivity, and update
decision makers in real time as epidemics evolve. These systems require timely and
highly accurate input data, so that epidemiologists know precisely where the epidemic is
occurring and where it might spread and so that responders can develop well-informed
plans on how to address the epidemic. Historically, disease surveillance and forecasting
of disease outbreaks has relied on reports from sentinel healthcare providers and
laboratory results. These data are typically not available to physicians, researchers,
and the general public until weeks after the data were first collected. One of the
consequences of this situation is that computational epidemiologists—whose
mathematical models of disease transmission often inform public health officials and
policy makers on how to mitigate the spread of a disease—are sometimes unable to
react to an outbreak until well after the critical point at which it has evolved into an
epidemic.
To address the need for early detection of and ongoing real-time updates about
epidemics, researchers have studied the use of disparate Internet-based sources of
information, such as Google search queries [1–4], Wikipedia page access logs [5,6],
Twitter [3,7,8], online grey literature [9–11], and online news reports [9–11]. The
advantages of this source of information are that data can be obtained in real- or near-
real time at a high degree of spatial resolution, and that the resource cost of obtaining
13
the data is generally low. However, online data tend to be either unstructured or semi-
structured, thus making efforts to aggregate and interpret data from multiple sources
challenging. In addition, certain online data sources are often lacking in specificity
and/or sensitivity, as was the case for Google Flu Trends [4], which can pose practical,
economic, and ethical concerns regarding their utility to public health officials who
cannot afford to waste precious resources on false positive warnings of outbreaks.
Despite these drawbacks, Internet-based resources have proven valuable for the
early detection and modeling of epidemics. Between 2013 and 2014, digital reports of
polio cases preceded official reports by the World Health Organization (WHO) for all
seven outbreaks that occurred within that time period [12]. Likewise, in 2014, online
news reports about an outbreak of a Lassa-fever-like hemorrhagic fever in eight people
in Guinea were published one week ahead of an official case report released by the
WHO [12]. Online news and official reports on both MERS and Ebola epidemics have
also been shown to be useful for estimating the basic reproduction number and other
disease parameters that epidemiologists rely on for assessing the risk for an outbreak
and for determining the appropriate control measure(s) to implement [13,14].
Meanwhile, Paul et al. have demonstrated that combining data scraped from Twitter
with historic incidence data improves the performance of influenza forecasting models
relative to using historical incidence data alone. Taken together, these findings illustrate
that online data sources are capable of providing epidemiologists with actionable
information on possible outbreaks in a timely manner, and that aggregating them with
more traditional data sources can result in improved estimates of outbreak
characteristics.
14
There still remains a considerable amount of work to be done before the full
potential of these online resources is realized. Much of this work will involve the
programmatic extraction of data from the text of online outbreak reports, which is
challenging due to the unstructured or semi-structured nature of many of these reports,
the heterogeneity in which the data is presented in the text, the possibility of formatting
changes to the text or the Hypertext Markup Language (HTML) documents that contain
them, and the lack of a guarantee that some sources will be available in the future.
Indeed, some tools have adopted hybrid approaches, which consist of computational
and manual methods for extracting epidemiological data from online reports. Cleaton et
al. [13] and Cauchemez et al. [14] decided to forego using computational methods
altogether, and instead manually extracted the data. Such an approach is labor
intensive and can take a long time to complete if a large volume of reports is to be
utilized.
At present, the available epidemiological tools that derive their data from official
or news-based online reports—the Global Public Health Intelligence Network (GPHIN),
HealthMap, the Medical Information System (MedISys), and ProMED-mail—are
designed to simply detect outbreaks as they emerge [10,11,15,16]. Thus, a lot of
epidemiological data, such as data about transmission patterns or the affected host
species, are not extracted from these reports. As I have mentioned, these data can
have a lot of value to epidemiologists for studying and developing responses to current
epidemics. However, without an automated way of identifying reports that are about the
same epidemic—an initial step to extracting epidemiological data from them—a great
amount of time and effort is needed to do so manually. This can be problematic when
15
dealing with rapidly developing epidemics where any delays in the development and
implementation of control measures can be disastrous.
Therefore, if the full potential of online reports as a data source for infectious
disease epidemiology is to be realized, part of that effort will involve developing
computational tools that can draw links between data about the same epidemic as it is
being reported on over time. Ideally, these tools would extract data from several
sources (e.g., ProMED-mail, online news articles, and Centers for Disease Control and
Prevention (CDC) reports), identify data that are about the same epidemic in each of
these sources, and then link those data together using entity resolution. These tools
would then be able to provide users with data about individual epidemics that are
occurring simultaneously and update these data as new reports are published by one or
more sources. With these data, epidemiologists and public health professionals would
then be able to develop more well-informed responses to these epidemics.
The focus of this thesis is to create and evaluate a set of tools that extracts data
from a set of Web-based reports and links multiple reports over time to the individual
epidemics they are about. The first tool in this set is a web scraper that fetches textual
reports that describe avian influenza epidemics that occurred across the globe in
various host populations from 2006 to 2017. The second tool is a named-entity
recognition classifier for identifying specific information in the text that identifies
individual epidemics. The third and final tool is a rule-based entity resolver that
generates data about individual epidemics within each report and then connects these
data to other data about the same individual epidemic identified in earlier reports. The
16
net result is that for each distinct epidemic, we can trace sequentially the reports that
describe it over time to see how it evolved.
17
CHAPTER 2 METHODS
Statement of Purpose
The goal of this project is to create query-able, structured data about avian
influenza epidemics from unstructured, web-based, narrative reports about these
epidemics. These epidemic reports are released approximately every two weeks by the
US Geological Survey (USGS). I will classify fragments of text in this corpus of USGS
epidemic reports as referring to or denoting an influenza pathogen, a location, a host
population, or a date in order to 1) perform entity resolution to identify unique epidemics
in epidemic reports that serially update information about ongoing epidemics, and 2)
distinguish among multiple different epidemics that are occurring simultaneously and
that may be in proximity to one another.
Data Source
I used the Avian Influenza News Archive of the United States Geological
Survey’s National Wildlife Health Center as the source of influenza epidemic information
[17]. As of May 15, 2018, this archive contained 1,963 epidemic reports across 442
Web pages published from November 7, 2006 to September 28, 2017 on avian
influenza epidemics.
Each web page typically consists of at most three subsections: avian influenza in
wild animals, avian influenza in domesticated animals, and avian influenza in humans.
Each subsection typically contains one or more epidemic reports for a host–country–
influenza-subtype combination. For example, the September 9, 2016 web page consists
of two subsections—“Avian Influenza in Poultry” and “Avian Influenza in Humans”—with
four epidemic reports under “Avian Influenza in Poultry” and one under “Avian Influenza
18
in Humans”. Specifically, the web page contains reports of (1) H9N2 influenza in
humans in China, (2) H5N6 influenza in poultry in China, (3) H5N1 in birds in Ghana, (4)
H5N2 in ostriches in South Africa, and (5) H5N8 in ducks and chickens in South Korea
(Figure 2).
Components
This project consists of three main tasks: 1) creating a web scraping module to
fetch the Web pages and identify the epidemic reports contained in them, 2) training an
NLP module for named entity recognition (NER) to identify the four types of named
entities (that I describe below) in the reports, and 3) building an entity resolution module
to identify unique influenza epidemics from the output of the NLP step (Figure 1). Each
of these tasks consists of a set of subtasks, as I shall discuss.
Web scraper
The web scraper is a Python version 3.5 [18] module that is comprised of two sub-
modules for data extraction . The first sub-module (html_scraper.py) sequentially
fetches each web page (including the one shown in Figure 2) from the USGS Avian
Influenza News Archive and saves it locally as an HTML file. The second sub-module
(scraper.py) then iterates over each HTML file and creates a parse tree from its HTML,
using the Beautiful Soup Python package [19]. From there it extracts the text-heading
and content of each epidemic report contained in the HTML file and saves them to a
structured JSON file. Accordingly, each JSON file corresponds to a single web page,
and contains the text heading and body of one or more epidemic reports.
Manual extraction was necessary for (1) three web pages that displayed some of the
data in tabular form, rather than as textual reports, as well as (2) 24 pages that
19
contained duplicates or errors. In addition, three web pages were exact duplicates of the
previous page and were therefore excluded. After excluding the three duplicate web
pages, there were 439 web pages to process. Of these 439 pages, 27 (6.14%) required
manual extraction.
In addition, a small number of the reports within the web pages contain
information on recently published studies, health policy changes, and ecological reports
that relate to avian influenza and its various host species. This made the task of web
scraping more difficult since the headers of these latter types of reports may be
formatted or worded differently than the epidemic reports themselves. In addition, it
adds complexity to the task of entity resolution as it potentially creates a false indication
of an influenza outbreak.
Named Entity Recognition
One of the main problems with extracting data from unstructured natural
language text is how to determine if a particular word or phrase is relevant to the task at
hand. One approach to determining relevance is to manually read through the text and
select each word or phrase that meets a set of pre-established criteria of relevance.
Since one of the goals of this task is to extract data about individual epidemics, the
criteria that I selected are based on four essential properties of epidemics that are
described in the definition of ‘epidemic’ from the World Health Organization (WHO).
According to the WHO, an epidemic is (emphasis added):
The occurrence in a community or region of cases of an illness, specific health-related behaviour, or other health-related events clearly in excess of normal expectancy. The community or region and the period in which the cases occur are specified precisely. The number of cases indicating the presence of an epidemic varies according to the agent, size, and type of population exposed,
20
previous experience or lack of exposure to the disease, and time and place of occurrence.[20]
Because this definition explicitly states that the location and temporal period of disease
cases are vital to determining whether an outbreak qualifies as an epidemic, I selected
locations and dates as two categories of interest for this task. Similarly, because agent
(i.e., pathogen) and type of host population are also listed as being important, I included
them as well.
Since it often is not feasible for humans to manually extract words or phrases
from large textual datasets, it is common to use computational methods that
automatically segment and classify the words or phrases in the text. Indeed, given that
the dataset I am using consists of 1,963 textual reports in total, manually extracting
each word or phrase that is about some epidemic would be labor intensive and time-
consuming. Therefore, for this task I used NER, which is a computational method that
segments and classifies words or phrases that represent named entities in a text
according to pre-selected categories. This method is used extensively for information
extraction tasks in the biomedical and public health domains [21–25].
I define a named entity as a thing that exists in reality and that can be referred to
or denoted by a noun or proper noun. For this project, the named entities in which I am
interested fall into the four categories that I described above (host, pathogen, date/time,
and location). As an example of how these named entities appear in a report, consider
the sentence, “Three pigs were confirmed to have been infected with H1N1 influenza in
Mexico on August 4.” This sentence contains references to four named entities: host
organisms (‘pigs’), an influenza pathogen (‘H1N1 influenza’), a location (‘Mexico’), and a
date (‘August 4’),
21
There are two main approaches to NER—the rule-based approach and the
learning-based approach—that I will describe here, each of which has its own
advantages and drawbacks. The rule-based approach relies on a set of grammar-based
rules that are either hand-crafted or bootstrapped from previously-developed rule sets,
which are then implemented throughout the entire data extraction process to extract
structured data from unstructured text. An example of what one of these rules might
look like informally is, “identify a match of a location type (e.g., ‘district of’) followed by a
match of a location name (e.g., ‘Shushan’).” This approach has the advantage of
achieving higher precision than the learning-based approach [26]. However, the
performance of systems that use this approach is entirely dependent on how
comprehensive the information extraction rules are and therefore often require a
substantial amount of manual effort to construct, test, and refine. In addition, information
extraction systems that use this approach are limited to the domains that their rules set
covers.
The second main approach that is used for NER is the learning-based approach.
This approach uses statistical models that automatically implement a set of rules at the
beginning of the extraction task to identify certain discriminative features of words (e.g.,
capitalization, the sequence of integers that form a year), which are then used by a
statistical model that implements the rest of the extraction process. The advantages of
this approach are the relatively higher recall that learning-based systems achieve
compared to rule-based systems and the avoidance of having to manually develop a
comprehensive set of rules, which can take several months [27]. Due to time constraints
22
and to recent improvements in the performance of learning-based systems for NER, I
elected to use the learning-based approach for this task.
Implementations of this approach typically rely on one of three techniques:
supervised learning, semi-supervised learning, or unsupervised learning [28]. The most
common of the three for NER, supervised learning, uses data that people have
manually annotated for positive (and sometimes negative) examples of named entities.
Semi-supervised learning uses a technique called “bootstrapping” that takes a small
number of example names of named entities, called “seeds,” searches for sentences
within a text that contain these names, and then uses their surrounding context to
identify other names elsewhere. Once new names are identified, the process is
repeated using the newly identified names. The third technique, which happens to be
relatively new for NER tasks, is unsupervised learning, which clusters groups of words
that denote or refer to named entities based on similarities in the context in which these
words occur.
By far, the most successful technique of the three for NER is supervised learning
[26,28], which is why I selected it for this task.
Conditional Random Fields
As I mentioned previously, supervised learning techniques use annotated data,
called “training data,” to develop a set of rules for classifying words or phrases in an
unannotated sequence (e.g., a sentence) of text. However, rather than treating each
word in a sentence as individual tokens, these models take into account the
dependencies that exist between words in the sentence. Take, for example, the words
that form the term ‘bird flu virus’. In this example, the words that form the term ‘bird flu
23
virus’ have individual meanings (i.e., ‘bird’ refers to a type of flying animal that has a
beak, ‘flu’ refers to a type of infectious disease, etc.), but when they appear together like
this in a sentence, they are taken to refer to a particular type of virus. Using these
dependencies and the set of rules for distinguishing certain words (i.e., that were
generated at the beginning of the extraction task during the training step), the model
then predicts the most likely sequence of named entity labels for each word. Models
that take this approach to predicting the labels of sequential data are sometimes called
“sequential labeling models.”
One such model that is commonly used for learning-based NER tasks is the
conditional random fields (CRF) model [29]. To arrive at a sequence of labels for the
words in a sentence, during the training step of NER, a CRF model will first generate
feature functions for each labeled word in the sentence based on the following inputs:
the sentence (s), the position of a word in the sentence (i), the label of the current word
(li), and the label of the previous word (li-1). The output of each feature function will
typically be an integer between 0 and 1. Through this process, the conditional
dependence of li on s is defined by the model as a set of feature functions, such that the
probability of each possible value for li is partially determined by the feature functions. In
the next step, the model assigns each feature function a weight and then sums the
weighted features over all of the words in the sentence. It then uses this sum to predict
the most likely sequence of labels for the sentence.
One of the benefits of CRFs is that they are capable of taking advantage of non-
local information in natural language text. One example of a non-local dependency that
is valuable in NER is the consistent labeling of two instances of the same word that
24
occur far from each other in a text. Similarly, it is also advantageous to have consistent
labeling of two words that are very similar to each other. For instance, if ‘United States’
is assigned a location label, I would also like for ‘United States of America’ to be
assigned a location label, too. Many other sequence labeling models, such as hidden
Markov models (HMMs) and conditional Markov models (CMMs), cannot account for
these non-local dependencies since they only capture local dependencies of each word
in a sentence (i.e., the label of a word and the label of the previous word). Because of
this capability and the high degree of success that CRFs have seen in NER tasks, I
decided to use CRFs for the NER task of this project.
To achieve the goal of correctly classifying all words and phrases that refer to
influenza pathogens, locations, dates, and host organisms in the epidemic reports, I
used the Stanford Named Entity Recognizer (NER) toolkit for named-entity recognition.
This Java-based toolkit uses a CRF sequence model for classifying named entities
[22,30]. It includes a pre-trained linear chain CRF model for classifying names of
people, organizations, or locations in English texts. In addition, Stanford NER provides
the functionality for users to train, test, and run their own models either
programmatically or from the command line.
Overview of NER task
The NER task of this project consisted of a pre-processing step, corpus
annotation, training and evaluation, and a final data extraction step, each of which is
described below.
25
Pre-processing step
The pre-processing step (1) cleans the data that was stored in the JSON files
created by the web scraper and (2) breaks them out further into a set of text files, each
one of which contains exactly one report from a JSON file. For each JSON file, the text
content string and respective heading string of each report are taken as input by a
module written (print_datasets.py) in Python 3.5 that first removed any whitespace
characters from the beginning or end of each string and any extraneous punctuation
from the beginning of each string. My motivation for doing this was to prevent any future
issues with the software I used for tokenizing and annotating the reports and also to
clean up the text that contains body of each report so that the it begins with the first
word of the report.
This step was necessary due to the formatting of some web pages that
appended periods to the end of some headers, but in a different tag than what the rest
of the text was located in. For example, some reports had a heading that denoted the
location of an outbreak (e.g., ‘Egypt.’), such that the alphabetical characters of the
heading (i.e., ‘Egypt’) were stored in a <b> tag, while ‘.’ was stored outside of the <b>
tag in the paragraph (<p>) tag that the entire heading and text content were wrapped in
(Figure 3). As a result, this period would end up being prepended to the content string of
each report that followed this heading format.
In addition to removing whitespace from the beginning and end of the string, the
module also replaced any newline (‘\n’), horizontal tab (‘\t’), or return (‘\r’) escape
sequences that may appear within the content string with a single space. These
sequences occasionally appeared in the text as apparent artifacts of poorly formatted
26
HTML text strings. Since I am only interested in retaining the text of the reports, and not
any of these unexplained escape sequences, I elected to remove them.
The module then writes the report heading string to the first line of a text file and
the content string to the second line to produce a single text file for each report (Figure
4). Finally, it concatenates the positional index of the report in the web page to the date
of the web page, and assigns this as the file name (e.g., the text file containing the first
report from the page for November 7, 2006 would be named “2006_11_07_1”).
Corpus annotation
One of the obvious disadvantages to supervised learning is that it requires a
manually-annotated training dataset. As this task is often tedious and time-consuming
(although usually less so than that of developing rule sets) it is common to use pre-
annotated datasets that have been released by others [31–33]. Unfortunately, to my
knowledge there are no available pre-annotated corpora of text that consist of influenza
epidemic reports, so I had to create a training dataset manually.
I selected thirty web pages randomly from the set of scraped web pages, and
then annotated the individual reports contained in each using the brat rapid annotation
tool (brat) [34]. Brat is a web-based text annotation tool that allows users to create
structured annotations for text corpora. It stores the annotations in “standoff format”
(i.e., in a separate file from the text report being annotated) in an individual annotation
file (.ann). This annotation file takes the base name of its corresponding input text file
such that each input file remains unedited.
Of these thirty web pages, I randomly assigned five to the development set,
twenty to the training set, and another five to the testing set. Because these web pages
27
contain multiple reports, which I will henceforth refer to as the ‘datasets’, the total
number of development datasets was 24, the total number of training datasets was 94,
and the total number of testing datasets was 22.
As mentioned above, I selected four key named entities of interest—date,
location, influenza pathogen, and host organism—for annotation in accordance with
certain inclusion and exclusion criteria, which I describe next.
I annotated alphanumeric terms as date entities if the term referred to or denoted
a calendar day, month, year, holiday, or season (e.g., fall), or if they consisted of
relative temporal terms, such as ‘last week’ or ‘today’. If an alphanumeric term was
synonymous with ‘influenza virus’ or any of its serotypes, it was given an influenza
pathogen annotation. Terms referring to the pathogenicity of the influenza virus (e.g.,
‘highly pathogenic’ in ‘highly pathogenic H5N1 avian influenza’) were not included in the
annotation.
I assigned location annotations to alphanumeric terms that denoted a
geographical entity (e.g., Mt. Everest) or a geopolitical entity (e.g., United States or
Beijing). I excluded from annotation any descriptor terms that precede or follow a
location term (e.g., ‘state of’ or ‘town of’) that were not part of the proper name of the
entity. For example, I would not annotate ‘state of’ in ‘state of California’ since it is not
part of the proper name of California, whereas I would annotate ‘State of’ in ‘State of
Palestine’ since it is part of the official name of Palestine. I also excluded terms for
cardinal directions (e.g., ‘north’ or ‘southern’) from annotation unless they were part of
the proper name of the location entity. For example, I would not annotate ‘north’ in
‘north Switzerland’, but I would annotate ‘North’ in ‘North America’. Additionally, if a
28
series of words contained location terms that were sequentially ordered (e.g.,
Gainesville, Florida’), I made sure to separately annotate each term that denoted an
individual location, rather than annotating the entire string as a single location (e.g., in
the previous example, ‘Gainesville’ and ‘Florida’ would each receive individual
annotations). In addition, I excluded from annotation those homonyms of a location term
that denoted a non-location entity, because annotating them would be inaccurate. For
example, I would not annotate the term ‘Washington’ as a location if it was used to
denote the federal government of the United States, rather than to denote the state of
Washington or Washington, DC. Finally, I also excluded from annotation those location
terms that belong to the proper name of a different type of entity, such as a private or
government organization, for example ‘Brazil’ in ‘Ministry of Health of Brazil’.
I assigned host organism annotations to alphanumeric terms that were 1) nouns,
and 2) referred to or denoted one or more organisms infected by influenza virus.
Because the Avian Influenza News Archive contains reports on avian influenza
outbreaks in wild and domesticated animals, as well as humans, the host organism
terms tended to be heterogeneous. I did not annotate host terms that were not in their
noun form (e.g., ‘poultry’ in ‘poultry farm’). In addition, I excluded from annotation any
descriptors that preceded the host term (e.g., ‘cage-free’ in ‘cage-free chickens’).
Finally, if a host term was followed by its scientific name in parenthesis (e.g., ‘pigs (Sus
scrofa)’), I annotated each term individually, rather than as a single host organism.
As it is best practice for any NER task, I evaluated how well-defined the
annotation step is. To do so, I recruited a second annotator to annotate 15 randomly-
selected datasets that I had annotated previously. I then compared the annotation
29
results from both annotators and calculated an inter-annotator (IAA) score from them
using Cohen’s Kappa coefficient.
Model Training
Training a model is a preliminary step for any task that uses supervised statistical
learning and is defined as the process through which a machine learning algorithm
infers a function, given an annotated input dataset, for accurately predicting particular
output values in unannotated datasets. For NER, this process involves training an
algorithm on a corpus that has been manually annotated with the target feature set,
called a “gold standard corpus,” which the algorithm then uses to predict the class
labels of words or phrases in an unannotated corpus.
I used the Stanford NER toolkit for model training, evaluation, and classification
[30]. Because the output file of the brat rapid annotation tool is in the brat standoff
format, which is not accepted by the Stanford NER toolkit, I first converted the
annotated training dataset to inside-outside (IO) annotation format, which labels each
word with either an entity tag or an ‘O’ tag, using the open-source standoff2corenlp.py
Python 3 script [35]. I then converted the training dataset from the IO format to the
beginning-inside-outside format 2 (BIO2) for training and testing by specifying the
entitySubclassification property in the properties file and setting its value to ‘IOB2’ [36].
This format annotates each word in a corpus with a ‘B’, ‘I’, or ‘O’ according to the
following set of rules: 1) if a word is inside a noun phrase that represents a named
entity, it receives an ‘I’ tag; 2) if a word is at the beginning of a noun phrase, then it
receives a ‘B’ tag; and 2) if it is outside of a noun phrase, it receives an ‘O’ tag.
30
In many tasks involving the use of statistical models to make predictions,
hyperparameters are distinguished from the standard model parameters that are arrived
at during model training. These hyperparameters are not directly learned from the data
by the model, but rather the user usually sets their values prior to training. They include
a variety of different properties of the model, such as the complexity of the model or the
number of passes that the model should take on the training data . For this task, I used
the default hyperparameter values that are selected by Stanford NER.
To train a classifier, I used the following command on the command line:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop flu.prop
Here, the location of the properties file—flu.prop—was specified by the ‘-prop’
command. After training a classifier, Stanford NER then serializes the model to a
location specified in the properties file.
Model Evaluation
To assess the performance of the NER model in assigning labels to the words in
unannotated text, researchers commonly use three measures: precision (the fraction of
assigned labels that are accurate), recall (the fraction of all entities that were accurately
labeled), and the F-score (the harmonic average of the precision and recall). Taken
together, these measures help evaluate how well a model performs at correctly labeling
words or phrases that refer to or denote named entities and non-entities, and how well it
performs at recognizing words or phrases that do not represent named entities of
interest.
To compute these measures, I first calculated the number of true positives (TP),
false positives (FP), and false negatives (FN). For any NER task, a true positive is any
31
named entity label that was correctly assigned to a word or phrase in the corpus, a false
positive is any entity label that was inaccurately assigned to a word or phrase in the
corpus, and a false negative is a word or phrase that refers to or denotes an entity of
interest that was incorrectly labeled with an outside tag.
Precision (p) is the ratio of the number of true positives and the sum of true
positives and false positives:
𝑝 = 𝑇𝑃
𝑇𝑃 + 𝐹𝑃
Recall (r) is ratio of the number of true positives and the sum of true positives
and false negatives:
𝑟 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁
Finally, the F-score is the harmonic mean of precision (p) and recall (r):
𝐹1 = 2 ⋅𝑝 ⋅ 𝑟
𝑝 + 𝑟
I evaluated model performance using the following command, which calculates
the precision, recall, F-score, and the total numbers of true positives, false positives,
and false negatives:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier flu-ner-model.ser.gz -testFile ner-cfr-
testing-data.tsv
Here, ‘-loadClassifier’ specifies the location of the trained model and ‘-testFile’ specifies
the location of the annotated testing dataset.
Entity Resolution
Entity resolution (ER) is the process of identifying and linking different mentions
of the same real-world entity and that occur across multiple data sources. For example,
32
for a given influenza epidemic, there may be multiple articles about it that were serially
published over time in several different online news sources, along with multiple articles
published about other epidemics. The goal of an entity resolver for this task might
therefore be to first identify different, unique epidemics in this set of news articles, and
then link together references that are about one such epidemic. Thus, the end product
is one or more sets of linked articles (or epidemiological data if data extraction was
performed prior to the ER task), where each set of articles is about one particular
epidemic.
For any ER task, as with NER, the approach used to identify and link entities will
typically be either (1) rule-based, in which a set of conditions must be fulfilled in order to
determine whether two sets of data refer to the same entity, or (2) learning-based, in
which statistical learning algorithms are used to predict whether two sets of data refer to
the same entity. While learning-based approaches are useful for data that are
heterogeneous and unstructured, a rule-based approach using a set of rules that
leverage domain-specific knowledge is sometimes easier to implement and can produce
results that are sufficient for the goal(s) of the application.
Because the rules for ER for this task were a much smaller and easier set to
specify than the rules for NER would have been, I decided to use a rule-based
approach for the entity resolver. In particular, I constructed a set of rules for (1)
identifying all the unique epidemics that a given avian influenza epidemic report
mentions, and (2) determining whether each such epidemic is the same one mentioned
by previous or later reports.
33
Overall, I broke the ER task into five separate steps. The first step is a pre-
processing step that takes as input the tagged corpus from the NER task and
determines where every sentence in each report begins and ends extracts the tagged
words and their associated labels. It then standardizes all date entities to the
International Organization for Standardization (ISO) 8601 format, all country acronyms
to the official country name, and certain host terms from their colloquial name(s) to their
scientific name. The second step involves a term look-up that determines the scientific
names of each tagged host entity. The third step is a term look-up that determines
whether each location entity is a country, administrative country level 1 subdivision
(e.g., a United States (US) state), administrative country level 2 subdivision (e.g., a US
county or parish), or a city. In the fourth step, I generate epidemic tuples for every
epidemic identified in each report. The final step takes the epidemic tuples and performs
epidemic ER.
Pre-processing Step
For sentence boundary detection, the ER module determines the ending of a
sentence by a period, which is the only full stop punctuation that appears in each
epidemic report. Because each report consists of both a heading and a body, the
headings are treated as the equivalent of a sentence in the sentence boundary
detection sub-step. As such, a full stop for a heading can be either a period, a colon, or
a right parenthesis. However, occasionally the body of a report contains a colon or right
parenthesis in the middle of a sentence, which should not be treated as a full stop. To
overcome this, I only considered a colon or right parenthesis to be a full stop if it was
the first full stop identified in the report since this would indicate the end of the heading.
34
For each sentence within the report, upon determining its ending I extracted each
tagged word from the sentence, if it contained any, along with its respective label.
For the second sub-step of the pre-processing step, I devised a set of rules for
standardizing certain named entity terms. For dates, this set of rules ultimately
produced a date term that was formatted according to the ISO 8601 standard date
format (i.e., yyyy-mm-dd), which made it easier to compute the distance between two
dates in the final step of the ER task. In constructing this set of rules, I tried to account
for dates that were missing a year and/or a day term. For those date terms that were
missing a year, I usually assigned the same year in which the report was published. In
some cases, these reports were describing an epidemic that occurred the previous year
(e.g., reports published in January), and thus I assigned the previous year when
assigning the year, otherwise using the year published would place the date in the
future relative to the date of the epidemic report. For date terms that contained a month
and year, but no day, if the month in the date term was the same as that of the report,
then I applied the following rules: if the day of the report fell within the 1st to the 15th of
the month, ‘01’ was assigned as the day; if the day of the report fell after the 15th of the
month, then ‘15’ was assigned as the day. Finally, if the year within the date term was
different than that of the date of the epidemic report, then the date term was ignored, as
it is unlikely that this particular report is providing information relevant to a past
epidemic. The only exception to this rule is if the report was published at the beginning
of the year and the year within the date term denotes the previous year.
To account for any issues caused by the set of labels for a country not always
containing the country abbreviation, I decided to convert the abbreviations of countries
35
into the official name of the country before conducting the location term look-up. There
were only three countries I found that were so affected, and thus I included in this step
the United States (US, USA, etc), the United Kingdom (UK), and the United Arab
Emirates (UAE). Similarly, for the host term look-up step, I converted host terms that
were in their colloquial form to terms that are accepted by the database I used. In
particular, I converted all terms referring to humans (e.g., man, girl, patient) to ‘humans’,
the term ‘poultry’ (which was commonly used throughout this corpus) to ‘Galliformes’
(i.e., the name of the taxonomical order that contains most species of poultry), and the
terms ‘piglet’ and ‘piglets’ to ‘Sus scrofa’ (i.e., the species name for domesticated pigs).
Host Population and Pathogen Term Look-up Step
For the second step of the ER component, I queried the NCBI Taxonomy
database to get the scientific name of each host population and pathogen term that was
extracted in the pre-processing step, as well as their respective NCBI Taxonomy
identifiers. The NCBI Taxonomy database is a repository of taxonomical information
about organisms, including organism names and lineage, for the organism data in the
NCBI sequence databases [37,38]. It contains information only on those species of
organisms for which sequence data exist in another NCBI database (about 10% of
known species worldwide). Each organism classification has a standard numeric
identifier that was curated by the NCBI Taxonomy database. The database is
accessible through the NCBI Entrez Application Programming Interface (API), which
provides a set of protocols and tools that allows outside developers to communicate
with its various components. To programmatically connect to and query the database, I
used the BioPython Entrez library version 1.68 [39]. If multiple identifiers were returned
36
for a single query, I selected the one that was located at the lowest level in the
taxonomical lineage (no queries returned a list of identifiers for organism classifications
that were located in disjoint lineages). Although it contains only 10% of known species,
I found the NCBI Taxonomy’s coverage to be 100% for the host and pathogen species
discussed in the epidemic reports.
Location Term Look-up Step
For the third step of the ER component, I used the tagged location entity terms to
query Wikidata. Wikidata is an open-access knowledge base that contains structured
data extracted from numerous resources (e.g., Wikipedia) that the Wikimedia
Foundation maintains [40]. One of the benefits of using Wikidata is that it provides a
SPARQL endpoint for querying Wikidata and that can be accessed through a web
interface or programmatically via their API. To access this endpoint and submit queries
to Wikidata, I used the Python SPARQLWrapper library version 1.8.2 [41].
As I mentioned previously, one reason for querying Wikidata in this step was to
determine whether a given location term denotes a location that is a country, an
administrative country level 1 subdivision, administrative country level 2 subdivision, a
city, or none of these. This is important because a given sentence might contain
multiple location terms that denote different administrative subdivisions where one is
part of the other. For example, if a sentence describes an epidemic that occurred in the
city of Gainesville and then later mentions the state of Florida, it is important to know
that Gainesville is a city that is part of Florida so that the entity resolver does not
incorrectly infer that two different epidemics—one in Florida and one in Gainesville—
occurred, when only one epidemic in Gainesville occurred.
37
Another reason for querying Wikidata is to resolve multiple references to the
same location to a single identifier. For instance, if the US is written in the text as both
‘United States’ and ‘United States of America’, I want to resolve both of these text
strings to the same country identifier. This consistency is important because, when I get
to the epidemic ER step, I will need to group epidemics by country. Doing so is much
easier with resolved country identifiers.
To determine the type of administrative unit to which the location text refers
(country, admin 1, admin 2, or city), I executed separate SPARQL queries. If the query
found a match between a location term and the Resource Description Framework
Schema label (rdfs:label) of a Wikidata location individual, it returned the English
rdfs:label for the individual and its Internationalized Resource Identifier (IRI). Thus, if the
individual was a country, this first query returned the rdfs:label and IRI for that
individual, and the ER module then moved on to the next location term.
Otherwise, it would submit a second query to determine whether the location
term denotes an administrative level 1 subdivision. If it retrieves a match, the rdfs:label
and IRI of the individual, as well as the rdfs:label and the IRI of the country that it is part
of, are returned in the query results.
If this query did not return a result, the ER module submitted a third query to
determine whether the location term denotes an administrative country level 2
subdivision. If a match is found, then the rdfs:label and IRI of the individual, the
rdfs:label and IRI of the country, and the rdfs:label and IRI of the administrative country
level 1 subdivision that it is part of are all returned in the query results.
38
If still this third query did not find a match, then the ER module submitted a final
query to determine whether the term denotes a city, with a successful match returning
the rdfs:label and IRI of the individual, the rdfs:label and IRI of the country, and the
rdfs:label and IRI of the administrative country level 1 subdivision that the city is part of.
If no results are turned, then the module moved on to the next location term.
Epidemic Tuple Generation
In the fourth step of this component, I generated epidemic tuples using the data
from the look-up steps. To accomplish this, I used a set of rules that I constructed from
knowledge of how each epidemic report is structured. In particular, I know that most, if
not all, epidemic reports are about one or more epidemics in one country and that were
caused by one type of avian influenza pathogen.
From this knowledge, I constructed the following set of rules for generating
epidemic tuples: (1) for any given epidemic report, there is at most one influenza
pathogen type that is responsible for all epidemics described therein; (2) for any given
epidemic report, there is at most one country that each epidemic described therein is
located; (3a) if a sentence in an epidemic report contains one or more host population
entity terms and one or more terms denoting a location that is within the country that the
report is about, assign these terms to the same epidemic; (3b) if a sentence does not
contain any location entity terms but does contain one or more host population entity
terms that are distinct from those in the previous sentence, assign those host terms to
the epidemic identified in the previous sentence; (4) if a sentence contains terms
referring to one type of host population entity and terms denoting multiple location
entities, generate an epidemic tuple for each location entity and assign them the host
39
term that was mentioned; and (5) if no date entity is identified in the report, assign to the
epidemic tuple the 1st of the month if the report was published within the 1st or the 15th
of the month, or the 15th day of the month if the report was published after the 15th of
the month. In addition, I assigned a “report identifier” to each tuple, which I curated
based on the date of the epidemic report and the positional index of the report in the
web page (e.g., the third report for the update published on September 28, 2017 would
receive ‘2017_09_28_3’ as an identifier).
Each epidemic tuple therefore consists of a report identifier, a date in ISO 8601
format, an influenza pathogen identifier, a location identifier, a country identifier, and a
host population identifier.
Epidemic Entity Resolution
In the final fifth step of the ER component, I removed all epidemic tuples that
were an exact duplicate of another tuple. For the deduplication step, I first took each of
the epidemic tuples and removed any that contained an exact match with another tuple
for report identifier, influenza pathogen, location, and country, and whose host
population terms were all listed in the other tuple. In addition, I merged tuples that had
exact matches on report identifier, influenza pathogen, location, and country with
another tuple, but had different host population terms.
I then loaded all the remaining tuples to an SQLite database, and ordered them
by epidemic date, influenza pathogen, country, and location (in that order). Using the
output of this ordering, I programmatically grouped the tuples by influenza pathogen and
country, such that each tuple in a group had the same influenza pathogen term and
country term and had an epidemic date that was less than 60 days apart from at least
40
one other tuple in the group. The final output of this step was a tuple group consisting of
one or more tuples that are ordered by date and that each (ideally) refer to a single
epidemic.
Figure 2-1. Flow chart of methods.
41
Figure 2-2. Example web page from the National Wildlife Health Center’s Avian Influenza News Archive. Accessed: 2018 May 29.
42
Figure 2-3. Images that illustrate heterogeneity in the formatting and use of punctuation in headings of reports. A) Text content of the first influenza report taken from the June 7, 2011 update. B) HTML content for the report in (A) that shows the report heading to be encased in a <b> tag with the exception of the terminal period in the heading, which is outside of this tag at the beginning of the report content. C) Text content of the first influenza report taken from the December 14, 2012 update. D) HTML content for the report in (C) that shows the report heading to be encased in a <b> tag along with its terminal period.
Figure 2-4. Example output file of NLP pre-processing step. The first line contains the heading as it appeared in the original epidemic report, while the lines below it contain the text content of the report.
43
CHAPTER 3 RESULTS
Web Scraper
The web scraper consists of two sub-modules. The first sub-module
(html_scraper.py), which contains 89 lines of code, downloads the HTML content of the
web pages containing avian influenza epidemic reports and stores them locally. The
second sub-module (scraper.py), which contains 245 lines of code, parses out the text
content of each report and stores it in structured JSON files. The web scraper ultimately
parsed 442 web pages in total. I excluded three pages that were exact duplicates of
previous reports.
Named-entity Recognition
Annotation Guideline Validation
Based on two annotators using the annotation guideline to annotate 15 reports,
the initial round of validation produced a value of 0.8798. Because the value was
high, no re-validation was necessary. However, there are two disagreements that arose
between both annotators that are worth mentioning. The first is related to how to
annotate multiple location or host organism names that occur sequentially in a
sentence. Consider, for example, the sentence, “Avian influenza was isolated from
samples taken from two pigs (Sus scrofa) that died on a farm in Gainesville in Alachua
County, Florida.” With respect to location names, the disagreement centered on
whether to annotate the entire string ‘Gainesville in Alachua County, Florida’ or each
location name that was in the string separately (i.e., ‘Gainesville’, ‘Alachua County’, and
‘Florida’). Similarly, for host organism names, the disagreement was on whether to
annotate the entire string ‘pigs (Sus scrofa)’ or each host name individually (i.e., ‘pigs’
44
and ‘Sus scrofa’). Because I want the NER module to identify each individual location
name within a report, I revised the guideline to clarify that each individual location and
host organism name needs to be annotated in such circumstances.
The second disagreement was on whether the word ‘virus’ as the singular
reference to an influenza pathogen in a report should be annotated or not. For example,
one sentence might say, “Lab testing confirmed that the virus was of the highly
pathogenic form.” Ultimately, I decided to clarify in the annotation guideline that it should
not be annotated. This is because extracting this term in the data extraction process
provides little to no benefit since I assume that each epidemic report is going to be
about some avian influenza virus.
CRF Model Training
For the initial round of training, I used a baseline set of features (Table 3-1). I
then tested the trained model against the development datasets, and calculated the
number of true positive, false positives, and false negatives for each named entity, and
subsequently the precision (p), recall (r), and F-score of the model (Table 3-2).
Of the four named entities, the model performed best—as measured by F1—at
classifying influenza terms (p=0.9250, r=0.9136, F1=9193), while its worst performance
was on locations (p=0.7500, r=0.6316, F1=0.6857). The second and third best
performances, respectively, were on host organisms (p=0.8375, r=0.8272, F1=0.8323)
and dates (p=0.9167, r=0.7097, F1=0.8000).
For the second round of training, I added three features to the set of baseline
features (Table 3-3). The second model performed considerably better at classifying
locations (p=0.8182, r=0.8289, F1=0.8397), and had modest improvements in precision
45
for dates (p=0.9565). However, this came at a small cost to its performance with respect
to classifying host organisms and influenza (p=0.8472, r=0.7531, F1=0.7974; and
p=0.8916, r=0.9136, F1=0.9024, respectively).
CRF Model Testing
I selected the second trained model for the NER task due to the considerable
improvement to performance with classifying locations, relative to the first trained
model. For testing, I ran the model on the testing datasets and evaluated its
performance by calculating the precision (p), recall (r), F-score, and number of true
positive, false positives, and false negatives for each named entity (Table 3-2).
Most notably, there was a considerable increase in the precision for host
organisms from that of the second round of development (0.8472 to 1.0000). One
possible explanation for this is that the training data may have contained more unique
host organism names as a proportion of the total number of host organism names in the
data, despite random sampling. The small size of the overall dataset likely would have
contributed to this effect. Despite this, the overall metrics showed little change to the F-
score (0.8397 to 0.8463), a modest increase in precision (0.8627 to 0.9220), and a
modest decrease in recall (0.8178 to 0.7821), when compared to the results for the
training data (Table 3-2). The increase in precision to 1.0 for dates and hosts was likely
offset by a reduction in the recall for locations (0.8289 to 0.7632) such that the overall
effect on the F-score was minimal.
Entity Resolution
I performed entity resolution on the tagged corpus that was generated by the
NER module. From the 1,963 reports that the web scraper module extracted, the entity
46
resolution module produced at least one epidemic tuple for 1,192, generating 3,461
epidemic tuples in total. After the deduplication and merge steps, 2,524 epidemic tuples
remained. From these 2,524 tuples, the entity resolution module identified 1,144
epidemic individuals, each of which contains the identifiers of the reports that described
it, the range of dates in which the epidemic occurred, the influenza pathogen that was
its cause, the locations where it occurred, the country in which it occurred, and the host
organisms that were infected.
From these data, I was able to further analyze the number of avian influenza
epidemics that occurred in each country from November 7, 2006 to September 28, 2017
(Figure 3-1), the number of epidemics that occurred in each year (Figure 3-2), the
number of times each host organism was involved in an epidemic (Figure 3-3), and the
number of times each influenza subtype was the cause of an epidemic (Figure 3-4).
In total, there were 68 countries that were the location of an epidemic from
November 7, 2006 to September 28, 2017, 52 different types of host organisms that
were a participant in an epidemic, and 17 different avian influenza subtypes that were
the cause of epidemics in this time period. China was the most common location
(n=213 identified epidemics or 18.62%); birds were the most common host (n=535
identified epidemics or 46.77%); and H5N1 was the most common influenza subtype
(n=735 identified epidemics or 65.25%).
47
Table 3-1. Baseline set of features selected for the first round of training from the Stanford NER NERFeatureFactory.
Feature Selected Value
maxLeft 1
useClassFeature true
useWord true
useNGrams true
maxNGramLeng 2
usePrev true
useNext true
useWordPairs true
useSequences true
usePrevSequences true
useTypeSeqs true
useTypeSeqs2 true
useTypeySequences true
wordShape chris2useLC
48
Table 3-2. Summary of the CRF model performance for the first and second rounds of training, and the final round of testing.
Entity P R F1 TP FP FN
Ro
un
d 1
Date 0.9167 0.7097 0.8000 22 2 9
Host organism 0.8375 0.8272 0.8323 67 13 14
Influenza 0.9250 0.9136 0.9193 74 6 7
Location 0.7500 0.6316 0.6857 48 16 28
Totals 0.8508 0.7844 0.8162 211 37 58
Ro
un
d 2
Date 0.9565 0.7097 0.8148 22 1 9
Host organism 0.8472 0.7531 0.7974 61 11 20
Influenza 0.8916 0.9136 0.9024 74 9 7
Location 0.8182 0.8289 0.8235 63 14 13
Totals 0.8627 0.8178 0.8397 220 35 49
Fin
al
Date 1.0000 0.6957 0.8205 16 0 7
Host organism 1.0000 0.7083 0.8293 68 0 28
Influenza 0.9516 0.9516 0.9516 59 3 3
Location 0.8056 0.7632 0.7838 58 14 18
Totals 0.9220 0.7821 0.8463 201 17 56
P=precision; R=recall; F1=F-score; TP=true positive; FP=false positive; FN=false negative.
Table 3-3. Additional features added to the baseline features for the second round of model training.
Feature Selected Value
entitySubclassification “IOB2”
noMidNGrams true
useDisjunctive true
49
Figure 3-1. Number of avian influenza epidemics by country from 2006 to 2017 (top 15 shown).
Figure 3-2. Number of avian influenza epidemics by year from 2005 to 2017.
50
Figure 3-3. Number of avian influenza epidemics by host from 2006 to 2017 (top 15 shown).
Figure 3-4. Number of avian influenza epidemics by influenza pathogen from 2006 to 2017.
51
CHAPTER 4 DISCUSSION
My goal for this project was to transform unstructured text data in web reports
about avian influenza outbreaks into query-able, structured data about individual
epidemics. I developed a method to extract data about dates, locations, host
populations, and influenza pathogens from these reports, and then use those data to
link multiple reports about the same epidemic over a period of time. This method
identified 1,144 individual epidemics that were reported in the USGS Avian Influenza
Archive from November 7, 2006 to September 28, 2017 and that occurred globally in
humans, avian, and other animal species.
Given the small size of the training dataset, the overall performance of the NER
classifier on the testing dataset for all four named entities is noteworthy (p=0.9220,
r=0.7821, F1=0.8463). Both date and host organism had high scores for precision
(p=1.0000 for each), however this came at the cost of recall for both (r=0.6957 and
r=0.7083, respectively). For date, the number of true positives was 16 and the number
of false negatives was seven. Of these seven false negatives, three were for relative
terms for dates (e.g., ‘last week’, ‘this year’). Meanwhile, for host organism the number
of true positives was 68 and the number of false negatives was 28. Most of these false
negatives occurred with host terms that were uncommon in the corpus (e.g., ‘cat’,
‘Oriental White eyes’, ‘black-crowned night heron’).
Of the four named entities, influenza pathogen had the best results (p=0.9516,
r=0.9516, F1=0.9516). This was probably due to the lack of heterogeneity in the terms
used to refer to avian influenza and avian influenza subtypes. Of the three false
negatives, two were for terms that were rarely used in the entire corpus (i.e., ‘swine flu’
52
and ‘seasonal influenza’). Meanwhile, the model mistakenly labeled ‘H5N1’ in ‘H5N1
vaccine’ and ‘2009’ in ‘H1N1 (2009)’ as influenza pathogens.
On the other hand, location had the poorest results of the four named entities
(p=0.8056, r=0.7632, F1=0.7838). Of the 14 false positives, four occurred on the
commas separating two location names, as in the string ‘Detroit, Michigan’, while the
rest were on names of people, organizations, or dates. Most of the false positives
occurred on terms denoting villages, cities, and continents that were mentioned fewer
than two times in the training and testing datasets.
Taken together, although the performance of the NER model was satisfactory,
the results suggest that a larger dataset may contribute to improvements in
performance. Nevertheless, that my model achieved an F-sore of 0.8463 with such a
small training dataset has promising implications for the application of CRF models to
epidemiological data extraction tasks, especially ones that involve names of pathogens
that are new or rare for which few text-based datasets exist.
Another interesting result is the number of reports in which an epidemic individual
was identified (1,192). This means that out of 1,963 total reports, the entity resolver was
only able to identify complete mentions of epidemics (date, location, pathogen, host) in
60.72% of all reports. One reason for this result might be that many of the reports in the
web pages were not about epidemics, but instead were about research breakthroughs
and initiatives, ecological reports, and public health guidelines, to name a few. It is
possible, then, that many of these reports did not contain terms denoting locations or
host organisms. Because an epidemic tuple is not generated if one or both of these
named entities is not identified, this would have partly contributed to this discrepancy.
53
Classification errors in the NER step also likely contributed, as the misclassification or
omission of a label for a named entity term would also affect whether a host organism
and location are identified in a report. In addition, spelling mistakes in the host organism
and location terms would cause the term look-up steps for each to fail for those terms,
which results in an epidemic tuple not being generated even if the host organism or
location named entities were correctly labeled in the NER step. Finally, because the
entity resolver only performed a location look-up for certain types of geographical
regions (countries, administrative level 1 country subdivisions, administrative level 2
country subdivisions, and cities), many location terms that denoted things like towns
and villages were overlooked. Including these types of locations in the future may
therefore lead to an increase in the number of identified epidemics.
The end result of the toolset I developed and its application to the USGS avian
influenza reports is a dataset about avian influenza epidemics. One end goal of creating
this dataset of discrete data about unique epidemics—with links to the reports that
discuss them—was to create the ability to apply standard data analysis techniques
(including epidemiological analysis) to understand the burden of avian influenza better.
I demonstrated that my final, avian epidemic influenza dataset enables these
capabilities by conducting a basic analysis of the locations, hosts, pathogens, and dates
of the epidemics.
The 1,144 identified epidemics occurred in 68 countries, with the majority being
clustered within southeast Asia. By far, China had the largest number of identified
epidemics at 213, while Indonesia had the second highest number at 120. These results
are not surprising, given that avian influenza is endemic to southeast Asia.
54
With respect to the number of epidemics by year, the results show three peaks
during the time period covered by the reports. The first peak is in 2009, which coincides
with the H1N1 swine influenza pandemic. Indeed, the overwhelming majority of H1N1
epidemics that were identified in the reports (13 out of 16) occurred in 2009, although
this hardly accounts for why the number of identified epidemics (114 epidemics) was so
high for this year. The second peak was in 2011, which had 134 identified avian
influenza epidemics. The third peak (also the highest peak) occurred in 2015 and 2016,
with 149 epidemics each. Interestingly, Chatziprodromidou et al. [42] performed a
systematic review of the scientific literature and ProMED reports about avian influenza
epidemics from 2010 to 2016 and found that 2016 had the highest number of avian
influenza epidemics in that time period (144 in total), while 2015 had the second highest
number (142 in total). My results are remarkably consistent with these numbers.
The host populations mentioned in the epidemic reports mainly fell into three
groups—birds, humans, and pigs. With respect to birds, chickens (Gallus gallus), ducks
(Anas), and geese (Anser) were the most common hosts mentioned in the epidemic
reports (Aves is the class for birds, and Galliformes is the order that contains most
species of poultry). This result is not surprising since chickens are some of the most
common birds grown on poultry farms, which frequently are the source of avian
influenza epidemics [43]. Furthermore, waterfowl, which include wild ducks and geese,
are considered to be common reservoir hosts for low pathogenic avian influenza
subtypes [44]. Interestingly, as with the results for the number of epidemics by year,
these findings align with those of Chatziprodromidou et al. [42], which found that avian
55
influenza epidemics from 2010 to 2016 affected commercial poultry more than any other
type of host.
The breakdown of the number of epidemics by avian influenza subtype is
interesting in that the overwhelming majority (65.2%) of identified epidemics involve the
H5N1 subtype (735 identified epidemics). Although Chatziprodromidou et al. [42] also
found that H5N1 was the most common subtype in reported epidemics, they calculated
that it was the causal pathogen in only 38.2% of avian influenza epidemics from 2010 to
2016. This percentage differs considerably from my result (65.2%). Even when limiting
my epidemic dataset to the time period 2010 to 2016, the discrepancy is still large:
56.9% vs. 38.2%. One possible reason for this discrepancy might be that the USGS
placed greater emphasis on curating epidemic reports for H5N1 due to the public health
risk that it poses and to the high economic burden it places on the poultry industry within
the United States and globally, relative to other avian influenza subtypes. Much of the
concern surrounding H5N1 has been placed particularly on the highly pathogenic form,
which has the capability to infect several different host species, including humans, and
has a high mortality rate.
Limitations
Taken together, my results illustrate the feasibility of identifying and linking
individual epidemics that are reported by multiple epidemic reports over time using the
methods that I have described here. However, there are several limitations to this work
that need to be considered if these methods are to be extended to other online data
sources or to other types of pathogens.
56
One notable limitation is the small size of the training dataset, which may have
negatively impacted the performance of the NER model. Whereas used 94 datasets for
training a model, many NLP tasks typically have a training dataset that is two or more
orders of magnitude larger. Despite this limitation, the NER model that I trained was still
achieved good overall results for precision (0.9220), recall (0.7821), and F1 score
(0.8463). One explanation for this performance might be that a well-defined and clear
annotation guideline contributed to consistent and precise labeling of named entities in
the training dataset, which therefore contributed to a better-trained model. Because
inconsistencies in how certain named entities, such as locations, are labelled can
negatively affect a model’s ability to identify instances of them in text, it is possible that
minimizing such consistencies helped to boost the overall performance of the model.
Also, because the reports were almost entirely constrained to describing avian influenza
epidemics (and I excluded many reports that did not), it is likely that there is significant
regularity to the text in the reports that makes the task simpler than other NER tasks.
Whether these results would transfer to other online sources like ProMED mail, which
for example is not restricted to influenza pathogens and to the subset of hosts that they
infect, is uncertain.
In addition, online text-based data may be more prone to spelling and
typographical errors, relative to official sources, that can make the process of textual
data extraction more challenging and negatively impact results. With respect to the
methods used here, spelling mistakes in location or host names might have resulted in
those terms not returning results in their respective look-up steps in the entity resolver.
For example, in one report there were two mentions of Kaohsiung City—one that had
57
the correct spelling and one that incorrectly spelled it as ‘Kaosiung City’. A separate
report misspelled it as ‘Kaoshiung City’. The most likely consequence of misspellings
(and the fact that I did not attempt to apply any spelling correction techniques) is that my
tools would not identify the epidemic in that location or in that host population. The
effect of misspellings overall would be to decrease the sensitivity of my toolset for
detecting epidemics.
Additional limitations exist with respect to the generalizability of the methods and
tools that I have described. Due to the unique formatting of the web pages that contain
the epidemic reports, the web scraper that I have described here would not likely be
able to extract the text content of epidemic reports from other sources of data. In
addition, it has been previously shown that there are issues with the generalizability of
learning-based NER models, especially in diverse domains with limited training data,
[45]. Given the diversity in the online data sources that are used for disease
surveillance, the NER model that I trained on a subset of these epidemic reports likely
would not perform as well on a different corpus, especially a more general one like
ProMED Mail which is unrestricted with respect to pathogens and hosts (as opposed to
the pathogens and hosts involved with avian influenza). As such, if one were to extend
these methods to other data sources, it would be necessary to develop, train, and
evaluate a new web scraper and NER model.
Another limitation is that any classification errors that arise in the NER task
almost certainly affect the performance of the entity resolver later on. Although such
errors are unavoidable, the misidentification of certain named entities as being another
type of named entity or as not being any of the four types of named entities likely
58
resulted in several epidemics being omitted from the results. With respect to host
organism and location, this is especially true. For date entities, this type of error would,
at best, produce an epidemic date that is several days inaccurate, and at worst would
likely cause an epidemic tuple to not be linked to the epidemic that it contains data on.
Likewise, imperfections in the entity resolution task might also significantly skew
the results, as insufficient resolution could lead to over-counting epidemics and over-
resolution could lead to under-counting epidemics. With respect to the latter, if multiple
epidemics were reported to have occurred within 60 days of each other in countries that
have a large geographical area, it is possible that they may have been resolved to a
single epidemic, even if they occurred far from each other and therefore are different
epidemics. Because I did not measure the performance of the entity resolution task,
there is no way of knowing the exact degree to which my results truly reflect the number
of epidemics that were mentioned by the epidemic reports. Nevertheless, I am
reasonably confident that the epidemics that my entity resolver identified do represent
real epidemics, especially given how well my results align with those of
Chatziprodromidou et al. [42]. Nevertheless, it would be of benefit to future extensions
of this work to properly evaluate the performance of the entity resolver to ensure that
the most accurate count of epidemics is being achieved.
Future Work
There are several different ways in which this work can be improved upon and
extended in the future. As I already mentioned, one way is to evaluate the performance
of the entity resolver at identifying and linking epidemic individuals. One way to achieve
this task is to recruit a domain expert to manually extract each individual epidemic that
59
is mentioned by a set of epidemic reports, and then compare the expert’s results to the
results of the entity resolver on that particular set of reports.
In addition, rather than using a rule-based approach to entity resolution, I could
test more sophisticated approaches that employ graph similarity measures or machine
learning, and then compare the results to those that I achieved here. The extent to
which any improvement to the results would be noticeable, however, is uncertain,
especially if the results that I arrived at here come close to the number of epidemics that
were reported on.
One task that I did not attempt is the extraction of epidemiological data, such as
the number of cases or the mode or frequency of interspecies transmission, from the
reports. As these data are frequently of high value to epidemiologists, the ability to
extract them in an automated fashion would be beneficial.
Another approach that might increase the utility of this work for others would be
to store the extracted data about these epidemics as linked data on the Semantic Web.
This would help to improve the machine-readability and standardization of the data, the
latter of which is noticeably lacking in the field of computational epidemiology.
Moreover, this might also help to improve the ability to integrate the dataset with other
resources in the future. Nevertheless, my final epidemic data set does use standard
identifiers for host and pathogen, formats dates according to international standards
(ISO8601), and uses a popular Semantic Web database (i.e., Wikidata) for locations.
Finally, it would be interesting to extend these methods to other data sources,
such as ProMED-mail or online news aggregators to see if similar results can be
achieved. Doing so would be the ultimate test of generalizability, as it would illustrate
60
the effectiveness of these methods at extracting data from multiple data sources that
differ greatly in format.
61
APPENDIX ANNOTATION GUIDELINE
The purpose of this guideline is to provide a standard for annotating the given text corpus that is to be used by the annotator as a reference. For this particular task, the annotator will receive a set of 15 “datasets” that are each comprised of a brief report on an influenza outbreak. It is important that the annotator read through each report carefully so as not to miss or misidentify any of the four entities that will be outlined below. If at any point the annotator is unsure of whether a term deserves an annotation or not, they can refer back to this guideline. Entities of interest:
• Date
• Location
• Influenza pathogen
• Host organism For each of the aforementioned entities, the following rules for annotation apply: General
• Always annotate abbreviations that refer to or denote one of the above entities.
• For terms that consist of more than one word, always annotate the entire set of words as one term. For example, ‘high pathogenicity H5N1 avian influenza virus’ would receive one annotation.
• The plural form of a word can be annotated. Date
• Any alphanumeric term that refers to or denotes a day, month, year, season, or holiday, as well as relative temporal terms, such as “last week” or “today.”
Location
• Any alphanumeric term that refers to or denotes a geographical entity, such as ‘Lake Erie’ or ‘Mt. Everest’.
• Any alphanumeric term that refers to or denotes a geopolitical entity, such as ‘United States of America’ or ‘Paris’
o Include countries, states/provinces, cities, towns, etc.
• DO NOT annotate any descriptors that precede a term if they are not part of the proper name of that term, for example ‘state of’ in ‘state of California’ would not be annotated, but ‘State of’ in ‘State of Palestine’ would.
• DO NOT annotate any cardinal directions, such as ‘northwest’ or ‘southern’, unless they are part of the proper name of that term. For example, ‘north’ in ‘north Switzerland’ would not be annotated, but ‘North’ in ‘North America’ would.
• DO NOT annotate any homonyms of a location term that denote some other entity that does not meet the first two criteria (this requires that you pay careful attention to the context that a term is used in). For example, ‘Washington’ can denote either Washington, DC or Washington state, in which case the term would
62
be annotated; however, it can also be used to denote the federal government of the United States, and therefore would not receive one.
• DO NOT annotate a term that belongs to the proper name of a different type of entity, such as the name of a company, private organization, or government organization or branch. For example, ‘Canada’ in ‘Ministry of Health of Canada’ would not be annotated.
• DO NOT annotate a series of locations that occur in sequence to one another (e.g., ‘Gainesville, FL’) as a single location. Instead, annotate each location name separately.
Influenza pathogen
• Any alphanumeric term that is synonymous with ‘influenza virus’ or any of its serotypes.
• Any alphanumeric term that refers to the pathogenicity of the virus (i.e., ‘high pathogenicity’ or ‘low pathogenicity’, or any variant thereof).
Host organism
• Any alphanumeric term that is in the noun form that refers to or denotes one or more host organisms. For example, ‘poultry’ in ‘H7N9 was responsible for the death of 8 poultry’ would be annotated, but ‘poultry’ in ‘poultry farm’ would not.
• DO NOT annotate any descriptors that precede a term that do not have any taxonomical meaning, such as ‘cage-free’ in ‘cage-free chickens’ or ‘wild’ in ‘wild birds’.
• DO NOT annotate a series of host organisms that occur in sequence to one another (e.g., ‘pigs (Sus scrofa)’) as a single host organism. Instead, annotate each organism name separately.
63
LIST OF REFERENCES
1 Ginsberg J, Mohebbi MH, Patel RS, et al. Detecting influenza epidemics using search engine query data. Nature 2009;457:1012–4. doi:10.1038/nature07634
2 Polgreen PM, Chen Y, Pennock DM, et al. Using internet searches for influenza surveillance. Clin Infect Dis Off Publ Infect Dis Soc Am 2008;47:1443–8. doi:10.1086/593098
3 Santillana M, Nguyen AT, Dredze M, et al. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLoS Comput Biol 2015;11:e1004513. doi:10.1371/journal.pcbi.1004513
4 Lazer D, Kennedy R, King G, et al. Big data. The parable of Google Flu: traps in big data analysis. Science 2014;343:1203–5. doi:10.1126/science.1248506
5 Hickmann KS, Fairchild G, Priedhorsky R, et al. Forecasting the 2013-2014 influenza season using Wikipedia. PLoS Comput Biol 2015;11:e1004239. doi:10.1371/journal.pcbi.1004239
6 McIver DJ, Brownstein JS. Wikipedia usage estimates prevalence of influenza-like illness in the United States in near real-time. PLoS Comput Biol 2014;10:e1003581. doi:10.1371/journal.pcbi.1003581
7 Nagar R, Yuan Q, Freifeld CC, et al. A case study of the New York City 2012-2013 influenza season with daily geocoded Twitter data from temporal and spatiotemporal perspectives. J Med Internet Res 2014;16:e236. doi:10.2196/jmir.3416
8 Paul MJ, Dredze M, Broniatowski D. Twitter improves influenza forecasting. PLoS Curr 2014;6. doi:10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117
9 Chowell G, Cleaton JM, Viboud C. Elucidating Transmission Patterns From Internet Reports: Ebola and Middle East Respiratory Syndrome as Case Studies. J Infect Dis 2016;214:S421–6. doi:10.1093/infdis/jiw356
10 Mykhalovskiy E, Weir L. The Global Public Health Intelligence Network and early warning outbreak detection: a Canadian contribution to global public health. Can J Public Health Rev Can Sante Publique 2006;97:42–4.
11 Freifeld CC, Mandl KD, Reis BY, et al. HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports. J Am Med Inform Assoc JAMIA 2008;15:150–7. doi:10.1197/jamia.M2544
12 Anema A, Kluberg S, Wilson K, et al. Digital surveillance for enhanced detection and response to outbreaks. Lancet Infect Dis 2014;14:1035–7. doi:10.1016/S1473-3099(14)70953-3
64
13 Cleaton JM, Viboud C, Simonsen L, et al. Characterizing Ebola Transmission Patterns Based on Internet News Reports. Clin Infect Dis 2016;62:24–31. doi:10.1093/cid/civ748
14 Cauchemez S, Fraser C, Van Kerkhove MD, et al. Middle East respiratory syndrome coronavirus: quantification of the extent of the epidemic, surveillance biases, and transmissibility. Lancet Infect Dis 2014;14:50–6. doi:10.1016/S1473-3099(13)70304-9
15 Health Threats Unit at Directorate General Health and Consumer Affairs of the European Commission. MedISys (Medical Intelligence System). http://medisys.newsbrief.eu/medisys/homeedition/en/home.html (accessed 30 Jun 2018).
16 Yu VL, Madoff LC. ProMED-mail: An Early Warning System for Emerging Diseases. Clin Infect Dis 2004;39:227–32. doi:10.1086/422003
17 National Wildlife Health Center. USGS National Wildlife Health Center - Avian Influenza News Archive. https://www.nwhc.usgs.gov/disease_information/avian_influenza/ (accessed 5 Jul 2018).
18 The Python Language Reference — Python 3.5.5 documentation. https://docs.python.org/3.5/reference/ (accessed 29 May 2018).
19 Beautiful Soup website. https://www.crummy.com/software/BeautifulSoup/? (accessed 11 Jun 2018).
20 WHO | Definitions: emergencies. WHO. http://www.who.int/hac/about/definitions/en/ (accessed 11 Jun 2018).
21 Collier N, Doan S, Kawazoe A, et al. BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics 2008;24:2940–1. doi:10.1093/bioinformatics/btn534
22 Jimeno A, Jimenez-Ruiz E, Lee V, et al. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 2008;9:S3. doi:10.1186/1471-2105-9-S3-S3
23 Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform 2005;6:357–69. doi:10.1093/bib/6.4.357
24 Tsai RT-H, Sung C-L, Dai H-J, et al. NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 2006;7:S11. doi:10.1186/1471-2105-7-S5-S11
65
25 Zhao S. Named entity recognition in biomedical texts using an HMM model. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics 2004. 84–7. doi:10.3115/1567594.1567613
26 Mohit B. Named Entity Recognition. In: Natural Language Processing of Semitic Languages. Springer, Berlin, Heidelberg 2014. 221–45. doi:10.1007/978-3-642-45358-8_7
27 Kapetanios E, Tatar D, Sacarea C. Natural Language Processing: Semantic Aspects. In: Natural Language Processing: Semantic Aspects. CRC Press 2013. 298.
28 Nadeau D, Sekine S. A survey of named entity recognition and classification. Lingvisticae Investig 2007;30:20.
29 Lafferty J, McCallum A, Pereira FCN. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. 2001;:10.
30 Finkel JR, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. Association for Computational Linguistics 2005. 363–70. doi:10.3115/1219840.1219885
31 Chinchor N. Message Understanding Conference (MUC) 7 LDC2001T02. 2001.https://catalog.ldc.upenn.edu/LDC2001T02 (accessed 6 Jul 2018).
32 Ratinov L, Roth D. Design challenges and misconceptions in named entity recognition. Association for Computational Linguistics 2009. 147. doi:10.3115/1596374.1596399
33 Doğan RI, Leaman R, Lu Z. NCBI disease corpus: A resource for disease name recognition and concept normalization. J Biomed Inform 2014;47:1–10. doi:10.1016/j.jbi.2013.12.006
34 Stenetorp P, Pyysalo S, Topić G, et al. brat: a Web-based Tool for NLP-Assisted Text Annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon, France: : Association for Computational Linguistics 2012. 102–107.http://www.aclweb.org/anthology/E12-2021 (accessed 30 May 2018).
35 standoff2corenlp.py. https://gist.github.com/thatguysimon/6caa622be083f97b8c5c9a10478ba058 (accessed 30 May 2018).
36 Ratnaparkhi A. Maximum Entropy Models For Natural Language Ambiguity Resolution. 1998.https://repository.upenn.edu/ircs_reports/60
66
37 Federhen S. The NCBI Taxonomy database. Nucleic Acids Res 2012;40:D136–43. doi:10.1093/nar/gkr1178
38 Sayers EW, Barrett T, Benson DA, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2009;37:D5-15. doi:10.1093/nar/gkn741
39 Bio.Entrez module. https://biopython.org/DIST/docs/api/Bio.Entrez-module.html (accessed 9 Jul 2018).
40 Vrandečić D, Krötzsch M. Wikidata: A Free Collaborative Knowledgebase. http://korrekt.org/page/Wikidata:_A_Free_Collaborative_Knowledgebase (accessed 9 Jul 2018).
41 Fernández S, Tejo C, Herman I, et al. SPARQLWrapper: SPARQL Endpoint interface to Python. https://rdflib.github.io/sparqlwrapper/ (accessed 9 Jul 2018).
42 Chatziprodromidou IP, Arvanitidou M, Guitian J, et al. Global avian influenza outbreaks 2010–2016: a systematic review of their distribution, avian species and virus subtype. Syst Rev 2018;7. doi:10.1186/s13643-018-0691-z
43 Sims LD, Domenech J, Benigno C, et al. Origin and evolution of highly pathogenic H5N1 avian influenza in Asia. Vet Rec 2005;157:159–64.
44 Garamszegi LZ, Møller AP. Prevalence of avian influenza and host ecology. Proc R Soc Lond B Biol Sci 2007;274:2003–12. doi:10.1098/rspb.2007.0124
45 Augenstein I, Derczynski L, Bontcheva K. Generalisation in named entity recognition: A quantitative analysis. Comput Speech Lang 2017;44:61–83. doi:10.1016/j.csl.2017.01.012
67
BIOGRAPHICAL SKETCH
Matt’s major was in medical sciences with a concentration in biomedical
informatics. He graduated with a Master of Science in the summer of 2018 while
working as a research assistant for the Informatics Services Group of the Models of
Infectious Disease Agent Study.
top related