data analytics project lot 1: measurement and evaluation of ......2017/03/22 ·...
TRANSCRIPT
1
2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing
Final Technical Report: Description of the Implemented Method of Annotation and the Resulting
Dataset
1. Introduction
This technical report has been produced as a deliverable associated with Lot 1 (Measurement and
Evaluation of a Gold Dataset for Text Processing) of the RCloud Statement of Requirement for Dstl’s
2016/17 Data Analytics project. The report has been written by Aleph Insights Limited, with inputs
from Committed Software Limited (formerly Tenode Limited).
Lot 1 requires the development of a ‘Gold Standard’ dataset which is annotated in a manner which
optimises its subsequent use in the training and evaluation of machine learning approaches to
natural language processing (NLP) in a defence and security context. This report outlines the
methods that were applied to annotate the dataset to the requisite quality, and summarises and
reflects upon the dataset’s contents. It should be read in conjunction with Dstl report1
Aleph/2E27A3310, which lays out the requirements for the structure of the dataset. This current
report, therefore, addresses the practical implementation of the dataset requirement framework as
specified in Dstl report Aleph/2E27A3310.
This document includes a theoretical justification of how the method used to compile the dataset
meets the project's objectives, a detailed explanation of the method itself, and a discussion of the
resulting dataset. It comprises the following sections:
• Project Overview - an overview of the purpose of the project, the hybrid approach followed in
constructing the dataset and the theoretical justification of this method.
• Practical Application of the Method - a description of the implementation of the method
followed, including illustrative examples taken from the dataset.
• Dataset Summary - a review of the contents of the dataset, complete with summary statistics
providing insight into the composition of the dataset.
• Conclusions and Recommendations - a critical reflection on the process of compiling the
dataset, and of the dataset itself, addressing considerations for future work in this area.
1 2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing - Dataset Requirement Framework Specification - Dated: 5th January 2017
2
2. Project Overview
This section contains a review of the project aim and articulates how the hybrid method used to
annotate the dataset enables this aim to be met. It discusses the overall decisions taken about the
method, and seeks to justify why these decisions were taken. It draws upon references to previous
research in this area, and relates this research to the decisions taken in ‘the present project’ (i.e. the
work carried out under this research project).
The overall aim of the present project was the creation of a gold standard dataset that could be used
to train and validate machine learning approaches to NLP; specifically focussing on entity and
relationship extraction relevant to somebody operating in the role of a defence and security
intelligence analyst. The dataset was therefore constructed using documents and structured
schemas that were relevant to the defence and security analysis domain. The schema for entity
types was inherited from previous work2 (notably the development of Baleen (specifically version
2.3)) and the schema for relationship categorisation was developed during the project. The details of
the schemas, along with the selection of documents for the dataset, are discussed in greater depth
in Dstl report Aleph/2E27A3310. The methodological discussion within the present report will
concentrate instead, purely on the implementation of the ‘hybrid’ approach employed for the
purposes of this project. In this case, the term hybrid refers to the combination of automated text
extraction together with human annotation, where the human annotation comprises both ‘expert’
(defined later in this section) annotation and crowdsourced annotation.
The rationale for a hybrid approach is that it combines the text-processing efficiency of machines
with humans’ ability to understand meaning and context. Burger et al. (2014)3 proposed a hybrid
approach along these lines as an effective method for generating corpus annotations at scale, where
both entity and relationship extraction are required. In their study, Burger et al. suggested that
automated extraction was sufficient for the classification of entities, but that human annotation was
better suited to the assignment of relationships between those entities.
Accordingly, in the present project, Baleen4 was used as the tool for automated extraction of
entities; these were subsequently served to expert annotators and crowd workers for confirmation
and addition of missed entities. However, Baleen was not used for relationship extraction, with only
human annotation of relationship types being conducted.
Hirschman et al. (2016)5 discuss different models of crowdsourcing for NLP tasks. They define
‘expertise’ in relation to annotation along three axes:
2 https://github.com/dstl/baleen/wiki/Type-System
3 Burger, John D., Emily Doughty, Ritu Khare, Chih-Hsuan Wei, Rajashree Mishra, John Aberdeen, David Tresner-Kirsch et al. "Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing." Database 2014 (2014): bau094.
4 A text analysis framework developed by Dstl: https://github.com/dstl/baleen/wiki/An-Introduction-to-Baleen
5 Hirschman, Lynette, Karën Fort, Stéphanie Boué, Nikos Kyrpides, Rezarta Islamaj Doğan, and Kevin Bretonnel Cohen. "Crowdsourcing and curation: perspectives from biology and natural language processing." Database 2016 (2016): baw115.
3
• Knowledge of the domain of the corpus - in the case of the current project this was the
conflict in Syria and Iraq;
• Understanding of the specific domain of the annotation-type itself (e.g. annotations focussed
on semantics) - in the current project this equated to selecting entities and relationships of
relevance to an intelligence analyst; and
• Familiarity with the annotation task itself - in the case of the present project, the schema
structure, its category definitions and the use of the annotation tool employed.
The two expert annotators in the present project met these criteria: they were both former
intelligence analysts with knowledge of the conflict in Syria and Iraq, they jointly developed the
entity and relationship schemas, and they were involved in developing the annotation tool.
Hirschman et al. (2016) consider that this definition of expertise is helpful for distinguishing the
expert annotators from the crowd workers, who may be considered as being relatively more naive in
relation to the three axes of expertise. Traditionally, in tasks requiring the annotation of domain-
specific natural language corpora, expert annotators have been considered as the arbiters of the
‘ground truth’6 for a given dataset. This has often meant that all other annotations (as provided by
automated approaches or crowd workers) have been benchmarked against expert tagging, and
annotations are only considered to be ‘correct’ if they correspond to the experts’ judgement.
Furthermore, ‘correctness’ of annotations tends to be judged by means of inter-annotator
agreement. So, the quality of a gold standard database is inferred from the number of instances that
have been accurately and consistently annotated, as compared to the experts.
Researchers such as Dumitrache et al. (2017)7, however, have considered a more nuanced approach
to agreement in human-labelled data. They suggest that high levels of agreement, while potentially
implying accuracy with regards to some underlying truth about the meaning of the data, may also be
indicative of artificially-imposed homogeneity resulting from the annotation process. For example, a
small group of expert annotators who work closely together, have an opportunity to develop,
standardise and apply a schema, and thus have a greater chance of applying that schema
consistently (i.e. with lower inter-annotator variance) compared to a disassociated crowd. But the
application of the schema by the small expert group is then at greater risk of adopting
interpretations of schema categories that do not generalise well to meaning as interpreted by a
general population. Further, high levels of inter-annotator agreement may also suggest a narrowly
selective corpus in which the inherent ambiguity of natural language has been limited by the choice
of documents. In cases where this is not representative of the type of data the corpus is trying to
reflect, this may actually undermine the validity of the dataset rather than reinforce it.
Therefore, achieving high inter-annotator agreement should not be regarded, in itself, as a wholly
reliable indication of the quality of annotations, as it can mask genuine ambiguity pertaining to the
meaning of text that could have been ignored or gone unnoticed by a group of similarly minded
expert annotators.
6 This term refers to the abstract concept that there is a single ‘truthful’ classification for each particular instance within a dataset
7 Dumitrache, Anca, Lora Aroyo, and Chris Welty. "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017).
4
One method of ensuring that the ontologies associated with annotation schemas reflect the general
usage of a broader population is to employ crowd workers to verify, refute or provide alternative
classifications to expert annotators. Structured approaches for utilising crowd workers in the
annotation of gold standard datasets, such as the CrowdTruth Framework8 are becoming
increasingly common. This theoretical outlook, as espoused by Aroyo and Welty (2015)9, regards the
allocation of a single ground truth for the meaning of a given instance as naive, instead viewing the
capture of the uncertainty surrounding meaning, as an essential part of the meaning itself.
Additionally, even if one doesn’t accept the contended claim that the concept of ‘ground truth’ is
incoherent, it is uncontroversially the case that the correct interpretation of a term or sentence may
require contextual or background information not contained within the sentence itself, and that the
ability to identify this kind of inherent ambiguity is an important skill for interpreters of text.
Consequently, retaining information about subjective disagreement between the crowd annotators
(or between the crowd annotators and the experts) may be actively desirable if we wish to develop
machine learning applications that are able to identify (or at least account for the existence of)
genuinely ambiguous sentences10, albeit within a carefully managed framework11. Best practice for
the use of crowds12 therefore involves allowing multiple classifications of instances, and recording
and calculating this ambiguity, whilst protecting against spam workers13.
In order to be consistent with this approach, the present project utilised a crowd to validate (or
indeed challenge) expert annotators’ judgements. The crowd were presented with text extraction
microtasks14, which contained guidance designed to explain the task and the meaning of different
classification categories, but which aimed to avoid over-prescription and the stifling of dissent.
Importantly, crowd workers were presented with candidate instances that had been identified by
experts, but not the classifications applied by the experts. Crowd workers were asked to provide
their own classification from the schema to avoid being biased by previous expert interpretation. In
8 An approach originally developed as part of the IBM Watson project, but subsequently developed by IBM and a network of academic institutions - http://crowdtruth.org/
9 Aroyo, Lora, and Chris Welty. "Truth is a lie: Crowd truth and the seven myths of human annotation." AI Magazine 36, no. 1 (2015): 15-24.
10 Indeed, Dumitrache et al. (2017) (Dumitrache, Anca, Lora Aroyo, and Chris Welty. "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017)) argue that harnessing measures which reflect the existence of disagreement in instance classifications, rather than trying to settle on a single ground truth, results in improved performance for automated extraction classifiers
11 See Drapeau et al. (2016) for a novel approach to stimulating disagreement and resolving disputes in order to improve crowd worker engagement and the quality of information derived from a crowd (Drapeau, Ryan, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. "MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy." In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP). 2016.).
12 As proposed in Aroyo, Lora, and Chris Welty. "Truth is a lie: Crowd truth and the seven myths of human annotation." AI Magazine 36, no. 1 (2015): 15-24.
13 Crowd workers who exploit the revenue that can be gained through crowdsourcing platforms through delivering low effort/low quality work.
14 These were single sentence tasks, where a crowd worker was asked to classify a single entity or relationship type. The tasks presented to crowd workers were deliberately deconstructed in this way to make them as cognitively straightforward as possible and minimise the risk of task confusion.
5
the present project crowd workers therefore acted to provide judgements on the annotations pre-
selected by the experts, rather than to add new annotations themselves. Crowd workers who could
not meet a basic standard of performance (e.g. spam workers) were excluded from completing tasks,
but the qualification threshold was calibrated so as not to exclude different, but valid viewpoints15.
The approach taken in the present study was to undertake validation of the expert annotators’
judgements using two metrics: straightforward crowd worker agreement with experts (i.e. as a
proportion of overall crowd worker judgements), and a considerably more-sophisticated machine
learning approach based on a Bayesian model of crowd and expert judgement data generation.
The machine learning approach adopted in the present study - Independent Bayesian Classifier
Combination (IBCC) - aims to generate not a ‘true / false’ decision about the expert assignment, but
a probability that the judgement is correct, using only crowd and expert judgements as inputs. The
IBCC algorithm estimates two things: the accuracy of individual annotators given different categories
of input, and the probability distribution for each instance’s correct classification. The latter feature
enables us to assign a probability to each classification that takes account of all the information
encoded in the collected judgements of the annotators16.
This approach represents a significant improvement over the use of simplistic ‘accept / reject’ rules,
as it allows dataset users to determine the level of confidence17 required for a given use case18. This
approach carried a time and computation burden, but proved to have significant advantages,
including enabling distinction between annotators, as well as between easy and hard categories of
entity or relationship, and providing a means of spotting likely errors and inherently ambiguous
instances.
This section has discussed the underlying theoretical justification for the hybrid approach used in
the present study to annotate the dataset (comprising automated extraction, expert annotation and
crowd annotation), and the nuanced method applied to the task of validating annotations by
calculating agreement or confidence (involving a mix of inter-annotator agreement and an IBCC
machine learning approach). In doing so, the authors hope to have articulated a robust approach to
creating the gold standard dataset that constitutes the primary output for this project. The following
section describes the approach taken in greater detail, outlining the practical application of the
method in order to enable its replication.
15 Consistent with the findings of Hirschman et al. (2016) who conclude that crowd worker input is useful only if it can be periodically validated (Hirschman, Lynette, Karën Fort, Stéphanie Boué, Nikos Kyrpides, Rezarta Islamaj Doğan, and Kevin Bretonnel Cohen. "Crowdsourcing and curation: perspectives from biology and natural language processing." Database 2016 (2016): baw115.)
16 In this instance, however, as discussed below, computational constraints limited the dataset we could use to apply the algorithm to entities, to one that contained only around 60% of all the judgements made.
17 The project team referred to this score using the informal but useful term ‘goldiness’.
18 For example, training an entity classifier for use by intelligence analysts to inform targeting might require inclusion of instances with a very high level of confidence in class assignments; using the dataset to support an academic study might require a larger dataset including lower-confidence assignments.
6
3. Implementation
In general terms, the previous section discusses a methodological approach which involves a number
of sequential steps. These steps - which are discussed in more detail throughout this section - were:
• Automated Extraction - where an automated extraction tool is applied to unprocessed text (in
the present study’s case this involved single sentences) to begin to assign entities to the text
to act as an initial cue for a human annotator;
• Expert Annotation - where analysts within the project team applied entity and relationship
annotations to text using a specific annotation tool;
• Crowdsourcing - where crowd workers were presented with text containing highlighted
instances identified by the experts over a crowdsourcing platform, and were asked to provide
their own classification of these annotations;
• Confidence Calculation - where inter-annotator agreement and confidence scores were
calculated for each instance annotated through the hybrid approach represented by steps 1-3;
• Committal to Dataset - where all of this data is integrated, committed and stored in the
underlying database, enabling retrieval, interrogation and, ultimately, its utilisation as a
training or validation dataset.
An overview of this process is shown in Figure 1 overleaf.
7
Figure 1 Overview of the Dataset Generation Process
8
3.1. Automated Extraction
The automated extraction aspect of this process was conducted by the Baleen
system which parses and extracts text on ingest to an annotation tool
(Galaxy)19. Entities identified by Baleen were displayed to the expert
annotators and provided a prompt for some instances. The configuration was
developed through experimentation with source documentation in an attempt to produce an
optimised set up. Standard annotators and cleaners were used including custom built gazetteers20.
3.2. Expert Annotation
Expert Annotation was conducted by two members of Aleph Insights21, over
the course of the project, using the Galaxy tool produced by Committed
Software (formerly Tenode) for Data Analytics Lot 2. A screenshot of the
Galaxy annotation tool is displayed in Figure 2 below.
Text was presented to the experts in single sentence blocks. It was found that displaying larger
blocks of text (whole documents or paragraphs) made the task of annotation more fatiguing and
confusing. Furthermore, by isolating individual sentences it was possible to ensure consistency with
19 For full details of the Galaxy tool see Dstl report: CR17-RCLOUD-EGFR-20170330–R, Cloud Data Analytics Lot 2, Final Report
20 https://github.com/AlephInsights/gazetteers
21 Both expert annotators were experienced intelligence analysts who had worked in a range of analytical roles, including within teams doing tactical network analysis and on operational deployments. As such, they were familiar with the kinds of information considered salient for intelligence analysis tasks and were experienced in identifying this material within text documents.
Figure 2 Screenshot of Galaxy tool showing expert annotations in text.
9
the mode of display of text to crowd workers, increasing the validity of comparison between the two
populations of annotators22. The presentation of sentences to experts was randomised, to limit the
amount of contextual information the experts received. Sentences were annotated in short batches
to minimise fatigue.
Experts performed annotation using one of three modes of co-working23:
• Joint annotation – This was the primary mode employed at the beginning and the end of the
overall annotation effort. Experts worked together to apply annotations to the same text, with
significant discussion. At the start of the annotation process this mode allowed experts to
agree and adjust their shared understanding of the schemas. The rules for applying the
schemas were developed through this process and then used in subsequent annotations done
independently. This set of schema interpretations has been captured and is presented in
Annex A of this report. Joint annotation was also used to generate a special set of annotations
(‘platinum’), which was considered to represent the experts’ most considered annotations and
was used to recruit and qualify crowd workers. Joint annotation was also conducted at the
end of the annotation effort to see whether divergence in schema interpretation had
developed.
• Solo annotation – Having decided and agreed the interpretation of the schemas, the experts
completed the majority of annotation alone. The total number of sentences were divided into
two lists and each expert worked through their own allocated portion of the corpus. Experts
discussed instances which were considered unusual, peculiar or difficult to make a decision
about, and subsequent adjustments were made to the schema interpretation, or new rules
were created24.
• Overlapping – Some batches of sentences were completed by both experts independently,
without communication. These sentences could then subsequently be discussed to check that
interpretation was consistent between experts. It was found that generally agreement was
good and only minor changes were needed, thus providing an indication that the ‘solo’
annotated sentences were likely to be of a similar standard.
Once expert annotations had been applied using one of these three modes, the sentences were
processed for presentation to the crowd.
22 As discussed in Dstl Report Aleph/2E27A3310 (2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing - Dataset Requirement Framework Specification - Dated: 5th January 2017) single sentences were selected as the unit of text for the present project. While this decision was taken primarily because of the unmanageable complexity of whole document coreference, it also had a range of impacts on the ability of experts and crowd workers to judge annotations. One effect was the loss of sentence context, which made it more difficult for annotators to infer meaning, particularly with regards to the classification of relationships (where their existence is often determined based on knowledge accumulated from multiple sentences). However, the smaller units of text did present a cognitively simpler task to annotators, allowing quicker and less arduous completion, fitting better with the processing of the kinds of microtasks best suited to crowdsourcing.
23 Approximately 88% of the annotations were completed through solo annotation, 8% using joint annotation and the remainder using overlapping annotation.
24 Experts kept a shared running log of rules for interpretation, including examples, decisions about how to interpret these examples and the rationale for these decisions.
10
3.3. Crowdsourcing
Amazon’s Mechanical Turk25 was selected as the crowd platform for generating
annotations from crowd workers. This platform was selected due to its
technical flexibility (effectively displaying and managing large volumes of crowd
25 https://requester.mturk.com/
Figure 3 Example of an Entity Assignment Task
11
tasks) and because of the large population of crowd workers, with the requisite level of linguistic
competence, operating on the platform.
Mechanical Turk uses its own terminology, which this report borrows for its description of the
crowdsourcing method. It refers to those people completing tasks as ‘workers’, and those people
publishing tasks to be completed as ‘requestors’. The individual crowd tasks themselves are referred
to as ‘assignments’. In this project, two types of assignment were published and completed by
workers: Entity Assignments and Relationship Assignments26. Figure 3 (above) shows an example
template for a typical entity assignment task. Each assignment consisted of 5 separated sentences,
each with a list of options for workers to indicate their responses. In Entity Assignments, a single
26 Assignments were priced at $0.06-0.12 depending on the complexity of the assignment and the need to incentivise workers.
Figure 4 Example of a Relationship Assignment Task.
12
instance is highlighted in a sentence of text and the worker is asked to indicate which classification
they believe best suits this instance.
For Relationship Assignments, two entities (a source and a target) are highlighted, and the worker is
asked to indicate all relationships from a list which they believe apply in that context. Figure 4
(above) shows an example of a typical relationship assignment task.
Guidance notes were presented along with each assignment, which explained how to approach the
task. The different guidance forms compiled for Entity Assignments (top) and Relationship
Assignments (bottom) are shown in Figure 5 (below).
Crowd assignments were scheduled randomly so workers would see a selection of sentences out of
their original order. Assignments were released in batches to facilitate management of large
numbers of tasks in manageable chunks. When generating a batch, it was possible to specify the
degree of repetition of each assignment and set the reward for each submitted assignment.
Mechanical Turk also provided a messaging system for workers and requesters to contact each other
regarding specific tasks or batches. This proved very useful, as workers were able to ask for
clarification in relation to assignments about which they were confused (although this happened
infrequently, giving some indication that the guidance for assignments was clear).
Figure 5 Entity and Relationship Assignment Guidance
13
While Mechanical Turk allows requesters to pre-select populations of workers (e.g. based on
geographic area), it was found that any worker with a good command of English could potentially
represent an effective crowd worker. Therefore, to ensure the largest possible recruitment pool,
workers were initially recruited by advertising assignments to all workers without restriction. These
initial assignments formed Qualification Batches, which were made up of entity instances from the
‘platinum’ data generated by expert annotators (see previous section on joint annotation by
experts).
The results from Qualification Batches were analysed to assess the performance of the workers.
Workers who completed more than 20 tasks, and achieved an agreement of 80% with expert
classifications were awarded a Qualification which would enable them to participate in further work.
Periodically workers were assessed against platinum results to check the quality of their work. If
workers performance dipped below 80% agreement with experts, their qualification would be
revoked. Interestingly, out of the 200 most prolific workers initially recruited, only 5 needed this
sanction; generally, workers seemed consistent in their performance.
All Entity Assignments were completed by at least three workers, and Relationship Assignments by
at least five workers (due to the greater levels of ambiguity surrounding relationship annotations).
Some instances were assigned to more workers in order to facilitate quality assurance of crowd
worker responses.
3.4. Confidence Calculation
The database of judgements was ingested (entirely, in the case of relationship
judgements, but partially in the case of entity judgements)27 into a software
tool, along with a Bayesian model of annotation data generation (‘IBCC’), to
generate probability distributions for the ‘correct’ annotation for each
instance. The probabilities thus assigned to the expert annotations were committed to the database
as the confidence score. A full explanation of the IBCC algorithm and its application is provided in
Annex B of this report.
3.5. Committal to Dataset
All annotation, agreement and confidence data was committed to the
dataset linked to the instance to which it applied. In practice this involved
all descriptive information about the instance (e.g. its location in a
document within the corpus, its classification by experts, its classification
by each crowd worker, its confidence score, etc.) being tied to a unique instance ID within the
underlying database. For a full description of the data structure refer to the Mongo data definition
spreadsheet accompanying the database.
27 Only partial ingestion, corresponding to 60% of the overall number of judgements, was computationally feasible. Beyond the 60% point, each additional crowd annotator only provided judgements for less than 2% of the overall corpus, adding something like 9,300 additional free variables to be computed per iteration of the sampling algorithm.
14
4. Dataset Overview
This section provides an overview of the data contained within the dataset, particularly
concentrating on statistics relating to the annotations. It includes a number of subsections, which
cover different aspects of the data analysis, these are discussed in sequence below.
4.1. Summary Statistics
The document corpus comprises 219 individual documents drawn from 25 sources (comprising open
source online content producers28. The total number of sentences in the corpus was 2,281. There are
a total of 12,135 entity instances, 9,464 of which have been encoded by at least one ‘expert’
annotator. All of these have been verified by at least 3 crowd workers (with varying levels of
agreement). There are a total number of 3,694 relation instances which have been encoded by
expert annotators, all of which have been verified by at least 5 workers.
4.2. Frequency of Instances by Type
Figure 6 (overleaf) shows the frequency of the entity types as annotated by the expert annotators
across the whole corpus. It should be noted that this is not the same thing as the ‘true’ frequency of
entity types, which is a disputed concept (as discussed in the project overview section of this report).
The most frequently occurring types of entity were organisations (3,959 instances as annotated by
experts), locations (2,171 instances as annotated by experts) and persons (1,404 instances as
annotated by experts). The large number of organisations suggests that it would be appropriate to
investigate unpacking the organisation type into a set of more narrowly defined sub-types which
would provide greater structure.
The most frequent entity types contrast markedly with the least frequent entity types of URLs (only
1 instance annotated by experts), radio frequencies (4 instances annotated by experts) and
communications identifiers (8 instances annotated by experts). This would seem reflect the
frequencies that might be expected given the types of documents in the corpus (mostly news
articles, government information and blogs). This frequency could, of course, differ were military
and intelligence reports incorporated into the dataset, where the low frequency entities might be
referenced more often.
Figure 7 (overleaf) shows the same data based on expert annotations by relationship type. It should
be noted that relationship annotations achieved far lower levels of confidence (as discussed later in
this section) than entity types; therefore, expert annotations provide a weaker indication of the
overall annotation frequency (including crowd workers) for relationships than for entities.
There is a relatively even distribution of frequency of relationship types according to the data in
Figure 7, apart from for the co-located relationship which is much more frequent than the other
categories.
28 See ‘2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing - Dataset Requirement Framework Specification - Dated: 5th January 2017’ for full details of how documents and sources were selected.
15
There are 1,430 co-located relationship instances, compared with a range of 96-411 for the other
categories (below). This is partly a feature of the nature of co-located relationships which often
apply multiply, simultaneously and mutually (i.e. when numerous objects are all in the same place).
Figure 7 Expert Annotated Entities Instances Frequency
Figure 6 Expert Annotated Relationship Instances Frequency
16
Figure 8 shows the number of crowd worker annotations produced for entity instances29. From the
graph it can be seen that the most frequent number was 3 annotations per entity instance. It should
be noted that those instances which appear on the graph (above) as receiving zero crowd
annotations, represent entities which were extracted by Baleen, but subsequently rejected by the
expert annotators on inspection, and thus not presented to the crowd. Those instances that
received >20 annotations were part of the platinum dataset that was used to qualify workers.
29 Assignments for batches were selected randomly from the sentences annotated by experts, with a bias towards instances with no previous crowd results.
Figure 9 Number of Worker Annotations for Relationships.
Figure 8 Number of Worker Annotations for Entities
17
Figure 9 shows the same data for crowd worker annotations for relationship instances. The graph
shows that the significant majority of relationship instances received 5-6 crowd worker annotations.
4.3. Crowd-Expert Agreement Scores
Figure 10 shows the distribution of crowd agreement with the experts’ annotations, ranging from 0%
to 100% (1.0). The graph shows a high level of agreement between the crowd and experts for the
classification of entities, with the significant majority of entity instances receiving higher than 90%
agreement. The mean agreement between crowd workers and expert annotations was 86.9%.
Figure 10 Distribution of Percentage Agreement Between Crowd Workers and Experts for Entity Annotations
Figure 11 Distribution of Percentage Agreement between Crowd Workers and Experts on Relationships.
18
Figure 11 shows the same data for crowd and expert agreement relating to relationship instances.
This shows a very different picture, and reflects a mean crowd agreement with expert annotations of
57.3%. The greater ambiguity inherent in the process of relationship classification, and the more
subjective nature of relationship schema interpretation are likely to be the major factors accounting
for the discrepancy between relationship and entity instance agreement in this project.
Figure 12 shows the distribution of crowd-expert agreement across quartiles by entity type. It can be
seen from the graph that ‘time’ and ‘person’ categories have the greatest proportion of high levels
of crowd-expert agreement. Other than ‘URL’ of which there was only a single instance30, the
‘nationality’ category possesses the largest proportion of low agreement scores. This corresponds
with the intuition of the expert annotators, who felt that classifying nationality was often very
confusing, due to the fact that nationality can often be used to refer to a person (e.g. “an Iraqi left
the building”) or an organisation (e.g. “The U.S. launched attacks on the city”).
30 The low agreement for the single occurrence of a URL within the dataset, may be a feature of the rarity of this category, which meant crowd workers were unlikely to be familiar with classifying this type of entity.
Figure 12 Quartiles of Crowd-Expert Agreement by Entity Type
19
Figure 13 depicts the distribution of crowd-expert agreement across quartiles by relationship type.
The graph shows lower levels of crowd-expert agreement across all relationship categories. Of note,
is the relatively high level of agreement for the ‘co-location’ category. Figure 7 showed that this was
the most frequently tagged type of relationship. There is no data to suggest why the ‘fighting
against’ relationship category results in such low levels of agreement. This may be a result of the
Figure 13 Frequency of IBCC Confidence in Judgements by Range (of confidence score)
Figure 14 Quartiles of Crowd-Expert Agreement by Relationship Type
20
description of the category provided to crowd workers, or may be a feature of the nature of the
relationship itself.
Figure 14 provides an interesting contrast to the crowd-expert agreement scores. The data in this
graph represents the scores for confidence obtained by using the IBCC algorithm. It shows the range
of confidence scores against all of the instances in the dataset (both entity and relationship). From
this it can be seen that the vast majority of instances receive confidence scores of greater than 90%.
Additional results derived from the IBCC algorithm are now discussed in greater depth, alongside an
explanation of the basis of the calculations in order to illustrate the methods utility.
4.4. Measuring Confidence: the IBCC-M Model
4.4.1. Confidence
A standard approach to producing ‘gold’ data is to employ a decision rule on a set of annotations
(e.g. a certain level of score on inter-annotator agreement, Cohen kappa measure, or similar) and to
accept instances meeting this rule into the dataset while rejecting the rest. In this project, we
decided to take a more discriminating approach that involves assigning confidence measures to each
annotation in the dataset. The rationale for this approach is primarily that accept / reject rules do
not allow dataset users to calibrate confidence to the requirements of the task (i.e. by enabling them
to make the trade-off between the quality and size of the dataset), but using confidence measures
also respects the fact that some instances may be intrinsically ambiguous.
In addition, we decided to use a more-sophisticated approach to assigning confidence scores, based
on machine learning, than the widespread metrics looking solely at inter-annotator agreement.
Simple metrics (e.g. percentage agreement, Cohen’s kappa or Gwet’s AC1) fail to account for
differences between the difficulty of instance types (e.g. people being easier to identify than
nationality) or, more importantly, between the quality of annotators. The project team therefore
investigated, adapted, and deployed an algorithm based on a Bayesian model of classification
performance, that produces a probability that a particular annotation is correct. Aside from its other
advantages, this has a straightforward interpretation, in contrast to other measures that cannot
straightforwardly be used to answer the question: “how likely is this annotation to be true?”.
4.4.2. Modelling Classification Performance
A number of approaches for combining classifiers of unknown quality have been proposed; Kim and
Ghahramani (2012)31 review a number of methods, which differ in the complexity of their
assumptions. Taking account the expected sparsity of the dataset, we selected a relatively-simple
model for classification generation: ‘Independent Bayesian Classifier Combination’. The empirical
31 H. Kim and Z. Ghahramani. (2012). Bayesian Classifier Combination. In Proc. of the 15th Int. Conf. on Artificial Intelligence and Statistics, page 619
21
benefits of a variant of this approach are demonstrated in Simpson et al. (2013)32, and it is used in
combination with automated text classifiers with considerable success by Simpson et al. (2015)33.
This model assumes that classifiers (i.e. annotators, in the terms of our study) are all independently
drawn from a population of unknown quality, and defined by a set of independent probability
distributions that describe, for any given ‘true’ classification of an instance, the probability of every
possible classification that that classifier will assign. The model assumes that the data are the result
of these classifiers being presented with a set of instances. Estimation of model parameters is
considerably improved where ‘gold’ datasets exist (to calibrate classifier performance), but this is
not required.
In our case we needed to reflect the possibility that expert and crowd annotators were drawn from
different populations; in particular, we wanted to model the assumption that crowd annotators
could not (at least prior to analysing the data) be trusted as much as the expert annotators.
Consequently, we adapted the model to allow for two populations of classifiers - experts and crowd.
Details of this model - ‘Independent Bayesian Classifier Combination with Multiple Populations’, or
‘IBCC-M’ - are outlined in Annex B.
4.4.3. Estimating the Model
In non-technical terms, the IBCC-M model proposes that the annotated data are generated by
passing instances (of unknown types) through classifiers (of unknown performance). Given that
classification performance can only be properly inferred if we know the true classification of (at least
some of) the instances, and that the instance types can only be inferred if we know the true
performance of the classifiers, it might seem that knowing neither makes for a circular impasse.
However, this is not the case. With a very weak assumption that classifiers are slightly more likely to
be correct than not, it is possible to estimate both unknowns using statistical methods. Intuitively,
this is related to the idea that if three people agree on an annotation, it’s more likely that that thing
is correct, than that the three are independently (randomly) selecting the same label.
32 E. Simpson, S. Roberts, I. Psorakis, and A. Smith. (2013) Dynamic Bayesian Combination of Multiple Imperfect Classifiers. In Intelligent Systems Reference Library series: Decision Making and Imperfection, pages 1–35. Springer.
33 Simpson, Edwin, Venanzi, Matteo, Reece, Steven, Kohli, Pushmeet, Guiver, John, Roberts, Stephen and Jennings, Nicholas R. (2015) Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning At 24th International World Wide Web Conference (WWW 2015). , pp. 992-1002. (doi:10.1145/2736277.2741689).
22
Nevertheless, estimating the model - i.e. calculating (or at least putting probability distributions on)
the parameters describing classifier performance and true instance characteristics - is not
straightforward. Because of the entanglement described above, the parameter values cannot be
estimated analytically. Instead, stochastic sampling methods must be used to approximate the
probabilities of the parameters, given a set of annotations. The IBCC-M model, by design, is
amenable to Gibbs Sampling34 and so we used the specialised WinBUGS35 (Bayesian Inference Using
Gibbs Sampling for Windows) software application to perform the necessary estimation.
However, although the model estimation worked extremely well for dummy data we generated for
testing, and for our individual relationship classification models36 the sheer volume of missing data37
meant that the software became unstable when we tried to deploy it using all of our entity
annotation data. Figure 15 shows the total number of unknowns in the model (each of which needs
to be re-estimated several hundred times) with increasing numbers of annotators; with all the crowd
workers included, there are more than 2.5 million unknowns. We were able to exploit the power-
law-like distribution of effort among annotators to compromise effectively in generating entity
confidence scores. Figure 16 shows that the top 15 annotators’ data (the two experts and the
busiest thirteen crowd workers) accounted for 60% of the total judgements, but less than 5% of the
unknown variables of the full model. This should be borne in mind when interpreting the confidence
scores, although it is worth bearing in mind that the scores themselves already account for missing
data by (essentially) not being as high (or low) as they would be if more data were available.
34 A statistical method of estimating complex probability distributions by iteratively updating individual parameter values
35 https://www.mrc-bsu.cam.ac.uk/software/bugs/the-bugs-project-winbugs/
36 For reasons outlined in the Annex, the fact that our entity schema was mutually exclusive, while our relationship schema was not, meant that we could treat the relationship classifications as a range of simple models whereas the entity model was a single, complex one
37 Each annotator, on average, made judgements about less than 3% of the total number of instances
Figure 15 Number of Unknown Parameters by Number of Annotators
23
Finally, the estimated parameter values were transformed straightforwardly into confidence scores
by extracting the probabilities, conditional on the data, attached to the expert annotations. In some
cases, these were lower than the probabilities attached to alternative annotations. For reasons of
transparency, however, we decided to retain the original (human expert) annotations together with
the confidence scores, rather than report the (IBCC-M-generated) alternative annotation. This
effectively meant that although the crowd could disagree (by indicating alternative annotations, thus
lowering confidence) with the experts, they could not ‘suggest’ alternatives in a way that would lead
to them entering the database.
4.4.4. Findings from the Confidence Measurement
In theory, the confidence scores measure the probability that the expert annotation is correct. We
have no real way of telling whether this is true, lacking any external source of data, although the
studies referred to above give some indication of the model’s value. However, there are a number of
indications of the approach’s validity, and indeed of its potential advantages over more simplistic
metrics.
First, and most straightforwardly, the IBCC-M confidence measures were in almost all cases higher
than simple agreement frequency (i.e. the proportion of raters who agreed with the experts); for
92% of instances, the IBCC-M confidence probability exceeded the frequency of annotator
agreement. This suggests that the model is accurately capturing the fact that if two (or more) raters
agree, this provides particular justification for the hypothesis that this is due to anchoring around
the truth rather than to chance agreement. Figure 17 illustrates this difference by plotting the
traditional measure of crowd-expert agreement plotted against IBCC generated confidence scores.
Figure 16 Number of judgements made by annotators
24
The IBCC-M model also exhibited some interesting behaviour. First, and without any explicit
prompting from the experts or through the design of the model, it learned to put very little credence
in crowd annotations either of ‘NONE’ or of ‘DON’T KNOW, CAN’T TELL (DKCT)’38. In most of the
other categories, the frequency of its assignment was of a very similar order of magnitude to those
of the annotators, but it assigned almost no instances to these ‘bucket’ categories. We do not know
exactly why39, but one plausible hypothesis is that those annotations (by crowd workers), when they
occurred, occurred largely randomly and were not correlated with one another.
The confidence algorithm also identified a few cases that (when followed up) appeared to be the
result of misunderstanding by the crowd (or, perhaps, miscommunication by the project team). For
example, one of the instances mentioned “6787 civilians”; some of the workers thought this was a
quantity, and the algorithm assigned a confidence of only 46% to the experts’ judgement that this
was an ‘organisation’. In fact, the experts were following a pre-agreed schema in which numbers of
people were considered organisations, but the crowd workers were not trained in this specific
interpretation. Another instance talked of the ‘Arab world’: even though many of the annotators
agreed with the experts that this was a location, the algorithm did not, and thought it was more
likely to be an ‘organisation’. Figure 18 shows the slight discrepancies in annotator and IBCC defined
assignments, highlighting the algorithm’s lack of confidence in none and DKCT categories.
38 For example, the algorithm discounted crowd annotators’ views concerning one instance which mentioned a ‘major oil pipeline’; three out of four of them thought it did not fall into anything in the schema, yet the model nevertheless put a confidence of 43% in the experts’ judgement that it was a location.
39 One of the drawbacks of the machine learning approach to assigning confidence is a lack of transparency in the provenance of the conclusions
Figure 17 Average confidence compared to agreement
25
Without further and more systematic investigation, we cannot determine exactly the performance
characteristics of the IBCC-M model and its Bayesian siblings on intelligence-related corpora of the
kind under study here. Counting against this approach is its computational burden (which, for larger
bodies of work or numbers of annotators would require significant outlay), and its lack of immediate
transparency compared to more-intuitive metrics. In its favour are the clear interpretation of its
output and its subtlety in differentiating between annotators and the difficulty of instance types.
The approach does hold out the possibility of delivering reliable gold data even in the absence of
experts. By learning the characteristics of annotators, it may be possible for an approach like this to
deliver reliable, crowd-sourced judgements at only a fraction of the expense of employing people
with domain knowledge to conduct the annotation40, and without necessarily requiring any
knowledge of ‘ground truth’. In addition, if it could be automated effectively, this approach offers
the prospect of relatively-fast identification of well-performing annotators, or indeed even of
annotators who are particularly effective at identifying particular categories of instance. We believe
further investigation in this direction would be extremely worthwhile.
40 We estimate that the average cost per crowd worker annotation, including design of the template, management of the crowd and processing of the data, is £0.20-0.30 per annotation; by way of comparison, the expert annotators tagged approximately 100 individual instances per hour, which would equate to an hourly rate of £20-30 per hour for an expert annotator (including the cost of verifying their work) if paid at the same rate as the crowd. Further, the cost per crowd-supplied annotation would decrease markedly as the task is upscaled because most of the cost is attached to the management of the crowd, the development of the tasks themselves and the processing of the results.
Figure 18 Annotator vs IBCC judgements
26
5. Conclusions and Recommendations
5.1. Observations about the creation of the dataset and its utility
One of the significant difficulties encountered throughout this project was designing and then
implementing entity and relationship schemas. The experience of this project was that however a
schema was designed there would always be examples within the corpus that did not fit the schema,
or that forced the annotator to make seemingly nonsensical decisions. To some extent the
application of a rigid schema, is something of a fool’s errand. Rigid schema must attempt to pre-
suppose the types of problem that will be addressed using the dataset. This complicates the design
of the schema and forces significant compromise to be made before any data is assessed or
structured. Schema design must attempt to maximise conflicting aims: it must simultaneously be
suitably broad to cover generally useful types, but specific enough to be meaningful. Ultimately any
schema must be designed with representative questions in mind, and once they are used the data
structure derived from it cannot be adapted easily to suit other specific questions without significant
reclassification or restructuring effort.
A possible solution to this problem could involve the use of emergent schema, where the task of
annotating the data is combined with creating the schema. Chapman and Dowling (2006)41 present a
structured approach to developing a schema inductively based on the corpus contents and the
question of interest. However, given advances in machine learning, it is conceivable that the process
of forming an emergent schema and determining saliency could be augmented by automated
approaches. In such a solution, the characteristics of differences between classes in an ontology
might be recognised algorithmically, in effect, leading to the generation of a meta-schema.
Crowdsourcing was another vital aspect of this project. The approach of utilising crowd workers
through Mechanical Turk proved valid and appropriate for the scope of this dataset. The population
of workers was easily able to accommodate the scale of work produced, and there would be capacity
for tasking on greater scale. The use of the crowd in this way presented excellent value for money in
terms of annotation effort and should be considered for similar tasks in the future (see the following
subsection for specific recommendations).
Throughout this project, the issue of what constitutes a gold standard dataset for natural language
processing has arisen. Wissler et al. (2014)42 have suggested a definition describing a dataset which
has been extensively tagged by a range of human annotators whose outputs are cross-validated
against one another, and against ‘objective’ automated tagging. However, this does not resolve the
issue of deciding on the requisite quality of the annotations and whether inter-annotator agreement
is a good measure of quality. Central to this issue is the question of whether or not a semantic
model, which considers all natural language text to contain ‘ground truth’ meaning, is valid or useful.
41 Chapman, Wendy W., and John N. Dowling. "Inductive creation of an annotation schema for manually indexing clinical conditions from emergency department reports." Journal of biomedical informatics 39, no. 2 (2006): 196-208.
42 Wissler, L., Almashraee, M., Díaz, D.M. and Paschke, A., 2014. The Gold Standard in Corpus Annotation. In IEEE GSC.
27
The preceding report associated with this project43 discussed the theoretical foundations of
meaning, ambiguity and the relevance of both concepts to annotating natural language text. It
argued that to try and reduce ambiguity in some contexts could actually erode the information value
of text, rather than increase it; this idea is encapsulated by Dumitrache et al.’s (2017) hypothesis
that “annotator disagreement is not noise, but signal”44. Further, in the present report we have
discussed and demonstrated a method for measuring ‘confidence’ in annotations and considered its
relevance to determining ambiguity and ‘true’ meaning. We conclude that using such an approach is
better able to distinguish between genuine ambiguity and poor annotation, than simple inter-
annotator agreement. Overall the IBCC-M model used in this project provided higher confidence
scores across most instances, and drew attention to instances where experts’ annotation decisions,
were open to valid alternative interpretations.
This ability to handle divergent interpretations is particularly important for relationship extraction,
where agreement is generally much lower (as demonstrated in this project and other research45).
However, this disagreement should not be viewed as a barrier to using the dataset for machine
learning training and validation. Li, Good and Su (2015)46 showed that even corpora where there are
fairly low levels of annotator agreement for relationships can be used successfully to train machine
learning approaches.
The dataset itself should represent a rich source of data for its intended purpose of training and
evaluating machine learning approaches to text extraction. The IBCC-M model, and approaches like
it, offer a great deal of potential value over and above simpler metrics. Although they carry a
computational and transparency burden, the intuitiveness of their output, and the prospects they
offer in terms of precision about instance categories as well as annotators, would be of substantial
benefit in a range of contexts.
5.2. Potential future developments
Whilst the dataset produced by this project is both rich in information and robust in terms of the
quality of that information, there is always the potential to improve either the efficiency of
developing a dataset, or the utility of its contents. A list of potential future developments for related
datasets is included below:
• Gold Standard data production without experts - The use of Bayesian algorithms for
assessing and scoring the confidence of annotations, demonstrated in this project presents
the possibility of delivering reliable ‘gold’ data even in the absence of experts. This would
provide the possibility of producing very large, structured datasets of high quality, with
43 2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing - Dataset Requirement Framework Specification - Dated: 5th January 2017
44 Dumitrache, Anca, Lora Aroyo, and Chris Welty. "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017).
45 For example, Aroyo, Lora, and Chris Welty. Harnessing disagreement in crowdsourcing a relation extraction gold standard. Tech. Rep. RC25371 (WAT1304-058), IBM Research, 2013.]
46 Abstract to Li, Tong Shu, Benjamin M. Good, and Andrew I. Su. "Exposing ambiguities in a relation-extraction gold standard with crowdsourcing." arXiv preprint arXiv:1505.06256 (2015)
28
minimal requirement for costly subject matter expertise. We believe further investigation in
this direction would be worthwhile.
• Fully utilising confidence scores - Beyond using the confidence score to set a threshold for
data quality (i.e. discounting instances below a certain confidence score), the dataset could be
used in a more sophisticated way. The confidence score could be incorporated into
probabilistic machine learning training, for example, by informing the priors about the data,
the annotators, or other features. The IBCC algorithm is just one of a large family of similar
techniques. Other approaches and modelling could be explored to provide additional data to
assist with training of classifiers of various types.
• Dynamic schema - The previous subsection discussed the problems inherent in trying to
implement rigid schemas. A more integrated hybrid approach using a combination of expert
and crowd workers operating in concert with automated extractors could be used to define
and evolve adaptable schema.
• Large scale co-referencing - Indicating coreference between sentences was determined out of
scope for this project. However, it is recognised that such an exercise, while challenging,
would add significant value to any dataset. Indicating coreference could be performed by
crowd workers managed in a similar way to the approach followed in this project. This process
could be augmented by the targeted use of machine learning.
• Intelligent crowd management - A number of methods could be employed for future
crowdsourcing tasks, to utilise resources more efficiently, particularly in the case of larger
scale projects. Such methods might include: dynamically adjusting payment and other
incentives to increase productivity or quality; monitoring performance in real-time using the
constant running of IBCC over crowd results; creating hierarchies of workers or specialised
qualifications to better target the most difficult tasks; or using mutually exclusive sub-groups
of the crowd to validate different presentation methods.
A1
Annex A - Interpretation of Entity and Relationship Schemas
This annex provides an informal guide to the interpretation of the annotation schemas used in this
project and has been produced to increase understanding of the dataset. It gives further details on
the way the schemas used in this project have been applied during the expert annotation of the
dataset, and reflects the instructions issued to crowd workers to guide their data tagging activities.
The interpretations which have evolved during this process have been driven by practical
considerations about ensuring the annotation is as efficient and unambiguous as possible, but also
by the requirement to capture information that would be relevant to the intelligence analyst using
the available schema.
A1. Entity Schema
The main challenge in annotating entities has involved deciding the exact text which constitutes a
given referent. In general terms, definite and indefinite articles have been included in the entity
annotation in order to distinguish between, for example, “the man” and just “man”. However,
interpretation has been required to determine which adjectives and other descriptive phrases form
part of the referent. For example, in the text – “the tall man in the black coat” – the entity would be
captured simply as “the tall man”. In this case, “the black coat” would be considered as a separate
object which belonged to “the tall man”. However, there are times where the context within the
sentence make this separation more difficult. The table overleaf contains the interpretations for
different entity types.
A2
Entity Name Entity Description Entity Interpretation Rules
CommsIdentifier An alias used to
represent people,
places, military units in
electronic
communication etc.
Could be a call sign, an
email address, a twitter
handle.
These have been relatively rare occurrences in the
dataset. On most occasions, they have included
email addresses or twitter handles, and have been
relatively straightforward to classify.
DocumentReference A unique (or semi
unique) identifier for a
document.
Within the dataset this category has mainly been
restricted to instances referring to religious texts,
specific reports produced by organised bodies, or
references to specific written legislation or policy
(e.g. UN Security Council Resolutions).
Frequency A radio frequency used
in electronic
communication.
There have been very few of these occurring in the
dataset. The limited number of examples have
been unambiguous in their classification.
Location A specific point, area or
feature on earth. May
be named (e.g. a city
name) or referenced
(e.g. with a coordinate
systems)
This has been interpreted to include geographical
features such as mountain ranges and rivers. It has
also been used to capture abodes such as ‘his
home’ or ‘their flat’. The annotation also includes
locations expressed relative to another location
(e.g. 25 miles north of the capital city). It has not
been used for references to abstract and
unspecified locations such as ‘the battleground’.
A3
MilitaryPlatform A military ship,
aeroplane, land vehicle
or system or platform
onto which weapons
might be mounted. May
be indicated by its
abstract class, e.g.
"tank", or by specific
designation or other
aliases e.g. "T-14
Armata"
The application of this category has mostly been
obvious, with two exceptions. The first relates to
the use of civilian vehicles for military purposes
(e.g. a jeep being used by terrorists to conduct an
attack) and the second, regards named military
vehicles (such as naval vessels) where it is not
obvious, from the isolated sentence, to what its
name refers (i.e. it seems it could be a person’s
name). For the first case, the decision was taken to
classify vehicles whose primary purpose is military
within the context of the sentence, as military
vehicles. In the second case, where the vehicle’s
name is not obviously associated with a military
vehicle (e.g. “Victoria sailed to the
Mediterranean”), then it has not been classified.
Money An amount of money or
reference to currency
This category has been used to capture both the
reference to the currency concerned and the
quantity of that currency (e.g. “$40,000”). It has
also been used to refer to sources of money, where
they are used to indicate another entity having
access to or being sent money (e.g. “benefit
payments” or “oil revenue”).
Nationality Description of
nationality, religious or
ethnic identity
Nationality has only been annotated where it is not
part of another referent. For example, “a Syrian
man”, would be classified as a person, not a
nationality. However, in the sentence – “the food
eaten at the party was mainly Syrian” – the
nationality reference for “Syrian” would be
annotated. This category has also been used to
encode other identity descriptors such as religious
(e.g. Sunni) or ethnic (e.g. Kurdish) identity.
Organisation A group of people, a
government, a
company, or a family.
May be referenced by
name e.g. "the British
Army" or in relying on
context "the Army".
May also be indicated
by an abstract class e.g.
"rebels".
This category has been one of the most inclusive
categories, encapsulating a very broad range of
entities. It has been used to annotate official
organisations and groups (such as states,
companies or political movements), but also any
references to collective plurals of people (e.g. “the
population of Aleppo” or “the six soldiers”). Where
groups of people have been annotated, the
number of people has been captured within the
entity as well (e.g. “30,000 refugees”).
A4
Person A specific person. May
be a name e.g. Barack
Obama. By title,
"President of the United
States". A combination
of title and name e.g.
"President Obama". Or
by reference to some
other entity. e.g. "The
US's head of state."
In addition to names and titles of individuals,
annotation under this category has been applied to
references to any individual within a sentence (e.g.
“the terrorist suspect”, “the Kurdish soldier” or “his
wife”) and also personal pronouns (e.g. “he” or
“she”).
Quantity A quantity or amount of
something e.g. "1 kg"
This has been used to capture values and the units
in which those values are measured (e.g. “30
kilometres”) and the quantities of other objects
(e.g. “6” meetings held per year or “15” cases of
ammunition). The exceptions to this are quantities
of money (where the quantity may be included in
the entity classified as ‘money’), quantities of
people (where the number of people may be
included in the entity ‘organisation’) and quantities
of distance incorporated in a location (e.g. “15
kilometres to the west of Baghdad”).
Temporal A specific date, or time
or range of time. e.g.
10:30 PST, 12/09/2016,
"Next week",
"Wednesday"
Annotations under this category include a wide
range of references to periods or points in time.
They might include years, months, days, dates or
times, but also include relative temporal
references to particular events (e.g. “today”,
“three days later” or “for five years”).
Url A web URL There have not been very many instances which
are relevant to this category occurring in the
dataset. Those that have occurred have been
straightforward to classify.
A5
Vehicle A non-military
maritime, land or air
vehicle. May be
indicated by its abstract
class e.g. "airliner", by a
specific vehicle name
e.g. "Boeing 777" or
another related
identifier, e.g. its flight
name "MH370"
The annotations in this category have generally
been uncomplicated. In addition to including types
and names of vehicle (as described in the entity
description across), they also include possessive
determiners and possessive nouns (e.g. “his car” or
“the President’s helicopter”).
Weapon A weapon. May be
indicated by its abstract
class e.g. "rifle", or by
specific name or alias.
e.g. "AK-47".
In addition to the classifications included in the
entity description, this category has also been used
to include objects which the sentence indicates are
being used as a weapon (such as “a rifle butt”).
A2. Relationship Schema
The annotation of relationships has required a more flexible interpretation of the context and
meaning of a sentence. It has not lent itself to the imposition of rigid universal annotation rules. The
interpretations in the table below are more focussed on the kinds of relationships in general terms
that have been annotated, rather than examples of specific uses of language which encapsulate
those relationships.
A6
Relationship
Name
Relationship Description Relationship Interpretation Rules
Co-located Two or more entities in
the same place at a given
time. (e.g. two vehicles
being co-located)
This relationship occurs commonly within the
dataset and there are three specific interpretations
within the annotations which require highlighting.
The first relates to the tense within the sentence.
Co-location between entities has been recorded in
cases where the text indicates two or more entities
were previously in the same place, are currently in
the same place or will be in the same place
(excluding speculative phrases such as “should” or
“may”). Secondly, co-location is sometimes
recorded unidirectionally where a smaller entity is
located in a larger entity, but not vice versa (e.g.
“the house” was in “London”). This also applies to
features such as rivers, which may be “in” a city,
but not vice versa. Finally, the issue of proximity
has required a degree of interpretation to make
decisions about whether entities are close enough
to be considered as co-located. Words such as
“near” have generally been used to indicate co-
location in instances where this provides
information which might have value to the
hypothetical intelligence analyst.
Apart Two or more entities in
different places at a given
time. (e.g. two people
being apart)
This relationship has only been recorded where the
text explicitly records that entities were apart (e.g.
“the insurgents have been driven out of the city” or
“the President missed his visit to Paris”). It does not
include examples where significant inference is
required to determine that two entities are apart.
Belongs to One entity being owned
by another entity. (e.g.
one vehicle belonging to a
person). OR an entity
being a member of
another entity (e.g. a
person being a member of
an organisation)
This relationship category has been applied to text
where a physical object is in possession of another
entity, where a person or group is a member of
another organisation, or where money is owned by
another entity. Where the possessive noun or
determiner is included in the entity, the ‘belongs
to’ relationship has not been recorded. For
example, in the sentence – “the President’s
helicopter crashed in the jungle” – the “President’s
helicopter” is captured as the entity, not a
“helicopter” belonging to “the President”.
A7
In charge of One entity controlling or
having responsibility for
another entity. (e.g. an
organisation being in
charge of a building)
This has mainly been applied to command
structures and control of territory; however,
occasionally, it has been used to indicate that an
individual or organisation have control over the use
of a weapon or vehicle.
Has the attribute
of
One entity being an
attribute or associated
quality of another entity.
(e.g. a type of weapon
having the attribute of a
quantity or a person
having the attribute of a
nationality)
The main use of this relationship has been to attach
quantities to entities such as vehicles, weapons or
locations (e.g. “12” tanks). It is also used in
instances where nationality is applied to another
entity, but where this isn’t part of the referent
associated with the entity (i.e. the relationship is
used for examples like “the man was Syrian”, but
not in examples like “the Syrian man” – where the
nationality is incorporated into the ‘person’ entity.
Is the same as One entity being used to
mean the same thing as
another entity (e.g.
multiple names for the
same person)
This relationship has been used as a distinct
classification compared to the coreferencing
function provided by the Galaxy annotation tool. “Is
the same as” has been reserved for cases where
two specific phrases are used for an entity, each of
which uniquely identifies the entity to the extent
that it might feasibly be recognised by any one of
the phrases (e.g. the terrorist organisation “Daesh”,
which calls itself “the Islamic State”). Coreferencing
has been used for non-specific references to the
same entity in a sentence (e.g. “the man” walked
back the way “he” had come).
Likes One entity being
positively disposed
towards another. (e.g. a
person likes a military
platform)
This relationship has been used relatively
infrequently for sentences that use synonyms of “to
like” in the context of the relationship between
entities. It may also have been employed to
annotate instances where the text implies that the
one entity views another favourably.
Dislikes One entity being
negatively disposed
towards another. (e.g. a
person dislikes another
person)
The main application of this relationship in the
dataset has been to sentences which imply the
existence of an acrimonious relationship between
entities, or where one entity is quoted as saying
something disparaging or universally negative
about another entity.
A8
Fighting against One entity being in armed
conflict against another
entity. (e.g. an
organisation is fighting
against another
organisation)
Instances which specify military action having
occurred between entities, or instances which
indicate an entity is engaged in a military campaign
against another entity have predominantly formed
those instances which have been annotated with
this relationship.
Military allies of One entity fighting
alongside another entity
on the same side of a
conflict. (e.g. one person
being a military ally of
another person)
This relationship has been recorded between
entities where it is clear that they are co-operating
militarily against a common enemy. It has not
generally included financial or other support which
is provided by a partner which is not actually
involved in the military operations.
Communicated
with
An entity has had direct or
indirect contact with
another entity (e.g. met,
spoken with,
communicated remotely
with, messaged, emailed,)
This relationship has been used to capture
instances where information has been exchanged
between parties, or where one entity has passed
information to another. In the former case, the
communication is considered to be symmetric, in
the latter case it is annotated as unidirectional. The
relationship has also been applied to sentences
which indicate that communication must have
taken place in order for an outcome to occur (e.g.
“the ceasefire brokered by Russia and Turkey”).
B1
Annex B: The IBCC-M Model
B1. Introduction
The IBCC-M model was adapted from the IBCC model presented in Kim and Ghahramani (2012)47
specifically for the present study. The model we adapted allowed for multiple populations of
classifiers (in this case, two - experts and multiple crowd workers)48.
B1.1 Model Specification
The data-generating model, in qualitative terms, is as follows. A number of ‘instances’ are
independently generated, each instance being exactly one of R different ‘types’ - these are unknown.
Each of these instances is passed to one or more annotators who classify it as being in one of the R
categories. The annotators’ performance is governed by a ‘confusion matrix’, which consists of a
separate probabilistic distribution for each potential ‘type’ that the annotator is presented with.
Again, we assume we do not know these distributions but must infer them. We assume there are
two types of annotator - ‘crowd’ and ‘expert’ - which are distinguished only by our prior assumptions
about their quality. Ultimately, the aim of the inference process will be to generate probability
distributions for the type of each instance, indicating the ‘confidence’ of each classification, and
specifically (in this case) of the experts’ agreed-upon annotation.
The technical specification of the model is as follows:
• There are N instances, each of which is one of R different types; there are K crowd annotators
and L expert annotators.
• The ‘true’ frequency distribution of sentence types is denoted by a 1xR vector 𝜿; each
sentence type is drawn from a categorical distribution with this vector as the parameters;
• This vector 𝜿 is generated from a Dirichlet distribution with hyperparameters 𝝂, representing
our prior understanding of the distribution of instance types49.
• The ‘confusion matrices’ for each of the crowd annotators are represented by a set of R 1xR
vectors 𝝅[1 to K, 1 to R], where 𝝅[i, j] is the confusion vector for annotator i when presented
with an instance of type j. These are all independently drawn from a set of 1xR Dirichlet
distributions with hyperparameters 𝛼[1 to R].
• The ‘confusion matrices’ for each of the expert annotators are represented by a set of R 1xR
vectors 𝜹[1 to L, 1 to R], where 𝜹[i, j] is the confusion vector for expert annotator i when
47 H. Kim and Z. Ghahramani. (2012) Bayesian Classifier Combination. In Proc. of the 15th Int. Conf. on Artificial Intelligence and Statistics, page 619
48 Different populations of classifier are distinguished only by the data presented to the model, which reflects differences in their prolificity (experts completed more annotations than individual crowd workers) and by the prior expectation of their reliability (experts were initially considered to have a higher level of annotation accuracy than crowd workers). However, the more information there is about an annotator (i.e. the more judgements they make), whether expert or crowd, the more-finely calibrated the assessment of that annotator’s reliability will be. Likewise, the more annotators exposed to a given instance, the more finely-calibrated the assessment of that instance’s ‘true type’ will be. In this way, the number of classifiers in each population is not directly relevant to the way the model calculates confidence for the classification of each instance.
49 In fact, we used a nearly flat prior, essentially representing no prior expectations about the distribution of types.
B2
presented with an instance of type j. These are all independently drawn from a set of 1xR
Dirichlet distributions with hyperparameters 𝜷[1 to R].
• For each instance i, the instance type t[i] is generated on a categorial distribution from 𝜿; for
each crowd annotator j exposed to that instance, their judgement c[j, i] is generated on a
categorical distribution from 𝝅[j, t[i]]; for each expert annotator k exposed to that instance,
their judgement d[k, i] is generated on a categorical distribution on 𝜹[k, t[i]].
• The data consists of the set of judgements c[1 to K, 1 to N] and d[1 to L, 1 to N], which in
practice was very sparse because each annotator was exposed to only a very small subset of
instances.
• Our aim is to infer the posterior probability distributions for each sentence type t[i], and also
instrumentally to infer the confusion vectors for each crowd annotator 𝝅[1 to K, 1 to R] and
expert annotator 𝜹[1 to L, 1 to R].
To summarise the model:
• 𝜿 ~ Dir(𝝂)
• 𝝅[i, j] ~ Dir(𝛼[j]) for i=1 to K and j=1 to R
• 𝜹[i, j] ~ Dir(𝜷[j]) for i=1 to L and j=1 to R
• t[i] ~ Cat(𝜿) for i=1 to T
• c[i, j] ~ Cat(𝝅[i, t[j]]) for i=1 to K and j=1 to T
• d[i, j] ~ Cat(𝜹[i, t[j]]) for i=1 to L and j=1 to T
• The data consist of c[j, i] and d[k, i] for j=1 to K, k=1 to L, and i=1 to T
The posterior distribution for the data and parameters (from which the Gibbs samplers are derived)
is as follows:
𝑝(𝝅, 𝜿, 𝜹, 𝒕, 𝒄, 𝒅 | 𝜶, 𝜷, 𝝂) =
𝑝(𝝅 | 𝜶) ⋅ 𝑝(𝜹 | 𝜷) ⋅ 𝑝(𝜿 | 𝝂) ⋅ ∏ 𝜅[𝑡[𝑖]]𝑁𝑖=1 ⋅ ∏ ∏ 𝜋[𝑗, 𝑡[𝑖], 𝑐[𝑗, 𝑖]]𝐾
𝑗=1𝑁𝑖=1 ⋅ ∏ ∏ 𝛿[𝑗, 𝑡[𝑖], 𝑑[𝑗, 𝑖]]𝐿
𝑗=1𝑁𝑖=1
For the entity classification model, we used a set of 16 mutually-exclusive classifications (R=16)
(including ‘NONE’ and ‘DON’T KNOW / CAN’T TELL’), reflecting the entity schema in which only one
of the categories could be true. However, for the relationship classification model, for which the
schema left open the possibility that none or all of the (13 possible) relationships might be true of a
specific instance, we treated this effectively as 13 separate models, each of which involved classifiers
assigning ‘yes’ or ‘no’ to a particular classification (i.e. R=2). In all cases, we used very weak priors
that for crowd annotators corresponded to a prior expectation of around 50% accuracy, and for
expert annotators to a prior expectation of around 90% accuracy. However, the pseudocounts
describing these priors were very low and (apart from in the least well-represented categories) the
posterior distributions associated with the parameters were dominated by the characteristics of the
data rather than our prior assumptions.
B1.2 Gibbs Sampling
Gibbs sampling is a well-understood approach that is useful in cases in which a marginal probability
distribution can be specified but is intractable. It is well covered in statistical literature and we do
not expound on it here. We used the WinBUGS tool to perform the requisite analysis. The sheer
volume of computation required for the entity model meant that (as outlined in section 4.4.3 of the
B3
main report) we had to restrict the data to only the top 15 annotators (accounting for 60% of all
judgements); with the relationship annotation data (due to the considerably-simpler model and
smaller dataset) we were able to use all of it.
Visual inspection suggested that convergence for the key variables (in this case, the ‘true’
classifications t[i]) was achieved very quickly. In all cases, our reported confidence scores are the
means of the values assigned in each of 900 samples (following a 100-sample burn-in period) to the
probability that the expert annotators’ classification is correct.