data analytics project lot 1: measurement and evaluation of ......2017/03/22  ·...

39
1 2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing Final Technical Report: Description of the Implemented Method of Annotation and the Resulting Dataset 1. Introduction This technical report has been produced as a deliverable associated with Lot 1 (Measurement and Evaluation of a Gold Dataset for Text Processing) of the RCloud Statement of Requirement for Dstl’s 2016/17 Data Analytics project. The report has been written by Aleph Insights Limited, with inputs from Committed Software Limited (formerly Tenode Limited). Lot 1 requires the development of a ‘Gold Standard’ dataset which is annotated in a manner which optimises its subsequent use in the training and evaluation of machine learning approaches to natural language processing (NLP) in a defence and security context. This report outlines the methods that were applied to annotate the dataset to the requisite quality, and summarises and reflects upon the dataset’s contents. It should be read in conjunction with Dstl report 1 Aleph/2E27A3310, which lays out the requirements for the structure of the dataset. This current report, therefore, addresses the practical implementation of the dataset requirement framework as specified in Dstl report Aleph/2E27A3310. This document includes a theoretical justification of how the method used to compile the dataset meets the project's objectives, a detailed explanation of the method itself, and a discussion of the resulting dataset. It comprises the following sections: Project Overview - an overview of the purpose of the project, the hybrid approach followed in constructing the dataset and the theoretical justification of this method. Practical Application of the Method - a description of the implementation of the method followed, including illustrative examples taken from the dataset. Dataset Summary - a review of the contents of the dataset, complete with summary statistics providing insight into the composition of the dataset. Conclusions and Recommendations - a critical reflection on the process of compiling the dataset, and of the dataset itself, addressing considerations for future work in this area. 1 2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing - Dataset Requirement Framework Specification - Dated: 5th January 2017

Upload: others

Post on 04-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

1

2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing

Final Technical Report: Description of the Implemented Method of Annotation and the Resulting

Dataset

1. Introduction

This technical report has been produced as a deliverable associated with Lot 1 (Measurement and

Evaluation of a Gold Dataset for Text Processing) of the RCloud Statement of Requirement for Dstl’s

2016/17 Data Analytics project. The report has been written by Aleph Insights Limited, with inputs

from Committed Software Limited (formerly Tenode Limited).

Lot 1 requires the development of a ‘Gold Standard’ dataset which is annotated in a manner which

optimises its subsequent use in the training and evaluation of machine learning approaches to

natural language processing (NLP) in a defence and security context. This report outlines the

methods that were applied to annotate the dataset to the requisite quality, and summarises and

reflects upon the dataset’s contents. It should be read in conjunction with Dstl report1

Aleph/2E27A3310, which lays out the requirements for the structure of the dataset. This current

report, therefore, addresses the practical implementation of the dataset requirement framework as

specified in Dstl report Aleph/2E27A3310.

This document includes a theoretical justification of how the method used to compile the dataset

meets the project's objectives, a detailed explanation of the method itself, and a discussion of the

resulting dataset. It comprises the following sections:

• Project Overview - an overview of the purpose of the project, the hybrid approach followed in

constructing the dataset and the theoretical justification of this method.

• Practical Application of the Method - a description of the implementation of the method

followed, including illustrative examples taken from the dataset.

• Dataset Summary - a review of the contents of the dataset, complete with summary statistics

providing insight into the composition of the dataset.

• Conclusions and Recommendations - a critical reflection on the process of compiling the

dataset, and of the dataset itself, addressing considerations for future work in this area.

1 2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing - Dataset Requirement Framework Specification - Dated: 5th January 2017

Page 2: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

2

2. Project Overview

This section contains a review of the project aim and articulates how the hybrid method used to

annotate the dataset enables this aim to be met. It discusses the overall decisions taken about the

method, and seeks to justify why these decisions were taken. It draws upon references to previous

research in this area, and relates this research to the decisions taken in ‘the present project’ (i.e. the

work carried out under this research project).

The overall aim of the present project was the creation of a gold standard dataset that could be used

to train and validate machine learning approaches to NLP; specifically focussing on entity and

relationship extraction relevant to somebody operating in the role of a defence and security

intelligence analyst. The dataset was therefore constructed using documents and structured

schemas that were relevant to the defence and security analysis domain. The schema for entity

types was inherited from previous work2 (notably the development of Baleen (specifically version

2.3)) and the schema for relationship categorisation was developed during the project. The details of

the schemas, along with the selection of documents for the dataset, are discussed in greater depth

in Dstl report Aleph/2E27A3310. The methodological discussion within the present report will

concentrate instead, purely on the implementation of the ‘hybrid’ approach employed for the

purposes of this project. In this case, the term hybrid refers to the combination of automated text

extraction together with human annotation, where the human annotation comprises both ‘expert’

(defined later in this section) annotation and crowdsourced annotation.

The rationale for a hybrid approach is that it combines the text-processing efficiency of machines

with humans’ ability to understand meaning and context. Burger et al. (2014)3 proposed a hybrid

approach along these lines as an effective method for generating corpus annotations at scale, where

both entity and relationship extraction are required. In their study, Burger et al. suggested that

automated extraction was sufficient for the classification of entities, but that human annotation was

better suited to the assignment of relationships between those entities.

Accordingly, in the present project, Baleen4 was used as the tool for automated extraction of

entities; these were subsequently served to expert annotators and crowd workers for confirmation

and addition of missed entities. However, Baleen was not used for relationship extraction, with only

human annotation of relationship types being conducted.

Hirschman et al. (2016)5 discuss different models of crowdsourcing for NLP tasks. They define

‘expertise’ in relation to annotation along three axes:

2 https://github.com/dstl/baleen/wiki/Type-System

3 Burger, John D., Emily Doughty, Ritu Khare, Chih-Hsuan Wei, Rajashree Mishra, John Aberdeen, David Tresner-Kirsch et al. "Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing." Database 2014 (2014): bau094.

4 A text analysis framework developed by Dstl: https://github.com/dstl/baleen/wiki/An-Introduction-to-Baleen

5 Hirschman, Lynette, Karën Fort, Stéphanie Boué, Nikos Kyrpides, Rezarta Islamaj Doğan, and Kevin Bretonnel Cohen. "Crowdsourcing and curation: perspectives from biology and natural language processing." Database 2016 (2016): baw115.

Page 3: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

3

• Knowledge of the domain of the corpus - in the case of the current project this was the

conflict in Syria and Iraq;

• Understanding of the specific domain of the annotation-type itself (e.g. annotations focussed

on semantics) - in the current project this equated to selecting entities and relationships of

relevance to an intelligence analyst; and

• Familiarity with the annotation task itself - in the case of the present project, the schema

structure, its category definitions and the use of the annotation tool employed.

The two expert annotators in the present project met these criteria: they were both former

intelligence analysts with knowledge of the conflict in Syria and Iraq, they jointly developed the

entity and relationship schemas, and they were involved in developing the annotation tool.

Hirschman et al. (2016) consider that this definition of expertise is helpful for distinguishing the

expert annotators from the crowd workers, who may be considered as being relatively more naive in

relation to the three axes of expertise. Traditionally, in tasks requiring the annotation of domain-

specific natural language corpora, expert annotators have been considered as the arbiters of the

‘ground truth’6 for a given dataset. This has often meant that all other annotations (as provided by

automated approaches or crowd workers) have been benchmarked against expert tagging, and

annotations are only considered to be ‘correct’ if they correspond to the experts’ judgement.

Furthermore, ‘correctness’ of annotations tends to be judged by means of inter-annotator

agreement. So, the quality of a gold standard database is inferred from the number of instances that

have been accurately and consistently annotated, as compared to the experts.

Researchers such as Dumitrache et al. (2017)7, however, have considered a more nuanced approach

to agreement in human-labelled data. They suggest that high levels of agreement, while potentially

implying accuracy with regards to some underlying truth about the meaning of the data, may also be

indicative of artificially-imposed homogeneity resulting from the annotation process. For example, a

small group of expert annotators who work closely together, have an opportunity to develop,

standardise and apply a schema, and thus have a greater chance of applying that schema

consistently (i.e. with lower inter-annotator variance) compared to a disassociated crowd. But the

application of the schema by the small expert group is then at greater risk of adopting

interpretations of schema categories that do not generalise well to meaning as interpreted by a

general population. Further, high levels of inter-annotator agreement may also suggest a narrowly

selective corpus in which the inherent ambiguity of natural language has been limited by the choice

of documents. In cases where this is not representative of the type of data the corpus is trying to

reflect, this may actually undermine the validity of the dataset rather than reinforce it.

Therefore, achieving high inter-annotator agreement should not be regarded, in itself, as a wholly

reliable indication of the quality of annotations, as it can mask genuine ambiguity pertaining to the

meaning of text that could have been ignored or gone unnoticed by a group of similarly minded

expert annotators.

6 This term refers to the abstract concept that there is a single ‘truthful’ classification for each particular instance within a dataset

7 Dumitrache, Anca, Lora Aroyo, and Chris Welty. "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017).

Page 4: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

4

One method of ensuring that the ontologies associated with annotation schemas reflect the general

usage of a broader population is to employ crowd workers to verify, refute or provide alternative

classifications to expert annotators. Structured approaches for utilising crowd workers in the

annotation of gold standard datasets, such as the CrowdTruth Framework8 are becoming

increasingly common. This theoretical outlook, as espoused by Aroyo and Welty (2015)9, regards the

allocation of a single ground truth for the meaning of a given instance as naive, instead viewing the

capture of the uncertainty surrounding meaning, as an essential part of the meaning itself.

Additionally, even if one doesn’t accept the contended claim that the concept of ‘ground truth’ is

incoherent, it is uncontroversially the case that the correct interpretation of a term or sentence may

require contextual or background information not contained within the sentence itself, and that the

ability to identify this kind of inherent ambiguity is an important skill for interpreters of text.

Consequently, retaining information about subjective disagreement between the crowd annotators

(or between the crowd annotators and the experts) may be actively desirable if we wish to develop

machine learning applications that are able to identify (or at least account for the existence of)

genuinely ambiguous sentences10, albeit within a carefully managed framework11. Best practice for

the use of crowds12 therefore involves allowing multiple classifications of instances, and recording

and calculating this ambiguity, whilst protecting against spam workers13.

In order to be consistent with this approach, the present project utilised a crowd to validate (or

indeed challenge) expert annotators’ judgements. The crowd were presented with text extraction

microtasks14, which contained guidance designed to explain the task and the meaning of different

classification categories, but which aimed to avoid over-prescription and the stifling of dissent.

Importantly, crowd workers were presented with candidate instances that had been identified by

experts, but not the classifications applied by the experts. Crowd workers were asked to provide

their own classification from the schema to avoid being biased by previous expert interpretation. In

8 An approach originally developed as part of the IBM Watson project, but subsequently developed by IBM and a network of academic institutions - http://crowdtruth.org/

9 Aroyo, Lora, and Chris Welty. "Truth is a lie: Crowd truth and the seven myths of human annotation." AI Magazine 36, no. 1 (2015): 15-24.

10 Indeed, Dumitrache et al. (2017) (Dumitrache, Anca, Lora Aroyo, and Chris Welty. "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017)) argue that harnessing measures which reflect the existence of disagreement in instance classifications, rather than trying to settle on a single ground truth, results in improved performance for automated extraction classifiers

11 See Drapeau et al. (2016) for a novel approach to stimulating disagreement and resolving disputes in order to improve crowd worker engagement and the quality of information derived from a crowd (Drapeau, Ryan, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. "MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy." In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP). 2016.).

12 As proposed in Aroyo, Lora, and Chris Welty. "Truth is a lie: Crowd truth and the seven myths of human annotation." AI Magazine 36, no. 1 (2015): 15-24.

13 Crowd workers who exploit the revenue that can be gained through crowdsourcing platforms through delivering low effort/low quality work.

14 These were single sentence tasks, where a crowd worker was asked to classify a single entity or relationship type. The tasks presented to crowd workers were deliberately deconstructed in this way to make them as cognitively straightforward as possible and minimise the risk of task confusion.

Page 5: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

5

the present project crowd workers therefore acted to provide judgements on the annotations pre-

selected by the experts, rather than to add new annotations themselves. Crowd workers who could

not meet a basic standard of performance (e.g. spam workers) were excluded from completing tasks,

but the qualification threshold was calibrated so as not to exclude different, but valid viewpoints15.

The approach taken in the present study was to undertake validation of the expert annotators’

judgements using two metrics: straightforward crowd worker agreement with experts (i.e. as a

proportion of overall crowd worker judgements), and a considerably more-sophisticated machine

learning approach based on a Bayesian model of crowd and expert judgement data generation.

The machine learning approach adopted in the present study - Independent Bayesian Classifier

Combination (IBCC) - aims to generate not a ‘true / false’ decision about the expert assignment, but

a probability that the judgement is correct, using only crowd and expert judgements as inputs. The

IBCC algorithm estimates two things: the accuracy of individual annotators given different categories

of input, and the probability distribution for each instance’s correct classification. The latter feature

enables us to assign a probability to each classification that takes account of all the information

encoded in the collected judgements of the annotators16.

This approach represents a significant improvement over the use of simplistic ‘accept / reject’ rules,

as it allows dataset users to determine the level of confidence17 required for a given use case18. This

approach carried a time and computation burden, but proved to have significant advantages,

including enabling distinction between annotators, as well as between easy and hard categories of

entity or relationship, and providing a means of spotting likely errors and inherently ambiguous

instances.

This section has discussed the underlying theoretical justification for the hybrid approach used in

the present study to annotate the dataset (comprising automated extraction, expert annotation and

crowd annotation), and the nuanced method applied to the task of validating annotations by

calculating agreement or confidence (involving a mix of inter-annotator agreement and an IBCC

machine learning approach). In doing so, the authors hope to have articulated a robust approach to

creating the gold standard dataset that constitutes the primary output for this project. The following

section describes the approach taken in greater detail, outlining the practical application of the

method in order to enable its replication.

15 Consistent with the findings of Hirschman et al. (2016) who conclude that crowd worker input is useful only if it can be periodically validated (Hirschman, Lynette, Karën Fort, Stéphanie Boué, Nikos Kyrpides, Rezarta Islamaj Doğan, and Kevin Bretonnel Cohen. "Crowdsourcing and curation: perspectives from biology and natural language processing." Database 2016 (2016): baw115.)

16 In this instance, however, as discussed below, computational constraints limited the dataset we could use to apply the algorithm to entities, to one that contained only around 60% of all the judgements made.

17 The project team referred to this score using the informal but useful term ‘goldiness’.

18 For example, training an entity classifier for use by intelligence analysts to inform targeting might require inclusion of instances with a very high level of confidence in class assignments; using the dataset to support an academic study might require a larger dataset including lower-confidence assignments.

Page 6: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

6

3. Implementation

In general terms, the previous section discusses a methodological approach which involves a number

of sequential steps. These steps - which are discussed in more detail throughout this section - were:

• Automated Extraction - where an automated extraction tool is applied to unprocessed text (in

the present study’s case this involved single sentences) to begin to assign entities to the text

to act as an initial cue for a human annotator;

• Expert Annotation - where analysts within the project team applied entity and relationship

annotations to text using a specific annotation tool;

• Crowdsourcing - where crowd workers were presented with text containing highlighted

instances identified by the experts over a crowdsourcing platform, and were asked to provide

their own classification of these annotations;

• Confidence Calculation - where inter-annotator agreement and confidence scores were

calculated for each instance annotated through the hybrid approach represented by steps 1-3;

• Committal to Dataset - where all of this data is integrated, committed and stored in the

underlying database, enabling retrieval, interrogation and, ultimately, its utilisation as a

training or validation dataset.

An overview of this process is shown in Figure 1 overleaf.

Page 7: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

7

Figure 1 Overview of the Dataset Generation Process

Page 8: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

8

3.1. Automated Extraction

The automated extraction aspect of this process was conducted by the Baleen

system which parses and extracts text on ingest to an annotation tool

(Galaxy)19. Entities identified by Baleen were displayed to the expert

annotators and provided a prompt for some instances. The configuration was

developed through experimentation with source documentation in an attempt to produce an

optimised set up. Standard annotators and cleaners were used including custom built gazetteers20.

3.2. Expert Annotation

Expert Annotation was conducted by two members of Aleph Insights21, over

the course of the project, using the Galaxy tool produced by Committed

Software (formerly Tenode) for Data Analytics Lot 2. A screenshot of the

Galaxy annotation tool is displayed in Figure 2 below.

Text was presented to the experts in single sentence blocks. It was found that displaying larger

blocks of text (whole documents or paragraphs) made the task of annotation more fatiguing and

confusing. Furthermore, by isolating individual sentences it was possible to ensure consistency with

19 For full details of the Galaxy tool see Dstl report: CR17-RCLOUD-EGFR-20170330–R, Cloud Data Analytics Lot 2, Final Report

20 https://github.com/AlephInsights/gazetteers

21 Both expert annotators were experienced intelligence analysts who had worked in a range of analytical roles, including within teams doing tactical network analysis and on operational deployments. As such, they were familiar with the kinds of information considered salient for intelligence analysis tasks and were experienced in identifying this material within text documents.

Figure 2 Screenshot of Galaxy tool showing expert annotations in text.

Page 9: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

9

the mode of display of text to crowd workers, increasing the validity of comparison between the two

populations of annotators22. The presentation of sentences to experts was randomised, to limit the

amount of contextual information the experts received. Sentences were annotated in short batches

to minimise fatigue.

Experts performed annotation using one of three modes of co-working23:

• Joint annotation – This was the primary mode employed at the beginning and the end of the

overall annotation effort. Experts worked together to apply annotations to the same text, with

significant discussion. At the start of the annotation process this mode allowed experts to

agree and adjust their shared understanding of the schemas. The rules for applying the

schemas were developed through this process and then used in subsequent annotations done

independently. This set of schema interpretations has been captured and is presented in

Annex A of this report. Joint annotation was also used to generate a special set of annotations

(‘platinum’), which was considered to represent the experts’ most considered annotations and

was used to recruit and qualify crowd workers. Joint annotation was also conducted at the

end of the annotation effort to see whether divergence in schema interpretation had

developed.

• Solo annotation – Having decided and agreed the interpretation of the schemas, the experts

completed the majority of annotation alone. The total number of sentences were divided into

two lists and each expert worked through their own allocated portion of the corpus. Experts

discussed instances which were considered unusual, peculiar or difficult to make a decision

about, and subsequent adjustments were made to the schema interpretation, or new rules

were created24.

• Overlapping – Some batches of sentences were completed by both experts independently,

without communication. These sentences could then subsequently be discussed to check that

interpretation was consistent between experts. It was found that generally agreement was

good and only minor changes were needed, thus providing an indication that the ‘solo’

annotated sentences were likely to be of a similar standard.

Once expert annotations had been applied using one of these three modes, the sentences were

processed for presentation to the crowd.

22 As discussed in Dstl Report Aleph/2E27A3310 (2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing - Dataset Requirement Framework Specification - Dated: 5th January 2017) single sentences were selected as the unit of text for the present project. While this decision was taken primarily because of the unmanageable complexity of whole document coreference, it also had a range of impacts on the ability of experts and crowd workers to judge annotations. One effect was the loss of sentence context, which made it more difficult for annotators to infer meaning, particularly with regards to the classification of relationships (where their existence is often determined based on knowledge accumulated from multiple sentences). However, the smaller units of text did present a cognitively simpler task to annotators, allowing quicker and less arduous completion, fitting better with the processing of the kinds of microtasks best suited to crowdsourcing.

23 Approximately 88% of the annotations were completed through solo annotation, 8% using joint annotation and the remainder using overlapping annotation.

24 Experts kept a shared running log of rules for interpretation, including examples, decisions about how to interpret these examples and the rationale for these decisions.

Page 10: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

10

3.3. Crowdsourcing

Amazon’s Mechanical Turk25 was selected as the crowd platform for generating

annotations from crowd workers. This platform was selected due to its

technical flexibility (effectively displaying and managing large volumes of crowd

25 https://requester.mturk.com/

Figure 3 Example of an Entity Assignment Task

Page 11: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

11

tasks) and because of the large population of crowd workers, with the requisite level of linguistic

competence, operating on the platform.

Mechanical Turk uses its own terminology, which this report borrows for its description of the

crowdsourcing method. It refers to those people completing tasks as ‘workers’, and those people

publishing tasks to be completed as ‘requestors’. The individual crowd tasks themselves are referred

to as ‘assignments’. In this project, two types of assignment were published and completed by

workers: Entity Assignments and Relationship Assignments26. Figure 3 (above) shows an example

template for a typical entity assignment task. Each assignment consisted of 5 separated sentences,

each with a list of options for workers to indicate their responses. In Entity Assignments, a single

26 Assignments were priced at $0.06-0.12 depending on the complexity of the assignment and the need to incentivise workers.

Figure 4 Example of a Relationship Assignment Task.

Page 12: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

12

instance is highlighted in a sentence of text and the worker is asked to indicate which classification

they believe best suits this instance.

For Relationship Assignments, two entities (a source and a target) are highlighted, and the worker is

asked to indicate all relationships from a list which they believe apply in that context. Figure 4

(above) shows an example of a typical relationship assignment task.

Guidance notes were presented along with each assignment, which explained how to approach the

task. The different guidance forms compiled for Entity Assignments (top) and Relationship

Assignments (bottom) are shown in Figure 5 (below).

Crowd assignments were scheduled randomly so workers would see a selection of sentences out of

their original order. Assignments were released in batches to facilitate management of large

numbers of tasks in manageable chunks. When generating a batch, it was possible to specify the

degree of repetition of each assignment and set the reward for each submitted assignment.

Mechanical Turk also provided a messaging system for workers and requesters to contact each other

regarding specific tasks or batches. This proved very useful, as workers were able to ask for

clarification in relation to assignments about which they were confused (although this happened

infrequently, giving some indication that the guidance for assignments was clear).

Figure 5 Entity and Relationship Assignment Guidance

Page 13: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

13

While Mechanical Turk allows requesters to pre-select populations of workers (e.g. based on

geographic area), it was found that any worker with a good command of English could potentially

represent an effective crowd worker. Therefore, to ensure the largest possible recruitment pool,

workers were initially recruited by advertising assignments to all workers without restriction. These

initial assignments formed Qualification Batches, which were made up of entity instances from the

‘platinum’ data generated by expert annotators (see previous section on joint annotation by

experts).

The results from Qualification Batches were analysed to assess the performance of the workers.

Workers who completed more than 20 tasks, and achieved an agreement of 80% with expert

classifications were awarded a Qualification which would enable them to participate in further work.

Periodically workers were assessed against platinum results to check the quality of their work. If

workers performance dipped below 80% agreement with experts, their qualification would be

revoked. Interestingly, out of the 200 most prolific workers initially recruited, only 5 needed this

sanction; generally, workers seemed consistent in their performance.

All Entity Assignments were completed by at least three workers, and Relationship Assignments by

at least five workers (due to the greater levels of ambiguity surrounding relationship annotations).

Some instances were assigned to more workers in order to facilitate quality assurance of crowd

worker responses.

3.4. Confidence Calculation

The database of judgements was ingested (entirely, in the case of relationship

judgements, but partially in the case of entity judgements)27 into a software

tool, along with a Bayesian model of annotation data generation (‘IBCC’), to

generate probability distributions for the ‘correct’ annotation for each

instance. The probabilities thus assigned to the expert annotations were committed to the database

as the confidence score. A full explanation of the IBCC algorithm and its application is provided in

Annex B of this report.

3.5. Committal to Dataset

All annotation, agreement and confidence data was committed to the

dataset linked to the instance to which it applied. In practice this involved

all descriptive information about the instance (e.g. its location in a

document within the corpus, its classification by experts, its classification

by each crowd worker, its confidence score, etc.) being tied to a unique instance ID within the

underlying database. For a full description of the data structure refer to the Mongo data definition

spreadsheet accompanying the database.

27 Only partial ingestion, corresponding to 60% of the overall number of judgements, was computationally feasible. Beyond the 60% point, each additional crowd annotator only provided judgements for less than 2% of the overall corpus, adding something like 9,300 additional free variables to be computed per iteration of the sampling algorithm.

Page 14: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

14

4. Dataset Overview

This section provides an overview of the data contained within the dataset, particularly

concentrating on statistics relating to the annotations. It includes a number of subsections, which

cover different aspects of the data analysis, these are discussed in sequence below.

4.1. Summary Statistics

The document corpus comprises 219 individual documents drawn from 25 sources (comprising open

source online content producers28. The total number of sentences in the corpus was 2,281. There are

a total of 12,135 entity instances, 9,464 of which have been encoded by at least one ‘expert’

annotator. All of these have been verified by at least 3 crowd workers (with varying levels of

agreement). There are a total number of 3,694 relation instances which have been encoded by

expert annotators, all of which have been verified by at least 5 workers.

4.2. Frequency of Instances by Type

Figure 6 (overleaf) shows the frequency of the entity types as annotated by the expert annotators

across the whole corpus. It should be noted that this is not the same thing as the ‘true’ frequency of

entity types, which is a disputed concept (as discussed in the project overview section of this report).

The most frequently occurring types of entity were organisations (3,959 instances as annotated by

experts), locations (2,171 instances as annotated by experts) and persons (1,404 instances as

annotated by experts). The large number of organisations suggests that it would be appropriate to

investigate unpacking the organisation type into a set of more narrowly defined sub-types which

would provide greater structure.

The most frequent entity types contrast markedly with the least frequent entity types of URLs (only

1 instance annotated by experts), radio frequencies (4 instances annotated by experts) and

communications identifiers (8 instances annotated by experts). This would seem reflect the

frequencies that might be expected given the types of documents in the corpus (mostly news

articles, government information and blogs). This frequency could, of course, differ were military

and intelligence reports incorporated into the dataset, where the low frequency entities might be

referenced more often.

Figure 7 (overleaf) shows the same data based on expert annotations by relationship type. It should

be noted that relationship annotations achieved far lower levels of confidence (as discussed later in

this section) than entity types; therefore, expert annotations provide a weaker indication of the

overall annotation frequency (including crowd workers) for relationships than for entities.

There is a relatively even distribution of frequency of relationship types according to the data in

Figure 7, apart from for the co-located relationship which is much more frequent than the other

categories.

28 See ‘2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing - Dataset Requirement Framework Specification - Dated: 5th January 2017’ for full details of how documents and sources were selected.

Page 15: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

15

There are 1,430 co-located relationship instances, compared with a range of 96-411 for the other

categories (below). This is partly a feature of the nature of co-located relationships which often

apply multiply, simultaneously and mutually (i.e. when numerous objects are all in the same place).

Figure 7 Expert Annotated Entities Instances Frequency

Figure 6 Expert Annotated Relationship Instances Frequency

Page 16: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

16

Figure 8 shows the number of crowd worker annotations produced for entity instances29. From the

graph it can be seen that the most frequent number was 3 annotations per entity instance. It should

be noted that those instances which appear on the graph (above) as receiving zero crowd

annotations, represent entities which were extracted by Baleen, but subsequently rejected by the

expert annotators on inspection, and thus not presented to the crowd. Those instances that

received >20 annotations were part of the platinum dataset that was used to qualify workers.

29 Assignments for batches were selected randomly from the sentences annotated by experts, with a bias towards instances with no previous crowd results.

Figure 9 Number of Worker Annotations for Relationships.

Figure 8 Number of Worker Annotations for Entities

Page 17: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

17

Figure 9 shows the same data for crowd worker annotations for relationship instances. The graph

shows that the significant majority of relationship instances received 5-6 crowd worker annotations.

4.3. Crowd-Expert Agreement Scores

Figure 10 shows the distribution of crowd agreement with the experts’ annotations, ranging from 0%

to 100% (1.0). The graph shows a high level of agreement between the crowd and experts for the

classification of entities, with the significant majority of entity instances receiving higher than 90%

agreement. The mean agreement between crowd workers and expert annotations was 86.9%.

Figure 10 Distribution of Percentage Agreement Between Crowd Workers and Experts for Entity Annotations

Figure 11 Distribution of Percentage Agreement between Crowd Workers and Experts on Relationships.

Page 18: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

18

Figure 11 shows the same data for crowd and expert agreement relating to relationship instances.

This shows a very different picture, and reflects a mean crowd agreement with expert annotations of

57.3%. The greater ambiguity inherent in the process of relationship classification, and the more

subjective nature of relationship schema interpretation are likely to be the major factors accounting

for the discrepancy between relationship and entity instance agreement in this project.

Figure 12 shows the distribution of crowd-expert agreement across quartiles by entity type. It can be

seen from the graph that ‘time’ and ‘person’ categories have the greatest proportion of high levels

of crowd-expert agreement. Other than ‘URL’ of which there was only a single instance30, the

‘nationality’ category possesses the largest proportion of low agreement scores. This corresponds

with the intuition of the expert annotators, who felt that classifying nationality was often very

confusing, due to the fact that nationality can often be used to refer to a person (e.g. “an Iraqi left

the building”) or an organisation (e.g. “The U.S. launched attacks on the city”).

30 The low agreement for the single occurrence of a URL within the dataset, may be a feature of the rarity of this category, which meant crowd workers were unlikely to be familiar with classifying this type of entity.

Figure 12 Quartiles of Crowd-Expert Agreement by Entity Type

Page 19: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

19

Figure 13 depicts the distribution of crowd-expert agreement across quartiles by relationship type.

The graph shows lower levels of crowd-expert agreement across all relationship categories. Of note,

is the relatively high level of agreement for the ‘co-location’ category. Figure 7 showed that this was

the most frequently tagged type of relationship. There is no data to suggest why the ‘fighting

against’ relationship category results in such low levels of agreement. This may be a result of the

Figure 13 Frequency of IBCC Confidence in Judgements by Range (of confidence score)

Figure 14 Quartiles of Crowd-Expert Agreement by Relationship Type

Page 20: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

20

description of the category provided to crowd workers, or may be a feature of the nature of the

relationship itself.

Figure 14 provides an interesting contrast to the crowd-expert agreement scores. The data in this

graph represents the scores for confidence obtained by using the IBCC algorithm. It shows the range

of confidence scores against all of the instances in the dataset (both entity and relationship). From

this it can be seen that the vast majority of instances receive confidence scores of greater than 90%.

Additional results derived from the IBCC algorithm are now discussed in greater depth, alongside an

explanation of the basis of the calculations in order to illustrate the methods utility.

4.4. Measuring Confidence: the IBCC-M Model

4.4.1. Confidence

A standard approach to producing ‘gold’ data is to employ a decision rule on a set of annotations

(e.g. a certain level of score on inter-annotator agreement, Cohen kappa measure, or similar) and to

accept instances meeting this rule into the dataset while rejecting the rest. In this project, we

decided to take a more discriminating approach that involves assigning confidence measures to each

annotation in the dataset. The rationale for this approach is primarily that accept / reject rules do

not allow dataset users to calibrate confidence to the requirements of the task (i.e. by enabling them

to make the trade-off between the quality and size of the dataset), but using confidence measures

also respects the fact that some instances may be intrinsically ambiguous.

In addition, we decided to use a more-sophisticated approach to assigning confidence scores, based

on machine learning, than the widespread metrics looking solely at inter-annotator agreement.

Simple metrics (e.g. percentage agreement, Cohen’s kappa or Gwet’s AC1) fail to account for

differences between the difficulty of instance types (e.g. people being easier to identify than

nationality) or, more importantly, between the quality of annotators. The project team therefore

investigated, adapted, and deployed an algorithm based on a Bayesian model of classification

performance, that produces a probability that a particular annotation is correct. Aside from its other

advantages, this has a straightforward interpretation, in contrast to other measures that cannot

straightforwardly be used to answer the question: “how likely is this annotation to be true?”.

4.4.2. Modelling Classification Performance

A number of approaches for combining classifiers of unknown quality have been proposed; Kim and

Ghahramani (2012)31 review a number of methods, which differ in the complexity of their

assumptions. Taking account the expected sparsity of the dataset, we selected a relatively-simple

model for classification generation: ‘Independent Bayesian Classifier Combination’. The empirical

31 H. Kim and Z. Ghahramani. (2012). Bayesian Classifier Combination. In Proc. of the 15th Int. Conf. on Artificial Intelligence and Statistics, page 619

Page 21: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

21

benefits of a variant of this approach are demonstrated in Simpson et al. (2013)32, and it is used in

combination with automated text classifiers with considerable success by Simpson et al. (2015)33.

This model assumes that classifiers (i.e. annotators, in the terms of our study) are all independently

drawn from a population of unknown quality, and defined by a set of independent probability

distributions that describe, for any given ‘true’ classification of an instance, the probability of every

possible classification that that classifier will assign. The model assumes that the data are the result

of these classifiers being presented with a set of instances. Estimation of model parameters is

considerably improved where ‘gold’ datasets exist (to calibrate classifier performance), but this is

not required.

In our case we needed to reflect the possibility that expert and crowd annotators were drawn from

different populations; in particular, we wanted to model the assumption that crowd annotators

could not (at least prior to analysing the data) be trusted as much as the expert annotators.

Consequently, we adapted the model to allow for two populations of classifiers - experts and crowd.

Details of this model - ‘Independent Bayesian Classifier Combination with Multiple Populations’, or

‘IBCC-M’ - are outlined in Annex B.

4.4.3. Estimating the Model

In non-technical terms, the IBCC-M model proposes that the annotated data are generated by

passing instances (of unknown types) through classifiers (of unknown performance). Given that

classification performance can only be properly inferred if we know the true classification of (at least

some of) the instances, and that the instance types can only be inferred if we know the true

performance of the classifiers, it might seem that knowing neither makes for a circular impasse.

However, this is not the case. With a very weak assumption that classifiers are slightly more likely to

be correct than not, it is possible to estimate both unknowns using statistical methods. Intuitively,

this is related to the idea that if three people agree on an annotation, it’s more likely that that thing

is correct, than that the three are independently (randomly) selecting the same label.

32 E. Simpson, S. Roberts, I. Psorakis, and A. Smith. (2013) Dynamic Bayesian Combination of Multiple Imperfect Classifiers. In Intelligent Systems Reference Library series: Decision Making and Imperfection, pages 1–35. Springer.

33 Simpson, Edwin, Venanzi, Matteo, Reece, Steven, Kohli, Pushmeet, Guiver, John, Roberts, Stephen and Jennings, Nicholas R. (2015) Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning At 24th International World Wide Web Conference (WWW 2015). , pp. 992-1002. (doi:10.1145/2736277.2741689).

Page 22: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

22

Nevertheless, estimating the model - i.e. calculating (or at least putting probability distributions on)

the parameters describing classifier performance and true instance characteristics - is not

straightforward. Because of the entanglement described above, the parameter values cannot be

estimated analytically. Instead, stochastic sampling methods must be used to approximate the

probabilities of the parameters, given a set of annotations. The IBCC-M model, by design, is

amenable to Gibbs Sampling34 and so we used the specialised WinBUGS35 (Bayesian Inference Using

Gibbs Sampling for Windows) software application to perform the necessary estimation.

However, although the model estimation worked extremely well for dummy data we generated for

testing, and for our individual relationship classification models36 the sheer volume of missing data37

meant that the software became unstable when we tried to deploy it using all of our entity

annotation data. Figure 15 shows the total number of unknowns in the model (each of which needs

to be re-estimated several hundred times) with increasing numbers of annotators; with all the crowd

workers included, there are more than 2.5 million unknowns. We were able to exploit the power-

law-like distribution of effort among annotators to compromise effectively in generating entity

confidence scores. Figure 16 shows that the top 15 annotators’ data (the two experts and the

busiest thirteen crowd workers) accounted for 60% of the total judgements, but less than 5% of the

unknown variables of the full model. This should be borne in mind when interpreting the confidence

scores, although it is worth bearing in mind that the scores themselves already account for missing

data by (essentially) not being as high (or low) as they would be if more data were available.

34 A statistical method of estimating complex probability distributions by iteratively updating individual parameter values

35 https://www.mrc-bsu.cam.ac.uk/software/bugs/the-bugs-project-winbugs/

36 For reasons outlined in the Annex, the fact that our entity schema was mutually exclusive, while our relationship schema was not, meant that we could treat the relationship classifications as a range of simple models whereas the entity model was a single, complex one

37 Each annotator, on average, made judgements about less than 3% of the total number of instances

Figure 15 Number of Unknown Parameters by Number of Annotators

Page 23: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

23

Finally, the estimated parameter values were transformed straightforwardly into confidence scores

by extracting the probabilities, conditional on the data, attached to the expert annotations. In some

cases, these were lower than the probabilities attached to alternative annotations. For reasons of

transparency, however, we decided to retain the original (human expert) annotations together with

the confidence scores, rather than report the (IBCC-M-generated) alternative annotation. This

effectively meant that although the crowd could disagree (by indicating alternative annotations, thus

lowering confidence) with the experts, they could not ‘suggest’ alternatives in a way that would lead

to them entering the database.

4.4.4. Findings from the Confidence Measurement

In theory, the confidence scores measure the probability that the expert annotation is correct. We

have no real way of telling whether this is true, lacking any external source of data, although the

studies referred to above give some indication of the model’s value. However, there are a number of

indications of the approach’s validity, and indeed of its potential advantages over more simplistic

metrics.

First, and most straightforwardly, the IBCC-M confidence measures were in almost all cases higher

than simple agreement frequency (i.e. the proportion of raters who agreed with the experts); for

92% of instances, the IBCC-M confidence probability exceeded the frequency of annotator

agreement. This suggests that the model is accurately capturing the fact that if two (or more) raters

agree, this provides particular justification for the hypothesis that this is due to anchoring around

the truth rather than to chance agreement. Figure 17 illustrates this difference by plotting the

traditional measure of crowd-expert agreement plotted against IBCC generated confidence scores.

Figure 16 Number of judgements made by annotators

Page 24: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

24

The IBCC-M model also exhibited some interesting behaviour. First, and without any explicit

prompting from the experts or through the design of the model, it learned to put very little credence

in crowd annotations either of ‘NONE’ or of ‘DON’T KNOW, CAN’T TELL (DKCT)’38. In most of the

other categories, the frequency of its assignment was of a very similar order of magnitude to those

of the annotators, but it assigned almost no instances to these ‘bucket’ categories. We do not know

exactly why39, but one plausible hypothesis is that those annotations (by crowd workers), when they

occurred, occurred largely randomly and were not correlated with one another.

The confidence algorithm also identified a few cases that (when followed up) appeared to be the

result of misunderstanding by the crowd (or, perhaps, miscommunication by the project team). For

example, one of the instances mentioned “6787 civilians”; some of the workers thought this was a

quantity, and the algorithm assigned a confidence of only 46% to the experts’ judgement that this

was an ‘organisation’. In fact, the experts were following a pre-agreed schema in which numbers of

people were considered organisations, but the crowd workers were not trained in this specific

interpretation. Another instance talked of the ‘Arab world’: even though many of the annotators

agreed with the experts that this was a location, the algorithm did not, and thought it was more

likely to be an ‘organisation’. Figure 18 shows the slight discrepancies in annotator and IBCC defined

assignments, highlighting the algorithm’s lack of confidence in none and DKCT categories.

38 For example, the algorithm discounted crowd annotators’ views concerning one instance which mentioned a ‘major oil pipeline’; three out of four of them thought it did not fall into anything in the schema, yet the model nevertheless put a confidence of 43% in the experts’ judgement that it was a location.

39 One of the drawbacks of the machine learning approach to assigning confidence is a lack of transparency in the provenance of the conclusions

Figure 17 Average confidence compared to agreement

Page 25: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

25

Without further and more systematic investigation, we cannot determine exactly the performance

characteristics of the IBCC-M model and its Bayesian siblings on intelligence-related corpora of the

kind under study here. Counting against this approach is its computational burden (which, for larger

bodies of work or numbers of annotators would require significant outlay), and its lack of immediate

transparency compared to more-intuitive metrics. In its favour are the clear interpretation of its

output and its subtlety in differentiating between annotators and the difficulty of instance types.

The approach does hold out the possibility of delivering reliable gold data even in the absence of

experts. By learning the characteristics of annotators, it may be possible for an approach like this to

deliver reliable, crowd-sourced judgements at only a fraction of the expense of employing people

with domain knowledge to conduct the annotation40, and without necessarily requiring any

knowledge of ‘ground truth’. In addition, if it could be automated effectively, this approach offers

the prospect of relatively-fast identification of well-performing annotators, or indeed even of

annotators who are particularly effective at identifying particular categories of instance. We believe

further investigation in this direction would be extremely worthwhile.

40 We estimate that the average cost per crowd worker annotation, including design of the template, management of the crowd and processing of the data, is £0.20-0.30 per annotation; by way of comparison, the expert annotators tagged approximately 100 individual instances per hour, which would equate to an hourly rate of £20-30 per hour for an expert annotator (including the cost of verifying their work) if paid at the same rate as the crowd. Further, the cost per crowd-supplied annotation would decrease markedly as the task is upscaled because most of the cost is attached to the management of the crowd, the development of the tasks themselves and the processing of the results.

Figure 18 Annotator vs IBCC judgements

Page 26: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

26

5. Conclusions and Recommendations

5.1. Observations about the creation of the dataset and its utility

One of the significant difficulties encountered throughout this project was designing and then

implementing entity and relationship schemas. The experience of this project was that however a

schema was designed there would always be examples within the corpus that did not fit the schema,

or that forced the annotator to make seemingly nonsensical decisions. To some extent the

application of a rigid schema, is something of a fool’s errand. Rigid schema must attempt to pre-

suppose the types of problem that will be addressed using the dataset. This complicates the design

of the schema and forces significant compromise to be made before any data is assessed or

structured. Schema design must attempt to maximise conflicting aims: it must simultaneously be

suitably broad to cover generally useful types, but specific enough to be meaningful. Ultimately any

schema must be designed with representative questions in mind, and once they are used the data

structure derived from it cannot be adapted easily to suit other specific questions without significant

reclassification or restructuring effort.

A possible solution to this problem could involve the use of emergent schema, where the task of

annotating the data is combined with creating the schema. Chapman and Dowling (2006)41 present a

structured approach to developing a schema inductively based on the corpus contents and the

question of interest. However, given advances in machine learning, it is conceivable that the process

of forming an emergent schema and determining saliency could be augmented by automated

approaches. In such a solution, the characteristics of differences between classes in an ontology

might be recognised algorithmically, in effect, leading to the generation of a meta-schema.

Crowdsourcing was another vital aspect of this project. The approach of utilising crowd workers

through Mechanical Turk proved valid and appropriate for the scope of this dataset. The population

of workers was easily able to accommodate the scale of work produced, and there would be capacity

for tasking on greater scale. The use of the crowd in this way presented excellent value for money in

terms of annotation effort and should be considered for similar tasks in the future (see the following

subsection for specific recommendations).

Throughout this project, the issue of what constitutes a gold standard dataset for natural language

processing has arisen. Wissler et al. (2014)42 have suggested a definition describing a dataset which

has been extensively tagged by a range of human annotators whose outputs are cross-validated

against one another, and against ‘objective’ automated tagging. However, this does not resolve the

issue of deciding on the requisite quality of the annotations and whether inter-annotator agreement

is a good measure of quality. Central to this issue is the question of whether or not a semantic

model, which considers all natural language text to contain ‘ground truth’ meaning, is valid or useful.

41 Chapman, Wendy W., and John N. Dowling. "Inductive creation of an annotation schema for manually indexing clinical conditions from emergency department reports." Journal of biomedical informatics 39, no. 2 (2006): 196-208.

42 Wissler, L., Almashraee, M., Díaz, D.M. and Paschke, A., 2014. The Gold Standard in Corpus Annotation. In IEEE GSC.

Page 27: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

27

The preceding report associated with this project43 discussed the theoretical foundations of

meaning, ambiguity and the relevance of both concepts to annotating natural language text. It

argued that to try and reduce ambiguity in some contexts could actually erode the information value

of text, rather than increase it; this idea is encapsulated by Dumitrache et al.’s (2017) hypothesis

that “annotator disagreement is not noise, but signal”44. Further, in the present report we have

discussed and demonstrated a method for measuring ‘confidence’ in annotations and considered its

relevance to determining ambiguity and ‘true’ meaning. We conclude that using such an approach is

better able to distinguish between genuine ambiguity and poor annotation, than simple inter-

annotator agreement. Overall the IBCC-M model used in this project provided higher confidence

scores across most instances, and drew attention to instances where experts’ annotation decisions,

were open to valid alternative interpretations.

This ability to handle divergent interpretations is particularly important for relationship extraction,

where agreement is generally much lower (as demonstrated in this project and other research45).

However, this disagreement should not be viewed as a barrier to using the dataset for machine

learning training and validation. Li, Good and Su (2015)46 showed that even corpora where there are

fairly low levels of annotator agreement for relationships can be used successfully to train machine

learning approaches.

The dataset itself should represent a rich source of data for its intended purpose of training and

evaluating machine learning approaches to text extraction. The IBCC-M model, and approaches like

it, offer a great deal of potential value over and above simpler metrics. Although they carry a

computational and transparency burden, the intuitiveness of their output, and the prospects they

offer in terms of precision about instance categories as well as annotators, would be of substantial

benefit in a range of contexts.

5.2. Potential future developments

Whilst the dataset produced by this project is both rich in information and robust in terms of the

quality of that information, there is always the potential to improve either the efficiency of

developing a dataset, or the utility of its contents. A list of potential future developments for related

datasets is included below:

• Gold Standard data production without experts - The use of Bayesian algorithms for

assessing and scoring the confidence of annotations, demonstrated in this project presents

the possibility of delivering reliable ‘gold’ data even in the absence of experts. This would

provide the possibility of producing very large, structured datasets of high quality, with

43 2016/17 Data Analytics Project Lot 1: Measurement and Evaluation of a Gold Dataset for Text Processing - Dataset Requirement Framework Specification - Dated: 5th January 2017

44 Dumitrache, Anca, Lora Aroyo, and Chris Welty. "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017).

45 For example, Aroyo, Lora, and Chris Welty. Harnessing disagreement in crowdsourcing a relation extraction gold standard. Tech. Rep. RC25371 (WAT1304-058), IBM Research, 2013.]

46 Abstract to Li, Tong Shu, Benjamin M. Good, and Andrew I. Su. "Exposing ambiguities in a relation-extraction gold standard with crowdsourcing." arXiv preprint arXiv:1505.06256 (2015)

Page 28: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

28

minimal requirement for costly subject matter expertise. We believe further investigation in

this direction would be worthwhile.

• Fully utilising confidence scores - Beyond using the confidence score to set a threshold for

data quality (i.e. discounting instances below a certain confidence score), the dataset could be

used in a more sophisticated way. The confidence score could be incorporated into

probabilistic machine learning training, for example, by informing the priors about the data,

the annotators, or other features. The IBCC algorithm is just one of a large family of similar

techniques. Other approaches and modelling could be explored to provide additional data to

assist with training of classifiers of various types.

• Dynamic schema - The previous subsection discussed the problems inherent in trying to

implement rigid schemas. A more integrated hybrid approach using a combination of expert

and crowd workers operating in concert with automated extractors could be used to define

and evolve adaptable schema.

• Large scale co-referencing - Indicating coreference between sentences was determined out of

scope for this project. However, it is recognised that such an exercise, while challenging,

would add significant value to any dataset. Indicating coreference could be performed by

crowd workers managed in a similar way to the approach followed in this project. This process

could be augmented by the targeted use of machine learning.

• Intelligent crowd management - A number of methods could be employed for future

crowdsourcing tasks, to utilise resources more efficiently, particularly in the case of larger

scale projects. Such methods might include: dynamically adjusting payment and other

incentives to increase productivity or quality; monitoring performance in real-time using the

constant running of IBCC over crowd results; creating hierarchies of workers or specialised

qualifications to better target the most difficult tasks; or using mutually exclusive sub-groups

of the crowd to validate different presentation methods.

Page 29: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

A1

Annex A - Interpretation of Entity and Relationship Schemas

This annex provides an informal guide to the interpretation of the annotation schemas used in this

project and has been produced to increase understanding of the dataset. It gives further details on

the way the schemas used in this project have been applied during the expert annotation of the

dataset, and reflects the instructions issued to crowd workers to guide their data tagging activities.

The interpretations which have evolved during this process have been driven by practical

considerations about ensuring the annotation is as efficient and unambiguous as possible, but also

by the requirement to capture information that would be relevant to the intelligence analyst using

the available schema.

A1. Entity Schema

The main challenge in annotating entities has involved deciding the exact text which constitutes a

given referent. In general terms, definite and indefinite articles have been included in the entity

annotation in order to distinguish between, for example, “the man” and just “man”. However,

interpretation has been required to determine which adjectives and other descriptive phrases form

part of the referent. For example, in the text – “the tall man in the black coat” – the entity would be

captured simply as “the tall man”. In this case, “the black coat” would be considered as a separate

object which belonged to “the tall man”. However, there are times where the context within the

sentence make this separation more difficult. The table overleaf contains the interpretations for

different entity types.

Page 30: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

A2

Entity Name Entity Description Entity Interpretation Rules

CommsIdentifier An alias used to

represent people,

places, military units in

electronic

communication etc.

Could be a call sign, an

email address, a twitter

handle.

These have been relatively rare occurrences in the

dataset. On most occasions, they have included

email addresses or twitter handles, and have been

relatively straightforward to classify.

DocumentReference A unique (or semi

unique) identifier for a

document.

Within the dataset this category has mainly been

restricted to instances referring to religious texts,

specific reports produced by organised bodies, or

references to specific written legislation or policy

(e.g. UN Security Council Resolutions).

Frequency A radio frequency used

in electronic

communication.

There have been very few of these occurring in the

dataset. The limited number of examples have

been unambiguous in their classification.

Location A specific point, area or

feature on earth. May

be named (e.g. a city

name) or referenced

(e.g. with a coordinate

systems)

This has been interpreted to include geographical

features such as mountain ranges and rivers. It has

also been used to capture abodes such as ‘his

home’ or ‘their flat’. The annotation also includes

locations expressed relative to another location

(e.g. 25 miles north of the capital city). It has not

been used for references to abstract and

unspecified locations such as ‘the battleground’.

Page 31: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

A3

MilitaryPlatform A military ship,

aeroplane, land vehicle

or system or platform

onto which weapons

might be mounted. May

be indicated by its

abstract class, e.g.

"tank", or by specific

designation or other

aliases e.g. "T-14

Armata"

The application of this category has mostly been

obvious, with two exceptions. The first relates to

the use of civilian vehicles for military purposes

(e.g. a jeep being used by terrorists to conduct an

attack) and the second, regards named military

vehicles (such as naval vessels) where it is not

obvious, from the isolated sentence, to what its

name refers (i.e. it seems it could be a person’s

name). For the first case, the decision was taken to

classify vehicles whose primary purpose is military

within the context of the sentence, as military

vehicles. In the second case, where the vehicle’s

name is not obviously associated with a military

vehicle (e.g. “Victoria sailed to the

Mediterranean”), then it has not been classified.

Money An amount of money or

reference to currency

This category has been used to capture both the

reference to the currency concerned and the

quantity of that currency (e.g. “$40,000”). It has

also been used to refer to sources of money, where

they are used to indicate another entity having

access to or being sent money (e.g. “benefit

payments” or “oil revenue”).

Nationality Description of

nationality, religious or

ethnic identity

Nationality has only been annotated where it is not

part of another referent. For example, “a Syrian

man”, would be classified as a person, not a

nationality. However, in the sentence – “the food

eaten at the party was mainly Syrian” – the

nationality reference for “Syrian” would be

annotated. This category has also been used to

encode other identity descriptors such as religious

(e.g. Sunni) or ethnic (e.g. Kurdish) identity.

Organisation A group of people, a

government, a

company, or a family.

May be referenced by

name e.g. "the British

Army" or in relying on

context "the Army".

May also be indicated

by an abstract class e.g.

"rebels".

This category has been one of the most inclusive

categories, encapsulating a very broad range of

entities. It has been used to annotate official

organisations and groups (such as states,

companies or political movements), but also any

references to collective plurals of people (e.g. “the

population of Aleppo” or “the six soldiers”). Where

groups of people have been annotated, the

number of people has been captured within the

entity as well (e.g. “30,000 refugees”).

Page 32: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

A4

Person A specific person. May

be a name e.g. Barack

Obama. By title,

"President of the United

States". A combination

of title and name e.g.

"President Obama". Or

by reference to some

other entity. e.g. "The

US's head of state."

In addition to names and titles of individuals,

annotation under this category has been applied to

references to any individual within a sentence (e.g.

“the terrorist suspect”, “the Kurdish soldier” or “his

wife”) and also personal pronouns (e.g. “he” or

“she”).

Quantity A quantity or amount of

something e.g. "1 kg"

This has been used to capture values and the units

in which those values are measured (e.g. “30

kilometres”) and the quantities of other objects

(e.g. “6” meetings held per year or “15” cases of

ammunition). The exceptions to this are quantities

of money (where the quantity may be included in

the entity classified as ‘money’), quantities of

people (where the number of people may be

included in the entity ‘organisation’) and quantities

of distance incorporated in a location (e.g. “15

kilometres to the west of Baghdad”).

Temporal A specific date, or time

or range of time. e.g.

10:30 PST, 12/09/2016,

"Next week",

"Wednesday"

Annotations under this category include a wide

range of references to periods or points in time.

They might include years, months, days, dates or

times, but also include relative temporal

references to particular events (e.g. “today”,

“three days later” or “for five years”).

Url A web URL There have not been very many instances which

are relevant to this category occurring in the

dataset. Those that have occurred have been

straightforward to classify.

Page 33: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

A5

Vehicle A non-military

maritime, land or air

vehicle. May be

indicated by its abstract

class e.g. "airliner", by a

specific vehicle name

e.g. "Boeing 777" or

another related

identifier, e.g. its flight

name "MH370"

The annotations in this category have generally

been uncomplicated. In addition to including types

and names of vehicle (as described in the entity

description across), they also include possessive

determiners and possessive nouns (e.g. “his car” or

“the President’s helicopter”).

Weapon A weapon. May be

indicated by its abstract

class e.g. "rifle", or by

specific name or alias.

e.g. "AK-47".

In addition to the classifications included in the

entity description, this category has also been used

to include objects which the sentence indicates are

being used as a weapon (such as “a rifle butt”).

A2. Relationship Schema

The annotation of relationships has required a more flexible interpretation of the context and

meaning of a sentence. It has not lent itself to the imposition of rigid universal annotation rules. The

interpretations in the table below are more focussed on the kinds of relationships in general terms

that have been annotated, rather than examples of specific uses of language which encapsulate

those relationships.

Page 34: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

A6

Relationship

Name

Relationship Description Relationship Interpretation Rules

Co-located Two or more entities in

the same place at a given

time. (e.g. two vehicles

being co-located)

This relationship occurs commonly within the

dataset and there are three specific interpretations

within the annotations which require highlighting.

The first relates to the tense within the sentence.

Co-location between entities has been recorded in

cases where the text indicates two or more entities

were previously in the same place, are currently in

the same place or will be in the same place

(excluding speculative phrases such as “should” or

“may”). Secondly, co-location is sometimes

recorded unidirectionally where a smaller entity is

located in a larger entity, but not vice versa (e.g.

“the house” was in “London”). This also applies to

features such as rivers, which may be “in” a city,

but not vice versa. Finally, the issue of proximity

has required a degree of interpretation to make

decisions about whether entities are close enough

to be considered as co-located. Words such as

“near” have generally been used to indicate co-

location in instances where this provides

information which might have value to the

hypothetical intelligence analyst.

Apart Two or more entities in

different places at a given

time. (e.g. two people

being apart)

This relationship has only been recorded where the

text explicitly records that entities were apart (e.g.

“the insurgents have been driven out of the city” or

“the President missed his visit to Paris”). It does not

include examples where significant inference is

required to determine that two entities are apart.

Belongs to One entity being owned

by another entity. (e.g.

one vehicle belonging to a

person). OR an entity

being a member of

another entity (e.g. a

person being a member of

an organisation)

This relationship category has been applied to text

where a physical object is in possession of another

entity, where a person or group is a member of

another organisation, or where money is owned by

another entity. Where the possessive noun or

determiner is included in the entity, the ‘belongs

to’ relationship has not been recorded. For

example, in the sentence – “the President’s

helicopter crashed in the jungle” – the “President’s

helicopter” is captured as the entity, not a

“helicopter” belonging to “the President”.

Page 35: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

A7

In charge of One entity controlling or

having responsibility for

another entity. (e.g. an

organisation being in

charge of a building)

This has mainly been applied to command

structures and control of territory; however,

occasionally, it has been used to indicate that an

individual or organisation have control over the use

of a weapon or vehicle.

Has the attribute

of

One entity being an

attribute or associated

quality of another entity.

(e.g. a type of weapon

having the attribute of a

quantity or a person

having the attribute of a

nationality)

The main use of this relationship has been to attach

quantities to entities such as vehicles, weapons or

locations (e.g. “12” tanks). It is also used in

instances where nationality is applied to another

entity, but where this isn’t part of the referent

associated with the entity (i.e. the relationship is

used for examples like “the man was Syrian”, but

not in examples like “the Syrian man” – where the

nationality is incorporated into the ‘person’ entity.

Is the same as One entity being used to

mean the same thing as

another entity (e.g.

multiple names for the

same person)

This relationship has been used as a distinct

classification compared to the coreferencing

function provided by the Galaxy annotation tool. “Is

the same as” has been reserved for cases where

two specific phrases are used for an entity, each of

which uniquely identifies the entity to the extent

that it might feasibly be recognised by any one of

the phrases (e.g. the terrorist organisation “Daesh”,

which calls itself “the Islamic State”). Coreferencing

has been used for non-specific references to the

same entity in a sentence (e.g. “the man” walked

back the way “he” had come).

Likes One entity being

positively disposed

towards another. (e.g. a

person likes a military

platform)

This relationship has been used relatively

infrequently for sentences that use synonyms of “to

like” in the context of the relationship between

entities. It may also have been employed to

annotate instances where the text implies that the

one entity views another favourably.

Dislikes One entity being

negatively disposed

towards another. (e.g. a

person dislikes another

person)

The main application of this relationship in the

dataset has been to sentences which imply the

existence of an acrimonious relationship between

entities, or where one entity is quoted as saying

something disparaging or universally negative

about another entity.

Page 36: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

A8

Fighting against One entity being in armed

conflict against another

entity. (e.g. an

organisation is fighting

against another

organisation)

Instances which specify military action having

occurred between entities, or instances which

indicate an entity is engaged in a military campaign

against another entity have predominantly formed

those instances which have been annotated with

this relationship.

Military allies of One entity fighting

alongside another entity

on the same side of a

conflict. (e.g. one person

being a military ally of

another person)

This relationship has been recorded between

entities where it is clear that they are co-operating

militarily against a common enemy. It has not

generally included financial or other support which

is provided by a partner which is not actually

involved in the military operations.

Communicated

with

An entity has had direct or

indirect contact with

another entity (e.g. met,

spoken with,

communicated remotely

with, messaged, emailed,)

This relationship has been used to capture

instances where information has been exchanged

between parties, or where one entity has passed

information to another. In the former case, the

communication is considered to be symmetric, in

the latter case it is annotated as unidirectional. The

relationship has also been applied to sentences

which indicate that communication must have

taken place in order for an outcome to occur (e.g.

“the ceasefire brokered by Russia and Turkey”).

Page 37: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

B1

Annex B: The IBCC-M Model

B1. Introduction

The IBCC-M model was adapted from the IBCC model presented in Kim and Ghahramani (2012)47

specifically for the present study. The model we adapted allowed for multiple populations of

classifiers (in this case, two - experts and multiple crowd workers)48.

B1.1 Model Specification

The data-generating model, in qualitative terms, is as follows. A number of ‘instances’ are

independently generated, each instance being exactly one of R different ‘types’ - these are unknown.

Each of these instances is passed to one or more annotators who classify it as being in one of the R

categories. The annotators’ performance is governed by a ‘confusion matrix’, which consists of a

separate probabilistic distribution for each potential ‘type’ that the annotator is presented with.

Again, we assume we do not know these distributions but must infer them. We assume there are

two types of annotator - ‘crowd’ and ‘expert’ - which are distinguished only by our prior assumptions

about their quality. Ultimately, the aim of the inference process will be to generate probability

distributions for the type of each instance, indicating the ‘confidence’ of each classification, and

specifically (in this case) of the experts’ agreed-upon annotation.

The technical specification of the model is as follows:

• There are N instances, each of which is one of R different types; there are K crowd annotators

and L expert annotators.

• The ‘true’ frequency distribution of sentence types is denoted by a 1xR vector 𝜿; each

sentence type is drawn from a categorical distribution with this vector as the parameters;

• This vector 𝜿 is generated from a Dirichlet distribution with hyperparameters 𝝂, representing

our prior understanding of the distribution of instance types49.

• The ‘confusion matrices’ for each of the crowd annotators are represented by a set of R 1xR

vectors 𝝅[1 to K, 1 to R], where 𝝅[i, j] is the confusion vector for annotator i when presented

with an instance of type j. These are all independently drawn from a set of 1xR Dirichlet

distributions with hyperparameters 𝛼[1 to R].

• The ‘confusion matrices’ for each of the expert annotators are represented by a set of R 1xR

vectors 𝜹[1 to L, 1 to R], where 𝜹[i, j] is the confusion vector for expert annotator i when

47 H. Kim and Z. Ghahramani. (2012) Bayesian Classifier Combination. In Proc. of the 15th Int. Conf. on Artificial Intelligence and Statistics, page 619

48 Different populations of classifier are distinguished only by the data presented to the model, which reflects differences in their prolificity (experts completed more annotations than individual crowd workers) and by the prior expectation of their reliability (experts were initially considered to have a higher level of annotation accuracy than crowd workers). However, the more information there is about an annotator (i.e. the more judgements they make), whether expert or crowd, the more-finely calibrated the assessment of that annotator’s reliability will be. Likewise, the more annotators exposed to a given instance, the more finely-calibrated the assessment of that instance’s ‘true type’ will be. In this way, the number of classifiers in each population is not directly relevant to the way the model calculates confidence for the classification of each instance.

49 In fact, we used a nearly flat prior, essentially representing no prior expectations about the distribution of types.

Page 38: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

B2

presented with an instance of type j. These are all independently drawn from a set of 1xR

Dirichlet distributions with hyperparameters 𝜷[1 to R].

• For each instance i, the instance type t[i] is generated on a categorial distribution from 𝜿; for

each crowd annotator j exposed to that instance, their judgement c[j, i] is generated on a

categorical distribution from 𝝅[j, t[i]]; for each expert annotator k exposed to that instance,

their judgement d[k, i] is generated on a categorical distribution on 𝜹[k, t[i]].

• The data consists of the set of judgements c[1 to K, 1 to N] and d[1 to L, 1 to N], which in

practice was very sparse because each annotator was exposed to only a very small subset of

instances.

• Our aim is to infer the posterior probability distributions for each sentence type t[i], and also

instrumentally to infer the confusion vectors for each crowd annotator 𝝅[1 to K, 1 to R] and

expert annotator 𝜹[1 to L, 1 to R].

To summarise the model:

• 𝜿 ~ Dir(𝝂)

• 𝝅[i, j] ~ Dir(𝛼[j]) for i=1 to K and j=1 to R

• 𝜹[i, j] ~ Dir(𝜷[j]) for i=1 to L and j=1 to R

• t[i] ~ Cat(𝜿) for i=1 to T

• c[i, j] ~ Cat(𝝅[i, t[j]]) for i=1 to K and j=1 to T

• d[i, j] ~ Cat(𝜹[i, t[j]]) for i=1 to L and j=1 to T

• The data consist of c[j, i] and d[k, i] for j=1 to K, k=1 to L, and i=1 to T

The posterior distribution for the data and parameters (from which the Gibbs samplers are derived)

is as follows:

𝑝(𝝅, 𝜿, 𝜹, 𝒕, 𝒄, 𝒅 | 𝜶, 𝜷, 𝝂) =

𝑝(𝝅 | 𝜶) ⋅ 𝑝(𝜹 | 𝜷) ⋅ 𝑝(𝜿 | 𝝂) ⋅ ∏ 𝜅[𝑡[𝑖]]𝑁𝑖=1 ⋅ ∏ ∏ 𝜋[𝑗, 𝑡[𝑖], 𝑐[𝑗, 𝑖]]𝐾

𝑗=1𝑁𝑖=1 ⋅ ∏ ∏ 𝛿[𝑗, 𝑡[𝑖], 𝑑[𝑗, 𝑖]]𝐿

𝑗=1𝑁𝑖=1

For the entity classification model, we used a set of 16 mutually-exclusive classifications (R=16)

(including ‘NONE’ and ‘DON’T KNOW / CAN’T TELL’), reflecting the entity schema in which only one

of the categories could be true. However, for the relationship classification model, for which the

schema left open the possibility that none or all of the (13 possible) relationships might be true of a

specific instance, we treated this effectively as 13 separate models, each of which involved classifiers

assigning ‘yes’ or ‘no’ to a particular classification (i.e. R=2). In all cases, we used very weak priors

that for crowd annotators corresponded to a prior expectation of around 50% accuracy, and for

expert annotators to a prior expectation of around 90% accuracy. However, the pseudocounts

describing these priors were very low and (apart from in the least well-represented categories) the

posterior distributions associated with the parameters were dominated by the characteristics of the

data rather than our prior assumptions.

B1.2 Gibbs Sampling

Gibbs sampling is a well-understood approach that is useful in cases in which a marginal probability

distribution can be specified but is intractable. It is well covered in statistical literature and we do

not expound on it here. We used the WinBUGS tool to perform the requisite analysis. The sheer

volume of computation required for the entity model meant that (as outlined in section 4.4.3 of the

Page 39: Data Analytics Project Lot 1: Measurement and Evaluation of ......2017/03/22  · "Crowdsourcing Ground Truth for Medical Relation Extraction." arXiv preprint arXiv:1701.02185 (2017))

B3

main report) we had to restrict the data to only the top 15 annotators (accounting for 60% of all

judgements); with the relationship annotation data (due to the considerably-simpler model and

smaller dataset) we were able to use all of it.

Visual inspection suggested that convergence for the key variables (in this case, the ‘true’

classifications t[i]) was achieved very quickly. In all cases, our reported confidence scores are the

means of the values assigned in each of 900 samples (following a 100-sample burn-in period) to the

probability that the expert annotators’ classification is correct.