d4.5 community evaluation methodology -...
TRANSCRIPT
D4.5 - Community Evaluation
Methodology January 16, 2017
Deliverable Code: D4.5
Version: 1.0 – Final
Dissemina(on level: Public
This document contains a formal descrip3on of the evalua3on methodology key indicators based on community driven applica3on scenarios (use cases from four thema3c areas) with the aim of valida-ng and assessing the OpenMinTeD framework through community scenarios.
H2020-‐EINFRA-‐2014-‐2015 / H2020-‐EINFRA-‐2014-‐2 Topic: EINFRA-‐1-‐2014 Managing, preserving and compu2ng with big research data Research & Innova.on ac.on Grant Agreement 654021
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 1 of 29
Document Description D4.5 – Community Driven Evalua&on Methodology
WP4 – Community Driven Requirements and Evalua5on
WP par'cipa'ng organiza'ons: ARC, University of Manchester, UKP-‐TUDA, INRA, EMBL, LIBER, OU, EPFL, CNIO, USFD, GESIS, GRNET, Fron4ers, UoS
Contractual Delivery Date: 11/2016 Actual Delivery Date: 01/2017
Nature: Report Version: 1.0 (Final)
Public Deliverable / Confiden'al Deliverable, only for members of the consor0um (including the Commission Services)
Preparation slip Name Organiza(on Date
From Miguel Madrid
Mar$n Krallinger
Nicole Doelker
CNIO 02/12/2016
Edited by Mar$n Krallinger
Nicole Doelker
CNIO 16/01/2017
Reviewed by Ma# Shardlow
Piotr Przybyła
UNIMAN
UNIMAN
19/12/2016
20/12/2016
Approved by Natalia Manola ARC 17/01/2017
For delivery Mike Hatzopoulos ARC 17/01/2017
Document change record Issue Item Reason for Change Author Organiza(on
V0.1 Dra$ version Ini$al version Miguel Madrid CNIO
V1.0 First version Final version Mar$n Krallinger CNIO
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 2 of 29
Table of Contents
1. INTRODUCTION ......................................................................................................................................... 7
2. GENERAL EVALUATION STRATEGY DESCRIPTION ....................................................................................... 9
3. COMMUNITY USE CASE COMPONENT EVALUATION SCENARIOS ............................................................. 10
THEME: AGRICULTURE / BIODIVERSITY ........................................................................................................... 11 3.13.1.1 USE CASE: AGRIS: AUTOMATIC EXTRACTION OF TOPICS, LOCATION AND FIGURES FROM PUBLICATIONS [AS-‐A] ............... 11 3.1.2 USE CASE: FOOD SAFETY: AUTOMATIC EXTRACTION OF LOCATION INFORMATION FROM PUBLICATIONS [AS-‐B] ............... 12 3.1.3 USE CASE: MICROBIAL BIODIVERSITY [AS-‐C] ........................................................................................................ 12 3.1.4 USE CASE: LINKING WHEAT DATA WITH LITERATURE [AS-‐D] ................................................................................... 13 3.1.5 USE CASE: INFORMATION EXTRACTION OF MECHANISM INVOLVED IN PLANT DEVELOPMENT [AS-‐E ] .............................. 13
THEME: LIFE SCIENCES ................................................................................................................................ 14 3.23.2.1 USE CASE: EXTRACT METABOLITES AND THEIR PROPERTIES AND MODES OF ACTIONS [LS-‐A] ........................................ 14 3.2.2 USE CASE: NEUROSCIENCE [LS-‐B] ..................................................................................................................... 14
THEME: SOCIAL SCIENCES ............................................................................................................................ 15 3.33.3.1 USE CASE: FACILITATION OF COMPLEX INFORMATION LINKING AND RETRIAL FROM SOCIAL SCIENCES PUBLICATIONS [SS-‐A] 15
THEME: SCHOLARLY COMMUNICATION .......................................................................................................... 15 3.43.4.1 USE CASE: RESEARCH ANALYTICS USE CASE [SC-‐A] .............................................................................................. 15
COMMUNITY USE CASE EVALUATION SCENARIO ACTORS ..................................................................................... 16 3.5 EVALUATION PHASE I: USE-‐CASE COMPONENT CENTRIC ...................................................................................... 17 3.6
3.6.1 CRITERIA OF ASSESSMENT ................................................................................................................................. 18 EVALUATION PHASE II: USE-‐CASE WORKFLOW CENTRIC ...................................................................................... 19 3.7
3.7.1 GENERAL CRITERIA OF ASSESSMENT .................................................................................................................... 22
4. REFERENCES ............................................................................................................................................ 24
5. APPENDIX ............................................................................................................................................... 25
QUESTIONNAIRE FOR EVALUATION PHASE I: USE-‐CASE COMPONENT CENTRIC .......................................................... 25 5.1 EVALUATION PHASE II: USE-‐CASE WORKFLOW CENTRIC ...................................................................................... 26 5.2
5.2.1 REGISTRY SERVICE SCENARIO ............................................................................................................................. 26 5.2.2 WORKFLOW SERVICE SCENARIO ......................................................................................................................... 28 5.2.3 ANNOTATION SERVICE SCENARIO ........................................................................................................................ 29
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 3 of 29
Table of Figures Figure 1. Evaluation flowchart cycle. ______________________________________________________________________ 9 Figure 2. OpenMinTeD Gold Standard dataset characterization aspects. __________________________________________ 11 Figure 3. OpenMinTeD evaluation framework for component assessment (evaluation phase I). _________________________ 18 Figure 4. Workflow service architecture (from OpenMinTeD Webinar 2). _________________________________________ 21 Figure 5. OpenMinTeD framework, general evaluation levels. __________________________________________________ 23
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 4 of 29
Disclaimer This document contains description of the OpenMinTeD project findings, work and products. Certain parts of it might be under partner Intellectual Property Right (IPR) rules so, prior to using its content please contact the consortium head for approval.
In case you believe that this document harms in any way IPR held by you as a person or as a representative of an entity, please do notify us immediately.
The authors of this document have taken any available measure in order for its content to be accurate, consistent and lawful. However, neither the project consortium as a whole nor the individual partners that implicitly or explicitly participated in the creation and publication of this document hold any sort of responsibility that might occur as a result of using its content.
This publication has been produced with the assistance of the European Union. The content of this publication is the sole responsibility of the OpenMinTeD consortium and can in no way be taken to reflect the views of the European Union.
The European Union is established in accordance with the Treaty on European Union (Maastricht). There are currently 28 Member States of the Union. It is based on the European Communities and the member states cooperation in the fields of Common Foreign and Security Policy and Justice and Home Affairs. The five main institutions of the European Union are the European Parliament, the Council of Ministers, the European Commission, the Court of Justice and the Court of Auditors. (http://europa.eu.int/)
OpenMinTeD is a project funded by the European Union (Grant Agreement No 654021).
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 5 of 29
Acronyms TDM Text and Data Mining NLP Natural Language Processing OMTD NER
OpenMinTeD Named En)ty Recogni)on
REST Representa)onal state transfer
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 6 of 29
Publishable Summary This document is the fifth deliverable of Work Package 4 titled “Community Driven Requirements and Evaluation” and aims to provide an overview of the requirements relevant to TDM from research communities that have been identified as potential end users of the OpenMinTeD project services. The first step in the process of the OpenMinTeD project developing its text-‐ and data-‐mining powered services, as well as its services to end users and e-‐infrastructure, was the identification of the actual needs of the communities that are currently in need of these services.
The aim of the OpenMinTeD project is to provide an interoperability framework, which serves as core for existing tools and resources. The technical evaluation of requirements is critical for the integration and interoperability of all abstraction levels of the framework.
Only rigorous evaluation can guarantee the reliability and usability of such a framework and therefore make it a solid tool for all user communities.
OpenMinTeD is born from the need to create and consolidate a European framework of culture, support, promotion, and training for data and text mining. To this end, it is necessary to create, host and encourage the numerous communities involved in this interdisciplinary field (suppliers, scientists, end users, etc.) through workshops, open tenders, and European and international liaisons for developing new services for the benefit of all members.
Therefore, the applied evaluation strategy necessarily needs to be Community Driven, i.e. based on the use of community driven applications on the OMTD platform.
This deliverable contains:
� An introduction � General evaluation strategy � Community use case derived evaluation � Appendix (example questionnaires for evaluation)
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 7 of 29
1. Introduction The OpenMinTeD project attempts to engage different thematic use case communities, namely Life Sciences, Agriculture/Biodiversity, Social Sciences and Scholarly Communication, with the aim to define real life application scenarios, which should be addressed through the OpenMinTeD infrastructure. This community use cases serve as a sort of guiding example for the implementation and evaluation of the resulting text mining workflows. A general examination of formal aspects underlying the OpenMinTeD community use cases serves as a base to define key performance indicators to be considered as evaluation criteria covered by the OpenMinTeD evaluation framework. Key performance indicators refer to standard evaluation measures in case of the use case Gold Standard data and evaluation structured survey outcomes in case of the evaluation phase covering the OpenMinTeD infrastructure.
This document provides a comparative analysis of the key performance indicators derived from the community use case evaluation scenario definition, taking into account additionally aspects related to component interoperability as well as technical aspects underlying the use case derived text mining workflows. Based on this comparative examination, key evaluation indicators implemented in the OpenMinTeD Evaluation framework are classified according to three levels of practical relevance: mandatory evaluation indicators, important evaluation indicators and optional evaluation indicators. Mandatory evaluation indicators refer to evaluation requirements of measurements / validation outcomes that need to be accomplished to be considered as fulfilling the OpenMinted evaluation criteria.
In order to facilitate a standardized assessment of some of the key evaluation indicators, an open community challenge shared task focusing on the technical interoperability and performance of specific components of the use case text mining workflows will be carried out. Moreover, for the design of the evaluation indicators, profiling of the actual users (actors) of text mining workflows derived from community cases was carried out. An evaluation indicator survey, in the form of a structured questionnaire covering key performance measures, quantitative, technical and qualitative, served to prioritize the main aspects covered by the evaluation setting considered by the OpenMinTeD Evaluation framework.
D.4.5 reports on the evaluation strategies of the framework services through the usage and results of the domain specific applications that implement the concrete scenarios. The main goal of the OpenMinTeD evaluation framework is to examine the technical and functional capabilities of the framework services and workflows as well as the integration, creation of resources and components needed to generate results from text mining workflows, following the design of community use case applications.
The content of this deliverable is complementary to D4.3 (OpenMinTeD Functional Specifications), D.4.4 (Community Evaluation Scenarios Definition Report), D4.1 (Requirements methodology) which
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 8 of 29
defines the actual community use cases, profiling users and building the questionnaires needed to capture the user requirements, and D4.2 (Community Requirements Analysis Report) which covers all use cases discussed in detail, the actors and workflows for extracting the key performance indicators to evaluate the community. This deliverable also introduces the evaluation methodology and questionnaires to formalize this approach.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 9 of 29
2. General evaluation strategy description The OpenMinTeD evaluation cycle is structured into two evaluation cycles (or phases) that will be supported by the development of the OpenMinTeD evaluation framework (multi-‐step evaluation strategy). During each evaluation phase, a set of validation scenarios will be defined to address the heterogeneity and particularities of the various (1) community use cases on one side, and (2) end users or OpenMinTeD framework actors on the other side. For each use case and framework user, a set of evaluation criteria, derived from key indicators extracted from community use cases, will be captured and analyzed. CNIO partners will be in charge of collecting and characterizing the minimal set of evaluation criteria.
The goal of the evaluation process is to determine to which extent the OpenMinTeD framework supports a given expected evaluation indicator or criterion, and to build a use case centered evaluation infrastructure, the OpenMinTeD evaluation framework. The evaluation indicators include metrics that assess performance as well as annotation quality at the level of individual components required to extract the type of information demands associated to each community use case. Those indicators do also cover adequacy, quality and usability assessment at the level of text mining workflows and its support by the OpenMinTeD framework through guided use, testing, reporting, cognitive walkthrough 1and structured questionnaires. Figure 1 provides a general overview of the evaluation cycle starting from descriptive community use cases to the actual evaluation of the OpenMinTeD framework through use case text mining workflows.
Figure 1. Evaluation flowchart cycle.
1 In Cognitive Walkthrough, the usability of the system is evaluated by a group of inspectors, who work through a series of tasks and ask a set of questions from the perspective of the user. Its main aim is to assess ease of understanding and learning.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 10 of 29
3. Community use case component evaluation scenarios Each of the nine OpenMinTeD community use cases is characterized by very particular user demands in terms of the underlying document collections of interest, named entity types, relation types or ontologies and databases used for data integration and entity grounding.
In order to be able to carry out a formal component-‐level evaluation and to perform a comparative performance analysis across various use case scenarios, several critical key steps have been identified. Steps 1-‐3, are prior formal requirements relevant for the evaluation process that. The key steps consist of :
1. Gathering, revision and refinement of descriptive use cases. 2. Transformation of descriptive use cases into formal annotation workflows 3. Design of text mining pipelines for use case workflows 4. Selected key component tasks for evaluation purposes 5. Define evaluation setting: Gold Standard corpora (intrinsic/extrinsic, guidelines, metrics) 6. Accessibility and integration of component applications to the OpenMinTeD framework
Although different formal evaluation scenarios are needed to address text mining demands associated with the types of users and tasks defined in the individual community use cases, the OpenMinTeD evaluation framework will also identify and capture upper level commonalities encountered across multiple evaluation scenarios.
The OpenMinTeD evaluation framework, in coordination by the CNIO, will collect key attributes, capturing use case specific Gold Standard datasets used for the evaluation of annotation workflow key component tasks. Standard evaluation setting shall be followed avoiding overlapping datasets between training and test sets as well as characteristics related to data selection. Annotation criteria will be carefully described to exclude evaluation biases. These descriptions will also include the baselines wherever possible (if possible these will be quantitative baseline scores, otherwise they will consist of a formal description of manual procedures that are necessary to obtain the same information). Baselines in this context are needed, as without them it is hard or impossible to assess the generated results from a perspective outside of each use case field.
Based on the detailed examination of the use case descriptions and "Community Evaluation Scenarios Definition Report” (D.4.4), the following collection of evaluation scenarios at the component level have been identified:
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 11 of 29
Figure 2. OpenMinTeD Gold Standard dataset characterization aspects.
Theme: Agriculture / Biodiversity 3.1
3.1.1 Use case: AGRIS: Automatic extraction of topics, location and figures from publications [AS-A]
Extracting information about topics and location and from data types like images or figures of the agricultural sector.
● Potential component tasks: automatic recognition of mentions of domains-‐specific topics, locations, viticulture terms and targets, water pathogens.
● Potential evaluation scenarios: extraction of AGROVOC thesaurus terms, using Agrotagger and Stanford NER. The output of these systems will be used as baselines, and compared against the Gold Standard datasets.
● Component centric evaluation setting: development of Gold Standard evaluation corpora of viticulture documents and comparative evaluation using standard metrics: precision (mandatory), recall (important), F-‐score (important).
● Integration validation: REST-‐based TDM web service. ● Evaluation data provided by Agroknow, which will also carry out the evaluation ● Annotation description: associated to each Gold Standard dataset a short
structured annotation guideline document and characterization description will be prepared and revised by the use case experts. This document will cover (as complete as possible) the several key aspects highlighted in figure 2.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 12 of 29
3.1.2 Use case: Food Safety: Automatic extraction of location information from publications [AS-B]
Location of foodborne outbreak and food alerts from food safety publications.
● Potential component tasks: automatic recognition of mentions of food, foodborne outbreak, food alerts, food safety issues, food recall, geographic information (e.g. countries), extraction of complex vents/relationships between geographic locations and foodborne disease outbreaks..
● Component centric evaluation setting: development of Gold Standard evaluation corpora and comparative evaluation using standard metrics: precision (mandatory), recall (important), F-‐score (important).
● Integration validation: REST-‐based TDM web service. ● Evaluation data provided by Agroknow, which will also carry out the evaluation ● Annotation Guidelines: associated to each Gold Standard dataset a short
structured annotation guideline document and characterization description will be prepared/revised by the use case experts. This document will cover (as complete as possible) the several key aspects highlighted in figure 2.
3.1.3 Use case: Microbial biodiversity [AS-C] Relation between microorganism, habitats and some properties of the microorganisms normalized using metagenomics knowledge bases.
● Potential component tasks: automatic recognition of mentions of food, microorganisms, habitats, microorganism-‐phenotype, microorganism-‐derived chemical entities, entity grounding (ontology linking/grounding of these entities, e.g. NCBI Taxonomy (microorganism), InCHI key (chemical), Ontobiotope (habitat)). Relation types: ‘[MICROORGANISM] lives in [HABITAT]’, and ‘[MICROORGANISM] produces [CHEMICAL]’
● Component centric evaluation setting: development of Gold Standard evaluation corpora and comparative evaluation using standard metrics: precision (mandatory), recall (important), F-‐score (important).
● Integration validation: REST-‐based TDM web service. ● Evaluation data provided by Agroknow/INRA (BioNLP-‐ST BB task corpus), which
will also carry out the evaluation ● Annotation Guidelines: associated to each Gold Standard dataset a short
structured annotation guideline document and characterization description will be prepared/revised by the use case experts. This document will cover (as complete as possible) the several key aspects highlighted in figure 2.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 13 of 29
3.1.4 Use case: Linking Wheat data with literature [AS-D] Phenotype related information on plant extracted from scientific articles.
● Potential component tasks: automatic recognition of mentions of plants (taxon names), gene/protein mentions, plant genetics, plant phenotypes, gene markers, entity grounding (ontology linking of these entities).
● Component centric evaluation setting: development of Gold Standard evaluation corpora and comparative evaluation using standard metrics: precision (mandatory), recall (important), F-‐score (important).
● Integration validation: REST-‐based TDM web service. ● Evaluation data for particular key tasks will be provided/revised by INRA ● Annotation Guidelines: associated to each Gold Standard dataset a short
structured annotation guideline document and characterization description will be prepared/revised by the use case experts. This document will cover (as complete as possible) the several key aspects highlighted in figure 2.
3.1.5 Use case: Information Extraction of mechanism involved in plant development [AS-E ]
Help in plant breeding, plant reproduction and seed development. The information extracted is complementary to the databases.
● Potential component tasks: automatic recognition of mentions of plants (taxons), plant reproduction/anatomy terms, plant seed development (development stages) terms, gene/protein mentions, relation extraction of associations between plant-‐genes-‐seed development/anatomy terms, protein-‐promoter binding relations, entity and concept grounding (ontology linking of these entities).
● Component centric evaluation setting: development of Gold Standard evaluation corpora and comparative evaluation using standard metrics: precision (mandatory), recall (important), F-‐score (important).
● Integration validation: explore use of REST-‐based TDM web service. ● Evaluation data: adaptation/exploitation of existing Gold Standard datasets from
BioNLP-‐ST 2016 SeeDev task, additional training data generated by consortium partners for entity normalization and ontology grounded terms.
● Annotation Guidelines: associated to each Gold Standard dataset a short structured annotation guideline document and characterization description will be prepared/revised by the use case experts. This document will cover (as complete as possible) the several key aspects highlighted in figure 2.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 14 of 29
Theme: Life Sciences 3.2
3.2.1 Use case: Extract Metabolites and their properties and modes of actions [LS-A] Help curators of metabolism databases by providing information tuples about metabolites from the published literature.
● Potential component tasks: literature triage (classification and ranking of curation relevant articles and/or text passages) automatic recognition of mentions of chemical names, chemical classes, chemical structures, organisms/species, organism parts (taxon names), biological activity, biological target entity grounding (ontology linking of these entities, e.g. ChEBI, NCBI taxonomy).
● Component centric evaluation setting: development of Gold Standard evaluation corpora and comparative evaluation using standard metrics: precision (mandatory), recall (important), F-‐score (important).
● Integration validation: explore use of REST-‐based TDM web service. ● Evaluation data: adaptation/exploitation of existing Gold Standard datasets from
BioCreative, BioNLP-‐STs, LINNEAUS corpus and others. ● Annotation Guidelines: associated to each Gold Standard dataset a short
structured annotation guideline document and characterization description will be prepared/revised by the use case experts. This document will cover (as complete as possible) the several key aspects highlighted in figure 2.
3.2.2 Use case: Neuroscience [LS-B] Systematic way to curate the relevant literature of neuroscientific data.
● Potential component tasks: automatic detection of figure legends, table legends (optional), recognition of modeling parameters in text, recognition of experimental values, neuroscience keyword extraction, detection of cell types (e.g. neurons by NeuroNER), detection of brain regions, detection of synapses, detection of species (taxon) detection of genes/proteins, detection of chemical substances of relevance for neuroscience, entity grounding (ontology linking of these entities).
● Component centric evaluation setting: development of Gold Standard evaluation corpora and comparative evaluation using standard metrics: precision (mandatory), recall (important), F-‐score (important).
● Integration validation: explore use of REST-‐based TDM web service. ● Evaluation data: adaptation/exploitation of existing Gold Standard datasets from
GENIA corpus, LINNEAUS corpus and preparation of additional Gold Standard data by consortium partners.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 15 of 29
● Annotation Guidelines: associated to each Gold Standard dataset a short structured annotation guideline document and characterization description will be prepared/revised by the use case experts. This document will cover (as complete as possible) the several key aspects highlighted in figure 2.
Theme: Social Sciences 3.3
3.3.1 Use case: Facilitation of complex information linking and retrial from social sciences publications [SS-A]
Automatic detection, disambiguation and linking of entities in Social Science text corpora to enhance indexing and searching.
● Potential component tasks: automatic recognition of named entity mentions and disambiguation (overall 12 subtypes and 4 coarse types), keyword assignment, variable mention detection, entity grounding (ontology linking of these entities).
● Component centric evaluation setting: development of Gold Standard evaluation corpora and comparative evaluation using standard metrics: precision (mandatory), recall (important), F-‐score (important). Although complementary evaluation measurements can be added, as these are the most commonly used metrics, which can be applied across use cases they will be the primary evaluation metrics.
● Integration validation: explore use of REST-‐based TDM web service. ● Evaluation data: the use case responsibles (GESIS) will be in charge of
compiling/constructing suitable Gold Standard corpora/datasets. ● Annotation Guidelines: associated to each Gold Standard dataset a short
structured annotation guideline document and characterization description will be prepared/revised by the use case responsibles. This document will cover (as complete as possible) the several key aspects highlighted in figure 2.
Theme: Scholarly Communication 3.4
3.4.1 Use case: Research Analytics Use Case [SC-A] Innovate framework for entity extraction from publications and automated and extendible multidimensional analysis of all scholarly content.
● Potential component tasks: automatic recognition and disambiguation of entity mentions including projects/project codes, funders/funding information, Protein Database (PDB) codes, extraction of patent citations, topic (thematic information) identification, research trend identification, automatic recognition of stop words for specific domains (pre-‐processing task).
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 16 of 29
● Evaluation data: the use case responsibles in collaboration with OpenAIRE data engineers/marketing department and OMTD end-‐users will be responsible for the creation of evaluation data (annotation guidelines and component-‐centric evaluation datasets).
Overall, three types of workflow component services can be distinguished across the different use cases: conventional components, adaptation of general components to use case requirements and development of new specialized components.
Community use case evaluation scenario actors 3.5There are several actors (platform user validators) identified from the use cases introduced in D4.2. They can be roughly classified into developers of text mining systems, consumers of text mining systems and generators of text annotations, Gold Standard datasets, and validators. A more detailed list of evaluation actors comprise:
Content providers (publishers): provide, integrate, support and maintain the access to the documents.
Database curators: carry out literature curation and functional annotations.
Text, NLP & data mining (TDM) experts: high level of knowledge about text mining and the framework.
Databases, applications and services developers (Computer scientists, bioinformaticians, other computing fields): provide hardware and software services for creating and improving the access and the different uses of the framework.
Legal experts: legal advice about copyright topics, international intellectual property and laws.
Researchers (experts in specific and different sciences): improve the annotations using text mining and provide feedback to the framework evaluating it. Basic and medium level of knowledge about text mining and the framework.
End users (Final users, industry, business): without need of knowledge and experience about text mining and the framework.
The community evaluation scenario does require the construction, definition, release and iterative refinement (validation cycles and procedures) of material to support training of validators (both general as well as actor-‐specific training material versions). Each use case should have at least one validator, with the aim of at least having two validators in case of more mature use case scenarios. Note that, as some use cases overlap in terms of the required expertise, validators may take on multiple roles, i.e. work for multiple scenarios. This will enable validators to acquire the needed minimal skills and knowledge to execute and evaluate the tasks underlying each validation scenario.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 17 of 29
OpenMinTeD validator training material will include tutorials, videos and written documentation specifying the task aim, example cases as well as the description of the evaluation task procedure.
The OpenMinTeD evaluation framework is organized into two evaluation phases to cover different important characteristics related to platform validation scenarios. During both phases, the OpenMinTeD evaluation framework comprises task description material, validation forms and evaluation templates to capture validators’ answers and requests referring to validation criteria.
For each of the community evaluation scenario a validator recruitment strategy will be provided to document the selection process and assignment of evaluation actors to particular evaluation tasks.
Moreover, evaluation actor or validators are classified into internal and external validators and platform users. Internal validators are those that comprise members of the OpenMinTeD consortium, while external validators correspond to users of the platform outside the consortium. During the evaluation phase I only internal validators will be recruited, while during the second evaluation phase first internal validators will be engaged, and in a final stage a selected target group of five external validators will be enrolled for validation purposes of the platform.
Evaluation phase I: use-case component centric 3.6The first evaluation phase will focus on the assessment of the single critical components for the specific use case tasks. A minimal evaluation, covering aspects related to performance, access and integration of those components is essential in order to be able to run the text mining workflows associated to a given use case. Therefore, for each community use case we will define a minimal evaluation form capturing key aspects of the evaluation setting such as Gold Standards datasets/corpora, structured definition of component tasks, a formal annotation workflow, the corresponding text mining workflow definition, workflow resource characteristics, test frame, test procedure, test results, absolute evaluation, and comparative evaluation.
The evaluation scenarios of OpenMinTeD component-‐centric assessment will include:
● For each use case, we will engage domain experts to revise the selection, annotation and documentation of the corresponding gold standard datasets.
● Key component resources will be tested against the gold standard datasets to determine their respective performance.
● Functionality and usability of a selected number of specific components will be assessed by expert text mining researchers in community shared tasks.
● The interoperability of tasks and their integration into the OMTD framework will be addressed separately from the functional evaluation.
The procedure for the design of the use case specific evaluation scenarios will be developed and tested on the basis of one selected representative use case that fulfills the following criteria:
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 18 of 29
● It is well defined ● It requires no (or few) 3rd party components ● It contains tasks and components that are of interest for other use test cases in more than one
domain.
Figure 3. OpenMinTeD evaluation framework for component assessment (evaluation phase I).
3.6.1 Criteria of assessment ● For each use case, select use case descriptions and generate a formal definition of common
component task template (initial) and iteratively refine the template based on the use case responsible feedback to generate the final template.
● For each use case, generate a diagram representing the design of the formal text mining workflows.
● For each use case text mining workflow, identify, implement and integrate workflow components and validate the compliancy of OMTD functional specifications.
● For each use case identify required evaluation resources, in terms of existing / to be implemented resources, definition of Gold Standard manually/automated steps, characterization of needed lexical resources.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 19 of 29
● Evaluate individual components with focus on: (1) quality, (2) technical and (3) functional aspects. This will be associated to the generation of GS data, implementation of technical assessment infrastructure, functional specification compliance check, and analysis of standards for registration / import to OMTD.
For those component level tasks that correspond to basic named entity recognition (and grounding) and document concept indexing modules, a technical infrastructure in form of the OpenMinTeD interoperability and performance evaluation server (named Becalm) has been implemented (figure 3). This service offers the possibility to carry out a technical validation in terms of Web service monitoring, component accessibility validation and evaluation NER and concept indexing systems at three different levels:
1. Data level: mapping into common formats (BioC XML, JSON, BioC JSON, BeCalm TSV PubAnnotation).
2. Technical level: stability, status, response time, batch processing, component time slot. 3. Functional specification level: metadata requirements/guidelines and annotation of
services through predefined metadata types and controlled vocabularies.
These three levels are important for the integration of components into the OpenMinTeD text mining workflow. Although initially such systems are accessed and tested through simple REST (Representational State Transfer) API applications, we foresee the possibility to convert such services into other more powerful execution schemas, system distribution types and service packaging to empower scalability of these services. The Becalm server additionally, through setting up a community challenge evaluation effort (TIPS-‐ Technical interoperability and performance of annotation servers task), will evaluate third party NER components of relevance for various use case scenarios from the thematic areas of agriculture and life sciences.
Evaluation phase II: use-case workflow centric 3.7OpenMinTeD will offer a suite of services to make the infrastructure components visible and accessible by all:
Registry Service: Validate the support to discover tools & services, scientific content, language resources within the OMTD registry service. A full documentation and registration mechanism for all types of resources available in the infrastructure, implementing widely agreed and used metadata models, covering all facets of documentation from persistent identification and versioning, technical specifications and software dependencies, rights of use, location and deployment instructions.
Workflow Service: Mix and match text-mining services in workflows by using the best of breed of text mining components for some task. Provides users with the ability to mix and match text-mining services in workflows, i.e., by using the best of breed of text mining components for some task. Its implementation and use delves into component level interoperability.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 20 of 29
DSL: OpenMinTeD scripting language is a Domain Specific Language and substituted workflow editor.
Engine: DSL interpreter, minigal engine executes one component in sequence and substitutes workflow service.
Adapters: abstract away from different architectures, component life cycle models, and data models.
Data converters: lazily convert data between components as necessary
Components: minimal execution unit of the workflow (adapters, data converters
Catalog: very simple JSON-based component catalog substitutes registry services
Repository: components packaged as Maven artifacts and stored in a Maven repository (atm GATE DKPro Core). Obtained automatically when used in a workflow. Substitutes cloud oriented deployment (a bit).
Remote service: web services, mostly third party NLP tools (figure 4).
Annotation Service: Get results in common representations and protocols, including quantitative indications of the quality of the automatic (or manual) processing. Implements the interoperability specifications for annotations showcasing common representations and protocols, including quantitative indications of the quality of the automatic (or manual) processing service on results.
This scenario for the OpenMinTeD framework services (registry, workflows, annotation) will be used to assess each of the services by each use case, following an approach for the evaluation of workflow services and frameworks that has been used successfully in projects of similar characteristics. We will provide usage instructions to the various types of users/communities, according to which they will test each of the framework services independently. Their experiences will then be collected by means of a structured survey with four evaluation levels:
Full - fully compliant
Partial - partially compliant. E.g. some parts of a product are compliant but not all. This is typically the case if a product is in a state of transition from a non-compliant to a compliant state.
No - not compliant.
N/A - not applicable. This is expected to occur mainly for concrete requirements if a certain requirement is not applicable for a certain implementation, e.g. a requirement on remote API access on a tool, which does not offer a remote API.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 21 of 29
Figure 4. Workflow service architecture (from OpenMinTeD Webinar 2).
General validation instructions for each end user of the evaluation framework will be distributed to carry out the validation experiments. The instructions will include material such as tutorials and task definitions, including easy to follow task completion steps, validation walkthroughs and validation form to enable the completion of structured surveys.
The proposed evaluation strategy was inspired by input and feedback of WP5 as well as existing efforts and systems:
● U-‐Compare [3-‐4]: allows a user to easily configure a workflow of components. The user can easily exchange components in their workflow to see how exchanging pre-‐processing components can make a difference to their overall results
● Argo [5]: a multi-‐purpose workflow management system built upon the UIMA standard. Argo provides a user with a library of atomic components, each of which provides a specific form of annotation upon an input text. These annotations can be processed iteratively by combining multiple components into a workflow. Workflows can be evaluated within Argo through the use
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 22 of 29
of dedicated evaluation components, or by exporting annotations in a variety of formats for further processing.
● Becalm [1]: BioCreative MetaServer (BCMS), the first distributed prototype framework requesting, retrieving and unifying biomedical textual annotations developed under the umbrella of the BioCreative II protein-‐protein interaction task, several BioCreative tasks have tried to promote the development of online text annotation servers.
● Panacea [2]: a STREP Project under EU-‐FP7, has developed a factory of Language Resources (LRs) in the form of a production line that automates all steps involved in the acquisition, production, maintenance and updating of the LRs required by Machine Translation and other Language Technologies.
● WP5: This evaluation strategy can be only understood in the context of the Interoperability Framework of OpenMinTeD Therefore, it requires a constant input and feedback from this work package and the future task 5.4 “Alignment of service and content provider systems”
Workflow evaluation will focus on three points:
● Technical and functional evaluation: latency times, availability and query limits. ● Metadata and annotations database evaluation: quality and quantity of annotations and
metadata available. ● Workflow manager evaluation: interoperability of components, web services and formats.
3.7.1 General criteria of assessment In order to capture evaluation aspects related to the usability of the OpenMinTeD platform, the evaluation framework, must examine how efficiently the platform can:
● Solve generic tasks, ● Be customized ● Be used in an interactive and collaborative manner.
Those general evaluation criteria should therefore enable the validation of four characteristics:
● Generic: evaluate to which degree the OpenMinTeD platform can handle general purpose/domain tasks. Therefore it will be examined if it supports a variety of formats and open standards to allow the evaluation of different domains and use cases.
● Customisable: the evaluation framework should determine to which degree the OpenMinTeD platform enables modularization, exposure of different types of pre-‐ and post-‐processing components.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 23 of 29
● Interactive: the evaluation framework should determine to which degree the OpenMinTeD platform supports user-‐interactive processing components and inspection/intervention of automatically created text mining pipelines that can support evaluation of modules.
● Collaborative: the evaluation framework should determine to which degree the OpenMinTeD platform supports multi-‐user collaboration and simultaneous modifications of e.g. annotations or components in the pipeline.
The individual components should also be evaluated in the context of different actionable workflows. These workflows should be checked with regard to their compliance with the use case demands and to be refined or adapted if required during the deployment process. The community should validate the workflows through a formal usability evaluation from inspired whenever possible by functional specification criteria (figure 5).
Figure 5. OpenMinTeD framework, general evaluation levels.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 24 of 29
4. References [1] http://www.becalm.eu/ [2] http://www.panacea-‐lr.eu/ [3] Y. Kano, W. A. Baumgartner, L. McCrohon, S. Ananiadou, K. B. Cohen, L. Hunter, and J. Tsujii, “U-‐
Compare: share and compare text mining tools with UIMA,” Bioinformatics, vol. 25, no. 15, pp. 1997–1998, 2009.
[4] Y. Kano, M. Miwa, K. B. Cohen, L. Hunter, S. Ananiadou, and J. Tsujii, “U-‐Compare: a modular NLP workflow construction and evaluation system,” IBM J. Res. Dev., vol. 55, no. 3, pp. 11:1 – 11:10, 2011.
[5] R. Rak, A. Rowley, W. Black, and S. Ananiadou, “Argo: an integrative, interactive, text mining-‐based workbench supporting curation.,” Database, vol. 2012, Jan. 2012.
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 25 of 29
5. Appendix The appendix section comprises sample structured questionnaires from the OpenMinTeD evaluation framework which should be completed by particular actors defined at the beginning of the document. Training material will be provided to the actors before filling the questionnaires to facilitate the right evaluation. Each question has and unique reference.
Questionnaire for evaluation phase I: use-case component centric 5.1Template for filling about the component evaluation
Questions
Kind of component
Use case
Document selection (corpora)
Annotation format
Annotation editor
Manual/Semi-‐automatic/Automatic annotation
Task
1. Did the component do its job correctly?
2. Does any mismatch exist between the component and its description (behavior, input, output...)?
3. Could be the component improved in some way?
Evaluation scenario
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 26 of 29
1. Can the component be used in the scenarios for which it was designed?
2. Could constraints exist for any situation (extreme or not)?
Gold standard
Describe the gold standard used for the evaluation of the component.
By default:
● Precision (mandatory)
● Recall (important)
● F-‐score (important)
Component access
1. Do the component has a web API? Benchmark time latency. 2. Measure the time elapsed between the input and the result
Evaluation phase II: use-case workflow centric 5.2The questionnaires are an integral part of the evaluation phase II with the aim to capture key performance indicators of the OpenMinTeD framework services1.
5.2.1 Registry service scenario In this scenario the OpenMinTeD registry is evaluated. The actor has to connect to the registry explore the differences knowledge resources choose several and use them in a workflow.
Steps
1. Connect to OpenMinTeD registry. 2. Select one component from the catalog. 3. Check the information of the component.
1 https://builds.openminted.eu/job/WP%205.2%20-‐%20Interoperability%20Specification/eu.openminted.interop$openminted-‐interoperability-‐spec/doclinks/1/openminted-‐interoperability-‐spec.html#_requirements
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 27 of 29
4. Repeat step 2 to 3 for several components. (Expert actors)
5. Create and add a new resource to the registry. 6. Use the new resource in a workflow. 7. Use other resource in that workflow.
Questions
Availability
1. RS-‐1 Resources of the registry are well described (license, content format e.g XML, DOCX, version) and url accessible? [Requirements 32, 33, 39]. Full-‐Partial-‐No-‐N/A
2. RS-‐2 The resource categories are well tagged? [Requirement 36]. Full-‐Partial-‐No-‐N/A
3. RS-‐3 Resources are downloadable or are external web service and knowledge resources well identified for ensuring that the processed data does not leave a particular institution? [Requirement 38]. Full-‐Partial-‐No-‐N/A
Adaptability (expert actors)
1. RS-‐4 It is possible create and add new knowledge resources to the registry? Full-‐Partial-‐No-‐N/A
2. RS-‐5 It is possible to adapt knowledge resources of the registry to be used by the workflow? Full-‐Partial-‐No-‐N/A
Interoperability (expert actors)
1. RS-‐6 It is possible aggregate the information from different knowledge resources of the registry to be used by the workflow? Full-‐Partial-‐No-‐N/A
Security
1. RS-‐7 The registry resources are only reachable by people and components authenticated? Full-‐Partial-‐No-‐N/A
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 28 of 29
5.2.2 Workflow service scenario In this scenario an example workflow is evaluated. The actor has to create a new workflow, add components and produce an output.
Steps
(Expert actors)
1. Create a new workflow. 2. Add one resource from the registry. 3. Add one component. 4. Repeat the step 3 many times a needed. 5. Link all components and produce an output.
Questions
Generic
1. WS-‐1 The workflow is described using and uniform language? [Requirement 18]. Full-‐Partial-‐No-‐N/A
2. WS-‐2 Components are downloadable or are external web service and knowledge resources well identified for ensuring that the processed data does not leave a particular institution? [Requirement 28]. Full-‐Partial-‐No-‐N/A
Customisable
1. WS-‐3 Configuration and parameterizable options of the components are well identified and documented? [Requirement 21]. Full-‐Partial-‐No-‐N/A
Interactive
4. WS-‐4 Components of the workflow are well described (license, its functionality, input and output format, input and output language dependent, citation information, version) and url accessible? [Requirements 4, 10, 12, 13, 14, 43, 45, 57]. Full-‐Partial-‐No-‐N/A
5. WS-‐5 Components detail all their environmental requirements for execution? Requirement 5. Full-‐Partial-‐No-‐N/A
6. WS-‐6 The component categories are well tagged? [Requirement 8, 40, 90]. Full-‐Partial-‐No-‐N/A
D4.5 -‐ Community Evaluation Methodology
• • •
Public Page 29 of 29
7. WS-‐7 Components handle failures (loss connection, inconsistent state) gracefully? [Requirement 27]. Full-‐Partial-‐No-‐N/A
Collaborative
1. WS-‐8 The workflow can be used as a component of other workflow? [Requirement 24]. Full-‐Partial-‐No-‐N/A
2. WS-‐9 Is it possible create, add new components and use it inside the workflow? Full-‐Partial-‐No-‐N/A
5.2.3 Annotation service scenario In this scenario the annotated data is evaluated. The actor has to review the annotated data from a workflow.
Steps
1. Open an annotated output. 2. Analyze the content and make conclusions.
(Expert actors)
3. Use it in another workflow or tool. Questions
Functionality
1. AS-‐1 It is possible to determine the source of an annotation/assigned category? [Requirement 26]. Full-‐Partial-‐No-‐N/A
2. AS-‐2 Are all the entities well annotated? Full-‐Partial-‐No-‐N/A
3. AS-‐3 Can the annotations be tagged? Full-‐Partial-‐No-‐N/A
4. AS-‐4 Are the annotations well related? Full-‐Partial-‐No-‐N/A
Exportability
1. AS-‐5 Could some tool import the annotated data? Full-‐Partial-‐No-‐N/A
Reusability
1. AS-‐6 Could some other workflow reuse the annotated data? Full-‐Partial-‐No-‐N/A