structured affiliations extraction from the scientific literature

Structured affiliations extractionfrom scientific literature

D. Tkaczyk, B. Tarnawski and Ł. BolikowskiInterdisciplinary Centre for Mathematical and Computational ModellingUniversity of Warsaw

24 June 2015

Introduction

CERMINE system

AUTHORS

AFFILIATIONSEMAILS

ABSTRACT

KEYWORDS AUTHOR

SOURCEYEAR

VOLUME

CERMINE analyses born-digital scientific articles and extracts:document metadata, eg. title, authors, abstract, keywords, publication date, ...a list of parsed bibliographic referencesfull text with sections hierarchy

This presentation

The presentation focuses on the following tasks:extracting a list of authors from a paperextracting a list of affiliations from a paperestablishing relations between extracted authors and affiliationsdetecting institution, address and country in affiliation strings

Motivation

CERMINE can be used to:

extract high-quality metadatafrom large PDF collections,when it is missing or fragmentaryprovide intelligent user interfacesfor metadata acquisition

Requirements

The metadata extraction system should be:comprehensive,automatic,modular,open and widely available,easily applicable,flexible and able to adapt to new layouts,well tested.

Architecture and Implementation

The workflow

PDFBT /F13 10 Tf 250 720 Td (PDF) TjET

<XML><author> <aff>1</aff></author> <aff id="1"> <inst>Instit... <addr>Wars.. <country>P...</aff>

structureextraction

M.K.1, J.I.2, T.W

1 University of2 Institute of

Institute of ...Warsaw, 027...Poland

XML recordgeneration

affiliationparsing

splitting andassociation

classification

Layout Extraction

1 Character extraction — iText library2 Page segmentation — Docstrum3 Reading order resolving — bottom-up

heuristic-based

Content Classification

general classification (labels: metadata, references,body and other)metadata classification (labels: abstract, bib_info, type,title, affiliation, author, keywords, correspondence, datesand editor)SVM with 83 and 62 features: geometrical, lexical,sequential, formatting, heuristicsthe best SVM parameters were found automatically bymaximizing mean F-score on a validation datasetclassifiers are trained on 2,551 and 2,716 documents,respectively

The output so far

TrueViz XML format:

hierarchical structure containing:pages, zones, lines, words, charactersall elements have bounding boxesreading order is givenzones have labels

<Page>...

<Zone>...

</ZoneCorners>

<Line>...

<Word>...

<Character>...

Authors and Affiliations Extraction

authors are split based on a listof separators

affiliations indexes are found using a listof index symbols and superscript

association is done by detected indexes

affiliations are already assignedto authorsfirst line is assumed to be the authoremail is found by a regexpthe remaining part is treatedas the affiliation

Affiliation Parsing

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5A blok D, 02-106 Warsaw, Poland

affiliation parsing detects institution, address, countrythe implementation is based on a CRF token classifier with features:

the classified word itself,whether the token is a number, all uppercase word, all lowercase word, a lowercaseword that starts with an uppercase letter,whether the token is contained by dictionaries of countries or words commonlyappearing in institutions or addresses,the features of two preceding and two following tokens.

Evaluation

Datasets

GROTOAP2: the evaluation and trainingof the zone classifiersGROTOAP2-affiliations: the evaluationand training of the affiliation parserPubMed Central Open Access Subset:the evaluation of the entire workflow

PubMedCentral CERMINE

zone textmatching

Zone Classification

2,551 documents fromGROTOAP2, containing:

355,779 zones68,557 metadata zones

5-fold cross-validation

metadata other labels precision recallmetadata 66,372 2,185 96.8 % 97.0 %

other labels 2,052 285,170 - -

affiliation other labels precision recall

affiliation 3,496 185 95.0 % 95.3 %other labels 173 64,703 - -

Affiliation Parsing

8,267 affiliationsfrom PubMed Central

5-fold cross-validation

Token classification:

address country institution precision recall

address 44,481 12 1,225 96.8 % 97.3 %country 50 8,108 8 99.6 % 99.3 %

institution 1,434 18 92,457 98.7 % 98.5 %

Affiliation metadata extraction:

institution recognized in 92.4% of casesaddress recognized in 92.2% of casescountry recognized in 99.5% of cases92.1% of affiliations entirely correctly parsed

Workflow Evaluation

1,943 documents from PMCevaluated tasks:

extracting author stringsextracting affiliation stringsdetermining author-affiliation relationsdetermining author-affiliation relations,if authors and affiliations extractedflawlessly

authors affiliations relations(total)

relations(perfect input)

SystemCERMINEGROBIDParsCitPDFX

Summary

System Features

CERMINE extracts metadata and content from scholarly articles in PDF formatthe system is based on a modular workflowthe implementation uses machine learning and heuristicsthe default system is trained on large and diverse datasetsthe source code is open and available on GitHubCERMINE is available as a web service and RESTful services

System Usage

Java + MavenJAR fileRESTful services:

$ curl -X POST –data-binary @article.pdf–header "Content-Type: application/binary"http://cermine.ceon.pl/extract.do

$ curl -X POST –data "affiliation=the textof the affiliation" http://cermine.ceon.pl/parse.do

●●

● ●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●●

●●

● ●● ●

●●

● ●●

●●

●●●

●●

● ●

●●●

● ●

●●

● ● ●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

● ●●

●●

● ●

●●

● ●

●●●

●●

●●●●

●●

● ●●

●●

● ●

●●

● ●

●●●

●●

0 10 20 30 40Number of pages

CERMINE web service: http://cermine.ceon.plCERMINE source code: https://github.com/CeON/CERMINEGROTOAP2: http://cermine.ceon.pl/grotoap2/GROTOAP2-affiliations: http://cermine.ceon.pl/grotoap2/affiliations/

Thank you!

linkedin.com/in/bolikowski

twitter.com/bolikowski

lukasz.bolikowski@icm.edu.pl

structured affiliations extraction from the scientific literature

Science

relation extraction william cohen 10-18. kernels vs...

event extraction using structured learning and rich domain

extraction and integration of data from semi-structured

knowledge extraction from structured...

ontology-guided extraction of structured information...

information extraction: distilling structured data from...

a benchmark for structured procedural knowledge extraction

extracting and managing structured web...

structured data extraction based on the slides from bing liu...

kde itinerary - fosdem · 2020-01-31 · data extraction...

masterarbeit process-based data extraction from web...

automatic extraction of clickable structured web … ·...

efficient knowledge extraction from structured data¬cient...

semi-automatic knowledge extraction from semi-structured

chapter 9: structured data extraction

crowdgather: entity extraction over structured...

dbpedia and the live extraction of structured data from...

kieran brahney information extraction from semi-structured...

1 information extraction. 2 information extraction (ie)...

structured information extraction from natural disaster...