exploring comparative evaluation of semantic enrichment tools for cultural heritage metadata

Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage MetadataHugo Manguinhas, Nuno Freire, Antoine Isaac, Valentine Charles| Europeana FoundationJuliane Stiller| Humboldt-Universität zu BerlinAitor Soroa| University of the Basque CountryRainer Simon| Austrian Institute of TechnologyVladimir Alexiev| Ontotext Corp

Agenda

TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage

MetadataCC BY-SA

• Europeana and Semantic Enrichments• Goal and Steps of the Evaluation• Results and Evaluation• Conclusion

What is Europeana?


MetadataCC BY-SA

We aggregate metadata:

• From all EU countries

• ~3,500 galleries, libraries, archives and museums

• More than 52M objects

• In about 50 languages

• Huge amount of references to places, agents, concepts, time

Europeana aggregation infrastructureEuropeana| CC BY-SA

The Platform for Europe’s Digital Cultural Heritage

http://pro.europeana.eu/share-your-data/

What are Semantic Enrichments?


MetadataCC BY-SA

● Key technique to contextualize objects● Aims at adding new information at the semantic level to

the metadata about certain resources● Linked Data context -> creation of new links between

the enriched resources and others● Information Retrieval -> adding new terms to a query or

document and therefore reaching a higher visibility of documents within the document space

Semantic Enrichments in Europeana


MetadataCC BY-SA

Offers richer context to items by linking them to relevant entities (people, places, objects, types)

Automatic Enrichments are Challenging as


MetadataCC BY-SA

• Bewildering variety of objects;• Differing provider practices for providing extra

information;• Data normalization issues;• Information in a variety of languages, without the

appropriate language tags to determine the language.

Established a Task Force Gathering Domain Experts


MetadataCC BY-SA

Goals of the Task Force


MetadataCC BY-SA

• Perform concrete evaluations and comparisons of enrichment tools in the CH domain

• Identify methodologies for evaluating enrichment in CH, which: • are realistic wrt. the amount of resources to be employed;• facilitate the measure of enrichment progress over time;• build a reference set of metadata and enrichments that can be

used for future comparative evaluations.

Enrichment Tools Evaluated


MetadataCC BY-SA

• Semantic Enrichment Framework used at Europeana• Dictionary based against a curated subset of DBpedia, Geonames and

SemiumTime

• The European Library named entity recognition process• NERD and heuristic based againts GeoNames, GemeinsamenNormdatei

• The Background Link service developed within the LoCloud project• NERD applying supervised statistical methods against DBpedia

• The Vocabulary Matching service (also from LoCloud project)• Dictionary based against several thesauri in SKOS

• The Recogito open source tool used in Pelagios 3 project• NERD and heuristic based against Wikidata

• The Ontotext Semantic Platform enrichment tool• NERD with NLP against MediaGraph

https://docs.google.com/document/d/1JvjrWMTpMIH7WnuieNqcT0zpJAXUPo6x4uMBj1pEx0Y

http://dx.doi.org/10.1145/1998076.1998140

http://support.locloud.eu/Metadata%20enrichment%20API%20technical%20documentation

http://www.locloud.eu/

http://www.vocabularyserver.com/

http://pelagios.org/recogito

http://pelagios-project.blogspot.nl/2013/09/pelagios-3-overview.html

http://ontotext.com/

http://ontotext.com/products/ontotext-semantic-platform/semantic-enrichment-and-text-mining/

Evaluation of Enrichments Tools


MetadataCC BY-SA

Steps performed:1. Select a sample dataset for enrichment,

2. Apply different tools to enrich the sample dataset,

3. Manually create reference annotated corpus,

4. Compare automatic annotations with the manual ones to determine performance of enrichment tools.

The Sample Dataset


MetadataCC BY-SA

• A subset of The European Library (TEL) submitted to Europeana

• Biggest data aggregator for Europeana -> widest coverage of countries and languages.

• For each of the 19 countries, we randomly selected a maximum of 1000 records.

Resulting in a dataset of 17,300 metadata records in 10 languages

Building the corpus


MetadataCC BY-SA

● Each tool was applied to the dataset resulting in a total of 360k enrichments

● Each enrichment set for each tool was:○ normalized to a reference target vocabulary using coreferencing

information (~62.16% of success)■ Geonames (places), DBpedia (for other resource types)

○ compared to identified overlap between tools resulting in 26 agreement combinations (plus non-agreements)

● At most 1000 enrichments were selected per set○ resulting in a main corpus of 1757 enrichments○ a secondary corpus composed of a subset of 46 enrichments for

measuring the inter-rater agreement

Building the corpus (overview)


MetadataCC BY-SA

Subset of TEL Dataset(17.3K)

Raw Enrichments

(~360K)

Evaluation Corpus(1757 distinct enrichments)

Inter-Annotator Corpus

(46 enrichments)

Annotated Corpus(1757 enrichments)

Agreement Combinations

(26 sets)

Normalized Enrichments

(~360K)

TEL Dataset(~175K)

Selected at most 1000 distinct enrichments per distinct set

Normalized using a reference target dataset

Compared amongst tools

Randomly selected ~ 1000 per country (19)

Applied 7 enrichment tools / tools settings

Annotated according to the Annotation Guidelines

Annotation criteria and guidelines


MetadataCC BY-SA

Category Annotation Description

Semanticcorrectness

C = CorrectI = IncorrectU = Unsure

Is the enrichment semantically correct or not?

Completeness of name match

F = Full matchP = Partial matchU = Unsure

Was the whole phrase/named entity enriched or only parts of it?

Completeness of concept match

F = Full matchB = Broader thanN = Narrower thanU = Unsure

Whether the matched concept is at the same level of conceptual abstraction as the named entity/phrase. Since sometimes the exact concept is not available in the target vocabulary, a narrower or broader concept may be used in the enrichment.

Use B when the concept identified (i.e. target) is broader than the intended concept. Use N, when it is narrower.

This is also true for a geographical region where the concept identified describes a smaller entity (narrower code), whereas the “broader”-code refers to a bigger geographical region.

In addition to the criteria, the guidelines also included:● explanation of how to apply the criteria● concrete examples matching all possible combinations of the criteria

Example of the annotated corpus


MetadataCC BY-SA

16 members of our Task Force manually assessed the

correctness of both corpora applying the criteria

Measuring performance of each tool


MetadataCC BY-SA

● We adapted the notions of Information Retrieval for precision and recall to measure performance of enrichment tools

● Two measures were defined considering the criteria:○ relaxed: all enrichments annotated as semantically correct are

considered “true” regardless of their completeness ○ strict: considers as “true” the semantically correct enrichments

with a full name and concept completeness

● As it was not possible within the TF to identify all possible true enrichments, we chose for a pooled recall using the total number of enrichments

Measuring performance of each tool


MetadataCC BY-SA

Tools

Annotated Enrichment

s(% of full corpus)

Precision

Max Pooled Recall

EstimatedRecall

Estimated F-measure

Relax. Strict Relax. Strict Relax. Strict

Group A

EF 550 (31.3%) 0.985 0.965 0.458 0.355 0.432 0.522 0.597

TEL 391 (22.3%) 0.982 0.982 0.325 0.254 0.315 0.404 0.477

Group B

BgLinks 427 (24.3%) 0.888 0.574 0.355 0.249 0.200 0.389 0.296

Pelagios 502 (28.6%) 0.854 0.820 0.418 0.286 0.340 0.428 0.481

VocMatch 100 (05.7%) 0.774 0.312 0.083 0.048 0.024 0.091 0.045

Ontotext v1 489 (27.8%) 0.842 0.505 0.407 0.272 0.202 0.411 0.289

Ontotext v2 682 (38.8%) 0.924 0.632 0.567 0.418 0.354 0.576 0.454The observer percentage agreement for Fleiss Kappa was 79.9%

Conclusions


MetadataCC BY-SA

1. Select a dataset for your evaluation that represents the diversity of your (meta)data

2. Building a gold standard is ideal but not always possible

3. Consider using the semantics of target datasets

4. Try to keep balance between evaluated tools

5. Give clear guidelines on how to annotate the corpus

6. Use the right tool for annotating the corpus

Recommendation for Enrichment Tools


MetadataCC BY-SA

● Consider applying different techniques depending on the kind of values used within the enriched property

● Matches on parts of a field's textual content may result in too general or even meaningless enrichments if they fail to recognize compound expressions○ In the Europeana context, concepts so broad as "general period"

do not bring any value as enrichments● Apply strong disambiguation mechanism that considers the

accuracy of a name reference together with the relevance of the entity

● Quality issues originating in the metadata mapping process are a great obstacle to get good enrichments

Next steps...


MetadataCC BY-SA

● Build a gold standard with○ significant language and spatial coverage and○ diversity of subjects and domains

● Develop a tool for helping annotators in creating and managing the gold standard○ tuned for the annotation task, collaborative and easily

accessible ● Develop (or extend) a solution for running evaluations

○ Possibly Inspired by GERBIL● Re-evaluate the tools● Integrate enrichment evaluation into an holistic evaluation

framework for search

Thank you!

References


MetadataCC BY-SA

● Monti, Johanna, Mario Monteleone, Maria Pia di Buono, and Federica Marano. 2013. “Cross-Lingual Information Retrieval and Semantic Interoperability for Cultural Heritage Repositories.” In Recent Advances in Natural Language Processing, RANLP 2013, 9-11 September, 2013, Hissar, Bulgaria, 483–90. http://aclweb.org/anthology/R/R13/R13-1063.pdf.

● Petras, Vivien, Toine Bogers, Elaine G. Toms, Mark M. Hall, Jacques Savoy, Piotr Malak, Adam Pawlowski, Nicola Ferro, and Ivano Masiero. 2013. “Cultural Heritage in CLEF (CHiC) 2013.” In Information Access Evaluation. Multilinguality, Multimodality, and Visualization - 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings, 192–211. doi:10.1007/978-3-642-40802-1_23.

● Stiller, Juliane, Vivien Petras, Maria Gäde, and Antoine Isaac. 2014. “Automatic Enrichments with Controlled Vocabularies in Europeana: Challenges and Consequences.” In Digital Heritage. Progress in Cultural Heritage: Documentation, Preservation, and Protection, 238–47.

● Usbeck, Ricardo, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, et al. 2015. “GERBIL: General Entity Annotator Benchmarking Framework.” In Proceedings of the 24th International Conference on World Wide Web, 1133–43. WWW ’15. New York, NY, USA: ACM. doi:10.1145/2736277.2741626.

http://aclweb.org/anthology/R/R13/R13-1063.pdf

Previous Studies


MetadataCC BY-SA

• Evaluations of enrichments were performed sporadically:• Sample evaluation of alignments of places and agents to

DBpedia • Evaluation of the relationship between enrichments, objects and

queries which retrieve enriched objects (Stiller et al. 2014)• Evaluation of retrieval performances in the CH domain indirectly

also target enrichments (e.g. Monti et al. 2013, Petras et al. 2013)

• Tools and frameworks for enrichment:• GERBIL (Usbeck et al. 2015)