exploring comparative evaluation of semantic enrichment tools for cultural heritage metadata
TRANSCRIPT
Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage MetadataHugo Manguinhas, Nuno Freire, Antoine Isaac, Valentine Charles| Europeana FoundationJuliane Stiller| Humboldt-Universität zu BerlinAitor Soroa| University of the Basque CountryRainer Simon| Austrian Institute of TechnologyVladimir Alexiev| Ontotext Corp
Agenda
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
• Europeana and Semantic Enrichments• Goal and Steps of the Evaluation• Results and Evaluation• Conclusion
What is Europeana?
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
We aggregate metadata:
• From all EU countries
• ~3,500 galleries, libraries, archives and museums
• More than 52M objects
• In about 50 languages
• Huge amount of references to places, agents, concepts, time
Europeana aggregation infrastructureEuropeana| CC BY-SA
The Platform for Europe’s Digital Cultural Heritage
What are Semantic Enrichments?
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
● Key technique to contextualize objects● Aims at adding new information at the semantic level to
the metadata about certain resources● Linked Data context -> creation of new links between
the enriched resources and others● Information Retrieval -> adding new terms to a query or
document and therefore reaching a higher visibility of documents within the document space
Semantic Enrichments in Europeana
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
Offers richer context to items by linking them to relevant entities (people, places, objects, types)
Automatic Enrichments are Challenging as
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
• Bewildering variety of objects;• Differing provider practices for providing extra
information;• Data normalization issues;• Information in a variety of languages, without the
appropriate language tags to determine the language.
Established a Task Force Gathering Domain Experts
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
Goals of the Task Force
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
• Perform concrete evaluations and comparisons of enrichment tools in the CH domain
• Identify methodologies for evaluating enrichment in CH, which: • are realistic wrt. the amount of resources to be employed;• facilitate the measure of enrichment progress over time;• build a reference set of metadata and enrichments that can be
used for future comparative evaluations.
Enrichment Tools Evaluated
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
• Semantic Enrichment Framework used at Europeana• Dictionary based against a curated subset of DBpedia, Geonames and
SemiumTime
• The European Library named entity recognition process• NERD and heuristic based againts GeoNames, GemeinsamenNormdatei
• The Background Link service developed within the LoCloud project• NERD applying supervised statistical methods against DBpedia
• The Vocabulary Matching service (also from LoCloud project)• Dictionary based against several thesauri in SKOS
• The Recogito open source tool used in Pelagios 3 project• NERD and heuristic based against Wikidata
• The Ontotext Semantic Platform enrichment tool• NERD with NLP against MediaGraph
Evaluation of Enrichments Tools
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
Steps performed:1. Select a sample dataset for enrichment,
2. Apply different tools to enrich the sample dataset,
3. Manually create reference annotated corpus,
4. Compare automatic annotations with the manual ones to determine performance of enrichment tools.
The Sample Dataset
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
• A subset of The European Library (TEL) submitted to Europeana
• Biggest data aggregator for Europeana -> widest coverage of countries and languages.
• For each of the 19 countries, we randomly selected a maximum of 1000 records.
Resulting in a dataset of 17,300 metadata records in 10 languages
Building the corpus
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
● Each tool was applied to the dataset resulting in a total of 360k enrichments
● Each enrichment set for each tool was:○ normalized to a reference target vocabulary using coreferencing
information (~62.16% of success)■ Geonames (places), DBpedia (for other resource types)
○ compared to identified overlap between tools resulting in 26 agreement combinations (plus non-agreements)
● At most 1000 enrichments were selected per set○ resulting in a main corpus of 1757 enrichments○ a secondary corpus composed of a subset of 46 enrichments for
measuring the inter-rater agreement
Building the corpus (overview)
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
Subset of TEL Dataset(17.3K)
Raw Enrichments
(~360K)
Evaluation Corpus(1757 distinct enrichments)
Inter-Annotator Corpus
(46 enrichments)
Annotated Corpus(1757 enrichments)
Agreement Combinations
(26 sets)
Normalized Enrichments
(~360K)
TEL Dataset(~175K)
Selected at most 1000 distinct enrichments per distinct set
Normalized using a reference target dataset
Compared amongst tools
Randomly selected ~ 1000 per country (19)
Applied 7 enrichment tools / tools settings
Annotated according to the Annotation Guidelines
Annotation criteria and guidelines
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
Category Annotation Description
Semanticcorrectness
C = CorrectI = IncorrectU = Unsure
Is the enrichment semantically correct or not?
Completeness of name match
F = Full matchP = Partial matchU = Unsure
Was the whole phrase/named entity enriched or only parts of it?
Completeness of concept match
F = Full matchB = Broader thanN = Narrower thanU = Unsure
Whether the matched concept is at the same level of conceptual abstraction as the named entity/phrase. Since sometimes the exact concept is not available in the target vocabulary, a narrower or broader concept may be used in the enrichment.
Use B when the concept identified (i.e. target) is broader than the intended concept. Use N, when it is narrower.
This is also true for a geographical region where the concept identified describes a smaller entity (narrower code), whereas the “broader”-code refers to a bigger geographical region.
In addition to the criteria, the guidelines also included:● explanation of how to apply the criteria● concrete examples matching all possible combinations of the criteria
Example of the annotated corpus
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
16 members of our Task Force manually assessed the
correctness of both corpora applying the criteria
Measuring performance of each tool
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
● We adapted the notions of Information Retrieval for precision and recall to measure performance of enrichment tools
● Two measures were defined considering the criteria:○ relaxed: all enrichments annotated as semantically correct are
considered “true” regardless of their completeness ○ strict: considers as “true” the semantically correct enrichments
with a full name and concept completeness
● As it was not possible within the TF to identify all possible true enrichments, we chose for a pooled recall using the total number of enrichments
Measuring performance of each tool
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
Tools
Annotated Enrichment
s(% of full corpus)
Precision
Max Pooled Recall
EstimatedRecall
Estimated F-measure
Relax. Strict Relax. Strict Relax. Strict
Group A
EF 550 (31.3%) 0.985 0.965 0.458 0.355 0.432 0.522 0.597
TEL 391 (22.3%) 0.982 0.982 0.325 0.254 0.315 0.404 0.477
Group B
BgLinks 427 (24.3%) 0.888 0.574 0.355 0.249 0.200 0.389 0.296
Pelagios 502 (28.6%) 0.854 0.820 0.418 0.286 0.340 0.428 0.481
VocMatch 100 (05.7%) 0.774 0.312 0.083 0.048 0.024 0.091 0.045
Ontotext v1 489 (27.8%) 0.842 0.505 0.407 0.272 0.202 0.411 0.289
Ontotext v2 682 (38.8%) 0.924 0.632 0.567 0.418 0.354 0.576 0.454The observer percentage agreement for Fleiss Kappa was 79.9%
Conclusions
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
1. Select a dataset for your evaluation that represents the diversity of your (meta)data
2. Building a gold standard is ideal but not always possible
3. Consider using the semantics of target datasets
4. Try to keep balance between evaluated tools
5. Give clear guidelines on how to annotate the corpus
6. Use the right tool for annotating the corpus
Recommendation for Enrichment Tools
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
● Consider applying different techniques depending on the kind of values used within the enriched property
● Matches on parts of a field's textual content may result in too general or even meaningless enrichments if they fail to recognize compound expressions○ In the Europeana context, concepts so broad as "general period"
do not bring any value as enrichments● Apply strong disambiguation mechanism that considers the
accuracy of a name reference together with the relevance of the entity
● Quality issues originating in the metadata mapping process are a great obstacle to get good enrichments
Next steps...
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
● Build a gold standard with○ significant language and spatial coverage and○ diversity of subjects and domains
● Develop a tool for helping annotators in creating and managing the gold standard○ tuned for the annotation task, collaborative and easily
accessible ● Develop (or extend) a solution for running evaluations
○ Possibly Inspired by GERBIL● Re-evaluate the tools● Integrate enrichment evaluation into an holistic evaluation
framework for search
Thank you!
References
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
● Monti, Johanna, Mario Monteleone, Maria Pia di Buono, and Federica Marano. 2013. “Cross-Lingual Information Retrieval and Semantic Interoperability for Cultural Heritage Repositories.” In Recent Advances in Natural Language Processing, RANLP 2013, 9-11 September, 2013, Hissar, Bulgaria, 483–90. http://aclweb.org/anthology/R/R13/R13-1063.pdf.
● Petras, Vivien, Toine Bogers, Elaine G. Toms, Mark M. Hall, Jacques Savoy, Piotr Malak, Adam Pawlowski, Nicola Ferro, and Ivano Masiero. 2013. “Cultural Heritage in CLEF (CHiC) 2013.” In Information Access Evaluation. Multilinguality, Multimodality, and Visualization - 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings, 192–211. doi:10.1007/978-3-642-40802-1_23.
● Stiller, Juliane, Vivien Petras, Maria Gäde, and Antoine Isaac. 2014. “Automatic Enrichments with Controlled Vocabularies in Europeana: Challenges and Consequences.” In Digital Heritage. Progress in Cultural Heritage: Documentation, Preservation, and Protection, 238–47.
● Usbeck, Ricardo, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, et al. 2015. “GERBIL: General Entity Annotator Benchmarking Framework.” In Proceedings of the 24th International Conference on World Wide Web, 1133–43. WWW ’15. New York, NY, USA: ACM. doi:10.1145/2736277.2741626.
Previous Studies
TPDL 2016 - Exploring Comparative Evaluation of Semantic Enrichment Tools for Cultural Heritage
MetadataCC BY-SA
• Evaluations of enrichments were performed sporadically:• Sample evaluation of alignments of places and agents to
DBpedia • Evaluation of the relationship between enrichments, objects and
queries which retrieve enriched objects (Stiller et al. 2014)• Evaluation of retrieval performances in the CH domain indirectly
also target enrichments (e.g. Monti et al. 2013, Petras et al. 2013)
• Tools and frameworks for enrichment:• GERBIL (Usbeck et al. 2015)