grant agreement 621023 - europeana food and drink...cultural heritage (ch) institutions can benefit...

15
Grant Agreement 621023 Europeana Food and Drink Semantic Demonstrator Description

Upload: others

Post on 25-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

Grant Agreement 621023

Europeana Food and Drink

Semantic Demonstrator Description

Page 2: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

2/15

Contents

Contents .................................................................................................................... 2

Introduction ............................................................................................................... 2 Europeana Search ................................................................................................... 3

Why Semantic Search ............................................................................................. 3 Semantic Concepts .................................................................................................. 3

Semantic Application ............................................................................................... 4 Semapp Resources ................................................................................................. 4 Bulgarian Traditional Recipes .................................................................................. 4 Objects From Oceania ............................................................................................. 5

Object Detailed View ................................................................................................ 5 Objects Related to Fermented Beverages and Asia ................................................ 6

Objects from Alinari related to the Roman Empire and Beverages .......................... 7 Objects Related to Palms ........................................................................................ 7

Potential Extensions ................................................................................................ 7 Mobile Application .................................................................................................... 8

Semantic Discovery ................................................................................................. 8 FD Classifier ............................................................................................................ 8

Enrichment Web Service ......................................................................................... 9 New Language for Enrichment................................................................................. 9 Geographical Mapping ............................................................................................. 9

Other Visualizations ............................................................................................... 10

Classification Tree ................................................................................................. 11 Coverage Map ....................................................................................................... 12 Radial Dendrogram ................................................................................................ 13

Treemap ................................................................................................................ 13 Sunburst ................................................................................................................ 14

Timeline ................................................................................................................. 14

References .............................................................................................................. 14 For More Information ............................................................................................. 15

Introduction

The Europeana Food and Drink Semantic Demonstrator (EFD semapp) is a simple application to demonstrate the use of semantic technologies for classification and discovery of Europeana objects related to Food and Drink (FD). These functions are based on the EFD Classification scheme, which is built using several Linked Open Data (LOD) sets and the Wikipedia categories. The sem app performs semantic enrichment, and applying the classification to Cultural Heritage Objects (CHOs) strengthens the classification itself.

Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia, Wikidata, Getty AAT. Significantly, this provides an avenue for multilingual access to their collections [Alexiev 2015d].

The semapp also provides simple end-user functions: semantic search and faceted navigation, timeline, geographical mapping.

Page 3: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

3/15

Europeana Search

The Europeana Portal search is based on full-text queries. It is fast, offers fielded search, and provides basic facets (e.g. provider=aggregator, dataProvider, country, language, year).

Europeana also offers a simple query translation service. E.g. a query for "beer" in 6 languages returns a translatedQuery that can be used for multilingual search:

beer OR "Beer" OR "Öl" OR "Cerveza" OR "Birra" OR "Bier" OR "Bière"

Isn't this enough and why do we need semantic search? Search for Beer and go to a random page (starting object 313): you may find that only 1 of 24 objects are relevant to your query. The rest are images of bears, objects related to de Beer (a very popular Dutch surname), etc. Mere word translation does not:

Accommodate hyponyms, e.g. Lager, Pilsner

Accommodate related topics, e.g. Stein, Jug (the Wikipedia Category:Beer does)

Prevent ambiguities, i.e. precision is very low.

Ambiguity is a very common problem in all domains, including places and FD topics. E.g. Recipe can be related to food or to Medicine; fork could mean cutlery, an agricultural instrument (pitch fork) or a musical device (tuning fork).

Multilingual expansion exacerbates the problem. [Olensky 2012] describes a classical failure of naïve query translation in Europeana (using the older enrichment tool AnnoCultor): If you search for "poison" in the collections provided by Swiss institutions, you may find photographs from India and Indian movie covers. The reason is that objects were enriched with the term "poison" and its multilingual equivalents. In Latvian "poison" means Inde, which is the same keyword the French-speaking domain expert gave to the objects to describe their content: India.

Europeana CHO fields often lack a language tag or the tag is incorrect, and dc:language describes the CHO content language not the metadata's language. So although the Europeana fielded search includes fields by language, it's not easy to filter by language.

Why Semantic Search

Semantic enrichment and semantic search are technically more complex, but enable more powerful and richer searches. E.g. how is this WW1 front-line letter "Wounded" letter related to Tea? It talks about "Brooke Bond", which is a tea brand. Wikipedia has an article about it that falls in Category:Tea, so it can be discovered.

Semantic enrichment of Europeana CHOs is a highly complex problem because of CHO metadata problems. E.g. "Kettle" returns 1101 objects. One of them is a medal by "Artist: Kettle, Henry, die-engraver". Can enrichment discover that this is not a kettle and discard it as irrelevant to the query? Unfortunately the metadata as ingested by Europeana has four dc:subject fields "Henry; medals; Kettle; medal". Rather than in dc:creator, the name is put in two separate dc:subject fields. So there is no way to recognize Kettle as a person in this CHO. But in most other cases it is possible

Semantic Concepts

The EFD Classification [Alexiev 2015a, c] and the semapp deal with types of semantic concepts, each organized in a hierarchical semantic facet:

FD-specific topics (domain-specific gazetteer) [Tagarev 2015, 20116]. We used the Wikipedia Categories to build a FD tree. It is a hierarchical classification with 8k categories, 111k articles and over 350k labels (English mostly). All categories

Page 4: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

4/15

reachable from the root Category:Food_and_drink are over 800k (most of them irrelevant), and we have pruned down the tree usign manual curation and machine learning approaches.

Places. We use DBpedia, but it doesn't have a uniform property for making up the place hierarchy. E.g. an island has "archipelago" to designate the island group and "region" to designate the sea (physical parents) and "country" to designate the country (administrative parent). So we also use Geonames, which has a uniform property gn:parentFeature, after making some fixes to the top levels of that hierarchy.

(Potentually) Ethnical groups, cultures, styles and periods. We are currently building a comprehensive LOD dataset from the AAT Periods and Styles facet, the British Museum ethnic group thesaurus, DBpedia and Wikipedia.

Semantic Application

The semapp specification [Alexiev 2015b] describes a number of end-user applications that can be created over semantically enriched data. We implemented a “mini-Europeana” including the following:

Full-text search: similar to Europeana FTS, but indexes only Title and Description. Currently Europeana sometimes also indexes Narrower terms, which lowers precision significantly.

Semantic facets: in addition to the flat Europeana facets (provider, dataProvider, country, type) we created semantic facets for FD topics and places (and may include culture/style). This enables semantic browsing using the FD and/or place hierarchies. Each facet shows the number of CHOs matching the current search. This number is aggregated across the poly-hierarchy: if a CHO has several paths to an ancestor, it will be counted once.

A responsive UI design that works on both large screens and tablets.

Semapp Resources

Data available at http://efd.ontotext.com/data

Data can be queried from SPARQL endpoint at http://efd.ontotext.com/sparql

The semantic application demo is available at http://efd.ontotext.com/app

The rest of this section shows some screenshots and interesting queries

Bulgarian Traditional Recipes

http://efd.ontotext.com/app/search?query=&limit=24&offset=0&dataProvider=recepti.gotvach.bg

Page 5: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

5/15

Objects From Oceania

http://efd.ontotext.com/app/search?query=&limit=24&offset=0&place=Oceania

Object Detailed View

See this link. The detailed view implements an image carousel (the above has 2 images) that is animated after a while or can be activated by the user. It also allows access to the object on the provider's site, and in the semantic repository (CHO and Aggregation data)

Page 6: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

6/15

Objects Related to Fermented Beverages and Asia

http://efd.ontotext.com/app/search?query=&limit=24&offset=0&category=Fermented_beverages&place=Asia

The third object is from the Roman Empire. It appears because we've added multiple parent statements (after all that empire did straddle 3 continents):

dbr:Roman_Empire gn:parentFeature dbr:Europe, dbr:Africa, dbr:Asia.

Page 7: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

7/15

Objects from Alinari related to the Roman Empire and Beverages

http://efd.ontotext.com/app/search?query=&limit=24&offset=0&dataProvider=Alinari&place=Roman_Empire&category=Beverages

This illustrates high-precision semantic search.

Objects Related to Palms

http://efd.ontotext.com/app/search?query=&limit=24&article=Arecaceae

Includes both things made of palm, and a palm wine tapping knife.

Potential Extensions

Depending on priorities, we may develop additional enrichments and functionalities. Some ideas are described below.

Page 8: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

8/15

Mobile Application

We could implement a mobile version of the application, optimized for smart phones and tablets.

Semantic Discovery

Once a FD classification is built, it is possible to discover CHOs related to the topic that are already in Europeana. The number of such projects is much greater than those to be provided by the EFD project (hundreds of thousands rather than tens of thousands).

As a first use case, we are discovering objects related to Tea. Category:Tea includes 658 articles with 2377 labels (page titles and redirects). We hope to extract over 10k Europeana CHOs using these labels and one of two approaches:

Using the Europeana API at http://europeana.eu. However, for some datasets it returns enrichments over Narrower terms that are irrelevant, and sometimes times out after 500 objects.

Using the full-text search at http://europeana.ontotext.com/sparql. It offers two indexes luc:title (dc:title and dc:alternate) and luc:full (dc:title, dc:alternate, dc:description). Since Ontotext maintains this server, we have better control over its performance.

Then we'll perform semantic enrichment over these CHOs to confirm whether they indeed include Tea-related items, and to extract additional semantic concepts. Eg:

"Arare" OR "Kaki mochi" OR "Kakemochi" OR "Mochi crunch" OR "Kakimochi" OR "Norimaki arare" OR "Hurricane popcorn" is a Japanese tea cracker.

If you search with these labels, you find a lot of ploughs from Italy. These have title "aratro" and Description similar to "arare e rivoltare le zolle per la preparazione del terreno alla coltura"

FD Classifier

We used some machine learning techniques to create a FD Classifier. This module can predict whether an object is FD-related or but by looking at the metadata text of the object. The prediction is based on the Wikipedia text of FD-related articles

The available labelled data consists of 4330 positive examples (articles used to tag Horniman objects), 106k maybes (all other articles in the FD hierarchy) and 3.6M negative (articles outside the FD hierarchy). The model was trained using all positive examples and a random sub-sample of size 5000 from the negatives. We should include more articles as positive examples, e.g. from leveraging other LOD datasets that evidence FD relevance (see sec 2.4).

The most informative features (word stems with largest weights in the model) are as follows: food, fish, cook, cake, agricultur, tree, bread, sweet, type, milk, plant, tradit, dish, common, sugar, shape, cuisin, drink, rice, edibl, coffe, water, fruit, perenni, nativ, popular, tea, hunt, dessert

Once fine-tuned, this classifier can be a very promising module for Europeana Discovery.

Rather than making queries using specific keywords, we can run it through all Europeana CHOs, predicting which are FD-relevant.

Because there are 43M CHOs, speed is a concern. But extracting features from CHOs is fast because they are small; and prediction for a new case is a fraction of a second

Page 9: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

9/15

Then we will run semantic enrichment over the positively predicted objects to find out why are they relevant.

Enrichment Web Service

ONTO will establish a web service to provide automatic enrichment suggestions to the Crowdsourcing Enrichment application to be developed by EFD. The service will:

Perform semantic enrichment of FD topics and Places to suggest automatic enrichments to curators (people from the content partners)

Provide interactive search with auto-completion for the same categories of data, to enable curators to select tags themselves.

Important issues to be tackled for this service are:

Availability, performance, monitoring

Fine tuning of the enrichment process, leveraging all manual work done to date

New Language for Enrichment

It would be very useful to extend semantic enrichment to another language (in addition to English), in order to increase the number of collections than it can handle. The considerations for selecting a new language to handle are:

Size and importance of collection data from EFD content partners

The richness of the respective national-language Wikipedia

The availability of language-specific NLP resources

ONTO's experience with the language

Geographical Mapping

Given the Place enrichments, we could implement a Geographical Map tab, in addition to the existing lightbox (thumbnail grid). It will involve the following tasks:

Eliminate superfluous ancestor places. E.g. if a CHO is tagged with Rome and Italy, we want to remove the parent place, else the same CHO will appear with two different markers on the map

Complement with ancestors with coordinates: If a CHO is marked with "Fleet Street" and neither GeoNames nor DBpedia have coordinates about it, we need to add its lowest ancestor with coordinates (in this case, "City of London" and not "London" which is a greater area)

Average coordinate values. Experience shows that many places have several coordinate pairs that differ by little. We need to consider that in the query, and average the coordinates of the same place, to ensure one marker per place.

Add coordinates to the semapp backend (API). An important consideration is the total response size, and whether we need to limit it somehow.

Implement geo map display in the semapp frontend. We'd like to use the "marker clusterer" library, which can display many thousands of places by using spots with the number of markers, which upon zoom are split into more fine-granularity spots.

Page 10: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

10/15

Other Visualizations

Other potentially useful visualizations of the semantic information are shown below.

Page 11: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

11/15

Classification Tree

Shows the distribution of CHOs along the EFD classification, which comprises Wikipedia articles and categories.

Page 12: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

12/15

Coverage Map

Shows a topical grouping of the EFD classification. This will be useful if actual CHOs are used to seed the map

Page 13: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

13/15

Radial Dendrogram

Shows the EFD categories that include CHOs in a compact format.

Treemap

Shows the count of CHOs in various categories as the respective coloured area. One can click on a cell to make it the new root of the visualization.

Page 14: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

14/15

Sunburst

Shows the nesting of categories and sub-categories.

Timeline

Shows a selection of CHOs (e.g. within a specific category), arranged along a timeline. For this to work, we need reliable year extraction from CHO metadata, and we'll try to reuse the respective Europeana source code.

Above we show the Histropedia timeline, which enables highly aesthetic arrangements. (Currently it works on Wikidata/Wikipedia, would need to be adapted to work on CHOs.)

References

[Alexiev 2015a] Vladimir Alexiev. Europeana Food and Drink Classification Scheme. Deliverable D2.2, Europeana Food and Drink project, February 2015.

[Alexiev 2015b] Vladimir Alexiev. Europeana Food and Drink Semantic Demonstrator Specification. Deliverable D3.19, Europeana Food and Drink project, March 2015.

Page 15: Grant Agreement 621023 - Europeana Food and Drink...Cultural Heritage (CH) institutions can benefit from coreferencing their thesauri and enriching their objects with LOD such as DBpedia,

EFD Semantic Demonstrator Description

15/15

[Alexiev 2015c] Vladimir Alexiev. Europeana Food and Drink Classification Scheme. In Europeana Food and Drink annual meeting, Athens, Greece, March 2015.

[Alexiev 2015d] Vladimir Alexiev. GLAMs Working with Wikidata. In Europeana Food and Drink content provider workshop, Athens, Greece, May 2015.

[Alexiev 2015e] Vladimir Alexiev. Europeana Food and Drink Semantic Demonstrator M18 Progress Report. Progress Report D3.20a, Europeana Food and Drink project, June 2015.

[Olensky 2012] Marlies Olensky, Juliane Stiller, Evelyn Dröge. Poisonous India or the Importance of a Semantic and Multilingual Enrichment Strategy. Metadata and Semantics Research 2012. CCIS Volume 343, 2012, pp 252-263, Springer.

[Tagarev 2015] Andrey Tagarev, Laura Tolosi, Vladimir Alexiev. Domain-specific modeling: Towards a Food and Drink Gazetteer. First International Keystone Conference, Coimbra, Portugal, Sep 2015, LNCS 9398.

[Tagarev 2016] Andrey Tagarev, Laura Tolosi, Vladimir Alexiev. Domain-specific modeling: Towards a Food and Drink Gazetteer. Extended version, to appear in LNCS Transactions on Computational Collective Intelligence

For More Information

If you'd like to obtain more information or are interested in exploring semantic approaches for your CH collection, please contact vladimir.alexiev <at> ontotext.com.