new approaches for data acquisition at europeana iiif, sitemaps and schema.org, dans seminar, 2017

35
New approaches for data acquisition at Europeana: IIIF, Sitemaps and Schema.org Valentine Charles and Nuno Freire Seminar Linked Data in Research and Cultural Heritage 1 May 2017

Upload: nuno-freire

Post on 21-Jan-2018

931 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

New approaches for data acquisition at Europeana: IIIF, Sitemaps and Schema.org

Valentine Charles and Nuno Freire

Seminar Linked Data in Research and Cultural Heritage

1 May 2017

Page 2: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Title hereCC BY-SA

EuropeanaThe Platform for Europe’s Digital Cultural Heritage

● We aggregate (and make available) metadata:

• From all EU countries• From ~3,500 galleries, libraries,

archives and museums• Under a CC0 licence • More than 54M objects and • In about 50 languages

“We transform the world with culture! We

want to build on Europe’s rich heritage and

make it easier for people to use, whether

for work, for learning or just for fun.”

New approaches for data acquisition at EuropeanaCC BY-SA

Page 3: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Czech Republic, PD

1887, Uměleckoprůmyslové museum v Praze

Preissig, Vojtech

Coloured etchings

Re-thinking data aggregation in Europeana

Page 4: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Title hereCC BY-SA

● Organisational rationale

• Data providers, Aggregators, Europeana

have defined roles

● A technical rationale

• Federated search had shown its limits in previous projects

• Choice of OAI-PMH as the core technological solution

● A data rationale

• Data aggregation focused

• on metadata

• on cultural objects as the main entity

A centralised approach

to data aggregation

New approaches for data acquisition at EuropeanaCC BY-SA

Page 5: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Title hereCC BY-SA

How to go from...

New approaches for data acquisition at EuropeanaCC BY-SA

Page 6: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Title hereCC BY-SA

Europeana aggregation infrastructureEuropeana| CC BY-SA

...to

New approaches for data acquisition at EuropeanaCC BY-SA

Page 7: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Title hereCC BY-SA

What kind of technology(ies) are we considering?

● What are the successors of OAI-PMH?

● Technologies widely used by CH organizations for other purposes• Search engine optimization• Linked data• Social web technologies

New approaches for data acquisition at EuropeanaCC BY-SA

Page 8: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Cristallisation ou Mouvement du

temps, René Bord

1987, Bibliothèque Municipale De Lyon,

public domain

Investigated technologies:

IIIF and Sitemaps

Page 9: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

International Image Interoperability Framework (IIIF)

CC BY-SA

New approaches for data acquisition at Europeana

● Why IIIF?

• It provides immediate access to full and high-res imagery and multi-

page documents is something all users want (whether casual or

professional)

• Some users have specific needs and pain points

• It supports Europeana in shifting its focus on content too. • Storing and serving digital media, on behalf of partners, is a major step towards an updated

value proposition to partners and users both.

Page 10: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

International Image Interoperability Framework (IIIF)

CC BY-SA

New approaches for data acquisition at Europeana

● How do we support IIIF?

• We have joined the IIIF community as a founding member!

• Within the IIIF community we are engaged in the Newspapers special

interest group and in prototyping using IIIF in web discovery and

metadata harvesting

• We work with the Europeana Network to encourage the use of IIIF• We have updated our Europeana Data Model and documentation to include instructions on

how to provide IIIF images and manifests

• And we support the idea to try to extend IIIF to other types of media,

esp. audio-visual.

Page 11: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Cristallisation ou Mouvement du

temps, René Bord

1987, Bibliothèque Municipale De Lyon,

public domain

Investigated technologies:

IIIF and Sitemaps

Page 12: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Sitemaps

CC BY-SA

New approaches for data acquisition at Europeana

● Sitemaps allows webmasters to inform search engines about pages on their sites that are available for crawling

● They are supported by• all major search engines• many content management systems• many Europeana data providers

● They provide a simple technological solution with a very low implementation barrier

● They can support a large range of resources type• There are sitemaps extensions for images and videos (by Google)

Page 13: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Sitemaps and Schema.org

CC BY-SA

New approaches for data acquisition at Europeana

● Sitemaps can be associated with microdata like Schema.org● Europeana has already developed EDM mappings to Schema.org● We have also worked on a series of recommendations

• URI for an object (http://data.europeana.eu/...) should differ from the URL of the page(s) that display information about that object (http://www.europeana.eu/portal/...).

• A sitemap should also include reference to the publisher of the data (http://europeana.eu) and provider pages that Europeana could publish in the future.

See more in Code4Lib article Recommendations for the application of Schema.org to aggregated Cultural Heritage metadata to increase relevance and visibility to search engines: the case of Europeana

Page 14: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Case studies

Netherlands, Public Domain

1910-1925, Rijksmuseum

Anonymous

Tak met vier mangolia’s

Page 15: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Partners

Europeana & IIIFCC BY-SA

● To study the feasibility of performing metadata aggregation via IIIF/Sitemaps, we have undertaken several case studies, in cooperation with data providers of the Europeana Network

• National Library of Wales

• Very active in the IIIF community• Very advanced in IIIF implementation• Expertise in full-text content (over IIIF)

• University College Dublin

• Very advanced in IIIF implementation• Expertise in search engine optimization (Sitemaps and its media specific extensions)

Page 16: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Brief introduction to the IIIF APIs

Europeana & IIIFCC BY-SA

How can IIIF be used for metadata aggregation?

Page 17: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io@bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io

“get pixels” via a simple, RESTful, web service

Just enough metadata to drive a remote viewing experience

Image API Presentation API

IIIF: Two Core APIs

Page 18: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io@bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io

Image Delivery API

http://iiif.io/api/image/2.0/

Page 19: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io@bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io

Object = Image + Presentation

Page 20: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io@bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io

Presentation API

•Descriptive: label, description•Rights: license, attribution(to be c’ed)

Image API

Image Data

Object = Image + Presentation

Page 21: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io@bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io

Presentation API (c’ed)

• Structure• Collections of objects

• Manifests organizing Items, Sequences, Parts together with their metadata

• Linking• service: additional service endpoint

• related: resource to display to the user

• seeAlso: semantic metadata resource

Page 22: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Case study 1:

Crawling services across the IIIF universe

• Questions addressed:

• Can Europeana find the available IIIF services through IIIF Service

Registries?

• Is the output of IIIF crawlable? Can robots follow links in IIIF output and

reach all resources?

• How mature and uniform are existing IIIF implementations ?

• Is metadata available?

• Are machine readable licenses available?

New approaches for data acquisition at EuropeanaCC BY-SA

Page 23: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Case study 1:

Crawling services across the IIIF universe

• Questions addressed:

• Can Europeana find the available IIIF services through IIIF Service Registries?

• Is the output of IIIF crawlable? Can robots follow links in IIIF output and reach all resources?

• How mature and uniform are existing IIIF implementations?

• Is metadata available?

• Are machine readable licenses available?

New approaches for data acquisition at EuropeanaCC BY-SA

Registries are available and are machine readable, but coverage was only partial

IIIF provides all that is necessary, but some features are optional (e.g. IIIF Collections)

Minor compliance problems only due to immaturity of the implementations

IIIF provides a way to link to metadata, but it is optional (and often not used)

IIIF provides licensing information, but it is optional (and often not used)

Page 24: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Case study 2:

Crawling IIIF services via IIIF Collections

IIIF offers a Collection construct to represent groups of objects• By making a IIIF collection available to Europeana, all the resources it references can be crawled

and their metadata harvested

• Often available or simple to implement

The two data providers had IIIF services in operation already, but... • No collection

• No metadata

==>Implementation of a IIIF collection was easily achieved in both cases.

We identified an additional issue for metadata aggregation : IIIF collections do not provide the

modification timestamp of resources.

In order to overcome it, other technologies may be used in conjunction with IIIF.

New approaches for data acquisition at EuropeanaCC BY-SA

Page 25: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Case study 3:

Crawling IIIF services via Sitemaps

• Aggregation using Sitemaps can be more efficient

• Resource timestamps can be included in a Sitemap

• Three possible ways of using Sitemaps where experimented:

• Standard Sitemaps

• Sitemaps extended with elements used in IIIF specifications

• Sitemaps extended with elements from the ResourceSync namespace

New approaches for data acquisition at EuropeanaCC BY-SA

Page 26: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Case study 3:

Crawling IIIF services via Sitemaps

New approaches for data acquisition at EuropeanaCC BY-SA

<url>

<loc>https://data.ucd.ie/api/img/collection/ivrla:3573</loc>

<lastmod>2014-08-24T04:09:09.716Z</lastmod>

</url>

Example of URL data in a Sitemap from University College Dublin. The loc element references a IIIF Manifest.

Page 27: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Case study 3:

Crawling IIIF services via Sitemaps

New approaches for data acquisition at EuropeanaCC BY-SA

<url>

<loc>http://newspapers.library.wales/view/3679651</loc>

<iiif:Manifest

xmlns:iiif="http://iiif.io/api/presentation/2/">http://dams.llgc.org.uk/iiif/newspaper/issue/3679651/m

anifest.json</iiif:Manifest>

<dcterms:isPartOf>http://dams.llgc.org.uk/iiif/newspapers/3679650.json<dcterms:isPartOf>

<lastmod>2014-11-08</lastmod>

</url>

Example of URL data in a Sitemap from the National Library of Wales, with references to the webpage of the resource, the IIIF Manifest and its IIIF Collection.

Page 28: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Case study 3:

Crawling IIIF services via Sitemaps

New approaches for data acquisition at EuropeanaCC BY-SA

<url>

<loc>https://digital.ucd.ie/view/ucdlib:38491</loc>

<rs:ln rel="alternate" href="https://data.ucd.ie/api/img/manifests/ucdlib:38491"

type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/>

<rs:ln rel="collection href="https://digital.ucd.ie/view/ucdlib:38488” type="application/json"

dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/>

<lastmod>2014-08-24T04:09:09.716Z</lastmod>

</url>

Example of URL data in a Sitemap from University College Dublin, with references to the webpage of the resource, the IIIF Manifest and its IIIF Collection, and the indication of the

IIIF API version in use

Page 29: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Case study 4:

Crawling IIIF services via IIIF Collections and HTTP cache headers

• Addressed the lack of efficiency of IIIF Collections, by using HTTP cache control

• The IIIF service is required to have the implementation of some HTTP cache headers for the URLs

that provide access to the IIIF resources.

• When resources have not changed, the IIIF service saves time and processing

• Reduced crawling time by ~50%

New approaches for data acquisition at EuropeanaCC BY-SA

➢ The IIIF crawler includes in all the requests for IIIF manifests, the HTTP header If-Modified-

Since, with the timestamp of the last crawl.

➢ The IIIF service only needs to send the IIIF manifest if an update has happened

➢ In case of deletion, the IIIF service returns a response with the HTTP Status code 404 Not

Found.

Page 30: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Case study 5:

Crawling resources and metadata referenced by Sitemaps Video and image Extensions

• Google has defined Sitemaps extensions for retrieval of image and video

• Just like search engines, Europeana may reuse the media specific

metadata, however:

• From Europeana’s metadata aggregation perspective, the main issue is that the metadata

does not fulfil its data quality requirements

• The solution adopted with University College Dublin was to further extend the Video

Sitemaps with elements from ResourceSync that allow for the association of the EDM

metadata

New approaches for data acquisition at EuropeanaCC BY-SA

Page 31: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Example of URL data using the Sitemaps Video extension from University

College Dublin. The Sitemap was extended for association of EDM metadata.

New approaches for data acquisition at EuropeanaCC BY-SA

<url>.

<loc>https://digital.ucd.ie/view/ucdlib:38509</loc>

<rs:ln rel="describedby" href="https://data.ucd.ie/api/edm/v1/ucdlib:38509"

dcterms:conformsTo="http://www.europeana.eu/schemas/edm/"/>

<rs:ln rel="collection" href="https://data.ucd.ie/api/img/collection/ucdlib:38488"/>

<video:video>

<video:thumbnail_loc>https://digital.ucd.ie/get/ucdlib:38509/thumbnail

</video:thumbnail_loc>

<video:description>Irish poet Catherine Ann Cullen reads her poem 'Meeting at the Chester Beatty' in UCD

Library's Special Collections.</video:description>

<video:player_loc allow_embed="yes">

https://player.vimeo.com/video/111413587

</video:player_loc>

<video:duration>00:02:51.04</video:duration>

<video:family_friendly>yes</video:family_friendly>

<video:live>no</video:live>

</video:video>

<lastmod>2015-09-10T17:14:26.523Z</lastmod>

</url>

Page 32: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

New approaches for data acquisition at EuropeanaCC BY-SA

Main conclusions from the case studies

• Applying these technologies by providers was straightforward• In-house knowledge is a great advantage

• None of the case studies presented serious technological obstacles

• Very simple technological solutions are available

• Only very large collections may require additional complexity

• ...the main challenge is to choose among the several possibilities and

establishing a standard (or best practice) within the community(ies):

• Europeana is working with the IIIF community in the context of the IIIF Discovery Technical

Specification group

• Europeana will prepare recommendations targeted at its own partner network.

Page 33: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Future work

France, Public Domain

Agence Rol. Agence photographique, Bibliothëque national de France

Chat "regardant" à travers une longue-vue et autre chat perché dessus

Page 34: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

New approaches for data acquisition at EuropeanaCC BY-SA

Future work

• More case studies in preparation:

• Crawling websites/LOD in search for resources represented in Schema.org

• ResourceSync: One case study in preparation with a collection containing over 600

thousand resources

• Continue monitoring and investigating technology trends in our domain:

• Continue work on IIIF and Sitemaps

• The Linked Data Platform

• Notification based solutions:

• Linked Data Notifications

• Webmention

Page 35: New approaches for data acquisition at europeana  iiif, sitemaps and schema.org, dans seminar, 2017

Title hereCC BY-SA

Name of image | CreatorProviding organization|

Country, licence

Name of image | CreatorProviding organization| Country, licence

Updated February 2016