new approaches for data acquisition at europeana iiif, sitemaps and schema.org, dans seminar, 2017
TRANSCRIPT
New approaches for data acquisition at Europeana: IIIF, Sitemaps and Schema.org
Valentine Charles and Nuno Freire
Seminar Linked Data in Research and Cultural Heritage
1 May 2017
Title hereCC BY-SA
EuropeanaThe Platform for Europe’s Digital Cultural Heritage
● We aggregate (and make available) metadata:
• From all EU countries• From ~3,500 galleries, libraries,
archives and museums• Under a CC0 licence • More than 54M objects and • In about 50 languages
“We transform the world with culture! We
want to build on Europe’s rich heritage and
make it easier for people to use, whether
for work, for learning or just for fun.”
New approaches for data acquisition at EuropeanaCC BY-SA
Czech Republic, PD
1887, Uměleckoprůmyslové museum v Praze
Preissig, Vojtech
Coloured etchings
Re-thinking data aggregation in Europeana
Title hereCC BY-SA
● Organisational rationale
• Data providers, Aggregators, Europeana
have defined roles
● A technical rationale
• Federated search had shown its limits in previous projects
• Choice of OAI-PMH as the core technological solution
● A data rationale
• Data aggregation focused
• on metadata
• on cultural objects as the main entity
A centralised approach
to data aggregation
New approaches for data acquisition at EuropeanaCC BY-SA
Title hereCC BY-SA
How to go from...
New approaches for data acquisition at EuropeanaCC BY-SA
Title hereCC BY-SA
Europeana aggregation infrastructureEuropeana| CC BY-SA
...to
New approaches for data acquisition at EuropeanaCC BY-SA
Title hereCC BY-SA
What kind of technology(ies) are we considering?
● What are the successors of OAI-PMH?
● Technologies widely used by CH organizations for other purposes• Search engine optimization• Linked data• Social web technologies
New approaches for data acquisition at EuropeanaCC BY-SA
Cristallisation ou Mouvement du
temps, René Bord
1987, Bibliothèque Municipale De Lyon,
public domain
Investigated technologies:
IIIF and Sitemaps
International Image Interoperability Framework (IIIF)
CC BY-SA
New approaches for data acquisition at Europeana
● Why IIIF?
• It provides immediate access to full and high-res imagery and multi-
page documents is something all users want (whether casual or
professional)
• Some users have specific needs and pain points
• It supports Europeana in shifting its focus on content too. • Storing and serving digital media, on behalf of partners, is a major step towards an updated
value proposition to partners and users both.
International Image Interoperability Framework (IIIF)
CC BY-SA
New approaches for data acquisition at Europeana
● How do we support IIIF?
• We have joined the IIIF community as a founding member!
• Within the IIIF community we are engaged in the Newspapers special
interest group and in prototyping using IIIF in web discovery and
metadata harvesting
• We work with the Europeana Network to encourage the use of IIIF• We have updated our Europeana Data Model and documentation to include instructions on
how to provide IIIF images and manifests
• And we support the idea to try to extend IIIF to other types of media,
esp. audio-visual.
Cristallisation ou Mouvement du
temps, René Bord
1987, Bibliothèque Municipale De Lyon,
public domain
Investigated technologies:
IIIF and Sitemaps
Sitemaps
CC BY-SA
New approaches for data acquisition at Europeana
● Sitemaps allows webmasters to inform search engines about pages on their sites that are available for crawling
● They are supported by• all major search engines• many content management systems• many Europeana data providers
● They provide a simple technological solution with a very low implementation barrier
● They can support a large range of resources type• There are sitemaps extensions for images and videos (by Google)
Sitemaps and Schema.org
CC BY-SA
New approaches for data acquisition at Europeana
● Sitemaps can be associated with microdata like Schema.org● Europeana has already developed EDM mappings to Schema.org● We have also worked on a series of recommendations
• URI for an object (http://data.europeana.eu/...) should differ from the URL of the page(s) that display information about that object (http://www.europeana.eu/portal/...).
• A sitemap should also include reference to the publisher of the data (http://europeana.eu) and provider pages that Europeana could publish in the future.
See more in Code4Lib article Recommendations for the application of Schema.org to aggregated Cultural Heritage metadata to increase relevance and visibility to search engines: the case of Europeana
Case studies
Netherlands, Public Domain
1910-1925, Rijksmuseum
Anonymous
Tak met vier mangolia’s
Partners
Europeana & IIIFCC BY-SA
● To study the feasibility of performing metadata aggregation via IIIF/Sitemaps, we have undertaken several case studies, in cooperation with data providers of the Europeana Network
• National Library of Wales
• Very active in the IIIF community• Very advanced in IIIF implementation• Expertise in full-text content (over IIIF)
• University College Dublin
• Very advanced in IIIF implementation• Expertise in search engine optimization (Sitemaps and its media specific extensions)
Brief introduction to the IIIF APIs
Europeana & IIIFCC BY-SA
How can IIIF be used for metadata aggregation?
Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io@bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io
“get pixels” via a simple, RESTful, web service
Just enough metadata to drive a remote viewing experience
Image API Presentation API
IIIF: Two Core APIs
Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io@bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io
Image Delivery API
http://iiif.io/api/image/2.0/
Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io@bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io
Object = Image + Presentation
Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io@bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io
Presentation API
•Descriptive: label, description•Rights: license, attribution(to be c’ed)
Image API
Image Data
Object = Image + Presentation
Ben Albritton Mike Appleby Tom Cramer Jon Stroop Rob Sanderson Stu Snydman Simeon Warner IIIF.io@bla222 @mikeapps @tcramer @jpstroop @azaroth42 @stusnydman @zimeon @iiif_io
Presentation API (c’ed)
• Structure• Collections of objects
• Manifests organizing Items, Sequences, Parts together with their metadata
• Linking• service: additional service endpoint
• related: resource to display to the user
• seeAlso: semantic metadata resource
Case study 1:
Crawling services across the IIIF universe
• Questions addressed:
• Can Europeana find the available IIIF services through IIIF Service
Registries?
• Is the output of IIIF crawlable? Can robots follow links in IIIF output and
reach all resources?
• How mature and uniform are existing IIIF implementations ?
• Is metadata available?
• Are machine readable licenses available?
New approaches for data acquisition at EuropeanaCC BY-SA
Case study 1:
Crawling services across the IIIF universe
• Questions addressed:
• Can Europeana find the available IIIF services through IIIF Service Registries?
• Is the output of IIIF crawlable? Can robots follow links in IIIF output and reach all resources?
• How mature and uniform are existing IIIF implementations?
• Is metadata available?
• Are machine readable licenses available?
New approaches for data acquisition at EuropeanaCC BY-SA
Registries are available and are machine readable, but coverage was only partial
IIIF provides all that is necessary, but some features are optional (e.g. IIIF Collections)
Minor compliance problems only due to immaturity of the implementations
IIIF provides a way to link to metadata, but it is optional (and often not used)
IIIF provides licensing information, but it is optional (and often not used)
Case study 2:
Crawling IIIF services via IIIF Collections
IIIF offers a Collection construct to represent groups of objects• By making a IIIF collection available to Europeana, all the resources it references can be crawled
and their metadata harvested
• Often available or simple to implement
The two data providers had IIIF services in operation already, but... • No collection
• No metadata
==>Implementation of a IIIF collection was easily achieved in both cases.
We identified an additional issue for metadata aggregation : IIIF collections do not provide the
modification timestamp of resources.
In order to overcome it, other technologies may be used in conjunction with IIIF.
New approaches for data acquisition at EuropeanaCC BY-SA
Case study 3:
Crawling IIIF services via Sitemaps
• Aggregation using Sitemaps can be more efficient
• Resource timestamps can be included in a Sitemap
• Three possible ways of using Sitemaps where experimented:
• Standard Sitemaps
• Sitemaps extended with elements used in IIIF specifications
• Sitemaps extended with elements from the ResourceSync namespace
New approaches for data acquisition at EuropeanaCC BY-SA
Case study 3:
Crawling IIIF services via Sitemaps
New approaches for data acquisition at EuropeanaCC BY-SA
<url>
<loc>https://data.ucd.ie/api/img/collection/ivrla:3573</loc>
<lastmod>2014-08-24T04:09:09.716Z</lastmod>
</url>
Example of URL data in a Sitemap from University College Dublin. The loc element references a IIIF Manifest.
Case study 3:
Crawling IIIF services via Sitemaps
New approaches for data acquisition at EuropeanaCC BY-SA
<url>
<loc>http://newspapers.library.wales/view/3679651</loc>
<iiif:Manifest
xmlns:iiif="http://iiif.io/api/presentation/2/">http://dams.llgc.org.uk/iiif/newspaper/issue/3679651/m
anifest.json</iiif:Manifest>
<dcterms:isPartOf>http://dams.llgc.org.uk/iiif/newspapers/3679650.json<dcterms:isPartOf>
<lastmod>2014-11-08</lastmod>
</url>
Example of URL data in a Sitemap from the National Library of Wales, with references to the webpage of the resource, the IIIF Manifest and its IIIF Collection.
Case study 3:
Crawling IIIF services via Sitemaps
New approaches for data acquisition at EuropeanaCC BY-SA
<url>
<loc>https://digital.ucd.ie/view/ucdlib:38491</loc>
<rs:ln rel="alternate" href="https://data.ucd.ie/api/img/manifests/ucdlib:38491"
type="application/json" dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/>
<rs:ln rel="collection href="https://digital.ucd.ie/view/ucdlib:38488” type="application/json"
dcterms:conformsTo="http://iiif.io/api/presentation/2.1/"/>
<lastmod>2014-08-24T04:09:09.716Z</lastmod>
</url>
Example of URL data in a Sitemap from University College Dublin, with references to the webpage of the resource, the IIIF Manifest and its IIIF Collection, and the indication of the
IIIF API version in use
Case study 4:
Crawling IIIF services via IIIF Collections and HTTP cache headers
• Addressed the lack of efficiency of IIIF Collections, by using HTTP cache control
• The IIIF service is required to have the implementation of some HTTP cache headers for the URLs
that provide access to the IIIF resources.
• When resources have not changed, the IIIF service saves time and processing
• Reduced crawling time by ~50%
New approaches for data acquisition at EuropeanaCC BY-SA
➢ The IIIF crawler includes in all the requests for IIIF manifests, the HTTP header If-Modified-
Since, with the timestamp of the last crawl.
➢ The IIIF service only needs to send the IIIF manifest if an update has happened
➢ In case of deletion, the IIIF service returns a response with the HTTP Status code 404 Not
Found.
Case study 5:
Crawling resources and metadata referenced by Sitemaps Video and image Extensions
• Google has defined Sitemaps extensions for retrieval of image and video
• Just like search engines, Europeana may reuse the media specific
metadata, however:
• From Europeana’s metadata aggregation perspective, the main issue is that the metadata
does not fulfil its data quality requirements
• The solution adopted with University College Dublin was to further extend the Video
Sitemaps with elements from ResourceSync that allow for the association of the EDM
metadata
New approaches for data acquisition at EuropeanaCC BY-SA
Example of URL data using the Sitemaps Video extension from University
College Dublin. The Sitemap was extended for association of EDM metadata.
New approaches for data acquisition at EuropeanaCC BY-SA
<url>.
<loc>https://digital.ucd.ie/view/ucdlib:38509</loc>
<rs:ln rel="describedby" href="https://data.ucd.ie/api/edm/v1/ucdlib:38509"
dcterms:conformsTo="http://www.europeana.eu/schemas/edm/"/>
<rs:ln rel="collection" href="https://data.ucd.ie/api/img/collection/ucdlib:38488"/>
<video:video>
<video:thumbnail_loc>https://digital.ucd.ie/get/ucdlib:38509/thumbnail
</video:thumbnail_loc>
<video:description>Irish poet Catherine Ann Cullen reads her poem 'Meeting at the Chester Beatty' in UCD
Library's Special Collections.</video:description>
<video:player_loc allow_embed="yes">
https://player.vimeo.com/video/111413587
</video:player_loc>
<video:duration>00:02:51.04</video:duration>
<video:family_friendly>yes</video:family_friendly>
<video:live>no</video:live>
</video:video>
<lastmod>2015-09-10T17:14:26.523Z</lastmod>
</url>
New approaches for data acquisition at EuropeanaCC BY-SA
Main conclusions from the case studies
• Applying these technologies by providers was straightforward• In-house knowledge is a great advantage
• None of the case studies presented serious technological obstacles
• Very simple technological solutions are available
• Only very large collections may require additional complexity
• ...the main challenge is to choose among the several possibilities and
establishing a standard (or best practice) within the community(ies):
• Europeana is working with the IIIF community in the context of the IIIF Discovery Technical
Specification group
• Europeana will prepare recommendations targeted at its own partner network.
Future work
France, Public Domain
Agence Rol. Agence photographique, Bibliothëque national de France
Chat "regardant" à travers une longue-vue et autre chat perché dessus
New approaches for data acquisition at EuropeanaCC BY-SA
Future work
• More case studies in preparation:
• Crawling websites/LOD in search for resources represented in Schema.org
• ResourceSync: One case study in preparation with a collection containing over 600
thousand resources
• Continue monitoring and investigating technology trends in our domain:
• Continue work on IIIF and Sitemaps
• The Linked Data Platform
• Notification based solutions:
• Linked Data Notifications
• Webmention
Title hereCC BY-SA
Name of image | CreatorProviding organization|
Country, licence
Name of image | CreatorProviding organization| Country, licence
Updated February 2016