harvesting&metadata enrich project eva 2009

43
Harvesting&Metadat a Florence, April 30 th 2009

Upload: icl-image-communication-laboratory

Post on 16-Apr-2017

2.859 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Harvesting&Metadata Enrich Project   EVA 2009

Harvesting&MetadataFlorence, April 30th 2009

Page 2: Harvesting&Metadata Enrich Project   EVA 2009

Harvesting&MetadataThe OAI-PMH Standard

Rudy Becarelli [email protected]

Page 3: Harvesting&Metadata Enrich Project   EVA 2009

“The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. ... “

• The OAI approach:– to enable access to Web-accessible material– interoperable repositories for metadata sharing, publishing and

archiving.

• Low-barrier interoperability framework to access digital materials.

The Open Archive Initiative Mission

The OAI-PMH Standard

Page 4: Harvesting&Metadata Enrich Project   EVA 2009

• The OAI-Protocol for Metadata Harvesting (OAI-PMH):– Simple technical option based on the open standards HTTP and XML. – Any format of metadata– Unqualified Dublin Core is specified to provide a basic level of

interoperability

• Metadata from many sources can be gathered together in one database

• The link between metadata and the related content is not defined by the OAI protocol

• OAI-PMH makes it possible to bring the data together in one place. In order to provide services, the harvesting approach must be combined with other mechanisms

The OAI-PMH Standard

Page 5: Harvesting&Metadata Enrich Project   EVA 2009

Resource: object the metadata are "about"

Item: component of a repository from which metadata about a resource can be disseminated; has an unique identifier

Record: metadata in a specific metadata format

Identifier: unique key for an item in a repository

Set: optional construct for grouping items in a repository

The OAI-PMH Standard

Page 6: Harvesting&Metadata Enrich Project   EVA 2009

• Archivea repository for stored information.

• Protocola set of rules defining communication between systems (HTTP, XML).

• Harvestingrefers specifically to the gathering together of metadata from a number of distributed repositories into a combined data store.

• Data Providermaintains one or more repositories (web servers) that support the OAI-PMH as a means of exposing metadata (1).

• Service Providerissues OAI-PMH requests to data providers and uses the metadata as a basis for building value-added services (1).

(1) OAI definition quoted from FAQ on OAI Web site

The OAI-PMH Standard

Page 7: Harvesting&Metadata Enrich Project   EVA 2009

• Data Providers (open archives, repositories) provide free access to metadata, and may, but do not necessarily, offer free access to full texts or other resources.

• Service Providers use the OAI interfaces of the Data Providers to harvest and store metadata. – no live search requests to the Data Providers; – services are based on the harvested data via OAI-PMH.– may select certain subsets from Data Providers

The OAI-PMH Standard

Page 8: Harvesting&Metadata Enrich Project   EVA 2009

• Multiple Service Providers can harvest from multiple Data Providers.

• Aggregators can sit between Data

Providers and Service Providers.

The OAI-PMH Standard

Page 9: Harvesting&Metadata Enrich Project   EVA 2009

• Based on HTTP.

• Request arguments are issued as GET or POST parameters.

• Verbs

• Responses are encoded in XML syntax.

• Error messages are HTTP-based.

• Sets (optional)

• OAI-PMH supports flow control.

The OAI-PMH Standard

Page 10: Harvesting&Metadata Enrich Project   EVA 2009

The OAI-PMH Standard

Page 11: Harvesting&Metadata Enrich Project   EVA 2009

Harvesting&MetadataCulturaItalia experience

Fabio [email protected]

Page 12: Harvesting&Metadata Enrich Project   EVA 2009

• An Italian experience: building an OAI-PMH Data Provider for CulturaItalia www.culturaitalia.it

• This Data Provider is conceived as a repository for metadata about Tuscany pieces of art.

• The mission of CulturaItalia:– to promote Italian culture and

heritage in Italy and abroad,– to promote and integrate

existing resources.

CulturaItalia experience

Page 13: Harvesting&Metadata Enrich Project   EVA 2009

• CulturaItalia is a descriptive catalogue that indexes metadata and redirects to resources.

• Resources remains distributed and under management of the owner.

• Each institution can establish which data will be harvested by the Portal.

CulturaItalia experience

Page 14: Harvesting&Metadata Enrich Project   EVA 2009

Standards

• CulturaItalia is based on international standards :– OAI-PMH– DCMI– HTTP– XML– XHTML

CulturaItalia experience

Page 15: Harvesting&Metadata Enrich Project   EVA 2009

Metadata Schema: PICO DC Application Profile

• Designed for CulturaItalia by Irene Buonazia, M. E. Masci, Davide Merlitti et alii (Scuola Normale Superiore - Pisa)

• Dublin Core has been adopted as metadata standard

• a DC Application Profile has been developed according DCMI recommendations for this specific application and domain

CulturaItalia experience

Page 16: Harvesting&Metadata Enrich Project   EVA 2009

Metadata Schema: PICO DC Application Profile

• The PICO DC Application Profile joins in one metadata schema:– All DC Elements;– All DC Element Refinements and Encoding Schemes from

the Qualified DC;– Other Qualifiers (refinements and encoding schemes)

specifically conceived for the CulturaItalia domain.

• Namespaces included into this metadata schema:– dc:– dcterms:– pico:

CulturaItalia experience

Page 17: Harvesting&Metadata Enrich Project   EVA 2009

PICO AP Added Qualifiers – Element Refinements

Elements added Element Refinements

CREATOR author, commissionerDESCRIPTION information, contact, servicePUBLISHER distributor, printerCONTRIBUTOR editor, performer, responsible, producer,

translatorFORMAT material and techniqueRELATION promotes / is promoted by, manages / is managed

by, is owner of / is owned by, produces / is produced by, performs / is performed by, is responsible for/ has as responsible, contributes to / has as contributor, digitizes / is digitized by

COVERAGE place of birth, place of death, date of birth, date of death

CulturaItalia experience

Page 18: Harvesting&Metadata Enrich Project   EVA 2009

PICO AP - Extensions to DCMI Type Vocabulary

• The element DCType, with its controlled vocabulary (DCMI Type Vocabulary), can describe the greatest part of resources to be managed within CulturaItalia.

• PICO Type Vocabulary integrates three more resource types.

dcmtype:Collectiondcmitype:Datasetdcmtype:Eventdcmtype:Imagedcmtype:MovingImagedcmtype:StillImagedcmtype:PhysicalObjectdcmtype:InteractiveResource

dcmtype:Servicedcmtype:Softwaredcmtype:Sounddcmtype:Text

picotype:Institutionpicotype:PhysicalPersonpicotype:Project

CulturaItalia experience

Page 19: Harvesting&Metadata Enrich Project   EVA 2009

PICO AP – Further Extensions

• PICO AP can be further extended:

– By adding new encoding schemes: they must be defined and published as xsd schemas,

– Using DCSV (Dublin Core Structured Values), defined in:Simon Cox - Renato IannellaDCMI DCSV: A syntax for writing a list of labelled values in a text

string, 2000-07-28

http://es.dublincore.org/documents/dcmi-dcsv/

CulturaItalia experience

Page 20: Harvesting&Metadata Enrich Project   EVA 2009

SIL “Museum”

NAL “In”NAL “Out”

Web ServiceWeb Service

CulturaItalia

Database Tuscany Repository

JDBC

OAI-PMHOAI-PMH

CART

Adapter

OAICat

CulturaItalia experience

Page 21: Harvesting&Metadata Enrich Project   EVA 2009

Publishing process

• Building the envelope: the elements Typology Publisher Local identifier Set Metadata

• Building the envelope: serialization

OACOACMUSEUMMUSEUMoac_09_00000001_0oac_09_00000001_0OAC_COMUNE_FIRENZEOAC_COMUNE_FIRENZE

CulturaItalia experience

Page 22: Harvesting&Metadata Enrich Project   EVA 2009

CART

NAL “Out”

Adapter

CARTCART WSWS

Tuscany Tuscany RepositoryRepository

• Software on NAL “Out” sends: – records to Data Provider– return receipts to publishers

CulturaItalia experience

Page 23: Harvesting&Metadata Enrich Project   EVA 2009

Publishing process

• Crosswalk from original profile to PICO

• Storage on database

NAL “Uscita”

Web Service

Tuscany Repository

JDBC

Adapter

Database

CulturaItalia experience

Page 24: Harvesting&Metadata Enrich Project   EVA 2009

Transformer

• Based on XSLT 2.0 language

• Different profiles:• OA, OAC (ICCD)• MFN (Fondazione Memofonte/Museo del Bargello -

Firenze)• GIOMM (Museo Marino Marini – Pistoia)

• Character encoding:OAI-PMH UTF-8

CulturaItalia experience

Page 25: Harvesting&Metadata Enrich Project   EVA 2009

• Predefined Entity References NOT ALLOWED!

• Numerical Character References ALLOWED!

• Example:[...] si rimanda al volume "Manzù", 1988 [...]

• Some characters handled this way (beyond 300):Some characters handled this way (beyond 300):

ê, ½, <, >, &, «, », £, °, `, ´, “,”ê, ½, <, >, &, «, », £, °, `, ´, “,”

[...] si rimanda al volume &#x0022;Manz&#x000F9;&#x0022;, 1988 [...][...] si rimanda al volume &#x0022;Manz&#x000F9;&#x0022;, 1988 [...]

CulturaItalia experience

Page 26: Harvesting&Metadata Enrich Project   EVA 2009

<AU><AUT><AUTN>Manzù Giacomo</AUTN><AUTA>1908/1991</AUTA></AUT><EDT><EDTN>Della Ragione Alberto</EDTN></EDT>

</AU>

<pico:author xsi:type="iccd:AUT"><pico:author xsi:type="iccd:AUT">

AUTN=Manz&#x000F9;Giacomo;AUTN=Manz&#x000F9;Giacomo;

AUTA=1908/1991AUTA=1908/1991

</pico:author></pico:author>

<dc:publisher xsi:type="oac:EDT"><dc:publisher xsi:type="oac:EDT">

EDTN=Della Ragione AlbertoEDTN=Della Ragione Alberto

</dc:publisher></dc:publisher>

Ref : Ref : Mapping PICO – ICCD ,

http://www.iccd.beniculturali.it/Catalogazione/standard-catalografici/metadati

CulturaItalia experience

Page 27: Harvesting&Metadata Enrich Project   EVA 2009

DATA PROVIDER

• Open source software:– OAICat– Apache Axis– Apache Tomcat– MySQL

• Personalization:– Use of Tomcat DataSource – JDBC2Pico crosswalk

SERVICE PROVIDERCulturaItalia harvested more than

14000 records

OAICatOAICat

PICO harvester

DatabaseTomcat

JDBC

OAI-PMHOAI-PMH

CulturaItalia experience

Page 28: Harvesting&Metadata Enrich Project   EVA 2009

Harvesting&MetadataEnrich experience

Paolo [email protected]

Page 29: Harvesting&Metadata Enrich Project   EVA 2009

Enrich experience

• An european experience: the ENRICH Project http://enrich.manuscriptorium.com/

• ENRICH Project goal:create seamless access to information about the vast collections of manuscripts and incunabula distributed across major European libraries

Italian Partners:MICC (Media Integration and Communication Center)BNCF (The National Librabry of Florence)

Page 30: Harvesting&Metadata Enrich Project   EVA 2009

• ENRICH Project:– Based on MANUSCRIPTORIUM Digital Library

http://www.manuscriptorium.eu(National Library of the Czech Republic, AIP-Beroun Ltd)

Enrich experience

• ENRICH Conceptual Model :• OAI-PMH• XML• TEI

Page 31: Harvesting&Metadata Enrich Project   EVA 2009

• Report on the Development and Validation of Migration Tools 28 February 2009http://enrich.manuscriptorium.com/files/ENRICH_WP3_D3_3_Migration_Tools_01.pdf

Migration routes for a number of different data formats to the ENRICH specification.

Enrich experience

Recommendations for Migration Routes:

- mature, open source, cross-platform technologies;

- human-readable, text-based scripting languages.

Page 32: Harvesting&Metadata Enrich Project   EVA 2009

– The metadata format transformation can be operated by the Service Provider or by the Data Provider and it depends on the XSLT skills of the Data Provider;

– The project offers a tool, named M-Tool, that guides the Data Provider to map its proprietary fields into the TEI-P5 ones.

Enrich experience

Migration of the metadata to the ENRICH:

Page 33: Harvesting&Metadata Enrich Project   EVA 2009

• Data Format:

Enrich experience

MANUSCRIPTORIUMBased on MASTER (Manuscript Access through Standards for Electronic Records)XML data format (extension to TEI P4 Guidelines)MASTER Reference Manual (available at http://www.teic.org.uk/Master/Reference/oldindex.html )The MASTER data format was updated and modified and eventually incorporated as a module into the Text Encoding Initiative TEI P5 Guidelines

ENRICH Based on TEI P5 (ratified by the TEI Technical Council)

MASTER to ENRICH transformation XSL (released by Creative Commons Attribution license)

Page 34: Harvesting&Metadata Enrich Project   EVA 2009

– over 1300 pages – 23 chapters– Over 500 XML elements

• ENRICH format specification is based on chapters for:– Manuscript Description– Digital images– Non-Unicode characters– Paleographic or trascriptional data

Enrich experience

•TEI P5: http://www.tei-c.org/Guidelines/P5/

Page 35: Harvesting&Metadata Enrich Project   EVA 2009

1. Metadata describing the original source manuscript;

2. metadata describing digitized images of the original source manuscript;

3. a transcription of the text contained by the original source manuscript (not required in Manuscriptorium).

Enrich experience

ENRICH TEI P5 schema contains three distinct aspects of a digitized manuscript:

Page 36: Harvesting&Metadata Enrich Project   EVA 2009

set # documents # images

1 Manoscritti in rete 33 3865

2 Bibliotheca Universalis II 183

63980

3 Carte Geografiche II 137

233

4 Bibliotheca Universalis I 377

159381

5 Carte Geografiche I 810

3765

6 Magliabechi 52096

211618

7 Galileo Galilei manuscripts 307

98650

8 Galileo Galilei printed books 183

58387

Contents of Biblioteca Nazionale di Firenze (BNCF) planned for aggregation via OAI-PMH

Enrich experience

Page 37: Harvesting&Metadata Enrich Project   EVA 2009

• to aggregate the content and to keep the aggregated information unconstrained as much as possible

• to harvest the original primary metadata contents

• The italian case of the BNCF:- MARCXML (slim) records (historical metadata)- MAG records (structural metadata)

Enrich experience

The goal :

Page 38: Harvesting&Metadata Enrich Project   EVA 2009

Enrich experienceExample of the mag profile record BNCF

Page 39: Harvesting&Metadata Enrich Project   EVA 2009

• Two harvests: one for MAG and the other for MARCslim.

• To match appropriate records together and to perform an automated processing of both the input files in order to produce a single XML record using the TEI P5 ENRICH schema.

• This TEI record is further processed in the Manuscriptorium platform (which will be TEI P5 based) for the purposes of searching and presentation (via end-users interface or the OAI-PMH interface).

• Migration to TEI P5 in progress…

Enrich experience

Page 40: Harvesting&Metadata Enrich Project   EVA 2009

Enrich experienceExample ONLINE Mag record BNCF

Page 41: Harvesting&Metadata Enrich Project   EVA 2009

Enrich experienceExample ONLINE Mag record BNCF

Page 42: Harvesting&Metadata Enrich Project   EVA 2009

Enrich experience

Enrich Harvesting Information:

-AIP Beroun, Beroun, Czech RepublicTomas Psohlavec [email protected] http://www.aipberoun.cz

Enrich Metadata Information:

- Oxford University Computing Services, Oxford, United Kingdom James Cummings [email protected] Sebastian Rahtz [email protected] http://www.oucs.ox.ac.uk/

Page 43: Harvesting&Metadata Enrich Project   EVA 2009

THANK you !

Rudy Becarelli Fabio Lanzi

Paolo Mazzanti

MICC - LCI Lab. -Media Integration and Communication CenterViale Morgagni, 65 50134 Florence (Italy)Tel. +39.055.4237404http://lci.micc.unifi.it