metadata april 8 2013
DESCRIPTION
TRANSCRIPT
1
Open Archives Initiative -Protocol for Metadata Harvesting
April 8, 2013
Richard Sapon-White
2
Overview
Definitions History The OAI Model Protocol for Metadata Harvesting
3
Definitions
Harvester - client application issuing OAI-PMH requests
Harvesting - the gathering together of metadata from a number of distributed repositories into a combined data store
Archives – synonym for a repository of scholarly papers
Protocol - a set of rules defining communication between systems (such as ftp or http)
4
History of the OAI
E-print servers = archives or repositories E-print servers provide access to scientific and
technical papers, scholarly journal articles Authors deposit pre-prints or published articles in
these repositories Concept: public, free access to scholarly
information without paid subscription to journals
5
History of the OAI (cont.)
Why? Scholarly research belongs to people Speeds the sharing of research Better for authors and readers
Known as the “open archives movement” Has nothing to do with physical archives
(repositories of institutional history or collections of unpublished materials)
6
History of the OAI (cont.)
Many e-print servers grew Overlapping disciplinary coverage Overlapping geographic coverage
Developing need to search multiple repositories simultaneously
(=federated searching) automatically identify and copy papers from
other repositories (=repository synchronization)
7
History of the OAI (cont.)
Meeting of experts, 1999, Santa Fe, New Mexico, USA
Defined an interface so that repositories could expose metadata for papers they held
Metadata could then be discovered by federated search services and other repositories and copied
Known as the Santa Fe Convention (later developed into PMH – Protocol for Metadata Harvesting
8
The Open Archives Model
Similar concept to union catalog Metadata “harvested” and stored in central
repository “Pull” rather than “push” model Collecting is similar to Internet spider
collecting HTML content
9
PMH and Z39.50
Differs from Z39.50 (specifically rejected at Santa Fe)
Z39.50: allows a client to search a remote
information server across a network Difficult to perform high-quality federated searches
across many servers – would need to deal with each server individually
Complex protocol
10
PHM and Z39.50 (cont.)
PHM is a simple protocol User interacts with database of harvested metadata,
not with individual repositories Database is constructed by the federated search
service using PHM Therefore, performance depends only on the
federated search service, not the individual repositories
11
Metadata Harvesting Protocol
Queries and responses carried over http Harvester application can request a single
metadata record or group of records to be exported Application can restrict records by date to only
gather new records (since previous harvesting)
12
Metadata Harvesting Protocol (cont.)
OAI-compliant data providers are capable of responding to such requests Data provider must be able to export metadata in
at least DC (unqualified) using XML communication syntax
Data provider includes URI with metadata
13
Metadata Harvesting Protocol (cont.)
Servers can also provide metadata in other schemes beside DC
Harvester applications can request metadata in other schemes beside DC
Harvester applications can also query a metadata repository for: List of metadata formats supported by repository List of record sets supported by the repository List of the identifiers of all records within the repository
14
Why the OAI PHM is important
Provides for a minimal level of interoperability Drives development of community-
specific metadata schemes Potential for new modes of scholarly
communication Dependent on widespread implementation by
research organizations, publishers, and “memory organizations” (i.e., libraries, museums, archives)
15
QUIZ!!!
http://www.oaforum.org/tutorial/english/page1.htm#section5
Problems with Metadata Harvesting
Loss of data when mapping unqualified DC Incorrect data from improper mapping Inconsistent punctuation and formatting
because of diverse sources of metadata High variance in data between institutions
16
Metasearching
Many systems = many metadata standards Convert to single system (harvesting)? Maintain individual element sets BUT create
interface to search simultaneously across heterogeneous databases
Voila: Metasearching! Not a single method
17
Definition
From NISO MetaSearch Initiative:“search and retrieval to span multiple databases, sources, platforms, protocols, and vendors at one time.”
Best known: Z39.50 protocol. Used to search remote library catalogs.
18
Z39.50
Allows computers to communicate to retrieve information – between client and server
Searches and results are restricted to Z39.50 databases
19
Z39.50 results
Server may interpret the query incorrectly Some automatically add Boolean “and” while
others add Boolean “or” Vocabulary issues – different vocabulary in
different databases Display results in order retrieved, by database
found, by data, by relevance
20
Problems with Z39.50
High recall, little precision Also present in Google Search: few studies
on user satisfaction Results may display in an irrelevant order for
the searcher
21
Metasearching: pros and cons
Single database searching allows users to use specialized indexing or controlled vocabulary
Single portal: No need for searcher to select a particular
database from list of databases
22
Case Studies
Divide into 3-4 groups Read the case study Discuss and report:
Describe the case briefly (2 min.) What can we learn from this case study? (3 min.)
23