1 open archives initiative protocol for metadata harvesting (oai-pmh) alon kadury
Post on 20-Dec-2015
217 Views
Preview:
TRANSCRIPT
1
Open Archives Initiative Protocol Open Archives Initiative Protocol for Metadata Harvestingfor Metadata Harvesting
(OAI-PMH)(OAI-PMH)
Alon Kadury
2
ContentContent
RemindersHistoryOAI overviewTechnical introductionConclusionsDemonstrationsResources
3
Definition- A Digital Library is a:Definition- A Digital Library is a:
1. Collection of digital objects
2. Collection of knowledge structures
3. Collection of library services
4. Domain/Focus/Topic
5. Quality Control
6. Preservation/Persistence
4
Types of DLsTypes of DLs
Single Digital Library (SDL) – also Stand-alone, Self-contained
Federated Digital Library (FDL)– also confederated, distributed
Harvested Digital Library (HDL)
5
Single Digital Library (SDL)Single Digital Library (SDL)
A regular DLSelf-contained material:
– purchased– scanned/digitized
Usually localized
6
Federated Digital Library (FDL)Federated Digital Library (FDL)
Contains many autonomous librariesUsually heterogeneous repositoriesConnected via networkForms a virtual distributed libraryTransparent user interface The major problem is interoperability.The major problem is interoperability.
7
Harvested Digital Library (HDL)Harvested Digital Library (HDL)
Does not contain data, just metadataObjects harvested into summariesRegular DL characteristics:
– fine granularity– rich library services– high quality control– annotated
8
HistoryHistory
As the Web evolved, the number of Web sites and search engines increased.A similar process happened with e-prints and digital libraries.
The changes in the amount of DLs led to the development of the OAI-PMH protocol as we’re about to see.
9
History - ProblemsHistory - Problems
The development of e-prints and digital libraries let to several problems like:
Many user interfaces -Each DL offered Web interface for deposit of articles and for end-user searches.The result: Difficult for end users to work across archives without having to learn multiple different interfaces.
10
History - ProblemsHistory - Problems
Different queries’ syntax -The result: Difficult for the user to keep track of the searching syntax of each SDL and difficult to create an FDL that could query many SDLs.
Many metadata formats -SDL metadata could be kept in any format the SDL wanted.The result: Hard times for the FDLs which had to know the formats of each SDL they are harvesting.
11
History – Possible solutionsHistory – Possible solutions
The problems led researchers to recognise the need for single search interface to all archives - Universal Pre-print Service (UPS).
Two possible approaches to building the UPS where considered:
12
History – Solution 1History – Solution 1
Cross-searching multiple archive:In this approach a client sends requests to several servers and then combines the data.The client and server work with a known and agreed protocol (for example Z39.50).
However, studies showed this approach is not the preferred approach for distributed searching of large values of nodes mainly due to problems like knowing which collections to search and performance issues.
13
History – Solution 2History – Solution 2
Harvesting metadata into a ‘Central Server’:This approach harvests the metadata and stores it in a central server, on which searches are made.
The idea was demonstrated in a convention held at Santa Fe NM, October 21-22, 1999.
UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/
More reading: http://www.dlib.org/dlib/february00/02contents.html
14
OAI overview- definitionsOAI overview- definitions
Lets start with a few definitions:InteroperabilityOpen Archive Initiative (OAI)Open Archive Initiative Protocol for
Metadata Harvesting (OAI-PMH)
15
OAI overview- definitionsOAI overview- definitions
What is Interoperability?Interoperability refers to the ability of two
or more systems to interact with one another and exchange data according to a prescribed method in order to achieve predictable results.
16
OAI overview- definitionsOAI overview- definitions
In order to exchange data we need to agree on things like:– requests format– results format– transport protocols (HTTP vs FTP vs….)– Metadata formats (DC vs MARC vs…)– Usage rights (who can do what with the records)
We need someone to organize it and “set the rules”.
17
OAI overview- definitionsOAI overview- definitions
Who will organize it?Open Archive Initiative -
“The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content.” (http://www.openarchives.org/organization/index.html)
18
OAI overview- definitionsOAI overview- definitions
What will the interoperability standards be called?
Open Archive Initiative Protocol for Metadata Harvesting
(OAI-PMH)
19
OAI overview- Key playersOAI overview- Key players
When talking about OAI-PMH we see three main players:
1. Data Providers
2. Service Providers
3. The protocol (OAI-PMH)
20
OAI overview- Data ProviderOAI overview- Data Provider
Data Provider:– Handles deposit/publishing of resources in archive.– Expose metadata about resources in archive (using the
OAI-PMH protocol\interface).– Data Providers may support any metadata format, but
must support the metadata format Dublin Core (DC).– Offer free access to the archives (at least the metadata).– A network accessible server, able to process OAI-PMH
requests correctly is often called a Repository.
21
OAI overview - Service OAI overview - Service ProviderProvider
Service Provider:– Harvest metadata from data providers and use it
to offer single user-interface across all harvested metadata.
– May enrich metadata.– Offer (value-added) services on the basis of the
metadata.– Client application issuing OAI-PMH requests is
often referred to as a Harvester.
22
OAI overview - ProvidersOAI overview - ProvidersData ProviderService Provider
End user interface
Might have user interface
Has user interface
ContainsItems & metadata
Metadata only
OAI interfacemust?
Offers data from
Its own resources
Harvests metadata from data providers
23
OAI overview - ProvidersOAI overview - Providers
Inputinterface Data ProviderInput
interface
Nativeharvestinginterface
Data Provider
Nativeend-userinterface
Nativeharvestinginterface
Service Provider
Nativeend-userinterface
Native end-userinterface optional(e.g., RePEc)
24
OAI overview - ProvidersOAI overview - ProvidersData providers
Service providers
Harvestingbased onOAI-PMH
25
OAI overview - ModelOAI overview - Model
Web Layer 1
SDL SDL SDL Layer 2
OAI-PMH
Layer 4
Layer 3Service Provider - FDL\HDL
Web interfaces
26
Technical introductionTechnical introduction
Since the days of the Santa Fe convention the protocol had several versions.
Version 2.0 is the latest and is considered stable.The technical introduction refers to this version.
27
Tech’- protocol versionsTech’- protocol versions
model metadataharvesting
metadataharvesting
metadataharvesting
about eprints documentlike objects
resources
metadata OAMS unqualifiedDublin Core
unqualifiedDublin Core
transport HTTP HTTP HTTP
responses XML XML XML
requests HTTP GET/POST HTTP GET/POST HTTP GET/POST
verbs Dienst OAI-PMH OAI-PMH
nature experimental experimental stable
Santa Feconvention
OAI-PMHv.1.0/1.1
OAI-PMHv.2.0
28
Tech’- request & responseTech’- request & response
The requests of the protocol are HTTP based. The response contents of the protocol are XML based. Question: why?
Answer: – Simple protocol based on existing standards which allows rapid
development & effortless implementation.– Systems can be deployed in variety of configurations.– Low barrier interoperability specification.– Internet/Firewall friendly.
29
Tech’- request & responseTech’- request & response
There are six request types which are called verbs.
The request type and additional information are passed as parameters using HTTP POST or GET methods.
Requests (based on HTTP)
Metadata (encoded in XML)Harvester
Metadata
Service Provider
Repository
Metadata(Documents)
Data Provider
„Service”
30
Lets see a demonstration about how we can create a FDL and then we will look at the backstage of it.
Demo
31
Tech’ – more definitionTech’ – more definition
Se
rvic
e P
rovi
der
Da
ta
Pro
vid
er
e-prints
Da
ta
Pro
vid
er Images
Da
ta
Pro
vid
er
OPAC
Da
ta
Pro
vid
er
Museum
Da
ta
Pro
vid
er
Archive
Requests:
Identify
ListMetadataformats
ListSets
ListIdentifiers
ListRecords
GetRecord
Responses:
General information
Metadata formats
Set structure
Record identifier
Metadata
Da
ta
Pro
vid
er Harvester
Repository
Repository
Repository
Repository
Repository
32
Tech’– Tech’– Request TypesRequest Types
Six different request types1. Identify2. ListMetadataFormats3. ListSets4. ListIdentifiers5. ListRecords6. GetRecord
Harvester does not have to use all types. Repository must implement all request types fully
(all required and optional arguments for each of the requests).
33
Tech’- Tech’- Request Type: IdentifyRequest Type: Identify
functionretrieve description and general information about an archive.
example archive.org/oai-script?verb=Identify
parameters none
errors / exceptionsbadArgument
e.g. archive.org/oai-script?verb=Identify&set=biology
34
Tech’- Tech’- Request Type: IdentifyRequest Type: IdentifyResponse format
ElementExample#
repositoryNameMy Archive1
baseURLhttp://archive.org/oai1
protocolVersion2.01
earliestDatestamp1999-01-011
deleteRecordsno, transient, persistent1
granularityYYYY-MM-DD, YYYY-MM-DDThh:mm:ssZ 1
adminEmailoai-admin@archive.org+
compressiondeflate, compress, …*
descriptionoai-identifier, eprints, friends, …*
35
Tech’- Tech’- Request Type: IdentifyRequest Type: IdentifyResponse in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=Identify
36
Tech’- Tech’- Request Type: Request Type: ListMetadataFormatsListMetadataFormats
functionretrieve available metadata formats from archive.Remember that each archive must implement at least DC.
example archive.org/oai-script?verb=ListMetadataFormats
parameters identifier (optional)
errors / exceptionsbadArgumentidDoesNotExist
e.g. archive.org/oai-script?verb=ListMetadataFormats&
identifier=really-wrong-identifier noMetadataFormats
37
Tech’- Tech’- Request Type: Request Type: ListMetadataFormatsListMetadataFormats
Response in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?verb=ListMetadataFormats
38
Tech’- Tech’- Request Type: ListSetsRequest Type: ListSets
Q: What are Sets?A: Sets are logical partitioning of repositories.
Q: Why use sets?A: Sets function was aimed to enable selective harvesting.
Data providers don’t have to define sets.Sets are not strictly hierarchical.
39
Tech’- Tech’- Request Type: ListSetsRequest Type: ListSets
functionretrieve set structure of a repository
example archive.org/oai-script?verb=ListSets
parameters resumptionToken (exclusive)
errors / exceptionsbadArgumentbadResumptionToken
e.g. archive.org/oai-script?verb=ListSets&resumptionToken=any-wrong-token
noSetHierarchy
40
Tech’- Tech’- Request Type: ListSetsRequest Type: ListSetsResponse in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListSets
41
Tech’- Tech’- Request Type: Request Type: ListIdentifiersListIdentifiers
functionabbreviated form of ListRecords, retrieving only headers
example archive.org/oai-script?verb=ListIdentifiers&
metadataPrefix=oai_dc&from=2002-12-01parameters
from (optional)until (optional) metadataPrefix (required)set (optional) resumptionToken (exclusive)
errors / exceptionsbadArgument, e.g. …&from=2002-12-01-13:45:00badResumptionTokencannotDisseminateFormatnoRecordsMatchnoSetHierarchy
42
Tech’- Tech’- Request Type: Request Type: ListIdentifiersListIdentifiers
Response in XML format http://theses.lub.lu.se/oai-service/xerxes/?verb=ListIdentifiers&metadataPrefix=oai_dc
43
Tech’- Tech’- Request Type: Request Type: ListRecordsListRecords
functionharvest records from a repository
example archive.org/oai-script?verb=ListRecords&
metadataPrefix=oai_dc&set=biologyparameters
from (optional)until (optional) metadataPrefix (required)set (optional) resumptionToken (exclusive)
errors / exceptionsbadArgumentbadResumptionTokencannotDisseminateFormatnoRecordsMatchnoSetHierarchy
44
Tech’- Tech’- Request Type: Request Type: GetRecordGetRecord
functionretrieve individual metadata record from a repository
example archive.org/oai-script?verb=GetRecord&
identifier=oai:HUBerlin.de:3000218&metadataPrefix=oai_dc
parametersidentifier (required)metadataPrefix (required)
errors / exceptionsbadArgumentcannotDisseminateFormatidDoesNotExist
45
Tech’- Records, items & DCTech’- Records, items & DCor setting the record or setting the record straightstraight
all available metadata about David
item
Dublin Coremetadata
MARCmetadata
SPECTRUMmetadata records
item = identifier
resource
46
Tech’- Records, items & DCTech’- Records, items & DC
A record consists of:1. Header (mandatory)
identifier (1)datestamp (1)setSpec elements (*)status attribute for deleted item (?)
2. Metadata (mandatory)XML encoded metadata with root tag, namespacerepositories must support Dublin Core
3. About (optional)rights statementsprovenance statements
47
Tech’- Records, items & DCTech’- Records, items & DC
OAI-PMH supports dissemination of multiple metadata formats from a repository.
Properties of metadata formats:id string to specify the format (metadataPrefix)metadata schema URL (XML schema to test validity)XML namespace URI (global identifier for metadata format)
Repositories must be able to disseminate unqualified DC. Arbitrary metadata formats can be defined and transported via
the OAI-PMH. Returned metadata must comply with XML namespace
specification.
48
Tech’- Records, items & DCTech’- Records, items & DC
As mentioned before the minimum standard is unqualified Dublin Core (http://dublincore.org/).
Dublin Core Metadata Element Set contains 15 elements.All elements are optional.All elements may be repeated.
The Dublin Core Metadata Element Set: TitleContributorSource
CreatorDateLanguage
SubjectTypeRelation
DescriptionFormatCoverage
PublisherIdentifierRights
49
Tech’- Records, items & DCTech’- Records, items & DCResponse in XML format http://cs1.ist.psu.edu/cgi-bin/oai.cgi?
verb=GetRecord&identifier=oai:CiteSeerPSU:1&metadataPrefix=oai_dc
50
Tech’- Flow controlTech’- Flow control
Some of the request commands can generate a very long response (for example think about requesting a CiteSeer or Library of Congress to list ALL their records using the GetRecords verb).
In order not to generate long responses that will over load the server, a flow control mechanism was added to the protocol.
It is only within the server responsibility to split long responses into shorter ones; the client has no control over length of the responses.
51
Tech’- Flow controlTech’- Flow control
The flow control mechanism is referred to as “resumption token”, and in it, the server splits the long response into shorter ones and assigns at the end of each response a token that the client will pass on the next request the get the next part.
52
Tech’- Flow controlTech’- Flow control
Harvester
Service Provider
Repository
Data Provider
“want to have all your records”
archive.org/oai?verb=ListRecords&metadataPrefix=oai_dc
“have 267, but give you only 100”
100 records + resumptionToken “anyID1”
“want more of this”
archive.org/oai?resumptionToken=anyID1
“have 267, give you another 100”
100 records + resumptionToken “anyID2”
“want more of this”
archive.org/oai?resumptionToken=anyID2
“have 267, give you my last 67”
67 records + resumptionToken “”
53
Conclusions and future useConclusions and future use
We saw that the increasing number of digital libraries caused the different DL types some problems:– FDLs and HDLs had to overcome different
obstacles in order to federate or harvest data from SDLs due to different metadata formats and different queries formats for example.
– The user had to overcome the learning of different user interfaces each SDL offered.
54
Conclusions and future useConclusions and future use
When looking at the OAI-PMH it seemed that putting the protocol in use will eliminate those problems.Service providers can lower the number of different user interfaces the user needs to handle and federating or harvesting would be much easier using a common standard.However…
55
Conclusions and future useConclusions and future use
When putting the protocol in use in digital libraries environment, the lack of strict rules may cause new problems or make the old ones reappear in another way.
Lets take Citeseer for example.It contains 723140 records and its metadata size is around 1GB.If one would want to harvest citeseer efficiently for records dealing with a specific topic how could it be done?
56
Conclusions and future useConclusions and future use
Since the searching for data within the metadata is done at the harvester size, it could not ask citeseer to give it only records dealing with "network computationת" for example.
Remember the sets? Could they be used to harvest only part of the information instead of handling a Giga of data?
The answer is no since citeseer contains only one set.
57
Conclusions and future useConclusions and future use
The DC also might be a too low barrier which causes more and more SDLs to support not only DC but to create their own metadata formats (citeseer for example has two formats it supports).
Nevertheless, OAI-PMH is becoming more and more a standard in digital libraries and is making a large contribution for the DLs and from the looks of it,
it’s here to stay.
58
What's nextWhat's next
Riddle –
– Improving harvesting and creation of HDLs.
– Composition of HDLs.
59
What's nextWhat's next
Web Layer 1
SDL SDL SDL Layer 2
OAI-PMH
Layer 4CHDL
Layer 3HDL
Layer 5Web interfaces
60
DemonstrationDemonstration
Independent queries. Repositories explorer:
http://re.cs.uct.ac.za/ OAISter (FDL):
http://oaister.umdl.umich.edu/o/oaister/ Scirus (FDL):
http://www.scirus.com/srsapp/ Riddle demo:
http://riddle.dynalias.com:20055/riddle.html
61
ResourcesResources
OAI – official sitehttp://www.openarchives.org/
protocol specificationhttp://www.openarchives.org/OAI/openarchivesprotocol.html
general mailing listhttp://www.openarchives.org/mailman/listinfo/OAI-general/
implementers mailing listhttp://www.openarchives.org/mailman/listinfo/OAI-implementers/
Presentation which this presentation was based on: http://www.oaforum.org/otherfiles/lisb_tutorial.ppt
Z39.50:http://www.loc.gov/z3950/agency/
62
QuestionsQuestions
63
The endThe end
top related