Outline of the course
Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%)( ) User Services (10%) Additional topics (15%)p ( )
Buliding of a (small) digital library
Reference material:– Ian Witten, David Bainbridge, David Nichols, How to build a Digital
Library Morgan Kaufmann 2010 ISBN 978 0 12 374857 7Library, Morgan Kaufmann, 2010, ISBN 978-0-12-374857-7(Second edition)
– The Web
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -1
Where are we
Digital Libraries– Description of information
• MetadataMARC– MARC
– Dublin Core– MODS– METS– TEI– EADEAD– ......
• Knowledge Representation– FRBR– RDF
• InteroperabilityFUB 2012-2013 Vittore Casarosa – Digital Libraries
• InteroperabilityPart 6 -2
Interoperability
Interoperability is the ability of systems, services p y y y ,and organisations to work together seamlessly toward common or diverse goals. In the technical garena it is supported by open standards for communication between systems and for ydescription of resources and collections, among others. Interoperability is of paramountg p y prelevance in the context of resource discovery and access.
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -3
Interoperabilityin digital libraries (1/2)in digital libraries (1/2)
Interoperability between vendors– Different databases and user interfaces
Interoperability between different organisations– Eg. using different library formats
Interoperability between groups of users– Eg. Public libraries/Academic libraries– Eg. libraries in different countries
Interoperability between communities– Eg. libraries, publishers, archives, museums
Interoperability across time – Preservation
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -4
Interoperabilityin digital libraries (2/2)in digital libraries (2/2)
Traditional libraries interoperabilityU i l (OPAC )– Union catalogs (OPACs)
– Interlibrary loan Digital libraries interoperabilityg p y
– Documents (digital objects, resources)– Metadata
C t l d l– Conceptual models– Protocols
• Z39.50• OAI-PMH
– Queries• Z39.50 queries (Type 1, Z39.58, CQL)Z39.50 queries (Type 1, Z39.58, CQL) • SRU, SRW• SPARQL
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -5
Z39.50
"Information Retrieval (Z39.50); Application Service Definition and Protocol Specification ANSI/NISO Z39 50 1995"Protocol Specification, ANSI/NISO Z39.50-1995"
Developed by NISO (National Information Standards Organization), the standards development organization serving libraries, publishing
d i f ti iand information services NISO was (is) the Z39 Committee of ANSI (American National
Standards Institute), and Z39.50 was the 50th standard defined by NISONISO
Current version (Version 3) was adopted in 1995, superceding earlier versions adopted in 1992 and 1988 (1984 version was rejected)– Another revision, initiated in 2001, is still “work in progress”
Z39.50 was heavily influenced by OSI, and was an “application layer” protocol that needed a full-duplex reliable OSI connectionprotocol that needed a full duplex reliable OSI connection– In Version 3 it runs over TCP/IP
It is a wide ranging protocol for information retrieval between a client and a database server which attempts to standardize shared
FUB 2012-2013 Vittore Casarosa – Digital Libraries
and a database server, which attempts to standardize shared semantic knowledge
Part 6 -6
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 7
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 8
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 9
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 10
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 11
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 12
Z39.50 architectural model (1/2)
A server houses one or more databases containing records. g Associated with each database are a set of access points (indexes)
that can be used for searchinghow to segment logical data into relations and how to name the– how to segment logical data into relations and how to name the columns in the relations are hidden (server-specific)
Z39.50 includes a set of “registries” that provide each (application) domain with a an agreed-upon structure and attributes (query syntax, attribute fields, content retrieval format, etc.)
A search (sent from the client - origin to the server - target) produces A search (sent from the client origin to the server target) produces a set of records, called a "result set", that are maintained on the serverTh li t h l f ti f h t ( t The client has also functions for search management (e.g. request progress reports for an active search, authorize the server to continue a resource intensive search, abort an active search)
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -13
Z39.50 architectural model (2/2)
Records from the result set can be retrieved by the client, which has i f lli h d f f h dmany options for controlling the contents and format of the records
that are returned (e.g. sorting a result set, selecting a subset of the result set, using the result set for a new search)
The client has available also a general mechanism called "extended services" to invoke services on the server, which can survive past the end of the session (e g saving result sets across sessions queuingend of the session (e.g. saving result sets across sessions, queuing result sets for print or electronic mail processing at the server, registering queries that would be executed periodically on the server)
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -14
Z39.50 functionality
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -15
Initialization facility
Init service: establishes Z-association
O i i TInit requestVersion (id/password)Origin TargetVersion, (id/password),option flags,message sizes,implementation information
Init responseResult, version,option flags,message sizes,implementation information
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -16
Search facility
Search service
Search requestSearch type, query,
Origin Targetyp , q y,
databases,result setlimits for small, medium, large, g
Search responseNumber of records found,number of records attached,,status information,(records)
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -17
Retrieval facility
Present serviceOrigin TargetPresent request
Number of records,starting point,result setresult set
Present responseNumber of returned records,stat s
Segment service
status,(records)
Segment service– Allows a “Present response” that is larger than max size to
be split in segments
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -18
Sort facility
Sort service
Origin TargetSort requestresult set to sortOrigin Targetresult set to sort,sorted result set,sort directives
Sort responsestatus
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -19
Browse facility
Scan service
Origin TargetScan requestdatabase termOrigin Targetdatabase, termlist, starting point,number of terms,(step size)
Scan responsestatusnumber of elements(elements)( )
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -20
Result-set-delete facility
Delete service
Origin TargetDelete requestlist of result setsto delete
Delete responseDelete responsestatus
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -21
Access control facility
Access-control service
Origin TargetRequest
Access control response
Request
Access control responseSecurity-challenge
Access control requestSecurity-challenge-response
Response
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -22
AccountingResource control facilityResource control facility
Resource-control service Trigger-resource-control service Resource report service Resource-report service
– Complex functionality to control and report resource usageresource usage
– Mostly used for fee based operation
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -23
Termination facility
Close service– Terminates a Z-association
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -24
Explain facility
Explain service– Gives access to information about the Z39.50 target
• Databases• Access points• Access points• Query languages• Element sets• ...
This information is maintained by the server in a specific data base and therefore can be accessed using thedata base, and therefore can be accessed using the Search and Retrieve facilities of Z39.50
The idea is that a (smart) client, when accessing a ( ) , g(unknown) data base, could be able to find its access points, its element sets and other info by querying the ”Explain” data base
FUB 2012-2013 Vittore Casarosa – Digital Libraries
Explain data basePart 6 -25
Extended Service facility
Extended Services service– Persistent Result Set Extended Service– Persistent Query Extended Service– Periodic Query Schedule Extended Service– Item Order Extended Service– Database Update Extended Service– Export Specification Extended Service
Task package– Used to create, modify or delete an Extended
Service Request
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -26
Queries
Query typesy yp– Type-0: proprietary between 2 parties– Type-1: RPN (Reverse Polish Notation)Type-1: RPN (Reverse Polish Notation)– Type-2: ISO 8777
Type 100: Z39 58– Type-100: Z39.58– Type-101: Extended RPN (v 2)– Type 102: Ranked List query
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -27
Type-1 QueryReverse Polish NotationReverse Polish Notation
Consists of– One or more operands linked (RPN style) with Boolean
operators (AND, OR, AND_NOT)– Every operand is a search expression consisting of 7 parts
Example of query( d)( d)– (operand)(operand)operator
– (“Mark Twain”, 1:1003, 2:3, 3:1, 4:1, 5:100, 6:1)(“Clemence Samuel” 1:1003 2:3 3:3 4:101 5:100 6:2)( Clemence, Samuel , 1:1003, 2:3, 3:3, 4:101, 5:100, 6:2)AND_NOT
RPN(3 + 5) * (7 – 2)
+5
-27
*5
FUB 2012-2013 Vittore Casarosa – Digital Libraries
3 5 + 7 2 – * 3 8 8 40Part 6 -28
Operands in Type-1 queries
0. TermWhat you are looking for– What you are looking for
1.Use Attributes – Which abstract access point to use (e.g. title, author)
2 Relation Attributes 2.Relation Attributes– Relation between the term and the data in the access point (e.g. less than, equals,
phonetic equals) 3 Position Attributes 3.Position Attributes
– Where in the access point should the term be? (e.g. first in field, first in subfield) 4.Structure Attributes
– How is the query term to be treated? (e g as phrase as words as date asHow is the query term to be treated? (e.g. as phrase, as words, as date, as normalised name)
5.Truncation Attributes – Should truncation be applied on the match? (e.g. left truncation, right and left
truncation, no truncation) 6.Completeness Attributes
– What is the term to be matched against? (e.g. part of subfield, whole subfield, hole field)
FUB 2012-2013 Vittore Casarosa – Digital Libraries
whole field)
Part 6 -29
From Z39.50 to SRW/U
Need for a generic Information Retrieval capability more suited to the Web Architecture
Motivation to create an easy to implement protocol with (more or less) the power of Z39.50
Use existing off the shelf solutions where possible Re-evaluate Z39.50, “a good idea at the time” Avoid library-centric perspective
Solution: SRU – Search/Retrieve via URL SRU – Search/Retrieve via URL SRW – Search/Retrieve via Web Service (SRW is now
called SRU over SOAP)
FUB 2012-2013 Vittore Casarosa – Digital Libraries
called SRU over SOAP)
Part 6 -30
Simple SRU query
http://sru.miketaylor.org.uk/sru.pl?http://sru.miketaylor.org.uk/sru.pl?version=1.1&operation=searchRetrieve&query=dinosaur&query=dinosaur&startRecord=1&maximumRecords=1&recordSchema=dc
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -31
SRU response in XML (1/2)
<?xml version="1.0"?>hR t i R<zs:searchRetrieveResponse
xmlns:zs='http://www.loc.gov/zing/srw/'><zs:version>1.1</zs:version><zs:numberOfRecords>29</zs:numberOfRecords><zs:records>
.... details in a moment ....
</zs:records></zs:records></zs:searchRetrieveResponse>
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -32
SRU response in XML (2/2)
<zs:record><zs:recordSchema>info:srw/schema/1/dc-v1 1</zs:recordSchema><zs:recordSchema>info:srw/schema/1/dc v1.1</zs:recordSchema><zs:recordPacking>xml</zs:recordPacking><zs:recordPosition>1</zs:recordPosition><zs:recordData><srw dc:dc xmlns:srw dc="info:srw/schema/1/dc-schema"<srw_dc:dc xmlns:srw_dc= info:srw/schema/1/dc-schema
xmlns="http://purl.org/dc/elements/1.1/"><title>Fossils</title><creator>Lappi, Megan.</creator><type>text</type><type>text</type><publisher>New York, NY: Weigl Publishers</publisher><date>2005</date><language>en</language><d i ti >St d i f il F il f t G<description>Studying fossils -- Fossil facts -- Goneforever -- A fossil is born -- From bone to stone --Insects in amber -- Dinosaur footprints</description>
<identifier>http://www.loc.gov/catdir/toc/ecip0415/2004004136.html
</identifier><identifier>URN:ISBN:1590362136</identifier>
</srw_dc:dc>
FUB 2012-2013 Vittore Casarosa – Digital Libraries
</zs:recordData></zs:record>
Part 6 -33
Contextual Query Language
CQL (formerly known as Common Query Language) is the query language used in SRU
The conceptual model of CQL is the same as Z39.50– The server has one or more databases, containing records– The databases can be searched through access points, or
i dindexes The language defines a number of defaults to make
simple queries really simplesimple queries really simple Ar the same time it defines a number of Indexes,
Relations Relation Modifiers Booleans and BooleanRelations, Relation Modifiers, Booleans and Boolean Modifiers to increase the expressing power of the language
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -34
CQL search clause
subject any/relevant "fish frog"subject any/relevant "fish frog"subject any/relevant fish frog
i d l tiRelation Search term
subject any/relevant fish frog
i d l tiRelation Search termindex relation modifier
Search termindex relation modifierSearch term
Subject to context qualificationSubject to context qualification
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -35
Learning curves for query languageslanguages
SQLCQL
SQLCQL
SQLCQL
SQLCQLCQLCQL
lear
n
CQLCQL
lear
n
GoogleGoogleffort
to
GoogleGoogleffort
to
EfEf
Expressive PowerExpressive Power
FUB 2012-2013 Vittore Casarosa – Digital Libraries
pp
Part 6 -36
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 37
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 38
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 39
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 40
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 - 41
Z39.50 and OAI-PMH
It is interesting to see another protocol for “resourcediscovery”, namely the Open Archive Initiative (OAI-PMH) protocol
Historical separation from Z39.50– OAI-PMH appears about 15 years after Z39.50
C f Cultural separation from Z39.50– Z39.50 originated in the traditional library community
OAI PMH i i t d i th “W b C it ”– OAI-PMH originated in the “Web Community” Conceptual separation from Z39.50
Z39 50 b d lid (b t h d b lk ) f d ti– Z39.50 based on solid (but heavy and bulky) foundations– OAI-PMH based on simple and pragmatic ideas
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -42
OAI – Open Archives Initiative
The roots of OAI lie in the development of eprint archivese oots o O e t e de e op e t o ep t a c es(i.e. Institutional Repositories) such as arXiv, CogPrints, NACA (NASA), RePEc, NDLTD, NCSTRL, etc.
Each repository offered a web interface for deposit of articles and for end-user searches
It was difficult for end-users to work across archives without having to learn multiple different interfacesI iti l i t f i l h i t f t ll Initial experiments for single search interface to all archives
Universal Pre print Service (UPS) renamed OAI at the Universal Pre-print Service (UPS) renamed OAI at the Santa Fe Convention (1999)
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -43
Searching versus Harvesting
Two possible approaches for single search interface to all hiarchives
– cross searching multiple archives based on protocol like Z39.50(possibly lighter)(p y g )
– harvesting metadata into one or more ‘central’ services Problems with cross searching
– Not scalable (overall performance determined by slowest server)– Problems of deciding which servers to target (collection
descritpions not consistent))– Differences in interfaces and query languages– Problems in the ranked merging of results (different types and
size of targets can skew results)size of targets can skew results)– Browse interface very difficult to build
Decision was to go with harvesting
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -44
OAI - PMH
OAI Protocol for Metadata Harvestingg Data providers make metadata available for harvesting Service Providers harvest metadata Metadata can be centrally collected or “aggregated” Data Providers
– Are creators and keepers of the metadata for objects (repositories) and (possibly but not necessarily) archives of resources
– Handle deposit and publishing Service Providers
– Are harvesters of metadata for the purpose of providing a service such as a search interface, peer-review system, etc.
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -45
OAI – PMH overview
Harvestingb dHarvestingb dbased onOAI-PMHbased onOAI-PMH
Aggregator
Searchingbased onSearchingbased on
Aggregator
based onZ39.50 orSRW
based onZ39.50 orSRW
Service providersService providersFUB 2012-2013 Vittore Casarosa – Digital Libraries
Se ce p o de sSe ce p o de sPart 6 -46
OAI disclaimer
The OAI use of the term ‘archive’ in fact implies very little of what we normally associate with archivesnormally associate with archives
No preservation aspect is implied whatsoever (not what the protocol is about at all)
No appraisal or provenance of eprints or digital objects is implied by this descriptive term
The term simply refers to a collection of digital objects (full text, p y g j (learning objects, etc.) which might (only might) also have been harvested along with the metadata
‘Archive’ is a term within the OAI PMH which serves to distinguish a gcollection of digital objects (the ‘archive’) from the collected metadata associated with these objects, described as ‘repositories’
In OAI repositories expose metadata about ePrints (strictly there areIn OAI repositories expose metadata about ePrints (strictly there are no metadata archives)
In OAI archives hold ePrints (strictly there are no eprint repositories)
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -47
Conceptual model of OAI data
resourceresourceresourceresource
all available metadataitem = all available metadataitem = all available metadataitem = all available metadata about David
itemitem = identifier
all available metadata about David
itemitem = identifier
all available metadata about David
itemitem = identifier
recordsDublin Coremetadata
MARCmetadata
SPECTRUMmetadata recordsDublin Core
metadataMARC
metadataSPECTRUM
metadataDublin Core
metadataMARC
metadataSPECTRUM
metadata
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -48
OAI – PMH records
A record contains the metadata of a resource in a specific fformat
It has three partsheader (mandatory)– header (mandatory)
• identifier • datestamp p
– metadata (mandatory)• XML encoded metadata with root tag, namespace
it i t t D bli C• repositories must support Dublin Core• may support other formats
– about (optional)( p )• rights statements• provenance statements
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -49
OAI-PMH Protocol Overview
Protocol based on HTTP Request arguments as GET or POST parameters Six request types (verbs)q yp ( ) Responses are encoded in XML syntax Supports any metadata format (Dublin Core mandatory pp y ( y
for each data provider) Support selective harvesting
– logical set hierarchy (data providers)– date stamps (last change of metadata set)
Flow control (token to retrieve subsequent records) Error messages
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -50
OAI – PMH verbs
Identify– description of an archive
ListMetadataFormatst i il bl t d t f t f hi– retrieve available metadata formats from archive
ListSetsretrieve set structure of a repository– retrieve set structure of a repository
ListIdentifiers– abbreviated form of ListRecords, retrieving only headersabbreviated form of ListRecords, retrieving only headers
ListRecords– harvest records from a repository
GetRecord– retrieve individual metadata record from a repository
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -51
Overview of OAI - PMH
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -52
OAI – PMH request
Requests must be submitted using the GET or POST methods of HTTP
Repositories must support both methods At least one key=value pair: verb=[RequestType] Additional key=value pairs depend on request type Example for GET request
– http://archive.org/oai?verb=ListRecords&metadataPrefix=oai_dc
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -53
OAI – PMH response
Formatted as HTTP responses Content type must be text/xml HTTP compression optional in OAI-PMH XML declaration XML declaration
(<?xml version="1.0" encoding="UTF-8" ?>) Root element named OAI-PMH with three attributes (xmlns,
xmlns:xsi, xsi:schemaLocation) Three child elements
ResponseDate (UTC datetime)– ResponseDate (UTC datetime)– Request (copy of the request that generated the response)– a) error (in case of an error or exception condition)) ( p )– b) element with the name of the OAI-PMH request
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -54
OAI – PMH example
http://edoc.hu-berlin.de/OAI-2.0?verb=ListIdentifiers&from=2002-01-06&until=2002 01 08&until=2002-01-08&metadataPrefix=oai_dc&set=doctypes:dissertationsyp
ListIdentifiers returns the record headers
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -55
Response to ListIdentifiers (1/2)
<?xml version="1.0" encoding="UTF-8"?> <OAI PMH l "htt // hi /OAI/2 0/"<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www openarchives org/OAI/2 0/OAI PMH xsd">http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd > <responseDate>2002-10-22T17:49:49+01:00</responseDate> <request verb="ListIdentifiers" from="2002-01-03" until="2002-01-08"
metadataPrefix="oai dc" set="doctypes:dissertations">metadataPrefix= oai_dc set= doctypes:dissertations >http://edoc.hu-berlin.de/OAI-2.0</request>
<ListIdentifiers>
...... details in a moment
</ListIdentifiers> </OAI-PMH>
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -56
Response to ListIdentifiers (2/2)
<ListIdentifiers> <header><header>
<identifier>oai:HUBerlin.de:3000819</identifier> <datestamp>2002-01-08</datestamp> <setSpec>doctypes</setSpec> y<setSpec>doctypes:dissertations</setSpec> <setSpec>dnb</setSpec> <setSpec>dnb:dnb33</setSpec>
</header></header> <header>
<identifier>oai:HUBerlin.de:3000831</identifier> <datestamp>2002-01-07</datestamp> p p<setSpec>doctypes</setSpec> <setSpec>doctypes:dissertations</setSpec> <setSpec>dnb</setSpec> <setSpec>dnb:dnb27</setSpec><setSpec>dnb:dnb27</setSpec>
</header> </ListIdentifiers>
FUB 2012-2013 Vittore Casarosa – Digital Libraries Part 6 -57
Where are we
Digital LibrariesDi f i f ti– Discovery of information
• Describing Information– Metadata
• MARCD bli C• Dublin Core
• MODS• METS• TEI• EAD• ......
– Knowledge Representation• FRBR• RDF
• Interoperability– Queries
• Z39.50 queries• Common Command Language
(CCL – ISO 8777 or Z39 58)(CCL ISO 8777 or Z39.58)– Protocols
• Z39.50• SRU/SRW
FUB 2012-2013 Vittore Casarosa – Digital Libraries
• OAI-PMH
Part 6 -58