Delivering MARC/XML records
from the Library of Congress
catalogue using the open
protocols SRW/U and Z39.50
Mike Taylor, Index Data
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Overview
Where we're headed in the next half-hour:
Existing standards for library catalogues
The new XML equivalents of these standards
Providing XML access to existing catalogues
Two services running from two databases
Two services running from a single database
New gateway running over the existing service
The Library of Congress's solution
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Existing standards for catalogues
The value of existing standards is well understood:
MARC (MAchine Readable Catalogue) records
ISO 2709 (interchange format for MARC)
ANSI/NISO Z39.50 (search and retrieve on the Internet)
These standards allow interoperability and co-operation
between libraries that other fields can only dream about.
(Librarians don't know how lucky they are!)
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Z39.50 for searching catalogues
Library of Congress
Z39.50 server
Z39.50 client
Z39.50 (fetching MARC records)
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Library of Congress
Z39.50 server
Z39.50 client
Z39.50
British Library
Z39.50 server
Z39.50 for searching catalogues
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Library of Congress
Z39.50 server
Z39.50 client
Z39.50
British Library
Z39.50 server
Local catalogue
Z39.50 server
Z39.50 for searching catalogues
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Library of Congress
Z39.50 server
Metasearching
Z39.50 client
Z39.50
British Library
Z39.50 server
Local catalogue
Z39.50 server
Z39.50 Z39.50
Z39.50 for searching multiple catalogues
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Trouble in paradise
Then the serpent saith unto Adam, “Lo, why doth thy
catalogue service not use XML?” And Adam saith, “Verily,
Z39.50 worketh just fine.” But the serpent, who was subtle
of tongue, saith unto him, “But XML is more fashionable.”
And, behold, Adam was deceived, and did fall.
-- The Book of Standards, ch. 3, v. 4-6.
Library of Congress
Z39.50 server
Metasearching
Z39.50 client
Z39.50
British Library
Z39.50 server
Local catalogue
Z39.50 server
Z39.50 Z39.50
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Welcome to the 21st Century
Everything
must be XML
Library of Congress
Z39.50 server
Metasearching
Z39.50 client
Z39.50
British Library
Z39.50 server
Local catalogue
Z39.50 server
Z39.50 Z39.50
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Welcome to the 21st Century
Resistance
is useless!
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Catalogue standards in an XML world
The binary USMARC format is superseded by MARCXML.
“As many of the original developers of Dublin Core were
Americans, various parochial national standards were
referenced. This will hopefully get fixed with the belated
discovery of the rest of the planet.” (Unattributed, sadly.)
Enter MarcXchange, a MARCXML superset that can
represent all the national MARC formats (DANMARC, etc.)
(Though repairing MARCXML might have been better.)
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Catalogue standards in an XML world
The binary Z39.50 protocol is superseded by SRU.
(Search/Retrieve by Url). This is a NISO-registered
standard for expressing queries using rich URLs, to obtain
XML responses that contain records matching the query.
http://sru.miketaylor.org.uk/sru.pl?version=1.1&operation=searchRetrieve&query=dinosaur&startRecord=1&maximumRecords=1&recordSchema=dc
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
An SRU response (single DC record)<?xml version="1.0"?><zs:searchRetrieveResponse xmlns:zs='http://www.loc.gov/zing/srw/'> <zs:version>1.1</zs:version> <zs:numberOfRecords>29</zs:numberOfRecords> <zs:records> <zs:record> <zs:recordSchema>info:srw/schema/1/dc-v1.1</zs:recordSchema> <zs:recordPacking>xml</zs:recordPacking> <zs:recordPosition>1</zs:recordPosition> <zs:recordData> <srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema" xmlns="http://purl.org/dc/elements/1.1/"> <title>Fossils</title> <creator>Lappi, Megan.</creator> <type>text</type> <publisher>New York, NY: Weigl Publishers</publisher> <date>2005</date> <language>en</language> <description>Studying fossils -- Fossil facts -- Gone forever -- A fossil is born -- From bone to stone -- Insects in amber -- Dinosaur footprints</description> <identifier>http://www.loc.gov/catdir/toc/ecip0415/2004004136.html</identifier> <identifier>URN:ISBN:1590362136</identifier> </srw_dc:dc> </zs:recordData> </zs:record> </zs:records></zs:searchRetrieveResponse>
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
An SRU response (single DC record)<?xml version="1.0"?><zs:searchRetrieveResponse xmlns:zs='http://www.loc.gov/zing/srw/'> <zs:version>1.1</zs:version> <zs:numberOfRecords>29</zs:numberOfRecords> <zs:records> <zs:record> <zs:recordSchema>info:srw/schema/1/dc-v1.1</zs:recordSchema> <zs:recordPacking>xml</zs:recordPacking> <zs:recordPosition>1</zs:recordPosition> <zs:recordData> <srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema" xmlns="http://purl.org/dc/elements/1.1/"> <title>Fossils</title> <creator>Lappi, Megan.</creator> <type>text</type> <publisher>New York, NY: Weigl Publishers</publisher> <date>2005</date> <language>en</language> <description>Studying fossils -- Fossil facts -- Gone forever -- A fossil is born -- From bone to stone -- Insects in amber -- Dinosaur footprints</description> <identifier>http://www.loc.gov/catdir/toc/ecip0415/2004004136.html</identifier> <identifier>URN:ISBN:1590362136</identifier> </srw_dc:dc> </zs:recordData> </zs:record> </zs:records></zs:searchRetrieveResponse>
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
An SRU response (single DC record)<?xml version="1.0"?><zs:searchRetrieveResponse xmlns:zs='http://www.loc.gov/zing/srw/'> <zs:version>1.1</zs:version> <zs:numberOfRecords>29</zs:numberOfRecords> <zs:records> <zs:record> <zs:recordSchema>info:srw/schema/1/dc-v1.1</zs:recordSchema> <zs:recordPacking>xml</zs:recordPacking> <zs:recordPosition>1</zs:recordPosition> <zs:recordData> <srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema" xmlns="http://purl.org/dc/elements/1.1/"> <title>Fossils</title> <creator>Lappi, Megan.</creator> <type>text</type> <publisher>New York, NY: Weigl Publishers</publisher> <date>2005</date> <language>en</language> <description>Studying fossils -- Fossil facts -- Gone forever -- A fossil is born -- From bone to stone -- Insects in amber -- Dinosaur footprints</description> <identifier>http://www.loc.gov/catdir/toc/ecip0415/2004004136.html</identifier> <identifier>URN:ISBN:1590362136</identifier> </srw_dc:dc> </zs:recordData> </zs:record> </zs:records></zs:searchRetrieveResponse>
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
SRU's big brother: SRW
SRU works by fetching rich URLs.
SRW (Search/Retrieve Webservice) works over SOAP.
In theory, SRW is more powerful and flexible than SRU.
In practice, it is hard to implement and runs more slowly.
It is still important because many Big Players (Microsoft,
IBM, etc.) have a big investment in SOAP.
However, most implementations have used SRU. With
HTTP/1.1 persistent connections, performance is fine.
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
SRU's query language: CQL
CQL (Common Query Language) is used by SRU and SRW.
It may also be used in other contexts (including Z39.50).
Its syntax is easy to learn, but very expressive.
dinosaurtitle=dinosaurtitle=(dinosaur or pterosaur) and author=martilldc.title=*saur and dc.author=martilltitle exact "the complete dinosaur" and date < 2000name=/phonetic "smith"fish prox/distance<3/unit=sentence frog
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Now what?
We have:
A mature, functional infrastructure based on MARC and Z39.50
A world out there that is comfortable with XML-based technology
An XML-based equivalent of MARC (MARCXML/MarcXchange)
An XML-based equivalent of Z39.50 (SRU)
But we don't have
Actual running SRU servers that deliver MARCXML records.
Can we get there from here?
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Server providers don't want to switch
Library of Congress
SRU server
Z39.50 client
Z39.50
Uh-oh!
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Client applications don't want to switch
Library of Congress
Z39.50 server
SRU client
SRU
Uh-oh!
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Transition period: run both services
Library of Congress
Z39.50 server
Z39.50 client
Z39.50
Library of Congress
SRU server
SRU client
SRU
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Transition period: run both services
This approach gives client applications a choice:
Existing client applications continue to work
New applications can be built using new technology
This flexibility comes at a cost to the service providers,
who have to provide not one but two services.
How can they do this? There are three approaches.
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
The two-database approach
Library of Congress
Z39.50 server
Library of Congress
SRU server
MARCXML
database
MARC
database
Proprietary APIProprietary API
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Why the two-database approach sucks
The two-database has the advantage of conceptual and
operational simplicity. The two separate systems can be
maintained by separate teams.
However: THE TWO DATABASES HAVE TO BE KEPT
SYNCHRONISED.
At best this entails duplication of effort.
At worst, it fails completely, and a record fetch from one
database may be different from the same record fetched
from the other database. (If it exists at all.)
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
The one-database-two-services approach
Library of Congress
Z39.50 server
Library of Congress
SRU server
MARC
database
Proprietary API Proprietary API
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Advantages of the 1D2S approach
When both services use data from the same database,
only one copy of the database has to be maintained.
This approach has several advantages:
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Advantages of the 1D2S approach
When both services use data from the same database,
only one copy of the database has to be maintained.
This approach has several advantages:
Eliminates duplication
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Advantages of the 1D2S approach
When both services use data from the same database,
only one copy of the database has to be maintained.
This approach has several advantages:
Eliminates duplication
Reduces redundancy
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Advantages of the 1D2S approach
When both services use data from the same database,
only one copy of the database has to be maintained.
This approach has several advantages:
Eliminates duplication
Reduces redundancy
Reduces redundancy
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Advantages of the 1D2S approach
When both services use data from the same database,
only one copy of the database has to be maintained.
This approach has several advantages:
Eliminates duplication
Reduces redundancy
Reduces redundancy
Eliminates duplication
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
The horrible truth
Library of Congress
Z39.50 server
Library of Congress
SRU server
Proprietary
database
No API!
When the database (and Z39.50 server) are part of an integrated
proprietary system, the SRU server runs into a brick wall.
Opa
que
blac
k bo
x
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
The solution
Library of Congress
Z39.50 server
Library of Congress
SRU server
Proprietary
database
Z39.50 IS the API!B
lack
box
with
a li
ttle
hol
e
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Why this is so cute
When the SRU server uses Z39.50 as its API to the database,
it is an SRU-to-Z39.50 gateway. Its front-end is an SRU
server and its back-end is a Z39.50 client.
This rocks because:
No duplication of data is necessary
No co-operation is necessary from the existing software
Use of the standard Z39.50 protocol as the API to the
database means that THE SAME GATEWAY can be
used to provide SRU access to ANY CATALOGUE
that is already available via Z39.50.
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
A novel application of Z39.50
Z39.50 is most often used to allow a client to query a
remote server.
Here we are using it as a tightly integrated part of
a locally provided service -- the gateway will typically run
on the same machine as the Z39.50 server, or on a
“nearby” machine on the same LAN.
HOWEVER, because Z39.50 is a network API rather than
a link-time API, other interesting arrangements are possible.
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Typical architecture: “integrated” SRU
Library of Congress
Z39.50 server
Library of Congress
SRU server
Proprietary
database
SRU client
SRU
Opa
que
blac
k bo
x
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
Alternative architecture: “3rd party” SRU
3rd party service
SRU server
Library of Congress
Z39.50 server
Proprietary
database
SRU client
SRU
Running in England
Running in USA
Denmark
Opa
que
blac
k bo
x
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
“What's it like?”
SRU client software neither knows nor cares that the
server it is connected to is really a gateway.
Application user knows nothing about the Z39.50 database.
You might expect that performance would degrade due
to the additional step.
In practice, with a high-quality gateway, performance of
the SRU server greatly exceeds that of the underlying
Z39.50 server.
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
“What's it like?”
SRU client software neither knows nor cares that the
server it is connected to is really a gateway.
Application user knows nothing about the Z39.50 database.
You might expect that performance would degrade due
to the additional step.
In practice, with a high-quality gateway, performance of
the SRU server greatly exceeds that of the underlying
Z39.50 server. (This is done using magic.)
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
The Library of Congress's solution
The Library of Congress contracted Index Data (that's us)
to build an SRU-to-Z39.50 gateway for them.
Having built it, we released it under an Open Source licence,
(the GNU General Public Licence)
The LC SRU server is available to anyone at:http://z3950.loc.gov:7090/Voyager
The gateway is freely available to download at:http://indexdata.com/yazproxy/
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
(Digression: why is it called YAZ Proxy?)
YAZ is our battle-tested and widely deployed Z39.50 toolkit.
(It powers 2/3 of all Z39.50 clients and servers worldwide.)
YAZ Proxy is so called because it acts as a Z39.50-to-Z39.50
gateway as well as SRU-to-Z39.50 (and SRW-to-Z39.50).
Why would you want a Z39.50 proxy? For the same reasons
you want a Web proxy such as Squid:
Reduce load on the underlying server
Improve client performance through caching
Protect fragile back-end by sanitising client requests
Balance load over multiple back-end servers
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
What YAZ Proxy does
For each SRU Search Request that it receives, YAZ Proxy:
Translates the CQL query into a Z39.50 Type-1 query
Embeds the translated query in a Z39.50 Search Request
Sends the request to the back-end server
(Asynchronously) awaits the Z39.50 Search Response
Extracts the MARC records from the response
Converts them into MARCXML
Embeds the converted records in an SRU Search Response
Returns the response to the client
All this is transparent to the SRU client and the Z39.50 server.
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
The sauropod dinosaur Brachiosaurus
(It's been a while since we had a picture.)
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
YAZ Proxy in detail: performance features
Access to the LC catalogue -- whether by Z39.50 or SRU --
is much faster through YAZ Proxy than directly.
YAZ Proxy re-uses a pool of initialised back-end sessions
It can pre-cache a set of ready-to-use back-end sessions
Query-caching avoids repeated identical searches
Record-caching allows repeated requests for the same
record to be instantaneous
The total effect is that access via YAZ Proxy is typically 10-100
times faster. (Source: Larry Dixson of the Library of Congress.)
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
YAZ Proxy in detail: load balancing
YAZ Proxy can be configured to balance load across
multiple back-end Z39.50 servers. Queries are generally
sent to the least heavily loaded back-end.
This allows a heavily-used service to be scaled across multiple
servers, distributed and made robust against system failure.
(Arrangements must be made to keep the multiple copies
up to date and synchronised.)
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
YAZ Proxy in detail: query translation
Both CQL and the Z39.50 Type-1 query allow application-specific
extensions (e.g. geospatial searching, thesaurus navigation).
Translation from CQL to Type-1 is therefore driven by a simple
configuration file which maps CQL index-names, relations, etc.
into Z39.50 Type-1 query attributes.
index.cql.serverChoice = 1=1016index.rec.id = 1=12index.dc.title = 1=4index.dc.subject = 1=21relation.< = 2=1relation.le = 2=2relationModifier.relevant = 2=102
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
YAZ Proxy in detail: record translation
Translating MARC (ISO2709) records into MARCXML is a core
function of YAZ Proxy.
It can also be configured to further transform the translated
MARCXML records using arbitrary XSLT stylesheets.
Standard stylesheets support translation to
Dublin Core
MODS
METS
Other formats, such as OAI_DC, are easy to support.
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
But, Mike! This is too good to be true!
Yes.
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>
But how do you people make a living?
Apart from living on good karma, we make money from:
Bespoke development (e.g. building YAZ Proxy)
Customisation (e.g. adding support for new XML formats)
Integration (e.g. making the proxy use local authentication)
Support contracts (but these are strictly optional)
Consultancy
We also provide services such as hosted SRU-to-Z39.50
gateways, so YOUR ORGANISATION could support SRU
(and SRW) access, and accelerate its Z39.50 service,
without requiring you to install any software.
Thanks for listening!
You know where to find us.http://indexdata.com/
Tel. +45 3341 0100
Fax. +45 3341 0101
Delivering MARCXML using SRW/U Mike Taylor, Index Data <[email protected]>