resourcesync: web-based resource synchronization
DESCRIPTION
Presentation about the NISO/OAI ResourceSync effort used at TICER 2012 Summer School.TRANSCRIPT
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
ResourceSync: Web-Based
Resource Synchronization
Herbert Van de Sompel
Los Alamos National Laboratory @hvdsomp
ResourceSync is funded by The Sloan Foundation & JISC
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Cornell University & OAI:
Berhard Haslhofer, Carl Lagoze, Simeon Warner Old Dominion University & OAI:
Michael L. Nelson Los Alamos National Laboratory & OAI:
Martin Klein, Robert Sanderson, Herbert Van de Sompel NISO:
Todd Carpenter, Nettie Lagace, Peter Murray
ResourceSync Core Team – NISO & OAI
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
• Manuel Bernhardt, Delving B.V. • Kevin Ford, Library of Congress • Richard Jones, JISC • Graham Klyne, JISC • Stuart Lewis, JISC • David Rosenthal, LOCKSS • Christian Sadilek, Red Hat • Shlomo Sanders, Ex Libris, Inc. • Sjoerd Siebinga, Delving B.V. • Ed Summers, Library of Congress • Jeff Young, OCLC Online Computer Library Center
ResourceSync Technical Group
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
ResourceSync
ResourceSync: What & Why? Problem Perspective & Conceptual Approach Technical Details Q&A
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
ResourceSync
ResourceSync: What & Why? Problem Perspective & Conceptual Approach Technical Details Q&A
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Synchronize What?
• Web resources – things with a URI that can be dereferenced and are cache-able (no dependency on underlying OS, technologies etc.)
• Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources)
• That change slowly (weeks/months) or quickly (seconds), and where latency needs may vary
• Focus on needs of research communication and cultural heritage organizations, but aim for generality
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Why?
… because lots of projects and services are doing synchronization but have to resort to ad-hoc, case by case, approaches!
• Project team involved with projects that need this
• Experience with OAI-PMH: widely used in repos but o XML metadata only o Attempts at synchronizing actual content via OAI-PMH
(complex object formats, dc:identifier) not successful. o Web technology has moved on since 1999
• Devise a shared solution for data, metadata, linked data?
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Use Cases – The Basics
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Use Cases - More
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Out Of Scope (For Now)
• Bidirectional synchronization
• Destination-defined selective synchronization (query) • Bulk URI migration
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Use Case: arXiv Mirroring
• 1M article versions, ~800/day created or updated at 8 PM US Eastern Time
• Metadata and full-text for each article
• Accuracy important
• Want low barrier for others to use
• Look for more general solution than current homebrew mirroring (running with minor modifications since 1994!) and occasional rsync (filesystem layout specific, auth issues)
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Use Case: DBpedia Live Duplication
• Average of 2 updates per second • Want low latency => need a push technology
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
ResourceSync
ResourceSync: What & Why? Problem Perspective & Conceptual Approach Technical Details Q&A
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
ResourceSync Problem
• Consideration: • Source (server) A has resources that change over time: they
get created, modified, deleted • Destination (servers) X, Y, and Z leverage (some) resources
of Source A. • Problem:
• Destinations want to keep in step with the resource changes at Source A: resource synchronization.
• Goal: • Design an approach for resource synchronization aligned
with the Web Architecture that has a fair chance of adoption by different communities. • The approach must scale better than recurrent HTTP
HEAD/GET on resources.
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Destination: 3 Basic Synchronization Needs
1. Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source
- avoid out-of-band setup
2. Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source
- subject to some latency; minimal: create/update/delete - allow to catch-up after destination has been offline
3. Audit – A destination should be able to determine whether it is synchronized with a source
- subject to some latency
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capability 1: Describing Content
In order to advertise the resources that a source wants destinations to know about, it may describe them:
o Publish an inventory of resource URIs and possibly associated metadata - Destination GETs the Content Description - Destination GETs listed resources by their URI
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capability 2: Communicating Change Events
In order to achieve lower latency, a source may communicate about changes to its resources:
o 2.1. Change Set: Publish a list of recent change events (create, update, delete resource) - Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capability 2: Communicating Change Events
In order to achieve lower latency, a source may communicate about changes to its resources:
o 2.1. Change Set: Publish a list of recent change events (create, update, delete resource) - Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
o 2.2. Push Change Set: Push a list of recent change events (create, update, delete resource) towards (a) destination(s) - Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capability 3: Providing Access to Versions
In order to allow a destination to catch up with missed changes, a source may support:
o 3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capability 3: Providing Access to Versions
In order to allow a destination to catch up with missed changes, a source may support:
o 3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set
o 3.2. Historical Content: Provide access to prior resource versions
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capability 4: Transferring Content
By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms:
o 4.1. Dump: Publish a package of resource representations and necessary metadata - Destination GETs the Dump - Destination unpacks the Dump
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capability 4: Transferring Content
By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms:
o 4.1. Dump: Publish a package of resource representations and necessary metadata - Destination GETs the Dump - Destination unpacks the Dump
o 4.2. Alternate Content Transfer: Support alternative mechanisms to optimize getting content, e.g. content via a mirror site, only changes not the entire changed resource.
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source: Advertise Capabilities
A source needs to advertise the capabilities it supports to allow a destination to discover them
• Some capabilities may be provided by a third party, not the source itself
o e.g. Historical Change Sets, Historical Content o But the source should still make those third party capabilities
discoverable - trust
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
ResourceSync
ResourceSync: What & Why? Problem Perspective & Conceptual Approach Technical Details Q&A
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
So Many Choices
XMPP
AtomPub
SDShare
RSS
Atom
PubSubHubbub
Sitemap
XMPP
rsync
OAI-PMH
WebDAV Col. Syn.
OAI-ORE
DSNotify
RDFsync
Crawl
Push
Pull
SWORD
SPARQLpush
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
A Framework Based on Sitemaps
• Modular framework allowing selective deployment
• Sitemap is the core component throughout the framework
o Introduce extension elements and attributes: - In ResourceSync namespace (rs:) to
accommodate synchronization needs - In XHTML namespace (xhtml:) mainly to
accommodate discovery needs o Reuse Sitemap format for Change Sets (both
current and historical) and for manifest in Dump
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Sitemap with Added Datetime
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Change Types: Extend lastmod, Use expires!
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Sitemap with lastmod and expires!
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Sitemap Discovery via robots.txt!
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Change Set: An rs Typed Sitemap
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
More rs Extension Elements
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Change Set with rs and xhtml Extensions
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Change Set Discovery via Sitemap
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Pushing Change Sets via XMPP PubSub
XMPP Publish-Subscribe: Client to Subscription Service, Subscription Service to Client(s) communication
• One of the XMPP (Extensible Messaging and Presence Protocol)
extensions http://xmpp.org/extensions/xep-0060.html • Apple Notifications based on XMPP PubSub • Available tools, see http://xmpp.org/about-xmpp/
technology-overview/pubsub/#impl-client o XMPP Servers with PubSub support:
- ejabberd , OpenFire , Tigase , SleekXMPP o XMPP libraries with PubSub support:
- Strophe (C, JavaScript), XMPP4R (Ruby), SleekXMPP (Python), PubSub Client (Python)
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Pushing Change Sets via XMPP PubSub
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Change Set via XMPP
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Push Change Set Discovery via Sitemap
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Discovering a Historical Change Set via a Current Change Set
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Discovering Historical Content – Link to Version Resource
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Memento Intermezzo
http://www.mementoweb.org/
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Original Resources and Mementos
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Bridge from Present to Past
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Bridge from Past to Present
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Memento Framework
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Discovering Historical Content – Link to Memento TimeGate
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Dump
• Two formats currently under discussion:
o Format based on ZIP: - Package content - Add manifest (manifest.xml) expressed in
Sitemap format - ZIP it up
o WARC files as used by the web archiving community
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Mapping URI to File Path with rs:path!
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Manifest (manifest.xml) Expressed in Sitemap Format
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Dump Discovery via Sitemap
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Source Capabilities – Destination Needs
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Alternate Location
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Alternate Protocol, e.g. Obtain Changes Only
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Timeline • August 2012
o First draft spec shared for feedback with ResourceSync team
• September 2012 o In-person meeting of ResourceSync Team o Revise spec, conduct experiments o Solicit broad feedback o Paper in D-Lib Magazine
• December 2012 – Finalize specification (?)
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
Pointers • First draft spec:
http://www.openarchives.org/rs/0.1/resourcesync!
• Simulator code on github http://github.org/resync/simulator!
• NISO workspace http://www.niso.org/workrooms/resourcesync/!!
• List for public comment coming soon
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands
ResourceSync: Web-Based
Resource Synchronization
Herbert Van de Sompel
Los Alamos National Laboratory @hvdsomp
ResourceSync is funded by The Sloan Foundation & JISC