resourcesync tutorial
DESCRIPTION
These slides are a tutorial for the OAI ResourceSync framework.TRANSCRIPT
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync:A Web-Based
Resource SynchronizationFramework
ResourceSync is funded by The Sloan Foundation & JISC
#resourcesync
1
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
2
These slides were presented at the LITA Forum, Louisville, Kentucky, November 10 2013
The most recent version of the slides is available at
http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
3
ResourceSync Tutorial History• First outing: OAI8, June 2013• Second run: Open Repositories, July 2013• Third run: JCDL, July 2013• Fourth run: TPDL 2013, September 2013• Fifth run: LITA Forum, November 2013• Sixth run: SWIB 2013, November 2013
Presenter
Herbert Van de Sompel Los Alamos National Laboratory
<[email protected]>@hvdsomp
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Martin KleinLos Alamos National Laboratory<[email protected]>
@mart1nkle1n
ResourceSync Tutorial Contributors
4
Simeon WarnerCornell University
<[email protected]>@zimeon
Herbert Van de Sompel Los Alamos National Laboratory
<[email protected]>@hvdsomp
Robert SandersonLos Alamos National Laboratory
<[email protected]>@azaroth24
Richard JonesCottage Labs
<[email protected]>@cottagelabs
Michael L. NelsonOld Dominion University
<[email protected]>@phonedude_mln
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
5
OAI
Herbert Van de SompelMartin KleinRobert Sanderson(Los Alamos National Laboratory)
Simeon Warner(Cornell University)
Berhard Haslhofer(University of Vienna)
Michael L. Nelson(Old Dominion University)
Carl Lagoze(University of Michigan)
NISO
Todd CarpenterNettie Lagace
University of Oxford
Graham Klyne
Lyrasis
Peter Murray
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync Technical Group
6
JISC
Richard JonesGraham Klyne
Stuart Lewis
OCLC
Jeff Young
LOCKSS
David Rosenthal
RedHat
Christian Sadilek
Ex Libris Inc.
Shlomo Sanders
Library of Congress
Kevin Ford
Paul Walk
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Timeline, Status of Specification(s)
• August 2013o Release of ResourceSync framework Core specification
- Version 0.9.1 o Public draft of ResourceSync Archives specification released
• September 2013o Core specification on its way to become an ANSI standard
• November 2013o Internal draft of ResourceSync Notification specification
• January 2014o Public draft of ResourceSync Notification specification
• Mid 2014o Core specification becomes ANSI/NISO standard
7
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Pointers
• Specification
http://www.openarchives.org/rs/http://www.openarchives.org/rs/resourcesynchttp://www.openarchives.org/rs/notificationhttp://www.openarchives.org/rs/archives
• List for public comment
https://groups.google.com/d/forum/resourcesync
• Client and simulator code
http://github.org/resync/resynchttp://github.org/resync/simulator
8
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Papers
• Klein, M., and Van de Sompel, H. (2013) Extending Sitemaps for Resourcesync. http://arxiv.org/abs/1305.4890 ACM/IEEE JCDL 2013
• Haslhofer, B., Warner, S, Lagoze, C., Klein, M., Sanderson, R., Nelson, M.L. and Van de Sompel, H. (2013) ResourceSync: Leveraging Sitemaps for Resource Synchronization. http://arxiv.org/abs/1305.1476 WWW 2013 Developer Track
• Klein, M., Sanderson, R., Van de Sompel, H., Warner, S, Haslhofer, B., Lagoze, C., and Nelson, M.L. (2013) A Technical Framework for Resource Synchronization. http://dx.doi.org/10.1045/january2013-klein D-Lib Magazine.
• Van de Sompel, H., Sanderson, R., Klein, M., Nelson, M.L., Haslhofer, B., Warner, S, and Lagoze, C. (2012) A Perspective on Resource Synchronization. http://dx.doi.org/10.1045/september2012-vandesompel D-Lib Magazine.
9
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual Approach
2. Motivation & Use Cases
3. Framework Walkthrough
4. Framework (Technical) Details
5. Implementation
6. Q&A
10
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual Approach
2. Motivation & Use Cases
3. Framework Walkthrough
4. Framework (Technical) Details
5. Implementation
6. Q&A
11
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Synchronize What?
• Web resourceso things with a URI that can be dereferenced
• Focus on needs of research communication and cultural heritage organizationso but aim for generality
12
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Synchronize What?
• Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources)
13
sync
sync
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Synchronize What?
14
• Low change frequency (weeks/months) to high change frequency (seconds)
sync
sync
sync
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Synchronize What?
15
• Synchronization latency and accuracy needs may vary
sync
Sync ???
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Why?
… because lots of projects and services are doing synchronization but have to resort to ad-hoc, case by case, approaches!
• Project team involved with projects that need this
• Experience with OAI-PMH: widely used in repos buto XML metadata onlyo Web technology has moved on since 1999
• Devise a shared solution for data, metadata, linked data?
16
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync Problem
17
• Consideration:• Source (server) A has resources that change over time: they
get created, modified, deleted• Destination (servers) X, Y, and Z leverage (some)
resources of Source A.• Problem:
• Destinations want to keep in step with the resource changes at Source A: resource synchronization.
• Goal:• Design an approach for resource synchronization aligned
with the Web Architecture that has a fair chance of adoption by different communities.• The approach must scale better than recurrent HTTP
HEAD/GET on resources.
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source: Core Synchronization Capabilities
1. Describing content – publish a list of resources available for synchronization to enable Destinations to perform an initial load or catch-up with a Source
2. Packaging content – bundle resources to enable bulk download by destinations
3. Describing changes – publish a list of resource changes to enable destinations to stay synchronized and decrease latency
4. Packaging changes – bundle resource changes for bulk download by destinations
18
PULL
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
19
To reduce synchronization latency and to optimize the synchronization process the Source can support:
• 1. Change Notification• Notifies about changes to particular resources• e.g., resource A has been updated | created | deleted
• 2. Framework Notification• Notifies about changes to capabilities i.e., their documents• e.g., a Change List has been updated | created | deleted
Source: Notifications Capabilities
PUSH
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source: Archival Capabilities
The Source may hold on to historical data, for example, to allow Destinations to catch up with events they missed or revisit prior resource states. To this end, the Source can publish archives, i.e. documents that enumerate historical capability documents
1. Resource List Archive
2. Resource Dump Archive
3. Change List Archive
4. Change Dump Archive
20
ARCHIVES
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source: Synchronization Features
1. Discovery of capabilities – support Destinations in discovering all offered capabilities o Applies to PULL, PUSH, ARCHIVES capabilities
2. Linking to related resources – provide links from resources subject to synchronization to related resourceso Applies to PULL, PUSH capabilities
21
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Destination: Synchronization Needs
1. Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source
- avoid out-of-band setup
2. Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source
- subject to some latency; minimal: create/update/delete- allow to catch-up after destination has been offline
3. Audit – A destination should be able to determine whether it is synchronized with a source
- regarding coverage and accuracy
22
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual Approach
2. Motivation & Use Cases
3. Framework Walkthrough
4. Framework (Technical) Details
5. Implementation
6. Q&A
23
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Use Cases – The Basics
24
a)
b)
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Use Cases – The Basics
25
c)
d)
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Use Cases – The not-so-Basics
26
e)
f)
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Use Case 1: arXiv Mirroring and Data Sharing
• Repository of scholarly articles in physics, mathematics, computer science, etc.
• > 850k articles• approx. 1.5 revisions per article on
average• approx. 75k new articles per year• Each article has full-text and separate
metadata record• approx. 3.8M resources
28
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Use Case 1: arXiv Mirroring and Data Sharing
• 2,700 updates dailyo at 8pm ESTo Currently using homebrew mirroring
solution (running with minor modifications since 1994!)
o occasional rsync (file system-specific, auth issues)
29
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Use Case 1: arXiv
Mirroring
• GOAL: Keep mirror sites synchronized with daily changes
• WANT:o high consistencyo moderate latencyo robustness to global network outages (low admin effort)o ability to verify sync status in case of questions
31
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Use Case 1: arXiv
Data Sharing
• GOAL: Make resources and update information publicly available so that any other service may synchronize at the frequency it needs, e.g.o Math Front at UC Daviso EprintWeb from IOP in UKo Data for bibliometric and scientometric analysis
• WANT:o low admin effort (i.e. standard approach, standard tools)o reasonable consistency, latency, efficiency
32
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Use Case 2: DBpedia Live Duplication
• Average of 2 updates per second• Low latency desirable => need for a push technology
33
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Use Case 2: DBpedia Live Duplication
• Daily traffic:o 99% updateso 0.6% deletionso 0.03% creations
35
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Use Case 2: DBpedia Live Duplication
• # of content transfer events in two 8 hour intervals
• Max, queue size of remote duplication process
36
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual Approach
2. Motivation & Use Cases
3. Framework Walkthrough
4. Framework (Technical) Details
5. Implementation
6. Q&A
37
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source Capability 1: Describing Content
In order to advertise the resources that a source wants destinations to know about, it may describe them:
o Publish a Resource List, a list of resource URIs and possibly associated metadata- Destination GETs the Resource List- Destination GETs listed resources by their URI
o A Resource List describes the state of a set of resources at one point in time (snapshot)
38
39
40
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source Capability 2: Packaging Content
By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms:
o Publish a Resource Dump, a document that points to packages of resource representations and necessary metadata- Destination GETs the package- Destination unpacks the package- ZIP format supported
o A Resource Dump and the packages it points to reflect the state of a set of resources at one point in time (snapshot)
41
42
43
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source: Modular Capabilities
44
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source Capability 3: Describing Changes
In order to achieve lower latency and/or greater efficiency, a source may communicate about changes to its resources:
o Publish a Change List, a list of recent change events (created, updated, deleted resource)- Destination acts upon change events, e.g. GETs
created/updated resources, removes deleted resources.o A Change List pertains to resources that changed in a
temporal interval with a start- and an end-date- If a resource changed more than once, it will be listed
more than once
45
46
47
48
49
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source Capability 4: Packaging Changes
In order to reduce the number of requests to obtain resource changes, a source may provide packaged bitstreams for changed resources:
o Publish a Change Dump, a document that points to packages containing bitstreams of recently changed resource and necessary metadata - Destination GETs the package- Destination unpacks the package- ZIP format supported
o A Change Dump and its packages pertain to resources that changed in a temporal interval with a start- and an end-date- If a resource changed more than once, it will be included
more than once
50
51
52
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source: Modular Capabilities
53
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Destination: Key Processes
54
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual Approach
2. Motivation & Use Cases
3. Framework Walkthrough
4. Framework (Technical) Details
5. Implementation
6. Q&A
55
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
4. Framework (Technical) Details
1. Sitemaps
2. Core synchronization capabilities (PULL)
3. Discovery
4. Linking to related resources
5. Notification Capabilities (PUSH)
6. Archival capabilities (ARCHIVES)
56
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
57
4. Framework (Technical) Details
1. Sitemaps
2. Core synchronization capabilities (PULL)
3. Discovery
4. Linking to related resources
5. Notification Capabilities (PUSH)
6. Archival capabilities (ARCHIVES)
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
So Many Choices
58
XMPP
AtomPub
SDShare
RSS
Atom
PubSubHubbub
Sitemap
XMPP
rsync
OAI-PMH
WebDAV Col. Syn.
OAI-ORE
DSNotify
RDFsync
Crawl
Push
Pull
SWORD
SPARQLpush
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
So Many Choices
59
XMPP
AtomPub
SDShare
RSS
Atom
PubSubHubbub
Sitemap
XMPP
rsync
OAI-PMH
WebDAV Col. Syn.
OAI-ORE
DSNotify
RDFsync
Crawl
Push
Pull
SWORD
SPARQLpush
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
60
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
A Framework Based on Sitemaps
• Modular framework allowing selective deployment
• Sitemap is the core format throughout the framework
o Introduce extension elements and attributes: - In ResourceSync namespace (rs:) to
accommodate synchronization needso Reuse Sitemap format for all capability documents:
Resource List, Resource Dump, Change List, Change Dump, as well as for manifest in Dumps
o Utilize Sitemap index format where needed/allowed
61
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Sitemap Format
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”>
<url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url>
<url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> </url> …</urlset>
62
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Sitemap Index Format
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9”>
<sitemap> <loc>http://example.com/sitemap1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </sitemap>
<sitemap> <loc>http://example.com/sitemap2.xml</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> </sitemap> …</sitemapindex>
63
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync Sitemap Extensions
<urlset xmlns=http://www.sitemaps.org/schemas/sitemap/0.9 xmlns:rs="http://www.openarchives.org/rs/terms/”> <rs:ln …/> <rs:md …/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:ln …/> <rs:md …/> </url> <url> … </url></urlset>
64
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync Sitemap Extensions
<sitemapindex xmlns=http://www.sitemaps.org/schemas/sitemap/0.9 xmlns:rs="http://www.openarchives.org/rs/terms/”> <rs:ln …/> <rs:md …/><sitemap> <loc>http://example.com/sitemap1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:ln …/> <rs:md …/> </sitemap>…</sitemapindex>
65
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Resource Metadata SummaryElement/Attribute Description Defined by
<loc> Resource URI (identity) sitemaps
<lastmod> Timestamp of last change sitemaps
<changefreq> Expected update frequency sitemaps
<rs:md> ResourceSync
change Change type (Change List & Change Dump Manifest only) ResourceSync
encodingHTTP Content-Encoding header value RFC2616
hashOne or more content digests (md5, sha-1, sha-256)
Atom Link Ext.
lengthHTTP Content-Length header value RFC4287
pathPath in ZIP package (Dump Manifests only)
ResourceSync
typeHTTP Content-Type header value RFC4287
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Related Resource Metadata Summary
• Attributes of the <rs:ln> element; c.f. resource metadata + pri
Element/Attribute Description Defined by
<rs:ln> ResourceSync
encoding HTTP Content-Encoding header value RFC2616
hash One or more content digests (md5, sha-1, sha-256) Atom Link Ext.
href Related resource URI (identity) RFC4287
length HTTP Content-Length header value RFC4287
modified Timestamp of last change (c.f. <lastmod>) Atom Link Ext.
path Path in ZIP package (Dump Manifests only) ResourceSync
pri Priority of link RFC6249
rel Relation - IANA registered or URI RFC4287
type HTTP Content-Type header value RFC4287
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Link Relation Summary
Relation Use in ResourceSync Defined in
rel="alternate" Link from generic to specific URI HTML 5
rel="canonical" Link from specific to generic URI RFC6596
rel="collection" Resource is member of collection RFC6573
rel="contents" Link from dump to manifest HTML4
rel="describedby" Has metadata Protocol for Web Description Resources (POWDER): Description Resources
rel="describes" Is metadata for The 'describes' Link Relation Type
rel="duplicate" Mirror or alternative copy RFC6249
rel=".../rs/terms/patch"A patch -- efficient change information This specification
rel="memento" Link to time-specific URI Memento Internet Draft
rel="timegate" Link to timegate Memento Internet Draft
rel="via" Provenance chain, came from RFC4287
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync Sitemap Validation
• All ResourceSync capability documents are valid according to the Sitemap XML Schemao http://www.sitemaps.org/schemas/sitemap/0.9
• For a more thorough validation use the ResourceSync XML Schemao http://www.openarchives.org/rs/0.9.1/resourcesync.xsd
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
70
http://www.openarchives.org/rs/resourcesync
4. Framework (Technical) Details
1. Sitemaps
2. Core synchronization capabilities (PULL)
3. Discovery
4. Linking to related resources
5. Notification Capabilities (PUSH)
6. Archival capabilities (ARCHIVES)
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
71
http://www.openarchives.org/rs/resourcesync#DescResources
Describing Content: Resource List
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Resource List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2013-01-03T09:00:00Z” completed="2013-01-03T09:01:00Z” /> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url></urlset>
72
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Resource List
• Describe Source’s resources that are subject to synchronization• At one point in time (snapshot)• Creation can take some time – duration can be conveyed
• Typical Destination use: Baseline Synchronization, Audit
• Each URI typically listed only once• Might be expensive to generate• Destinations use @at to determine freshness
• [@at, @completed] – interval of uncertainty• Destination issues GETs against URIs to obtain resources• Very similar to current Sitemaps
73
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
What if I have a million resources?
• Current sitemap limit is 50k resources (or maximum document size of 50MB)
• Break complete list of resources into 50k-resource chunks, each on a Resource List document
• Create a Resource List Index document to group them:o Based on <sitemapindex>o May have up to 50k component Resource Listso Extends capacity to 2,500,000,000 resources within current
community practices
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Resource List Index <resourcelist_index.xml>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcelist" at="2013-01-02T09:00:02Z”/> <sitemap> <loc>http://example.com/resourcelist1.xml</loc> <rs:md type="application/xml"/> </sitemap> <sitemap> <loc>http://example.com/resourcelist2.xml</loc> <rs:md type="application/xml"/> </sitemap></sitemapindex>
75
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Resource List <resourcelist1.xml>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs=http://www.openarchives.org/rs/terms/> <rs:ln rel=”index” href=”http://example.com/resourcelist_index.xml”/> <rs:md capability=”resourcelist" at="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T08:07:06Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> ...</urlset>
76
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Resource List Index
77
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
78
http://www.openarchives.org/rs/resourcesync#ResourceDump
Packaging Content: Resource Dump
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Resource Dump
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcedump" at="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/resourcedump_part1.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length=”97553" type=”application/zip"/> <rs:ln rel=”contents” href="http://example.com/resourcedump_manifest-part1.xml" type=”application/xml"/> </url> <url> <loc>http://example.com/resourcedump_part2.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod></url></urlset>
79
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Resource Dump Manifest
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”resourcedump-manifest" at="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md type="text/html" path=”/resources/res1"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md type=”application/pdf” path=”/resources/res2"/> </url></urlset>
80
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Resource Dump
• A Resource Dump points to packages (ZIP files) that contain representations of the Source’s resources• At one point in time (snapshot)
• Resource Dump is mandatory, even if there is only one ZIP file• ZIP package contains manifest, listing contained bitstreams• Typical Destination use: Baseline Synchronization, bulk
download
• Each URI typically listed only once• Might be expensive to generate• Destinations use @at to determine freshness
• [@at, @completed] – interval of uncertainty• GETs against individual URIs from Resource List achieves the
same result (ignoring varying freshness)
81
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
82
http://www.openarchives.org/rs/resourcesync#DesChanges
Describing Changes: Change List
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Change List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url></urlset>
83
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Open Change List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs=http://www.openarchives.org/rs/terms/> <rs:md capability="changelist" from="2013-01-02T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url></urlset>
84
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Change List
• A Change List pertains to a Source’s resources that changed• Changes that occurred during a temporal interval with start-
and end-date• Typical Destination use: Incremental Synchronization, Audit
• Changes are listed in chronological order• Multiple changes to one resource results in the resource being
listed multiple times, once per change• Source determines duration of temporal interval• Destinations use @from and @until to determine freshness• Destinations issue GETs against URIs to obtain changed
resources
85
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Change List Index
<changelist_index.xml>
<changelist1.xml>
86
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Change List Index <changelist_index.xml>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <sitemap> <loc>http://example.com/changelist1.xml</loc> <lastmod>2013-01-02T11:00:00Z</lastmod> <rs:md type="application/xml"/> </sitemap> <sitemap> <loc>http://example.com/changelist2.xml</loc> <lastmod>2013-01-02T23:00:00Z</lastmod> <rs:md type="application/xml"/> </sitemap></sitemapindex>
87
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Change List <changelist1.xml>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs=http://www.openarchives.org/rs/terms/> <rs:ln rel=”index” href=”http://example.com/changelist_index.xml”/> <rs:md capability="changelist" from="2013-01-02T09:00:00Z” until="2013-01-02T21:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url></urlset>
88
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Open Change List Index
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z”/> <sitemap> <loc>http://example.com/changelist1.xml</loc> <lastmod>2013-01-02T11:00:00Z</lastmod> </sitemap> <sitemap> <loc>http://example.com/changelist2.xml</loc> <lastmod>2013-01-02T23:00:00Z</lastmod> </sitemap> <sitemap> <loc>http://example.com/changelist_open.xml</loc> </sitemap></sitemapindex>
89
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Change List Index
90
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
91
http://www.openarchives.org/rs/resourcesync#PackChanges
Packaging Changes: Change Dump
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Capability 4: Change Dump
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changedump" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/change_dump_part1.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length="887" type=”application/zip"/> </url> <url> <loc>http://example.com/change_dump_part2.zip</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md length=”9767" type=”application/zip"/> </url></urlset>
92
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Change Dump Manifest
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changedump-manifest" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated" length=”2887” type=”text/html” path=”/changes/res1”/> </url> <url> … </url></urlset>
93
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Change Dump
• A Change Dump points at packages (ZIP files) that contain bitstreams of the Source’s resources that changed• Changes that occurred during a temporal interval with start-
and end-date • Change Dump is mandatory, even if there is only one ZIP file• ZIP package contains manifest, listing contained bitstreams• Typical Destination use: Incremental Synchronization, bulk
download of changes
• Changes in Change Dump Manifest listed in chronological order• Same URI can be listed multiple times• Might be expensive to generate• Destinations use @from and @until to determine freshness
94
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
95
http://www.openarchives.org/rs/resourcesync#Discovery
4. Framework (Technical) Details
1. Sitemaps
2. Core synchronization capabilities (PULL)
3. Discovery
4. Linking to related resources
5. Notification Capabilities (PUSH)
6. Archival capabilities (ARCHIVES)
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Discovery of Capabilities
Requirements:• Need to discover capabilities, i.e. Resource List, Resource
Dump, Change List, Change Dump, Archives, Notification channels
• Need to know the type of capability each document represents.
Approach:• The Source publishes a Capability List that enumerates the
capabilities it supports.• By pointing at Resource List, Change List, Resource Dump,
etc. using appropriate relation types, e.g. “resourcelist”, “changelist”, “resourcedump” etc.
96
http://www.openarchives.org/rs/resourcesync#CapabilityList
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
97
Discovery of Capabilities
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Capability List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”capabilitylist”/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability=”resourcelist”/> </url> <url> <loc>http://example.com/dataset1/changelist.xml</loc> <rs:md capability=”changelist”/> </url> <url> <loc>http://example.com/dataset1/resourcedump.xml</loc> <rs:md capability=”resourcedump”/> </url></urlset>
98
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
99
Requirements:• Need to discover a Capability List
Approaches:• Introduce a link in the HTTP Link header of a resources that is
subject to synchronization, pointing at the Capability List with the relation type “resourcesync”
• Introduce a link from an HTML document that is subject to synchronization (<head> section), pointing at the Capability List with the relation type “resourcesync”
• Link from a Resource List, etc. to the Capability List with the relation type “up”
Link header on example.com/res1.pdf
Link: <example.com/dataset1/capabilitylist.xml>;rel=“resourcesync”
Discovery of Capability Lists
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
100
Discovery of Capabilities
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Discovery: Source Description
Requirements:• Support for multiple Capability Lists, one per “set of
resources”• Need to discover these Capability Lists• Need descriptive information about each set of resources
that a Capability List pertains to• Useful to have descriptive information about the Source itself
Approach:• The Source Description document meets these requirements.
• It should be at a particular location to avoid having registries:
http://(hostname)/.well-known/resourcesync• It can be linked to from the Capability Lists as well.
101
http://www.openarchives.org/rs/resourcesync#SourceDesc
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
102
Discovery of Capabilities
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
103
Discovery of Capabilities
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source Description
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”description”/> <rs:ln rel=“describedby” href=“http://example.com/info_about_source.xml”/> <url> <loc>http://example.com/dataset1/capabilitylist.xml</loc> <rs:md capability=”capabilitylist”/> <rs:ln rel=“describedby” href=“http://example.com/dataset1/info_about_dataset1.xml”/> </url> <url> <loc>http://example.com/dataset2/capabilitylist.xml</loc> <rs:md capability=”capabilitylist”/> <rs:ln rel=“describedby” href=“http://example.com/dataset2/info_about_dataset2.xml”/> </url></urlset>
104
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
105
• Resource Lists are (enhanced) Sitemaps• Sitemaps can be discovered via robots.txt
• Ergo, Resource Lists should be discoverable via robots.txt
User-agent: *Disallow: /cgi-bin/Disallow: /tmp/Sitemap: http://example.com/dataset1/resourcelist.xml
Discovery via robots.txt
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Discovery of Capabilities
106
http://www.openarchives.org/rs/resourcesync#Discovery
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Framework Navigation
107
http://www.openarchives.org/rs/resourcesync#Navigation
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
e.g., Capability List
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”capabilitylist”/> <rs:ln rel=“up” href=“http://example.com/.well-known/resourcesync”/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability=”resourcelist”/> </url> <url> <loc>http://example.com/dataset1/changelist.xml</loc> <rs:md capability=”changelist”/> </url> <url> <loc>http://example.com/dataset1/resourcedump.xml</loc> <rs:md capability=”resourcedump”/> </url></urlset>
108
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Framework Structure
109
http://www.openarchives.org/rs/resourcesync#Structure
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
110
Framework Structure
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
4. Framework (Technical) Details
1. Sitemaps
2. Core capabilities (pull)
3. Discovery
4. Linking to related resources
5. Archives
6. Notifications (push)
111
http://www.openarchives.org/rs/resourcesync#LinkRelRes
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Supported Linking Use Cases
Provide links to related resources to address specific resource synchronization needs.
1. Mirrored content with multiple download locations
2. Alternate representations of the same content
3. Patching content rather than replacing it
4. Resources and metadata about resources
5. Prior versions of resources
6. Collection membership of resources
7. Republishing synchronized resources
All cases are handled with a <rs:ln> element referring to the linked resource
112
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Notes about Linked Resources
Some important things to keep in mind about linked resources:
• They may also be subject to synchronization• They may be updated in a very different schedule than the
resources that link to them• Therefore, it is recommended to convey metadata about the
linked resource too• Links can be bi-directional – the linked resource can link back to
the linking resource
113
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #1 - Mirror
1. Content with multiple download locations
This may be of interest for:• Content distribution networks• Mirror sites• Backup locations• Load balancing
114
http://www.openarchives.org/rs/0.9.1/resourcesync#MirCon
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #1 - Mirror
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”duplicate” pri=”1” href=”http://mirror1.example.com/res1"/> <rs:ln rel=”duplicate” pri=”2” href=”http://mirror2.example.com/res1"/> </url></urlset>
115
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #2 – Alternate Representations
2. Alternate representations of the same content
This may be of interest for:• Resources subject to HTTP content negotiation• Format migration for preservation reasons • Different clients wanting different formats• Multiple languages of the content
116
http://www.openarchives.org/rs/resourcesync#AltRep
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #2 – Alternate Representations
117
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel="alternate" type="text/html" href="http://example.com/res1.html"/> <rs:ln rel="alternate" type=“application/pdf" href=”http://example.com/res1.pdf"/> </url></urlset>
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #2 – Alternate Representations
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1.html</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”canonical” href="http://example.com/res1"/> </url></urlset>
118
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #3 – Patching Content
3. Patching content rather than replacing it
This may be of interest when:• Resources are very large and server wishes to conserve
bandwidth where possible• Changes are frequent and small• Changes are managed in a CMS that tracks differences
Need:• Machine processable format to describe a change in a
manner that allows patching a representation• Existing or newly defined by communities
119
http://www.openarchives.org/rs/resourcesync#PatchCon
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #3 – Patching Content
120
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1.json</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated” length=“398723”/> <rs:ln rel=”http://www.openarchives.org/rs/terms/patch” type=”application/json-patch” modified=“2013-01-02T17:00:00Z” length=“58” href=”http://example.com/res1-patch.json"/> </url></urlset>
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #4 – Metadata about Resources
4. Resources and metadata about resources
This may be of interest when:• Resources have associated descriptive metadata records,
which are useful for understanding the resource• Such as cultural heritage images, audio, video
• Resources that have associated technical, administrative, rights metadata
121
http://www.openarchives.org/rs/resourcesync#ResMDLinking
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #4 – Metadata about Resources
122
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”describedby” type=”application/xml” href=”http://example.com/metadata/res1.xml"/> </url></urlset>
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #4 – Metadata about Resources
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/metadata/res1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”describes” type=”text/html” href=”http://example.com/res1"/> </url></urlset>
123
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #5 – Prior Versions of Resources
124
This may be of interest when:• A Destinations needs to have a copy of all versions of a
resource
http://www.openarchives.org/rs/resourcesync#ResVers
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Memento Intermezzo
http://www.mementoweb.org/
URI for Original, URI for Version
URI-M - http://web.archive.org/web/20010911203610/http://www.cnn.com/
Web Archive
URI-R - http://www.cnn.com/
URI for Original, URI for Version
URI-M - http://en.wikipedia.org/w/index.php?title=September_11_attacks&oldid=282333
CMS
URI-R - http://en.wikipedia.org/wiki/September_11_attacks
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Memento Time Travel extension for Chrome
Download extension at http://bit.ly/memento-for-chrome
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #5 – Prior Versions of Resources
135
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”memento” href=”http://example.com/past/20130102130000/res1"/> <rs:ln rel=”timegate” href=”http://example.com/timegate/res1"/> <rs:ln rel=”timemap” href=“http://example.com/timemap/res1” type=“application/link-format”/> </url></urlset>
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #6 – Collection Membership
6. Collection membership of resources
This may be of interest when:• Resources are part of OAI-ORE aggregations• Resources are part of OAI-PMH sets• To indicate any other type of collections of resources
Collections are named with URIs and can then be linked to with rel=“collection”
• Nice if the collection URI resolves to a useful description
136
http://www.openarchives.org/rs/resourcesync#ColMem
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #6 – Collection Membership
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-02T09:00:00Z” until="2013-01-03T09:00:00Z”/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”collection” href=”http://example.com/aggregation/allres"/> </url></urlset>
137
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #7 – Republishing Resources
7. Republishing synchronized resources
This may be of interest when:• Aggregator systems harvest resources from Sources and
then republish them at new URIs
Examples include Blog republishing, content distribution networks, mirrored or combined collections
Hypothetical scenario: Lots of little museums with small collections, and a large European/American aggregating digital library system that wants to provide fast, combined access to the content (with permission)
138
http://www.openarchives.org/rs/resourcesync#RePub
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #7 – Republishing Resources #1
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-03T00:00:00Z”/> <url> <loc>http://original.example.com/res1</loc> <lastmod>2013-01-03T07:00:00Z</lastmod> <rs:md change=”updated”/> </url></urlset>
139
• Original Source publishes information about a changed resource via a Change List
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #7 – Republishing Resources #2
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-03T11:00:00Z”/> <url> <loc>http://aggregator1.example.com/res1</loc> <lastmod>2013-01-03T20:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”via” modified=“2013-01-03T07:00:00Z” href=”http://original.example.org/res1"/> </url></urlset>
140
• Aggregator 1 republishes information about the changed resource with reference to the original Source
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Linking #7 – Republishing Resources #3
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist" from="2013-01-03T12:00:00Z”/> <url> <loc>http://aggregator2.example.com/res1</loc> <lastmod>2013-01-04T09:00:00Z</lastmod> <rs:md change=”updated”/> <rs:ln rel=”via” modified=“2013-01-03T07:00:00Z” href=”http://original.example.org/res1"/> </url></urlset>
141
• Aggregator 2 ditto• Caution when republishing links, need to make sure they are still
appropriate from an aggregator’s perspective
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
142
4. Framework (Technical) Details
1. Sitemaps
2. Core synchronization capabilities (PULL)
3. Discovery
4. Linking to related resources
5. Notification Capabilities (PUSH)
6. Archival capabilities (ARCHIVES)
http://www.openarchives.org/rs/notification
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Motivation for Notifications
143
• Reduce synchronization latency by having the Source push out resource change information• To avoid continuous pull of Change Lists by Destinations
• Share information about changes to the Source’s ResourceSync implementation, e.g. announcement of new Resource List, new Capability List, etc.• To avoid continuous polling of e.g. Resource Lists,
ResourceSync Description
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
144
• 1. Change Notification• Notifies about changes to particular resources• e.g., resource A has been updated | created | deleted
• 2. Framework Notification• Notifies about changes to capabilities i.e., their documents• e.g., a Change List has been updated | created | deleted• Also for Capability Lists and Source Description
Source: Notifications Capabilities
PUSH
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
145
• Notification sent via channels• Resource Notification: one channel per set of resources• Framework Notification: one channel per set of resources
• Sent on level of capability document, not on index-level• Notifications about changes to Source Description sent on all
Framework Notification channels
• Payload for notifications: <urlset> documents
• Transport protocol for notifications:• PubSubHubbub -
https://pubsubhubbub.googlecode.com/git/pubsubhubbub-core-0.4.html - current choice
• WebSockets -http://tools.ietf.org/html/rfc6455 – may be added later
Notifications Channels
146
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
147
Framework NotificationStructure
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
148
Framework NotificationStructure
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Change Notification Payload
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"><url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T09:07:00Z</lastmod> <rs:md change=”updated" hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="8876" type="text/html"/> </url> <url> … </url></urlset>
149
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Framework Notification Payload
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <url> <loc>http://example.com/resourceset1/resourcelist.xml</loc> <rs:md change=”created" capability=”resourcelist”/> </url> <url> <loc>http://example.com/resourceset1/resourcedump.xml</loc> <rs:md change=”created" capability=”resourcedump”/> </url></urlset>
150
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Framework Notification Payload (w/ index)
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"><url> <loc>http://example.com/resourceset1/resourcelist.xml</loc> <rs:md change=”created" capability=”resourcelist”/> <rs:ln rel="index" href=”http://example.com/dataset1/resourcelist-index.xml/> </url></urlset>
151
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
152
Framework NotificationDiscovery
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
153
4. Framework (Technical) Details
1. Sitemaps
2. Core synchronization capabilities (PULL)
3. Discovery
4. Linking to related resources
5. Notification Capabilities (PUSH)
6. Archival capabilities (ARCHIVES)
http://www.openarchives.org/rs/archives
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Source: Archival Capabilities
The Source may hold on to historical data, for example, to allow Destinations to catch up with events they missed or revisit prior resource states. To this end, the Source can publish archives, i.e. documents that enumerate historical capability documents
1. Resource List Archive
2. Resource Dump Archive
3. Change List Archive
4. Change Dump Archive
154
ARCHIVES
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
155
http://www.openarchives.org/rs/archives#ResourceListArch
Resource List Archive
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist-archive" at="2013-01-09T13:00:00Z"/> <url> <loc>http://example.com/resourcelist1.xml</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/resourcelist2.xml</loc> <lastmod>2013-01-09T13:00:00Z</lastmod> </url> <url> … </url></urlset>
Resource List Archive
156
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Resource Dump Archive
157
http://www.openarchives.org/rs/archives#ResourceDumpArch
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcedump-archive" at="2013-02-10T03:00:00Z"/> <url> <loc>http://example.com/resourcedump1.xml</loc> <lastmod>2013-01-10T03:00:00Z</lastmod> </url> <url> <loc>http://example.com/resourcedump2.xml</loc> <lastmod>2013-02-10T03:00:00Z</lastmod> </url> <url> … </url></urlset>
Resource Dump Archive
158
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
159
http://www.openarchives.org/rs/archives#ChangeListArch
Change List Archive
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changelist-archive" from="2013-02-01T23:00:00Z until="2013-02-03T23:00:00Z"/> <url> <loc>http://example.com/changelist1.xml</loc> <lastmod>2013-02-01T23:00:00Z</lastmod> </url> <url> <loc>http://example.com/changelist2.xml</loc> <lastmod>2013-02-02T23:00:00Z</lastmod> </url> <url> <loc>http://example.com/changelist3.xml</loc> <lastmod>2013-02-03T23:00:00Z</lastmod> </url></urlset>
Change List Archive
160
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Change Dump Archive
161
http://www.openarchives.org/rs/archives#ChangeDumpArch
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability=”changedump-archive" from="2013-02-10T03:00:00Z until="2013-02-17T03:00:00Z"/> <url> <loc>http://example.com/changedump1.xml</loc> <lastmod>2013-02-10T03:00:00Z</lastmod> </url> <url> <loc>http://example.com/changedump2.xml</loc> <lastmod>2013-02-17T03:00:00Z</lastmod> </url> <url> … </url></urlset>
Change Dump Archive
162
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
<urlset xmlns=“http://www.sitemaps.org/schemas/sitemap/0.9” xmlns:rs=“http://www.openarchives.org/rs/terms/”> <rs:md capability=”capabilitylist”/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability=”resourcelist”/> </url>… <url> <loc>http://example.com/dataset1/resourcelist-archive.xml</loc> <rs:md capability=“resourcelist-archive”/> </url> <url> <loc>http://example.com/dataset1/changelist-archive.xml</loc> <rs:md capability=“changelist-archive”/> </url></urlset>
Capability List for Archives
163
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSyncFrameworkwith Archives
164
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual Approach
2. Motivation & Use Cases
3. Framework Walkthrough
4. Framework (Technical) Details
5. Implementation
6. Q&A
165
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Implementation #1:The Metadata Harvesting Use Case
166
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
The Metadata Harvesting Use Case
1. Identification of metadata records within a service
2. Use of standards in metadata formats
3. Incremental updates
4. Create, Update, Delete
5. Sets
167
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
The Metadata Harvesting Use Case
1. Identification of metadata records within a service
2. Use of standards in metadata formats
168
ResourceSync does not specifically care about metadata records, only resources. It is up to the server to identify which of those resources are metadata.
We are free to annotate a resource's entry with appropriate metadata to indicate the format.
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
The Metadata Harvesting Use Case
3. Incremental updates
4. Create, Update, Delete
5. Sets
169
All resources that can be obtained from a change list will be annotated with the kind of change that happened to them.
ResourceSync allows the server to publish lists of resources and changes and indexes of those lists all annotated with metadata.
ResourceSync publishes changes as static documents. The client is then free to walk up and down the change lists provided by the server.
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
(Required) Documents formetadata harvesting use case
170
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Describing Metadata Resources
171
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" from="2013-05-05T13:00:00Z"/> <url> <loc>http://mydspace.edu/dspace-rs/resource/123456789/7/qdc</loc> <lastmod>2013-05-01T19:09:35Z</lastmod> <changefreq>never</changefreq> <rs:md type=”application/xml”/> <rs:ln href="http://mydspace.edu/bitstream/123456789/7/1/bitstream.pdf" rel="describes"/> <rs:ln href="http://mydspace.edu/bitstream/123456789/7/2/image.jpg" rel="describes"/> <rs:ln href="http://mydspace.edu/123456789/3" rel=”collection"/> </url></urlset>
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Describing Bitstream Resources
172
<urlset … <url> <loc>http://mydspace.edu/bitstream/123456789/7/1/bitstream.pdf</loc> <lastmod>2013-05-01T19:09:35Z</lastmod> <changefreq>never</changefreq> <rs:md hash="md5:75d0ea94097a05fce9aca5b079e2f209" length="419805" type="application/pdf"/> <rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/7/qdc" rel="describedby"/> <rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/7/mets" rel="describedby"/> <rs:ln href="http://mydspace.edu/dspace-rs/resource/123456789/12/qdc" rel="describedby"/> <rs:ln href="http://mydspace.edu/123456789/2" rel=”collection"/> </url></urlset>
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Serving Metadata Resources
173
http://mydspace.edu/dspace-rs/resource/123456789/7/qdc
ResourceSync webapp Item handle Metadata Format
metadata.formats = \ qdc = http://purl.org/dc/terms/, \ mets = http://www.loc.gov/METS/
metadata.types = \ qdc = application/xml, \ mets = application/xml
<loc>http://mydspace.edu/dspace-rs/resource/123456789/7/qdc<loc> <rs:md type="application/xml”/> <rs:ln href="http://purl.org/dc/terms/" rel="describedby"/>
<loc>http://mydspace.edu/dspace-rs/resource/123456789/7/mets</loc> <rs:md type="application/xml”/> <rs:ln href="http://www.loc.gov/METS/" rel="describedby"/>
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Generating Documents1. Initialise
Creates initial Capability List and Resource List documents
[dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -i
2. Update
Creates a new Change List which covers the period since the last Change List was created
[dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -u
3. Rebase
A combination of both Initialise and Update.
[dspace]/bin/dspace dsrun org.dspace.resourcesync.ResourceSyncGenerator -r
174
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Usage of Resources by clients
175
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Impact on DSpace
176
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
URLs• Stable identifiers for archived items• Stable identifiers for unarchived items• Stable identifiers for metadata resources (in their various formats)• Stable identifiers for previous versions
Provenance• History of changes to an item/bitstream• Item/bitstream deletions (vs withdraw)• Bitstream create/update dates• Item create/update dates
177
?
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Versioning• Access of previous versions of both metadata and bitstreams• Stable identifiers for previous versions of both metadata and
bitstreams
Metadata Resources• Metadata in a variety of formats• Metadata as file/bitstream
178
?
?
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Admin Files• ResourceSync documents (Resource Lists, Change Lists, etc)• ResourceSync exports - Resource Dumps, Change Dumps• Metadata exports in a number of formats
Scheduled Tasks• Regular generation of RS documents
Complex Objects• Item/bitstream relationships• Collections of content
179
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Dspace Module:https://github.com/CottageLabs/DSpaceResourceSync
depends on the common java library:https://github.com/CottageLabs/ResourceSyncJava
PHP client:https://github.com/stuartlewis/resync-php
depends on the SWORDv2 clienbt library:https://github.com/swordapp/swordappv2-php-library/
Get the software!
180
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Implementation #2:ResourceSync at arXiv.org
181
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync @ arXiv
• Use ResourceSync for both mirroring and public data accesso efficient updateso ability to do periodic auditso public synchronization capabilityo reduce admin burden
• Likely start with metadata + source for mirroring use case (doing experiments now)
• Open access use cases requires processed PDF also• Some concerns about likely use/load…
182
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
183
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Alternate download location
• Likely want to separate machine accesses from human accesses to preserve response time on main server
=> Use Mirrored Content part of spec
o <loc> specifies canonical URI - e.g. http://arxiv.org/pdf/1306.1073v1.pdf
o <rs:ln rel=“duplicate”> specifies preferred download location- e.g. http://export.arxiv.org/pdf/1306.1073v1.pdf
184
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
<url> <loc>http://arxiv.org/pdf/1306.1073v1.pdf</loc> <lastmod>2013-06-06T00:57:12Z</lastmod> <rs:md hash="md5:e08e0c4e4d7b0895120014f0aa09e7c4" length="287714” type=”application/pdf"/> <rs:ln rel="duplicate” pri="1" href="http://export.arxiv.org/pdf/1306.1073v1.pdf" modified="2013-06-06T02:00:59Z"/></url>
Alternate download location
185
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Getting a copy of arXiv
It might be as easy as:
186
(of course, you probably have to wait a while but it is nice to know ResourceSync is stateless so one can efficiently restart)
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
Python Library and Client
• Aim to provide library code implementing all ResourceSync facilities for use in both source and destination implementationso Designed for python 2.6 (RHEL6) and 2.7o Will not work with python <= 2.5
• Client (resync) supports many destination operations, inspired by the common Unix rsync program
• Client also supports some operations that might be useful in a source, such as generation of static Resource Lists, or periodic Change Lists (used in arXiv experiments)
• Explorer (resync-explorer) intended to allow easy inspection of a source’s resource sets and capabilities
• Developed since ResourceSync v0.5, updated for v0.9
http://github.org/resync/resync
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync Source Simulator
• Python code using Tornado server• Provides random set of resources of different sizes updated at a
particular rate• Very useful for testing Destination code
http://github.com/resync/simulator
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync - Agenda
1. ResourceSync: Problem Perspective & Conceptual Approach
2. Motivation & Use Cases
3. Framework Walkthrough
4. Framework (Technical) Details
5. Implementation
6. Q&A
189
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
190
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
191
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
192
ResourceSync TutorialDANS, January 21 2014, Den Haag, Netherlands
ResourceSync:A Web-Based
Resource SynchronizationFramework
ResourceSync is funded by The Sloan Foundation & JISC
#resourcesync
193