self-preserving digital objects michael l. nelson [email protected] mln/ several slides from terry l....
TRANSCRIPT
Self-Preserving Digital Objects
Michael L. [email protected]
http://www.cs.odu.edu/~mln/
Several Slides from Terry L. Harrison
University of Southern California6/15/04
Outline
• History
• Preservation
• Archives vs. Objects
• Smart Objects & Dumb Archives
• Self-Preserving Objects
My DL History• 1992 - work first begun on first generation Langley Technical Report Server
(LTRS)• 1993 - WWW version of LTRS
• http://techreports.larc.nasa.gov/ltrs/• work w/ ODU on WATERS
• 1994 - NASA Technical Report Server (NTRS)• distributed searching of many “LTRS-like” servers (20 separate nodes, all
NASA centers)• http://techreports.larc.nasa.gov/cgi-bin/NTRS
• 1996 - NACA Technical Report Server (NACATRS)• http://naca.larc.nasa.gov/
• 1996 - Joint research in DLs with ODU begins• 1997 - NCSTRL+ (clustering, buckets)• 1999 - OAI-PMH development begins• 2001 - Arc, DP9, Archon, Kepler, etc.• 2002 - OAI-PMH version of the NTRS
• http://ntrs.nasa.gov/
History• ca. 1994 - 1995: a LaRC researcher, upon seeing
LTRS remarked:
“all of these reports are nice, but what we really want is the data...”
• ca. 1995 - present: many reports in LTRS start to include data files, appendices, software and other information types
• NACATRS: the scanned nature of the reports imply that 1 report = N files
N >= (pages * 3) + 2
NASA STI
• Formal publications cover a decreasing percentage of NASA’s STI output– most DLs focus only on formal publications
• Informal STI is maintained by only by a network of collegial distribution– aging and shrinking workforce weakens this network
• Customers want much more than formal publication– rather than stretch the meaning of “report” or
“document”, define a new object for DL transactions
STI Observations• Media formats are instantiations of a more general
class of information• Most DLs are uni-format, following the obsolete
media boundaries of their non-digital predecessors• “Separate but equal” DLs considered harmful
– customer should not have to re-integrate what should never have been de-integrated...
– institutional knowledge being lost because we don’t have a publishing vector established
Pyramid of Scientific and Technical Information (STI)
Journal Articles
Conference Papers
Technical Reports
time
software raw data notes video / images
Information is created in a variety of formats. Formal publications, the focus of
most DL projects, are supported by a pyramid of informal information.
Information Lost Over Time
Project
manuscript
software
raw data
images
library
ftp site
thrown away
filing cabinent
User NewProject
Figure 7: STI Lost in Project / Archival / Reuse Process
Content is King
The information content is more important than the systems used for its storage, management and retrieval
Objects should not be “locked” in specific DLs or archives
Prelude to OAI…
• I met Herbert Van de Sompel in April 1999...– we spoke of a demonstration project he had in mind and
had received sponsorship from Paul Ginsparg and Rick Luce
– We wanted to demonstrate a multi-disciplinary DL that leveraged the large number of high quality, yet often isolated, tech report servers, e-print servers, etc.
• most digital libraries (DLs) had grown up along single disciplines or institutions
– little to no interoperability; isolated DL “gardens”
Universal Preprint Service
• A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives– Nelson: NCSTRL+; a modified version of Dienst
• support for “clustering”• support for “buckets”
– Krichel: ReDIF metadata format– Van de Sompel: SFX Linking
• Demonstrated at Santa Fe NM, October 21-22, 1999– http://web.archive.org/web/*/http://ups.cs.odu.edu/– D-Lib Magazine, 6(2) 2000 (2 articles)
• http://www.dlib.org/dlib/february00/02contents.html
– UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/
UPS ParticipantsArchive / DL Records in DL Buckets in UPS Buckets Linked to
Full Content
arXiv
www.arxiv.org
128943 85204 85204
CogPrints
cogprints.soton.ac.uk
743 742 659
NACA
naca.larc.nasa.gov
3036 3036 3036
NCSTRL
www.ncstrl.org
29680 25184 9084
NDLTD
www.ndltd.org
1590 1590 951
RePEc
netec.mcc.ac.uk
71359 71359 13582
Totals: 235361 187115 112516
totals ca. July 1999
Buckets: Information Surrogates in UPS
• Limitations on intellectual property, file size, transmission time, system load, etc. caused us to focus on metadata only
• Metadata was collected into “buckets”, with pointers back to the data files (still at the original sites)
Value Added Services Attached
to the Buckets
SFX Reference Linking Service, developed at Univ of Ghent, Belgium. - provides a layer of indirection between reference services available at a local site and the object itself
SFX “buttons” are attached to the buckets themselves - communication occurs between SFX server and the bucket
Adding other services to the buckets is easy...
• Data Providers– publishing into an archive– Self-describing archives
• Much of the learning about the constituent UPS archives occurred out of band…
– providing methods for metadata “harvesting”• provide non-technical context for sharing information
also
• Service Providers– harvest metadata from providers– implement user interface to data
Data and Service Providers
Even if theseare done bythe same DL,these are distinct roles
Metadata Harvesting• Move away from distributed searching• Extract metadata from various sources• Build services on local copies of metadata
– data remains at remote repositories
user
. . .
search for “cfd applications”
local copy ofmetadata
metadataharvested offline
metadataharvested offline
metadataharvested offline
metadataharvested offline
each node independently maintained
all searching, browsing, etc. performed on the metadata hereindividual nodes can
still support direct userinteraction
Result… OAI
• The OAI was the result of the demonstration and discussion during the Santa Fe meeting
– OAI = a bunch of people, a religion, a cult, etc.
– OAI Protocol For Metadata Harvesting (OAI-PMH) = the protocol created and maintained by the OAI
• Initial focus was on federating collections of scholarly e-print materials…
• …however, interest grew and the scope and application of OAI-PMH expanded to become a generic bulk metadata transport protocol
• Note:– OAI-PMH is only about metadata -- not full text!
• but what is metadata vs. full-text?
– OAI is neutral with respect to the nature of the metadata or the resources the metadata describes
• read: commercial publishers have an interest in OAI-PMH too...
A Look Back at UPS
• Primary outcome of the meeting was the OAI & OAI-PMH
• Krichel: ReDIF metadata:– still in use & being developed
• Van de Sompel: SFX– OpenURL (NISO Standard)– SFX is a commercial OpenURL resolver marketed by Ex
Libris
• Nelson: – NCSTRL+ begat Arc (arc.cs.odu.edu) and others– Buckets?
Componentized Digital Libraries
. . .
RSS
SRW
!?
Preservation
• RLG Report: Preserving Digital Information: Final Report and Recommendations– http://www.rlg.org/ArchTF/– refreshing - moving to new media
• considered (comparatively) easy
– migrating - transitioning to new systems, formats, idioms
• considered hard
Really Long Term Preservation
• Migration is very hard, to be sure– but given sufficient demand, this can be accomplished– cf. early 1980s game emulation:
• http://www.intellivisionlives.com/• http://stella.atari.org/
• Refreshing may actually be harder…– or at least intrinsically bound to the migration problem
• http://web.archive.org/web/19980128071544/http://www.usc.edu/
• http://web.archive.org/web/*/http://library.usc.edu/• http://web.archive.org/web/19971210220634/http://lib-
www.lanl.gov/
Preservation Metrics So Far– Nelson & Allen
• 3% decay of objects in DLs– http://www.dlib.org/dlib/january02/nelson/01nelson.html
– Lawrence, et al.• 3% decay of URLs included in technical papers
– http://www.neci.nec.com/~lawrence/papers/persistence-computer01/bib.html
– Koheler• ~ 33% of URLs “unstable” or “partially unstable”• http://InformationR.net/ir/4-4/paper60.html
– Kahle• average URL lasts 44 days
– http://www.hackvan.com/pub/stig/articles/trusted-systems/0397kahle.html
– Spinellis• 28% loss of 5-8 year old URLs from CACM / IEEE Computer
– http://citeseer.ist.psu.edu/spinellis03decay.html
Case Study: ICASE• Institute for Computer Applications in Science and
Engineering– independent research institute affiliated with NASA Langley
Research Center• www.icase.edu
– years of operation: 1972-2002– combined with other LaRC institutes, rolled into the National
Institute for Aerospace (NIA)
• ICASE Report Series– pre-prints/e-prints of all ICASE affiliated authors
• also issued as NASA Contractor Reports
– Dienst was used for report management & workflow• Harrison, Zubair & Nelson, JCDL 03, Dienst <-> OAI-PMH gateway
NIA Transition
• At first, all files at www.icase.edu were lost
• then, the site was brought back online
• but how well do DLs survive bulk-transfer?
Whither the ICASE Digital Library?
it appears to be reinstated… but not completely…
How Long is Forever?
• Average human life span (from: http://www.che.uc.edu/acs/archives/cintacs/vol39no5/vol39no5.html)
– female: 78– male: 77
• Average Fortune 500 company lifespan: (from:
http://www.businessweek.com/chapter/degeus.htm)
– 40 - 50 years
• Universities?• U.S. Government agency or institution?
– what about individual labs?• NASA Zero Base Review• U.S. Military BRAC
Self-Preservation
• Objects should be prepared to outlive the people & institutions that are charged with their well-being
• Many areas of risk:– company, agency, university, etc. ceases to exist
– funding cut
– person dies
– disaster (hurricane, earthquake, etc.)
– malicious attack
P2P Model
• Applicable for scientific and technical information?– Napster, Gnutella, etc. rely on the repetitive nature of
popular culture media (songs, movies, etc.) to insure the availability of items
– a “bubble” of recent and popular interest
• this assumption is probably not valid in STI DLs– cf. popularity(HBO) >> popularity(AMC)
Smart Objects, Dumb Archives
DA
Guildford Protocol
OAI-PMH
Buckets
???
Fedora? METS?
“Key Concepts in the Architecture of the Digital Library”
• next 9 slides taken from Bill Arm’s seminal article in the inaugural issue of D-Lib Magazine:– http://www.dlib.org/dlib/July95/07arms.html
The technical framework exists within a legal and social framework
• DLs no longer represent systems specific to academics or information specialists– content influences how the DL is used
• architecture must allow the implementation of various policies
Understanding of digital library concepts is hampered by terminology
• “common English” != “professional English”– multiple professional jargons too
• What do these words mean to you?– copy
– publish
– content
– document
– work
The underlying architecture should be separate from the content stored in the library
• general purpose functions and content-specific functions should be separated
• TL analogy:– the more specific the bookshelf is to holding
actual books, the harder it is to repurpose the bookshelf in the future
Names and identifiers are the basic building block for the digital library
• names != addresses
• in any DL architecture diagram, (almost) anything that can be drawn can be named
• consider the impact that handles/DOIs have had on the publishing/DL community
Digital library objects are more than collections of bits
• objects = metadata + data– “but what is metadata?”
• don’t ask hard questions
figure 2 in http://www.dlib.org/dlib/July95/07arms.html
The digital library object that is used is different from the stored object
• what you store is not necessarily what you get– storage and dissemination are separate events,
and can represent separate formats• also, potentially separate from the application-
specific format
Users want intellectual works, not digital objects
• The DL architect’s needs should not inconvenience the users’ needs
• recombination of objects– what is an object in your world view?
figure 4 in http://www.dlib.org/dlib/July95/07arms.html
Repositories must look after the information they hold
• “Repository Access Protocol”– Kahn Wilensky Framework
• http://www.cnri.reston.va.us/home/cstr/arch/k-w.html
figure 3 in http://www.dlib.org/dlib/July95/07arms.html
Objects vs. Archives
• This is the tenet that I question…
• Most DL objects still bound to the applications that generate or render the objects
Design Goals
• Aggregation– DLs should be shielded from the transient
nature of file formats– Prevent information hemorrhaging by archiving
all data types
• Intelligence– Aggregation (above) implies code, why stop at
passive objects? Make objects smart...– Bucket-bucket & bucket-tool intelligence
Design Goals
• Self-Sufficiency– Maximum autonomy & survivability: fully self-
sufficient buckets– Option to internally store all needed materials
• Mobility– Why should an information object be stuck in
one place?– Mobility for replication, workflow, data
collection
Design Goals
• Heterogeneity– One size does not fit all...– Different buckets for different applications, sites,
disciplines, etc.
• Archive Independence– Focus is on information, not yet another DL “system”
• does not require an archive to function
– “Work with everything; break nothing”
Smart Objects• aggregate:
– metadata– data– methods to operate on the metadata/data
• http://www.cs.odu.edu/~mln/teaching/cs595-f03/?method=getMetadata&type=all
• http://www.cs.odu.edu/~mln/teaching/cs595-f03/?method=listMethods• http://www.cs.odu.edu/~mln/teaching/cs595-f03/?method=listPreference• (cheat)
http://www.cs.odu.edu/~mln/teaching/cs595-f03/bucket/bucket.xml
• assumptions– Perl– http server
Internal Structurejaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % lsbucket/ CVS/ index.cgi*jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/bucket.xml* content/ CVS/ lib/ logs/ methods/jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/content/~syllabus.txt ~week1~readings.html ~week5~readings.html~week10~readings.html ~week1~week-01.ppt ~week6~readings.html~week11~readings.html ~week2~readings.html ~week7~readings.html~week12~readings.html ~week2~week-02.ppt ~week8~readings.html~week13~readings.html ~week3~assignment1.ppt ~week9~readings.html~week14~readings.html ~week3~readings.html~week15~readings.html ~week3~week-03.pptjaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/libCVS/ EZXML.pm mime.e style.cssjaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/logs/access.log CVS/ mylog.logjaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/methods/addElement.pl* getElement.pl* listMethods.pl* setPreference.pl*CVS/ get_log.pl* listPreference.pl*deleteElement.pl* getlog.pl* log.pl*display.pl* getMetadata.pl* setMetadata.pl*jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 %
Examples
• 1.6.X bucket– http://ntrs.nasa.gov/– http://www.cs.odu.edu/~mln/phd/
• 2.0 buckets– http://www.cs.odu.edu/~mln/teaching/cs595-f03/– http://www.cs.odu.edu/~lutken/smalltest/1120/
• 3.0 buckets (under development)– http://www.cs.odu.edu/~jallen/buckets/– uses MPEG-21 DIDLs
• cf. http://www.dlib.org/dlib/november03/bekaert/11bekaert.html
Self-Preservation
• Objectives:– knowledge of the system state not required
• i.e. -- you don’t need to keep track of where everything is…
– the knowledge required for each object should be minimal
• actually, the required number of “friends” should be finite, even in very large systems
Friends and Family
• Friends– connections to “other” buckets
• Family– connections to replications of you
Scenario: 3buckets/2pals each
A
Pals: b,c
B
Pals: a,c
C
Pals: b,a
We want to add new_guy (D)
A
Pals: b,c
B
Pals: a,c
C
Pals: b,a
D
Pals:(none)
Tool calls: C.insert(D,”start”)
A
Pals: b,c
B
Pals: a,c
C Pals: b,a
D
Pals:
D is added to C’s pal list
A
Pals: b,c
B
Pals: a,c
C Pals: b,a,(d)
D
Pals:
C pal_list is overstuffed
Return handshake: D.insert(C,”finish”)
A
Pals: b,c
B
Pals: a,c
C Pals: b,a,(d)
D
Pals: c
C pal_list is overstuffed
C “refits” pal list
A
Pals: b,c
B
Pals: a,c
C Pals: b,a,(d)
D
Pals: c
C pal_list is overstuffed
…
Refit step 1: C.pop_1st_pal not known by (D)
A
Pals: b,c
B
Pals: a,c
CPals: b,a,d
D
Pals: c
Now C pal_list is overstuffed
:
Refit step 2B.pop_pal( C )
A
Pals: b,c
B
Pals: a,cC Pals: a,d
D
Pals: c
Refit step 2B.insert( D, “start” )
A
Pals: b,c
BPals: a,d C Pals: a,d
D
Pals: c
Refit step 3D.insert( B, “finish” )
A
Pals: b,c
BPals: a,d C Pals: a,d
D
Pals: c,b
Refit step 3D.insert( B, “finish” )
A
Pals: b,c
BPals: a,d C Pals: a,d
D
Pals: c,b
A Pals: b,c
BPals: a,d C Pals: a,d
DPals: c,b
10 Buckets, 4 Friends: Step 2
10 Buckets, 4 Friends: Step 3
10 Buckets, 4 Friends: Step 4
10 Buckets, 4 Friends: Step 5
10 Buckets, 4 Friends: Step 6
10 Buckets, 4 Friends: Step 7
10 Buckets, 4 Friends: Step 8
10 Buckets, 4 Friends: Step 9
10 Buckets, 4 Friends: Step 10
20 Buckets, 4 Friends
100 Buckets, 10 Friends
Building the NetworkBucket:
this_node_name;max_friend size;list_of_pals;
insert ( new_guy, string handshake)// Adds new_guy to this bucket's pal list// handshake = "start" or "finish"
{ if (I know(new_guy) { return; } else {
put new_guy at end of my pal list;if ( handshake = "start" )
{new_node.insert(this_node_name, "finish"); }if ( my pal list if now overstuffed)
{ refit(); } }
return list_of_pals;}
refit ()// To keep pal_list from being overstuffed{
read in new_guy's pal list;pop_1st_pal_list();
// I remove 1st pal "Y" from my list that's// not present in "new_guy's" pal list
Y.pop_from_list(Me)// Have "Y" pop "Me"
Y.insert(new_guy , "start");// Y adds new_guy to his list// this will call new_guy to add "Y" as well
}
Communications Cost: Building the Network
• Total communications cost to build the network
b2 - f - (b-f)2
• b = # of buckets
• f = # of friends
Communications Cost: Building the Network
Communications Cost:Traversing the Network
• Flood algorithm:
b(f-1) - f + 2
• Spanning Tree:
b - 1
• Upper bound on the diameter of the network:
(b-f) /2 +1– (typically much less)
Network Resiliency
• The network can survive at least f-1 node (bucket) or edge (communications) failures and still remain fully connected
Cf. Other P2P Projects
• Gnutella– also O(N2) to build the network
• currently don’t know the exact message cost
• Chord, Tapestry, etc.– content addressable networking
• hash function to map keys to locations
– orthogonal to buckets
Chatting
• the stored objects are inactive until invoked– if no one communicates with the object, it never wakes
up, can never perform self-tests, etc.
• solution:– circulate a number of tokens through the network to
insure that everyone is woken up– buckets can perform a number of administrative tasks
at these times
• Core to solving the migration issue
Communications Tokens
Flocking…• Craig Reynolds, “Flocks, Herds, and Schools:A
Distributed Behavioral Model”, SIGGRAPH 87• Observations:
– flocks, schools, herds, etc. exhibit many desirable properties:• scale-free
– neighbors matter, not total size of flock
• no upper bound– flocks are never “full”
– flocks, etc. can be modeled with simple rules:• Collision Avoidance: avoid collisions with nearby flockmates • Velocity Matching: attempt to match velocity with nearby flockmates • Flock Centering: attempt to stay close to nearby flockmates
Flocking for DLs
Rules Flocking Boids Flocking Buckets
Collision Avoidance
avoid collisions with nearby flockmates
not overwriting one's own copies nor the copies of other buckets (i.e., namespace collision avoidance)
Velocity Matching
attempt to match velocity with nearby flockmates
deleting copies of oneself to provide “space” for late arrivals in a storage location
Flock Centering
attempt to stay close to nearby flockmates
following others to available storage locations
Flocking (9,4) “new repository available”
“new repository available”
Flocking (10,4)
Future Work
• Friends– optimizing the connections while sending the
communication token• convert to small world graph over time
– repair faults in the network
• Family– types
• active• passive
– provenance / authenticity
Other Applications for Smart Objects
• communication pulses will share the location of new services– format conversion (migration)– new repository locations (refreshing)– submit logs, alerts, other messages to people,
services, etc.
• self-arranging displays
Self-Arranging Displays For Buckets
• premise: to have the links in the object reflect the community’s preferences– real-time computation; no log file processing– Bollen & Nelson, “Adaptive Networks of Smart
Objects”– http://www.cs.odu.edu/~mln/pubs/bollenj_adaptive.pdf
Hebbian Learning
http://b2?method=display&referer=b2&redirect=http://b1?method\=display\%26redirect=http://b3?method=display\%26referer=http://b2
http://b1?method=display&referer=b1&redirect=http://b2?method\=display\%26referer=http://b1
Initial Experiment• Elango, Bollen & Nelson, "Dynamic Linking of Smart
Digital Objects Based on User Navigation Patterns"
– http://www.arxiv.org/abs/cs.DL/0401029– http://www.acm.org/technews/articles/20046/0607m.html#item8
– Take top 50 all-time pop music bands • from Spin Magazine’s top 50 bands of all time
– From each band, take 2 “related” bands • according to allmusic.com
– Create network of 150 buckets with band info (metadata from allmusic.com)
– Randomize the network• each band points to 3 other randomly selected bands
– Get people to traverse the network…
Sample Screenshot
Sample Results
From the Initial Node: Public Enemy
Reviews and Summaries of Related Work
• Fedora, Warwick Framework, Kahn-Wilensky Framework, VERS, Multivalent Documents, Cryptolopes, etc.– NASA TM 211426– http://techreports.larc.nasa.gov/ltrs/PDF/2001/tm/NASA-2001-tm211426.pdf
• Journal of Digital Libraries, forthcoming special issue on Complex Digital Objects:– CFP http://www.dljournal.org/
Risks
• Why have these projects met with limited success or are only used in niche applications?– it is one thing to add a layer to your DL, but
changing the structure of your first-class objects incurs a level of short-term risk
– however, even the most well-thought out componentized DL is subject to long-term risks
• cf. ICASE DL
Conclusions
• Smart objects are an idea whose time has come– natural progression of DL R&D
• Smart objects will play an fundamental role in digital preservation
• More info on preservation:– http://www.cs.odu.edu/~mln/teaching/cs791-s04/