self-preserving digital objects michael l. nelson mln@cs.odu.edu mln/ several slides from terry l....

Post on 21-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Self-Preserving Digital Objects

Michael L. Nelsonmln@cs.odu.edu

http://www.cs.odu.edu/~mln/

Several Slides from Terry L. Harrison

University of Southern California6/15/04

Outline

• History

• Preservation

• Archives vs. Objects

• Smart Objects & Dumb Archives

• Self-Preserving Objects

My DL History• 1992 - work first begun on first generation Langley Technical Report Server

(LTRS)• 1993 - WWW version of LTRS

• http://techreports.larc.nasa.gov/ltrs/• work w/ ODU on WATERS

• 1994 - NASA Technical Report Server (NTRS)• distributed searching of many “LTRS-like” servers (20 separate nodes, all

NASA centers)• http://techreports.larc.nasa.gov/cgi-bin/NTRS

• 1996 - NACA Technical Report Server (NACATRS)• http://naca.larc.nasa.gov/

• 1996 - Joint research in DLs with ODU begins• 1997 - NCSTRL+ (clustering, buckets)• 1999 - OAI-PMH development begins• 2001 - Arc, DP9, Archon, Kepler, etc.• 2002 - OAI-PMH version of the NTRS

• http://ntrs.nasa.gov/

History• ca. 1994 - 1995: a LaRC researcher, upon seeing

LTRS remarked:

“all of these reports are nice, but what we really want is the data...”

• ca. 1995 - present: many reports in LTRS start to include data files, appendices, software and other information types

• NACATRS: the scanned nature of the reports imply that 1 report = N files

N >= (pages * 3) + 2

NASA STI

• Formal publications cover a decreasing percentage of NASA’s STI output– most DLs focus only on formal publications

• Informal STI is maintained by only by a network of collegial distribution– aging and shrinking workforce weakens this network

• Customers want much more than formal publication– rather than stretch the meaning of “report” or

“document”, define a new object for DL transactions

STI Observations• Media formats are instantiations of a more general

class of information• Most DLs are uni-format, following the obsolete

media boundaries of their non-digital predecessors• “Separate but equal” DLs considered harmful

– customer should not have to re-integrate what should never have been de-integrated...

– institutional knowledge being lost because we don’t have a publishing vector established

Pyramid of Scientific and Technical Information (STI)

Journal Articles

Conference Papers

Technical Reports

time

software raw data notes video / images

Information is created in a variety of formats. Formal publications, the focus of

most DL projects, are supported by a pyramid of informal information.

Information Lost Over Time

Project

manuscript

software

raw data

images

library

ftp site

thrown away

filing cabinent

User NewProject

Figure 7: STI Lost in Project / Archival / Reuse Process

Content is King

The information content is more important than the systems used for its storage, management and retrieval

Objects should not be “locked” in specific DLs or archives

Prelude to OAI…

• I met Herbert Van de Sompel in April 1999...– we spoke of a demonstration project he had in mind and

had received sponsorship from Paul Ginsparg and Rick Luce

– We wanted to demonstrate a multi-disciplinary DL that leveraged the large number of high quality, yet often isolated, tech report servers, e-print servers, etc.

• most digital libraries (DLs) had grown up along single disciplines or institutions

– little to no interoperability; isolated DL “gardens”

Universal Preprint Service

• A cross-archive DL that that provides services on a collection of metadata harvested from multiple archives– Nelson: NCSTRL+; a modified version of Dienst

• support for “clustering”• support for “buckets”

– Krichel: ReDIF metadata format– Van de Sompel: SFX Linking

• Demonstrated at Santa Fe NM, October 21-22, 1999– http://web.archive.org/web/*/http://ups.cs.odu.edu/– D-Lib Magazine, 6(2) 2000 (2 articles)

• http://www.dlib.org/dlib/february00/02contents.html

– UPS was soon renamed the Open Archives Initiative (OAI) http://www.openarchives.org/

UPS ParticipantsArchive / DL Records in DL Buckets in UPS Buckets Linked to

Full Content

arXiv

www.arxiv.org

128943 85204 85204

CogPrints

cogprints.soton.ac.uk

743 742 659

NACA

naca.larc.nasa.gov

3036 3036 3036

NCSTRL

www.ncstrl.org

29680 25184 9084

NDLTD

www.ndltd.org

1590 1590 951

RePEc

netec.mcc.ac.uk

71359 71359 13582

Totals: 235361 187115 112516

totals ca. July 1999

Buckets: Information Surrogates in UPS

• Limitations on intellectual property, file size, transmission time, system load, etc. caused us to focus on metadata only

• Metadata was collected into “buckets”, with pointers back to the data files (still at the original sites)

Value Added Services Attached

to the Buckets

SFX Reference Linking Service, developed at Univ of Ghent, Belgium. - provides a layer of indirection between reference services available at a local site and the object itself

SFX “buttons” are attached to the buckets themselves - communication occurs between SFX server and the bucket

Adding other services to the buckets is easy...

• Data Providers– publishing into an archive– Self-describing archives

• Much of the learning about the constituent UPS archives occurred out of band…

– providing methods for metadata “harvesting”• provide non-technical context for sharing information

also

• Service Providers– harvest metadata from providers– implement user interface to data

Data and Service Providers

Even if theseare done bythe same DL,these are distinct roles

Metadata Harvesting• Move away from distributed searching• Extract metadata from various sources• Build services on local copies of metadata

– data remains at remote repositories

user

. . .

search for “cfd applications”

local copy ofmetadata

metadataharvested offline

metadataharvested offline

metadataharvested offline

metadataharvested offline

each node independently maintained

all searching, browsing, etc. performed on the metadata hereindividual nodes can

still support direct userinteraction

Result… OAI

• The OAI was the result of the demonstration and discussion during the Santa Fe meeting

– OAI = a bunch of people, a religion, a cult, etc.

– OAI Protocol For Metadata Harvesting (OAI-PMH) = the protocol created and maintained by the OAI

• Initial focus was on federating collections of scholarly e-print materials…

• …however, interest grew and the scope and application of OAI-PMH expanded to become a generic bulk metadata transport protocol

• Note:– OAI-PMH is only about metadata -- not full text!

• but what is metadata vs. full-text?

– OAI is neutral with respect to the nature of the metadata or the resources the metadata describes

• read: commercial publishers have an interest in OAI-PMH too...

A Look Back at UPS

• Primary outcome of the meeting was the OAI & OAI-PMH

• Krichel: ReDIF metadata:– still in use & being developed

• Van de Sompel: SFX– OpenURL (NISO Standard)– SFX is a commercial OpenURL resolver marketed by Ex

Libris

• Nelson: – NCSTRL+ begat Arc (arc.cs.odu.edu) and others– Buckets?

Componentized Digital Libraries

. . .

RSS

SRW

!?

Preservation

• RLG Report: Preserving Digital Information: Final Report and Recommendations– http://www.rlg.org/ArchTF/– refreshing - moving to new media

• considered (comparatively) easy

– migrating - transitioning to new systems, formats, idioms

• considered hard

Really Long Term Preservation

• Migration is very hard, to be sure– but given sufficient demand, this can be accomplished– cf. early 1980s game emulation:

• http://www.intellivisionlives.com/• http://stella.atari.org/

• Refreshing may actually be harder…– or at least intrinsically bound to the migration problem

• http://web.archive.org/web/19980128071544/http://www.usc.edu/

• http://web.archive.org/web/*/http://library.usc.edu/• http://web.archive.org/web/19971210220634/http://lib-

www.lanl.gov/

Preservation Metrics So Far– Nelson & Allen

• 3% decay of objects in DLs– http://www.dlib.org/dlib/january02/nelson/01nelson.html

– Lawrence, et al.• 3% decay of URLs included in technical papers

– http://www.neci.nec.com/~lawrence/papers/persistence-computer01/bib.html

– Koheler• ~ 33% of URLs “unstable” or “partially unstable”• http://InformationR.net/ir/4-4/paper60.html

– Kahle• average URL lasts 44 days

– http://www.hackvan.com/pub/stig/articles/trusted-systems/0397kahle.html

– Spinellis• 28% loss of 5-8 year old URLs from CACM / IEEE Computer

– http://citeseer.ist.psu.edu/spinellis03decay.html

Case Study: ICASE• Institute for Computer Applications in Science and

Engineering– independent research institute affiliated with NASA Langley

Research Center• www.icase.edu

– years of operation: 1972-2002– combined with other LaRC institutes, rolled into the National

Institute for Aerospace (NIA)

• ICASE Report Series– pre-prints/e-prints of all ICASE affiliated authors

• also issued as NASA Contractor Reports

– Dienst was used for report management & workflow• Harrison, Zubair & Nelson, JCDL 03, Dienst <-> OAI-PMH gateway

NIA Transition

• At first, all files at www.icase.edu were lost

• then, the site was brought back online

• but how well do DLs survive bulk-transfer?

Whither the ICASE Digital Library?

it appears to be reinstated… but not completely…

How Long is Forever?

• Average human life span (from: http://www.che.uc.edu/acs/archives/cintacs/vol39no5/vol39no5.html)

– female: 78– male: 77

• Average Fortune 500 company lifespan: (from:

http://www.businessweek.com/chapter/degeus.htm)

– 40 - 50 years

• Universities?• U.S. Government agency or institution?

– what about individual labs?• NASA Zero Base Review• U.S. Military BRAC

Self-Preservation

• Objects should be prepared to outlive the people & institutions that are charged with their well-being

• Many areas of risk:– company, agency, university, etc. ceases to exist

– funding cut

– person dies

– disaster (hurricane, earthquake, etc.)

– malicious attack

P2P Model

• Applicable for scientific and technical information?– Napster, Gnutella, etc. rely on the repetitive nature of

popular culture media (songs, movies, etc.) to insure the availability of items

– a “bubble” of recent and popular interest

• this assumption is probably not valid in STI DLs– cf. popularity(HBO) >> popularity(AMC)

Smart Objects, Dumb Archives

DA

Guildford Protocol

OAI-PMH

Buckets

???

Fedora? METS?

“Key Concepts in the Architecture of the Digital Library”

• next 9 slides taken from Bill Arm’s seminal article in the inaugural issue of D-Lib Magazine:– http://www.dlib.org/dlib/July95/07arms.html

The technical framework exists within a legal and social framework

• DLs no longer represent systems specific to academics or information specialists– content influences how the DL is used

• architecture must allow the implementation of various policies

Understanding of digital library concepts is hampered by terminology

• “common English” != “professional English”– multiple professional jargons too

• What do these words mean to you?– copy

– publish

– content

– document

– work

The underlying architecture should be separate from the content stored in the library

• general purpose functions and content-specific functions should be separated

• TL analogy:– the more specific the bookshelf is to holding

actual books, the harder it is to repurpose the bookshelf in the future

Names and identifiers are the basic building block for the digital library

• names != addresses

• in any DL architecture diagram, (almost) anything that can be drawn can be named

• consider the impact that handles/DOIs have had on the publishing/DL community

Digital library objects are more than collections of bits

• objects = metadata + data– “but what is metadata?”

• don’t ask hard questions

figure 2 in http://www.dlib.org/dlib/July95/07arms.html

The digital library object that is used is different from the stored object

• what you store is not necessarily what you get– storage and dissemination are separate events,

and can represent separate formats• also, potentially separate from the application-

specific format

Users want intellectual works, not digital objects

• The DL architect’s needs should not inconvenience the users’ needs

• recombination of objects– what is an object in your world view?

figure 4 in http://www.dlib.org/dlib/July95/07arms.html

Repositories must look after the information they hold

• “Repository Access Protocol”– Kahn Wilensky Framework

• http://www.cnri.reston.va.us/home/cstr/arch/k-w.html

figure 3 in http://www.dlib.org/dlib/July95/07arms.html

Objects vs. Archives

• This is the tenet that I question…

• Most DL objects still bound to the applications that generate or render the objects

Design Goals

• Aggregation– DLs should be shielded from the transient

nature of file formats– Prevent information hemorrhaging by archiving

all data types

• Intelligence– Aggregation (above) implies code, why stop at

passive objects? Make objects smart...– Bucket-bucket & bucket-tool intelligence

Design Goals

• Self-Sufficiency– Maximum autonomy & survivability: fully self-

sufficient buckets– Option to internally store all needed materials

• Mobility– Why should an information object be stuck in

one place?– Mobility for replication, workflow, data

collection

Design Goals

• Heterogeneity– One size does not fit all...– Different buckets for different applications, sites,

disciplines, etc.

• Archive Independence– Focus is on information, not yet another DL “system”

• does not require an archive to function

– “Work with everything; break nothing”

Smart Objects• aggregate:

– metadata– data– methods to operate on the metadata/data

• http://www.cs.odu.edu/~mln/teaching/cs595-f03/?method=getMetadata&type=all

• http://www.cs.odu.edu/~mln/teaching/cs595-f03/?method=listMethods• http://www.cs.odu.edu/~mln/teaching/cs595-f03/?method=listPreference• (cheat)

http://www.cs.odu.edu/~mln/teaching/cs595-f03/bucket/bucket.xml

• assumptions– Perl– http server

Internal Structurejaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % lsbucket/ CVS/ index.cgi*jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/bucket.xml* content/ CVS/ lib/ logs/ methods/jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/content/~syllabus.txt ~week1~readings.html ~week5~readings.html~week10~readings.html ~week1~week-01.ppt ~week6~readings.html~week11~readings.html ~week2~readings.html ~week7~readings.html~week12~readings.html ~week2~week-02.ppt ~week8~readings.html~week13~readings.html ~week3~assignment1.ppt ~week9~readings.html~week14~readings.html ~week3~readings.html~week15~readings.html ~week3~week-03.pptjaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/libCVS/ EZXML.pm mime.e style.cssjaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/logs/access.log CVS/ mylog.logjaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 % ls bucket/methods/addElement.pl* getElement.pl* listMethods.pl* setPreference.pl*CVS/ get_log.pl* listPreference.pl*deleteElement.pl* getlog.pl* log.pl*display.pl* getMetadata.pl* setMetadata.pl*jaga.cs.odu.edu:/home/mln/public_html/teaching/cs695-f03 %

Examples

• 1.6.X bucket– http://ntrs.nasa.gov/– http://www.cs.odu.edu/~mln/phd/

• 2.0 buckets– http://www.cs.odu.edu/~mln/teaching/cs595-f03/– http://www.cs.odu.edu/~lutken/smalltest/1120/

• 3.0 buckets (under development)– http://www.cs.odu.edu/~jallen/buckets/– uses MPEG-21 DIDLs

• cf. http://www.dlib.org/dlib/november03/bekaert/11bekaert.html

Self-Preservation

• Objectives:– knowledge of the system state not required

• i.e. -- you don’t need to keep track of where everything is…

– the knowledge required for each object should be minimal

• actually, the required number of “friends” should be finite, even in very large systems

Friends and Family

• Friends– connections to “other” buckets

• Family– connections to replications of you

Scenario: 3buckets/2pals each

A

Pals: b,c

B

Pals: a,c

C

Pals: b,a

We want to add new_guy (D)

A

Pals: b,c

B

Pals: a,c

C

Pals: b,a

D

Pals:(none)

Tool calls: C.insert(D,”start”)

A

Pals: b,c

B

Pals: a,c

C Pals: b,a

D

Pals:

D is added to C’s pal list

A

Pals: b,c

B

Pals: a,c

C Pals: b,a,(d)

D

Pals:

C pal_list is overstuffed

Return handshake: D.insert(C,”finish”)

A

Pals: b,c

B

Pals: a,c

C Pals: b,a,(d)

D

Pals: c

C pal_list is overstuffed

C “refits” pal list

A

Pals: b,c

B

Pals: a,c

C Pals: b,a,(d)

D

Pals: c

C pal_list is overstuffed

Refit step 1: C.pop_1st_pal not known by (D)

A

Pals: b,c

B

Pals: a,c

CPals: b,a,d

D

Pals: c

Now C pal_list is overstuffed

:

Refit step 2B.pop_pal( C )

A

Pals: b,c

B

Pals: a,cC Pals: a,d

D

Pals: c

Refit step 2B.insert( D, “start” )

A

Pals: b,c

BPals: a,d C Pals: a,d

D

Pals: c

Refit step 3D.insert( B, “finish” )

A

Pals: b,c

BPals: a,d C Pals: a,d

D

Pals: c,b

Refit step 3D.insert( B, “finish” )

A

Pals: b,c

BPals: a,d C Pals: a,d

D

Pals: c,b

A Pals: b,c

BPals: a,d C Pals: a,d

DPals: c,b

10 Buckets, 4 Friends: Step 2

10 Buckets, 4 Friends: Step 3

10 Buckets, 4 Friends: Step 4

10 Buckets, 4 Friends: Step 5

10 Buckets, 4 Friends: Step 6

10 Buckets, 4 Friends: Step 7

10 Buckets, 4 Friends: Step 8

10 Buckets, 4 Friends: Step 9

10 Buckets, 4 Friends: Step 10

20 Buckets, 4 Friends

100 Buckets, 10 Friends

Building the NetworkBucket:

this_node_name;max_friend size;list_of_pals;

insert ( new_guy, string handshake)// Adds new_guy to this bucket's pal list// handshake = "start" or "finish"

{ if (I know(new_guy) { return; } else {

put new_guy at end of my pal list;if ( handshake = "start" )

{new_node.insert(this_node_name, "finish"); }if ( my pal list if now overstuffed)

{ refit(); } }

return list_of_pals;}

refit ()// To keep pal_list from being overstuffed{

read in new_guy's pal list;pop_1st_pal_list();

// I remove 1st pal "Y" from my list that's// not present in "new_guy's" pal list

Y.pop_from_list(Me)// Have "Y" pop "Me"

Y.insert(new_guy , "start");// Y adds new_guy to his list// this will call new_guy to add "Y" as well

}

Communications Cost: Building the Network

• Total communications cost to build the network

b2 - f - (b-f)2

• b = # of buckets

• f = # of friends

Communications Cost: Building the Network

Communications Cost:Traversing the Network

• Flood algorithm:

b(f-1) - f + 2

• Spanning Tree:

b - 1

• Upper bound on the diameter of the network:

(b-f) /2 +1– (typically much less)

Network Resiliency

• The network can survive at least f-1 node (bucket) or edge (communications) failures and still remain fully connected

Cf. Other P2P Projects

• Gnutella– also O(N2) to build the network

• currently don’t know the exact message cost

• Chord, Tapestry, etc.– content addressable networking

• hash function to map keys to locations

– orthogonal to buckets

Chatting

• the stored objects are inactive until invoked– if no one communicates with the object, it never wakes

up, can never perform self-tests, etc.

• solution:– circulate a number of tokens through the network to

insure that everyone is woken up– buckets can perform a number of administrative tasks

at these times

• Core to solving the migration issue

Communications Tokens

Flocking…• Craig Reynolds, “Flocks, Herds, and Schools:A

Distributed Behavioral Model”, SIGGRAPH 87• Observations:

– flocks, schools, herds, etc. exhibit many desirable properties:• scale-free

– neighbors matter, not total size of flock

• no upper bound– flocks are never “full”

– flocks, etc. can be modeled with simple rules:• Collision Avoidance: avoid collisions with nearby flockmates • Velocity Matching: attempt to match velocity with nearby flockmates • Flock Centering: attempt to stay close to nearby flockmates

Flocking for DLs

Rules Flocking Boids Flocking Buckets

Collision Avoidance

avoid collisions with nearby flockmates

not overwriting one's own copies nor the copies of other buckets (i.e., namespace collision avoidance)

Velocity Matching

attempt to match velocity with nearby flockmates

deleting copies of oneself to provide “space” for late arrivals in a storage location

Flock Centering

attempt to stay close to nearby flockmates

following others to available storage locations

Flocking (9,4) “new repository available”

“new repository available”

Flocking (10,4)

Future Work

• Friends– optimizing the connections while sending the

communication token• convert to small world graph over time

– repair faults in the network

• Family– types

• active• passive

– provenance / authenticity

Other Applications for Smart Objects

• communication pulses will share the location of new services– format conversion (migration)– new repository locations (refreshing)– submit logs, alerts, other messages to people,

services, etc.

• self-arranging displays

Self-Arranging Displays For Buckets

• premise: to have the links in the object reflect the community’s preferences– real-time computation; no log file processing– Bollen & Nelson, “Adaptive Networks of Smart

Objects”– http://www.cs.odu.edu/~mln/pubs/bollenj_adaptive.pdf

Hebbian Learning

http://b2?method=display&referer=b2&redirect=http://b1?method\=display\%26redirect=http://b3?method=display\%26referer=http://b2

http://b1?method=display&referer=b1&redirect=http://b2?method\=display\%26referer=http://b1

Initial Experiment• Elango, Bollen & Nelson, "Dynamic Linking of Smart

Digital Objects Based on User Navigation Patterns"

– http://www.arxiv.org/abs/cs.DL/0401029– http://www.acm.org/technews/articles/20046/0607m.html#item8

– Take top 50 all-time pop music bands • from Spin Magazine’s top 50 bands of all time

– From each band, take 2 “related” bands • according to allmusic.com

– Create network of 150 buckets with band info (metadata from allmusic.com)

– Randomize the network• each band points to 3 other randomly selected bands

– Get people to traverse the network…

Sample Screenshot

Sample Results

From the Initial Node: Public Enemy

Reviews and Summaries of Related Work

• Fedora, Warwick Framework, Kahn-Wilensky Framework, VERS, Multivalent Documents, Cryptolopes, etc.– NASA TM 211426– http://techreports.larc.nasa.gov/ltrs/PDF/2001/tm/NASA-2001-tm211426.pdf

• Journal of Digital Libraries, forthcoming special issue on Complex Digital Objects:– CFP http://www.dljournal.org/

Risks

• Why have these projects met with limited success or are only used in niche applications?– it is one thing to add a layer to your DL, but

changing the structure of your first-class objects incurs a level of short-term risk

– however, even the most well-thought out componentized DL is subject to long-term risks

• cf. ICASE DL

Conclusions

• Smart objects are an idea whose time has come– natural progression of DL R&D

• Smart objects will play an fundamental role in digital preservation

• More info on preservation:– http://www.cs.odu.edu/~mln/teaching/cs791-s04/

top related