ecdl 2005, september 18 th - 23 th 2005, vienna, austria file-based storage of digital objects:...

27
ECDL 2005, September 18 th - 23 th 2005, Vienna, Austria File-based storage of Digital Objects: XMLtapes & Internet Archive ARC files Xiaoming Liu, Luda Balakireva, Herbert Van de Sompel RESEARCH LIBRARY File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files Xiaoming Liu (1) , Luda Balakireva (1) , Patrick Hochstenbach (2) and Herbert Van de Sompel (1) (1) Digital Library Research & Prototyping Team Research Library, Los Alamos National Laboratory (2) University Library Ghent University [email protected] , [email protected] , [email protected] , [email protected] XMLtape registry ARCfile registry index arc record arc record arc record arc record arc record index arc record arc record arc record arc record arc record tape record tape record tape record tape record tape record index XMLtape ARCfile ARCfile XMLtape basics version block version block OpenURL A. xm l A. i dx 1. ar c 1. cdx 2. ar c 2. cdx http:://barracuda.lanl.gov/ moai2/ http:://barracuda.lanl.gov/openurl http://cox.lanl.gov/ taperegistry/OAIHandler http://cox.lanl.gov/ arcregistry/OAIHandler tape record tape record arc record arc record arc record arc record tape record arc record arc record

Upload: elwin-calvin-peters

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

File-based storage of Digital Objects and constituent datastreams: XMLtapes and Internet Archive ARC files

Xiaoming Liu (1), Luda Balakireva (1), Patrick Hochstenbach (2) and Herbert Van de Sompel (1)

(1) Digital Library Research & Prototyping Team

Research Library, Los Alamos National Laboratory (2) University Library

Ghent University

[email protected] , [email protected] , [email protected] , [email protected]

XMLtaperegistry

ARCfileregistry

index

arc record

arc record

arc record

arc record

arc record

index

arc record

arc record

arc record

arc record

arc record

tape record

tape record

tape record

tape record

tape record

index

XMLtape ARCfile ARCfile

XMLtape basics version blockversion block

OpenURL

A.xml

A.idx

1.arc

1.cdx

2.arc

2.cdx

http:://barracuda.lanl.gov/moai2/

http:://barracuda.lanl.gov/openurlhttp://cox.lanl.gov/taperegistry/OAIHandler

http://cox.lanl.gov/arcregistry/OAIHandler

tape record

tape record

arc record

arc record

arc record

arc record

tape record arc record arc record

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Disclaimer

• The term Digital Object (DO) will be used as in Kahn/Wilensky:o Compound objecto Multiple datastreams of different mime typeso Secondary information pertaining to object and datastreamso Identifiers for object (and datastreams)

• This is ~ OAIS Content Information

Type MIME identifier

Digital Object scholarly paper N/A DOI

Constituent Datastream 1 metadata record application/xml PMID

Constituent Datastream 2 fulltext file application/pdf –

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

XML-based representation of DOs

• Growing interest in XML-based representation of DOs in Digital Library architectures:

o Platform-independence, o Industry-supporto Longevity, potential migration pathso Processing tools, validation capabilities

• XML-based Compound Object formats:o ISO/IEC 21000-2 MPEG-21 DID & DIDLo METSo IMS/CPo CCDS XFDU

• Typical functionality:o By-Value (base64) and/or By-Reference provision of constituent datastreamso By-Value and/or By-Reference provision of secondary informationo Provision of identifiers

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Storing XML-based representations of DOs

• Existing approaches:o storage of the XML-representations as individual files in a file system:

- Poor access performance- Poor backup performance

o storage of the XML-representations in (SQL, XML, object) databases- Long term? Data are dependent on the underlying system

o storage of the XML-representations by concatenating many such documents into a single file such as tar or zip

- Not XML aware, hence, no use of off-the-shelf XML tools- Increasing storage space (base64-encoding of the constituent

datastreams)

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

aDORe XMLtape/ARCfile solution

• Part of LANL aDORe repository effort:o Standards-based, modular repository architecture

- Distributed architecture- Protocol-based interactions between modules- Usable to create interoperable federations of heterogeneous repositories

o Actual implementation of the architecture at LANLo Components of aDORe software will be released

• Inspired by Internet Archive ARC file approach:o File-based mechanism to store datastreams resulting from Web-crawlingo Concatenation of multiple datastreams into a single fileo Metadata as seperators between datastreamso But not OK to store XML-based representations of DOs:

- Metadata capabilities very limited & crawling related- Lose power of XML processing tools

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

aDORe XMLtape/ARCfile solution

• Two interconnected file-based storage mechanisms:o XMLtapes: File storage of XML-based representations of Digital Objectso ARCfiles: File storage of constituent datastreams of Digital Objects

• The ARC files are interconnected with one or more XMLtapes during the ingestion process

• A protocol-based access mechanism is introduced:o XMLtape is exposed as an autonomous OAI-PMH repositoryo ARCfile is exposed as an OpenURL Resolver

• Write once - Read many: o Files remain stableo Protocol-based access mechanism remains stableo Indexing mechanisms can change as technologies evolve

• Storage approach is independent from the compound object format used to represent DOs as XML

o aDORe uses MPEG-21 DIDL

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

ISO/IEC 21000-2: MPEG-21 DID & DIDL

Digital ItemDigital Item Declaration DIDL document

has declarationhas XML

serialization

MPEG-21 Abstract Model

MPEG-21 DIDL

has XMLserialization

based on based on

Representing DOs using MPEG-21 DID

DigitalObject

Package

sample DIDL document

<Item>

DIDLDocumentid="info:lanl-repo/i/58f202ac"

OAIS PACKAGE PERSPECTIVE OAIS CONTENT PERSPECTIVE

ID="uuid-0000a01c"

ID="uuid-00004a42"

ID="uuid-00005e90"

ID="uuid-888b135e"

item

<Item>

<Component>

<Component>

<DIDL>

item

component

info:doi/10.123/44455

component

info:pmid/2225887

info:lanl-repo/ds/380b1f5c

info:lanl-repo/ds/f1ec7e32

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

aDORe XMLtape

• An XML file that concatenates the XML-based representations of multiple DOs

• Structure is defined by an XML Schemao http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtape.xsdo tape-level administrative section:

- Open-ended content- Plug-in for processing-related information, indication of related ARCfiles:

- http://purl.lanl.gov/aDORe/schemas/2005-08/XMLtapeBasics.xsdo concatenation of records, each of which consists of:

- record-level administrative section - identifier and datestamp of the contained record- other record-level administrative information

- a record (can be from any XML Namespace). DIDL in case of aDORe:- http://purl.lanl.gov/aDORe/schemas/2005-08/DIDL.xsd

• An XMLtape is a valid and well-formed XML file• Independent from chosen XML-based Compound Object Format

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

aDORe XMLtape

<?xml version="1.0" encoding="UTF-8"?><ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/" <ta:tapeAdmin> ... </ta:tapeAdmin> <ta:tapeRecord> <ta:tapeRecordAdmin> <ta:identifier>oai:aps.org:PhysRevA.71.040101</ta:identifier> <ta:date>2005-03-29T04:31:22Z</ta:date> <ta:recordAdmin> ... </ta:recordAdmin> </ta:tapeRecordAdmin> <ta:record> <didl:DIDL>...</didl:DIDL> </ta:record> </ta:tapeRecord></ta:tape>

aDORe ta:tape

sample XMLtape

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

aDORe XMLtape index

identifierdatestamp of ingestion

XMLtape

record

record

record

record

record

record

record

record

identifierdatestamp of ingestion

identifierdatestamp of ingestion

index

identifier/datestamp

identifier/datestamp

identifier/datestamp

identifier/datestamp

identifier/datestamp

identifier/datestamp

identifier/datestamp

Indexing: • Can be achieved with a variety of technologies• Current implementation: Berkeley DB Java Edition

<ta:tapeRecordAdmin>

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

aDORe XMLtape as OAI-PMH repository

XMLtape

record

record

record

record

record

record

record

record

index

identifier/datestamp

identifier/datestamp

identifier/datestamp

identifier/datestamp

identifier/datestamp

identifier/datestamp

identifier/datestamp

OAI-PMH request

DIDL document

OAI-PMH identifier = identifier from <ta:tapeRecordAdmin>

OAI-PMH datestamp = datetime from <ta:tapeRecordAdmin>

OAI-PMH response = content of <ta:record>

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Internet Archive ARCfile

• Concatenation of binary files

• Designed and used by the Internet Archive (Wayback machine)o > 400 TB web data

• Under revision by the International Internet Preservation Consortium (IIPC): WARC file format

o Input from LANL to facilitate non-Web-crawling use case

• The ARC file format is structured as follows:o file header that provides administrative information about the ARC file itselfo a sequence of document records, consisting of:

- a header line containing some, mainly crawl-related, metadata.

- URI of the crawled document

- timestamp of acquisition of the data

- size of the data block

- a response to a protocol request such as an HTTP GET

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Internet Archive ARC file

filedesc://IA-001102.arc 0 19960923142103 text/plain 761 0 Alexa InternetURL IP-address Archive-date Content-type Archive-length

http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202HTTP/1.0 200 Document followsDate: Mon, 04 Nov 1996 14:21:06 GMTServer: NCSA/1.4.1Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMTContent-length: 30<HTML>Hello World!!!

</HTML> sample ARC file

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Internet Archive ARC file in aDORe

filedesc://singletape.arc 0.0.0.0 20050922142103 text/plain 76 1 0

Internet Archive

URL IP-address Archive-date Content-type Archive-length

info:lanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b989a 0.0.0.0 20050907221344 application/pdf 415025 %PDF-1.3 %âãÏÓ290 0 obj << /Linearized 1 /O 295 /H [ 3642 1057 ] /L 415025…

sample aDORe ARC file

sample ARCfile

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Internet Archive ARC file

index

URL

URL

URL

URL

URL

URL

URL

URL

ARC

datastream

datastream

datastream

datastream

datastream

datastream

datastream

datastream

URL

URL

Indexing: • Can be achieved with a variety of technologies• Current implementation in aDORe: Heritrix toolkit

URL IP-address Archive-date Content-type Archive-length

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

ARC file as OpenURL Resolver

ARC file

datastream

datastream

datastream

datastream

datastream

datastream

datastream

datastream

index

URL

URL

URL

URL

URL

URL

URL

OpenURL

OpenURL request

datastream

Referent Identifier = datastream identifier = URL from ARC record header

Resolver Identifier = identifier of ARC file

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Associating an XMLtape with ARC Files (1)

• A Digital Object is represented using an XML-based Complex Object format (e.g. MPEG-21 DID)

• The resulting package (e.g. DIDL document) is stored in an XMLtape

• Constituent datastreams of the Digital Object are provided By-Reference:o Using the ref attribute of the Resource element in MPEG-21 DIDo The value of the network location of the constituent datastream is compliant

with the NISO OpenURL Framework:

baseURL(ARCfile OpenURL Resolver)?

url_ver = Z39.88-2004 &

rft_id = Datastream Identifier &

res_id = ARCfile identifier

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Associating an XMLtape with ARC Files (1)

<?xml version="1.0" encoding="UTF-8"?><didl:DIDL>……<didl:Component id="uuid-ddec9dbb-90e5-4b8a-93f3-dd1c8b781547"> <didl:Descriptor> <didl:Statement mimeType="application/xml; charset=utf-8"> <dii:Identifier … > info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b </dii:Identifier> </didl:Statement> </didl:Descriptor> <didl:Resource mimeType="application/pdf“ ref="http://purl.lanl.gov/aDORe/demo/adore-arcfile-resolver/resolver? url_ver=Z39.88-2004 res_id=info:lanl-repo/arc/2001_4acb6e28-1ef9-11da-9e1e-d8ccd1d6c8f2 rft_id=info:lanl-repo/ds/ba0797d3-9414-42d0-90e8-f5397e74892b“/></didl:Component>……</didl:DIDL>

Extract from DIDL

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Associating an XMLtape with ARC Files (2)

• An XMLtape is associated with its corresponding ARCfiles through a plug-in for the XMLtape-level administrative section.

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Associating an XMLtape with ARC Files (2)

<?xml version="1.0" encoding="UTF-8"?>

<ta:tape xmlns:ta="http://library.lanl.gov/2005-08/aDORe/XMLtape/">

<ta:tapeAdmin>

<tb:XMLtapeBasics xmlns:tb="http://library.lanl.gov/2005-08/aDORe/XMLtapeBasics/“>

<tb:XMLtapeId>info:lanl-repo/xmltape/singlescitape</tb:XMLtapeId>

<tb:ARCfileId>info:lanl-repo/arc/singlescitape</tb:ARCfileId>

<tb:processSoftware>gov.lanl.xmltape.SingleTapeWriter</tb:processSoftware>

<tb:processTime>2005-09-07T22:13:39Z</tb:processTime>

</tb:XMLtapeBasics>

</ta:tapeAdmin>

<ta:tapeRecord>

<ta:tapeRecordAdmin>

</ta:tape>XMLtape header

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

AGENT

Identifier Locator

DID

LDoc

umen

t-id

or

con

tent

-id

List

of (

base

UR

L,

DID

LDoc

umen

t-id

)

DID

LDoc

umen

t-id

or

con

tent

-id

XMLtape

DIDLDocument- id

DIDLDocument-idindex

creation datetimeindex

ref

DIDL document

ref

OpenURL

data

stre

am-id

data

stre

am

ARC file

datastream id

datastream-idindex

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

XMLtaperegistry

ARCfileregistry

index

arc record

arc record

arc record

arc record

arc record

index

arc record

arc record

arc record

arc record

arc record

tape record

tape record

tape record

tape record

tape record

index

XMLtape ARCfile ARCfile

XMLtape basics version blockversion block

OpenURL

A.xml

A.idx

1.arc

1.cdx

2.arc

2.cdx

http:://barracuda.lanl.gov/moai2/

http:://barracuda.lanl.gov/openurlhttp://cox.lanl.gov/taperegistry/OAIHandler

http://cox.lanl.gov/arcregistry/OAIHandler

tape record

tape record

arc record

arc record

arc record

arc record

tape record arc record arc record

aDORe XMLtape/ARCfile environment

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Implementation

• XMLtapes:o Berkeley DB Java Editiono OCLC OAICat

• ARCfiles:o Heritrixo OCLC OpenURL software

• XMLtape Registryo MySQL dbo OCLC OAICat

• ARCfile Registry:o MySQL dbo OCLC OAICat

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Performance indicators

• System:o Model: Dell 2650 2U rack-mount server o CPU: dual 2.8 GHz Intel Xeon processors o RAM: 5GB RAM o Disks: 10k RPM SCSI disks

• XMLtape:o 1786 MB, 201872 DIDL recordso download 100 consecutive DIDL records (787 KB) => 0.18 secondo download static file of same size => 0.09 second

• ARCfile:o 272 MB,  4910 fileso download a sample PDF file (312 KB) => 0.24 secondo download static file of same size => 0.036 second

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Software

• Software - ARC files:o Heritrix: the internet archive's open-source, extensible, web-scale, archival-

quality web crawler project. http://crawler.archive.org/o NetArchive.dk: a project that plans for the preservation of Denmark's cultural

heritage on the internet for future generations. http://www.netarchive.dk/o Many other tools: http://archive-access.sourceforge.Net

• XMLtapes:o Perl tool, XML::Tape (LANL & Ghent University),

http://search.cpan.org/~hochsten/XML-Tape/

• Combined aDORe XMLtape/ARCfile environment:o Java tool (LANL), soon to be released on SourceForge

ECDL 2005, September 18th - 23th 2005, Vienna, Austria

File-based storage of Digital Objects: XMLtapes & Internet Archive ARC filesXiaoming Liu, Luda Balakireva, Herbert Van de SompelRESEARCH

LIBRARY

Conclusion

• The file-based approach is inherently simple, and reduces dependency on database system.

• The autonomy of the indexes allows retaining the files over time, while the indexes can be created using other techniques as technologies evolve.

• The protocol-based nature of the access increases the flexibility in light of evolving technologies as it introduces another layer of abstraction.

• The XMLtape approach is inspired by the ARC file format, but provides several additional attractive features:

o Off-the-shelf XML tools can be used to parse/validate an XMLtapeo All DO metadata can be stored in XML-based compound object format

Presentation available via http://public.lanl.gov/herbertv/Install TSCC codec for avi movies