interoperation and infrastructure for digital archiving: the lukii project by michael seadle &...

29
Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard Altenhöner & Tobias Steinke, Deutsche Nationalbibliothek Berlin School of Library and Information Science

Upload: stephen-doyle

Post on 24-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Interoperation and Infrastructure for Digital Archiving: 

the LuKII Project

by Michael Seadle & Peter Schirmbacher, 

Humboldt-Universität zu Berlin&

Reinhard Altenhöner & Tobias Steinke,Deutsche Nationalbibliothek

 

Berlin School of Library and Information Science

Page 2: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Introduction

  In June, 2007, a DFG-sponsored workshop on digital archiving took place in Berlin.  Interoperability between LOCKSS (Lots of Copies Keep Stuff Save) and KOPAL (Co-operative Development of a Long-Term Digital Information Archive) was one of the most discussed ideas that emerged from that workshop.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 3: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Scholarly infrastructure

  Today's scholarly infrastructure depends heavily on digital materials. In some fields, particularly in the natural sciences, digital publication is taken for granted.

More publishers are launching new journals only in digital formats and open-access publications are almost exclusively digital.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 4: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Repositories

  The repositories offer ways to collect and give access to digital information. They lack infrastructure to do integrity checking with a statistically significant likelihood of finding and addressing integrity problems or to address usability problems with regular migration.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 5: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Open Access

  Germany has played a leading part internationally in the open access movement.  As a result its institutional repositories contain a wealth of research works.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 6: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Cost Effectiveness

 Cost-effectiveness is key because long term digital archiving is expensive.

Universities and their libraries have grown accustomed to paying the costs for retaining paper works, including their housing, handling and repair after heavy use.

Those costs will not go away any time soon, which means that the cost of digital preservation comes in addition to, not instead of, existing costs.Michael Seadle

Berlin School of Library & Information Science

Humboldt Universität zu Berlin

Page 7: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

LuKII goals

 The first goal of this project is to establish interoperability between KOPAL (from Germany) and LOCKSS (from the US) in order to marry German goals for migration and usability with cost-effective bitstream preservation.  The second goal is to test the prototype interoperable system by harvesting a wide variety of data from German OPUS and eDoc institutional repositories.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 8: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

LOCKSS

 LOCKSS (Lots of Copies Keep Stuff Safe from Stanford University) is arguably the earliest digital preservation and dissemination system.

It is known in particular for its robustness in maintaining the integrity of the digital object.  LOCKSS has faced genuine attack scenarios, shifted platforms, and tested format migration network-wide.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 9: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Bitstream integrity

  Bitstream integrity is broadly seen in the US as the sine qua non of long term digital archiving.  If the file is damaged, usability/readability and authenticity cease to be meaningful.

LOCKSS is neutral toward usability/readability solutions and can function with more than one.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 10: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Archival Storage

  The Archival Storage in LOCKSS uses seven separate nodes to check routinely on the integrity of an archived bitstream and to take action to replace a damaged copy.  The updated version is copied to other LOCKSS boxes in the network, but the older version is also retained in case of future need.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 11: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Context

  Context plays an important role in LOCKSS. The URL of the original work is stored with the digital object.

This not only allows the system to recognize and refer back to the original version of a digital document in order to check routinely for changes without requiring human intervention, but also lets the system know if the original for some reason ceases to be available online.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 12: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Ingest

 The current LOCKSS ingest process (its SIP or Submission Information Package in OAIS terms) uses a crawler that efficiently harvests all documents in a standard tree-structure website when it has permission from a “manifest” on the server being harvested.

The manifest serves as a guarantee to publishers that the LOCKSS crawler only takes materials that they have explicitly authorized.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 13: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Cost-effectiveness

 Cost-effectiveness has been an integral feature of LOCKSS design from the outset. It helps to reduce costs by using cheap and simple equipment.

The fact that it is open source means that libraries and other preservation-oriented institutions world-wide can use it without paying for permission.  LOCKSS is used by 197 libraries and institutions in 19 countries.Michael Seadle

Berlin School of Library & Information Science

Humboldt Universität zu Berlin

Page 14: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

LOCKSS Alliance

   LOCKSS Alliance membership is not required for the use of an open source package like LOCKSS, though it is strongly encouraged as a way of sharing development and support costs.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 15: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Community

  LOCKSS looks to a community of developers at member institutions of the LOCKSS Alliance to help to keep it up to date.

This community-based co-development on the LINUX model is particularly cost-effective.  Cost is obviously a factor for a commercial firm with profits to make.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 16: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

KOPAL Background

  The goal of the KOPAL project (2004 – 2007), founded by the Federal Ministry for Education and Research (Bundesministerium für Bildung und Forschung), was the cooperative development of a long-term digital information archive.

The archival system is based on DIAS by IBM, which was originally developed for the Koninklijke Bibliotheek of the Netherlands (KB).

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 17: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

KOPAL

 The German National Library and the Staats- und Universitätsbibliothek Göttingen (SUB Göttingen) use KOPAL, whose DIAS (Digital Archive Information System) core was developed by IBM for the National Library of the Netherlands.

Additional open source software has enhanced the ingest procedures and has provided tools to enable preservation planning activities like systematic migration workflows.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 18: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

KOPAL users

 The DIAS system for the KOPAL solution is currently used by two clients, DNB and SUB Göttingen.  Their data are independently of each other stored and accessible.  The system is located at Göttingen, which is responsible for guaranteeing bitstream preservation.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 19: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Universal Object Format

 The KOPAL system tries to deal with the problem of obsolete file formats and rendering environments by support of file format migration throughout its architecture.

Every archival package is in an open defined format called Universal Object Format, which describes a structure to record metadata for preservation together with the content files.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 20: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

koLibRI

 The koLibRI Java software library was developed by the German National Library and SUB Göttingen within the KOPAL project to support the integration of DIAS in the local IT infrastructure of the clients. Its tasks are:

• Encapsulate the communication with DIAS • Create archival objects conforming to the Universal Object Format • Automatically generate technical metadata with the tool JHOVE • Manage the ingest and the access to DIAS • Manage the workflow to migrate file formats in archival objects

based on given parameters and migration tools

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 21: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

KOPAL advantages

KOPAL gains several advantages in working with LOCKSS.  • LOCKSS strength in preserving bitstream

integrity • LOCKSS's effective dissemination package. • The shared support and development

structure of the LOCKSS Alliance KOPAL's state-of the-art presentation environment offers a solution for digital objects that are no longer usable.

Since KOPAL's systematic migration-flow guarantees the long-term usability and accessibility of digital objects, it complements the functions of LOCKSS well.Michael Seadle

Berlin School of Library & Information Science

Humboldt Universität zu Berlin

Page 22: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

LOCKSS advantages

  KOPAL's state-of the-art presentation environment offers a solution for digital objects that are no longer usable.  Since KOPAL's systematic migration-flow guarantees the long-term usability and accessibility of digital objects, it complements the functions of LOCKSS well.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 23: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

1st Objective

 The goal of this project is to make open access repositories in Germany, both discipline-specific and institutional, more robust over time.

The first objective involves establishing a LOCKSS network in Germany and providing the technical support to maintain it without constant reference to the LOCKSS teams in Stanford or Edinburgh.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 24: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

2nd Objective

 Interoperability with KOPAL is the second objective.  David Rosenthal (Stanford/LOCKSS) in private correspondence suggested the following three types of interoperability:

• Transfer interoperability• Dissemination interoperability• Audit interoperability

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 25: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

3rd Objective

  The third objective is to test the interoperability prototype (the “LuKII prototype”) by harvesting digital contents from a selection of German institutional repositories from the OA-Netzwerk-Projekt.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 26: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

3rd Objective

Among the key development issues for this third objective are:  1.ingest automation, • cost-effective metadata creation, • format migration testing.

 An absolutely essential feature of long term digital archiving systems is to free them as much as possible from the need for costly human intervention.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 27: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Current status

 Current status: The project has the following rough timeline: • March/April  – Hiring staff• May -- Development of the LOCKSS network in Germany• June– training for Berlin technical staff at Stanford.• July/August – Programming for METS and query support at Stanford;

programming for SFTP crawler, and parsing & extracting METS metadata at Berlin

• September– koLibRI generation of data for testing LOCKSS modifications at D-NB; implementation into test LOCKSS network – Berlin / Stanford

• October– first repository data load – start of iterative tool development.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 28: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Conclusion

  Scholarly research on long term digital archiving is just beginning. Today's system designs may no longer be the ideal in 50 or 100 years.

The more that systems can cooperate and interoperate, the greater the chances that investments in archiving systems can be carried into the future.

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin

Page 29: Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard

Sources

• Deutsche Initiative für Netzwerkinformationen (2009) “Open Access-Netzwerk Projekt”. Available (Dec 2009): http://www.dini.de/projekte/oa-netzwerk/

• Library of Congress, (2009), “Metadata Encoding and Transmision Standard”. Available (Dec 2009): http://www.loc.gov/standards/mets/

• Library of Congress, National Digital Information Infrastructure Preservation Program (2009), “WARC, Web ARChive file format”.. Available (December 2009): http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml

• LOCKSS (2009), “Libraries“. Available (Dec 2009): http://www.lockss.org/lockss/Libraries

• LOCKSS (2009), “Publications”. Available (Dec 2009): http://www.lockss.org/lockss/Publications

• Country (Ranking Web of Repositories).• Seadle, Michael & Elke Greifeneder. 2008. “In archiving we trust: Results from a workshop at

Humboldt University in Berlin.” First Monday 13(1). • Directory of open access journals. Available at: http://www.doaj.org/doaj?

func=findJournals [Accessed January 23, 2009].    

Michael Seadle Berlin School of Library &

Information Science Humboldt Universität zu Berlin