digital preservation workshop - archivematica · archivematica technical overview 12:00 lunch 13:30...

Post on 15-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Digital Preservation Workshop

5 November 2010University of Calgary

Peter Van GarderenArtefactual Systems

Workshop Agenda

10:00 Introductions

10:15 What is digital preservation?

From strategy to implementation

Archivematica technical overview

12:00 Lunch

13:30 Free and open source software

Archivematica & ICA-AtoM demo/tutorial

Preservation planning (time permitting)

16:00 Wrap-up

NOTE: open discussion / Q&A encouraged throughout

The content in this presentation may be freely re-used under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike 3.0 license.

Attribution:Title: Archivematica: Digital Preservation WorkshopCreator: Peter Van Garderen, Artefactual SystemsDate: November 5, 2010

© Artefactual Systems Inc. 2010

Peter Van GarderenPresident / Systems Archivist

Evelyn McLellanSystems Archivist

Jack BatesSoftware Engineer

David JuhaszSoftware Engineer

Austin TraskSystems Engineer

Jesús García CrespoSoftware Engineer

Joseph PerrySoftware Engineer

open-source sofware for archives and librariesdigital preservation consulting services

http://artefactual.com

Digital Preservation:planning for the long-term accessibility and usability of authentic digital information

a.k.a. digital curation

digital continuity

The Digital Preservation Problem:

Fragility of digital storage media

Lack or loss of adequate metadata

Lack of responsibility and resources

The complexity of digital information

The volume of digital information

Rapid technological change

information

presentation

behaviour

1010 1000 1011 1101

1010 1000 1011 1101

1010 1000 1011

1010 1000 1011

1010 1000 1011 1101

1010 1000 1011 1101

digital

structure

content

context representation file bitstream

object entity

intellectual entity

information

presentation

behaviour

1010 1000 1011 1101

1010 1000 1011 1101

1010 1000 1011

1010 1000 1011

1010 1000 1011 1101

1010 1000 1011 1101

digital

structure

content

context representation file bitstream

object entity

intellectual entity

Information ↔ Record ↔ Archival Material

now future

bitstream

header information

storage media

package

storage device

storage driver

file system

error correction operating system

application software

user interface

input / output devices

metadata

find

relate / bind

authenticate

contextualize

stored

conserved

protected

Digital Preservation is Risk Management

Risk

inability to provide services, manage programs and operate business functions efficiently because of digital information that is not

accessible or usable.

Risk

Poor quality decision-making because digital information that would have been otherwise

available has disappeared or can’t be

trusted to be authentic.

Risk

Exposure to legal liability because the digital information that serves as evidence of the

organization’s compliance or accountability in its contractual, governance and administrative obligations has been lost or can’t be trusted to

be authentic.

Risk

Heightened risk of non-compliance with laws, regulations and policies due to inaccessible

electronic records.

Risk

The organization acquires a reputation as an irresponsible, incompetent or untrustworthy

institution.

Risk

Lost opportunities to re-use and exploit information in digital form.

Risk

Unforeseen cost-creep because the ongoing preservation of digital information is overlooked

during the calculation of costs for new or modified systems.

Risk

Corporate memory loss and ‘cultural amnesia’ as the digital information that documents the

governance, administration and culture of an organization or society disappears from

servers and systems before steps have been taken to preserve it.

The Business Case for Digital Preservation

• Manage risks• Manage information• Manage storage

The anti-Business Case for Digital Preservation

• “don’t we already have backup and a business continuity plan?”

• “don’t we just upgrade the software?”• “storage is cheap”• “we’ll just index everything”• “why can’t we use the ERDMS/ECM system for

this?”

ERDMS / ECM

Digital Archives

Staff Desktops(email, docs, files)

Business Systems(structured data)

Staff External Researchers

active documents

inactive documents

Legacy Systems & Data

Website(s)

Scanning / Imaging

Individual/Ad Hoc Accessions

capture

capture

transfer

destroy

access access access

store

store

organize preserveorganize

archival material

E-Record Creating Environment

Category 1: preserve in source system

Category 2: transfer to digital archives

ERP DAMData Warehouse

??

records schedule

ERDMSSharedDrivesEmail ??

What are the requirements for a digital preservation system?

...that depends, what do kind of future were you expecting?

now future?

Digital Preservation Strategies

technology preservationemulationmigration

normalization

Digital Preservation System Core Requirements

OCLC/NARA Trustworthy Repositories: Audit & Certification (TRAC) www.crl.edu/sites/default/files/attachments/pages/trac_0.p

df ISO 14721 – Open Archival information Systems

(OAIS) http://public.ccsds.org/publications/archive/650x0b1.pdf

PREMIS Data Dictionary for Preservation Metadata www.loc.gov/standards/premis/v2/premis-2-0.pdf

Data Management

Preservation Planning

Archival Storage

Ingest

Administration

SIP

MANAGEMENT

AIP Access DIP

PRODUCER

CONSUMER

Open Archival Information System

Open Archival Information System

● ISO 14721● High level reference model● Default language of the digital preservation world● Key concepts:

● Mandatory Responsibilities● Functional Entities● Information Packages● Actors

OAIS Solutions

Proprietary Safety Deposit Box

– www.tessella.com Rosetta

– www.exlibrisgroup.com

Open Source RODA

– http://roda.di.uminho.pt/ Archivematica

– http://archivematica.org

OAIS Solutions vs. Digital Preservation Tools

Plato Planets Testbed DROID PRONOM JHOVE Fedora Xena Dioscuri

OAIS Solutions vs Digital Preservation Initiatives

Planets CASPAR InterPARES NDIIPP

Business Case Opportunities

ERDMS, ECM implementation Enterprise search implementation Business process/records scheduling analysis Archiving and storage pressure Audits FOI and disclosure/transparency inititiaves Response to digital preservation strategy Inter-institutional partnerships

Project Milestones

Strategy Business Case Technical Analysis Proof-of-concept Pilot(s) Production

System maintenance System scope

Identifying Pilot Projects

Scoring Factors Technical difficulty Intrinsic value Obsolesence risk Transparency risk Breadth of collection Descriptive metadata

Pilot Candidates Institutional

Repository Email ERDMS Shared directories External media

Project Costs Software licenses (proprietary)

Software installation Application server Security Storage integration

Software customization Legacy/source system migrations System-specific ingest/transfer templates Access system integration

Annual maintenance Technical support Release upgrades

Staff

Hardware

Storage

What is Archivematica?

Archivematica is a comprehensive digital preservation system.

Archivematica uses a micro-services design pattern to provide an integrated suite of free and open-source tools that allows users to process digital objects from ingest to access in compliance with the ISO-OAIS functional model.

Archivematica uses METS, PREMIS, Dublin Core and other best practice metadata standards.

Archivematica implements media type preservation plans based on an analysis of the significant characteristics of file formats.

Where did Archivematica come from?

● Artefactual Systems● City of Vancouver Archives● UNESCO Memory of the World● International Monetary Fund Archives● Rockefeller Archives Center● University of British Columbia Library● ?

Data Management

Preservation Planning

Archival Storage

Ingest

Administration

SIP

MANAGEMENT

AIP Access DIP

PRODUCER

CONSUMER

Open Archival Information System

ISO-OAIS

OAIS Use Cases

UMLActivity

Diagrams

Digital Archivessoftware system

requirements

ISO-OAIS

OAIS Use Cases

UMLActivity

Diagrams

SystemWorkflow

Instructions

http://archivematica.org/docs

requirements

documentation

Producer places SIP in shared folder on host machine

[host]/sendSIP/

The producer places a folder of objects in a designated folder on his or her computer. This designated folder has been set up so that it automatically sends its contents to a shared folder in Archivematica.

SIP appears in shared folderIn Archivematica

/1-receiveSIP/The shared folder in Archivematica is 1-receiveSIP. When you are processing the SIP, leave the original copy in this folder as a backup in case you need to go back and start again.

Archivist copies SIP toSIP review folder

/2-reviewSIP/

Archivist reviews SIP /2-reviewSIP/

Check the SIP to make sure it conforms to Submission Agreement.. If MD5checksum.txt file is included, right-click and select Verify MD5 Checksum. Otherwise Archivematica will add checksums to the SIP logs directory and verify checksums at various time throughout the ingest process.

Archivist adds descriptive metadata

/2-reviewSIP/

Open the SIP. Right-click and select Add Dublin Core XML from the drop-down menu. Right-click the dublincore.xml file to open it with Mousepad. Add descriptive information to the appropriate Dublin Core elements and save the file.

Archivist moves SIP toquarantine

/3-quarantineSIP/

= manual step = automated step

- 46 -

= file directory

Agile development method

● Time-based system releases● Feb 2009: Release 0.1-alpha ● May 2010: Release 0.6-alpha● November 2010: Release 0.7-alpha

● Each iteration leads to updated and improved:● Requirements● Software● Documentation● Development resources

http://archivematica.org/software

micro-servicewatched input directory

success output directory

error output directory

Archivematica is:

A classic Unix pipeline of OAIS micro-services provided by a series of open-source tools and integration code written in Python and Bash.

Packaged as a virtual appliance that bundles the Xubuntu operating system and can be run within virtual machines, as a bootable USB or Live DVD, or as a bare metal install on dedicated machines.

Free Beer!

“They’ll never take our freedom”

© 1995 Paramount Pictures & 20th Century FoxSee fair use rationale: http://en.wikipedia.org/wiki/File:Brave_mel.jpg

Free Puppy!

Foundation orSteering Committee

Governance

Coordination

Funding

Promotion

Users

Lead institutions Funding DevelopmentAll users Bug reports Enhancement requests Code patches Documentation Promotion

Open Source Software

Code

Knowledge

Community

Service Providers

Development

Technical Support

Hosting

Training

Promotion

CodeTime

MoneyKnowledge

CodeTimeMoneyKnowledge

TimeMoney

Knowledge

The open-source eco-system

Preservation planning:Normalizing file formats

Defining normalization

What is it? Normalization means converting ingested objects

into a small number of pre-selected formats

Why do it? Some formats are easier to preserve than others A smaller number of formats means fewer

preservation actions required

Normalization vs. migration

Migration is similar to normalization in that it involves converting ingested objects into preservation-friendly formats

Unlike normalization, migration is typically done only when the format is at risk of obsolescence

Migration as a strategy means adopting a wait and see approach

Disadvantages of normalization

It requires more planning up front to implement Re-normalization may be required as better

target formats or conversion tools become available

Advantages of normalization

Taking preservation action on ingest helps define and manage risk Adopting a wait and see approach means putting

off an undefined amount of work for an indefinite period of time at an unknown cost

Normalization does not preclude the future use of migration or other strategies such as emulation

Criteria for choosing formats

1. The format must be non-proprietary− There must be no associated licenses or patents or the

possibility of there being such licenses or patents in the future

Criteria for choosing formats

2. There must be freely available specifications− A specification is a document that explains exactly how

the format is structured and rendered

− This specification must be freely available to all and not subject to copyright or other restrictions

Criteria for choosing formats

3. The format should be widely endorsed and/or adopted

− Other established repositories should be using or have endorsed the format

− Formats that have been approved as international standards are particularly desirable

Criteria for choosing formats

4. For images and audio files there should be no compression

5. For video files any compression should be lossless

Criteria for choosing formats

6. There should be writing and rendering tools available for the format

− Idealized standards must be matched by practical tools

− The tools must reliably meet the requirements of the format specifications and must produce normalized objects that are faithful representations of the original objects

http://archivematica.org

top related