Preservation Metadata Workshop (2) The Hague, the Netherlands 19 June 2014 Titia van der Werf adapted from: Rebecca Guenther, “Metadata for preservation of digital objects: background, functions, and standards” – Preservation Metadata Workshop (1), Hilversum, The Netherlands, 4 March 2014
Preservation Metadata: between theory and practice
OUTLINE
1. General introduction to preservation metadata 2. The PREMIS Data Dictionary 3. A use case: the Preservation Health Check
2
Introduction to preservation metadata
3
metadata Function � Discovery � Access � Management � Control intellectual property
rights � Identification � Certify authenticity � Mark content structure � Indicate status � Describe processes � Etc.
Type � Descriptive � Administrative � Technical � Rights/Access � Structural � Meta-metadata � Etc.
4
digital preservation Digital preservation is part and parcel of the “management and
preservation” tasks and responsibilities of a heritage institution. Digital information poses its own set of challenges to preservation: • The overwhelming volume of digital information created daily and
the uncontrolled duplication of information; • The complexity of digital information (content, structure, context,
presentation, behaviour) and the evolving boundaries of the scholarly record and the cultural record;
• The dependency on software/hardware (incl. incompatible, obscure or proprietary systems)
• The rapid technological change and the danger of obsolescence • The ease of (accidental or malicious) content alteration • Doubts about the reliability and integrity of electronic records and
the need to vouch for their authenticity
5
digital preservation Digital preservation is part and parcel of the “management and
preservation” tasks and responsibilities of a heritage institution. Digital information poses its own set of challenges to preservation: • The overwhelming volume of digital information created daily and
the uncontrolled duplication of information; • The complexity of digital information (content, structure, context,
presentation, behaviour) and the evolving boundaries of the scholarly record and the cultural record;
Ø The dependency on software/hardware (incl. incompatible, obscure or proprietary systems)
Ø The rapid technological change and the danger of obsolescence
• The ease of (accidental or malicious) content alteration • Doubts about the reliability and integrity of electronic records and
the need to vouch for their authenticity
6
preservation metadata in 2000 “We can then say that the main problem metadata
for long term preservation will help to solve is the problem of technological obsolescence.” (p.4)
7 http://www.kb.nl/sites/default/files/docs/NEDLIBmetadata.pdf
preservation metadata in 2002 “Preservation metadata (…) is the information
necessary to maintain the viability, renderability, and understandability of digital resources over the long-term.” (p.1)
8
http://www.oclc.org/content/dam/research/activities/pmwg/pm_framework.pdf?urlm=161391
preservation metadata in 2005 “Preservation metadata (…) metadata supporting
the functions of maintaining viability, renderability, understandability, authenticity, and identity in a preservation context.” (p. ix)
9
http://www.loc.gov/standards/premis/
The SPOT Model for risk assessment
SPOT Model
Availability
Identity
Persistence
Renderability
Understandability
Authenticity
Threats
http://www.dlib.org/dlib/september12/vermaaten/09vermaaten.html
Six essential properties of successful digital preservation
metadata and preservation metadata
“Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource”
METADATA
“Metadata that supports and documents the digital preservation process”
PRESERVATION METADATA
supporting and documenting the digital preservation process • Provenance:
– The chain of custody/ownership of the digital object; info about the depositor; etc.
• Authenticity:
– The documentation of changes affecting the authenticity of the digital object during the preservation process
• Preservation Activity:
– The documentation of actions taken to preserve the digital object • Technical Environment:
– The documentation of the dependencies on and changes in the technical environment needed to render and use the digital object
• Rights:
– The documentation of the rights and permissions for carrying out preservation activities on the digital object (duplication, migration, transformations)
OAIS Information Model
Information Package Concepts and Relationships (Figure 2-3)
Preservation Description Information
Preservation Description Information
Reference Information
Provenance Information
Context Information
Fixity Information
Preservation Description Information (Figure 4-16) – June 2012 version
Reference information: identifiers of the Content Provenance information: history of the custody Context information: relation of the Content to other objects Fixity information: a data integrity checksum of the Content Access Rights Information: permissions for preservation operations
Access Rights Information
How to record and manage change
OAIS rule: if the PDI changes, the AIP version changes.
Implementation choices: e.g. fixity information in source AIP + keep log of data integrity checks and their
outcomes separate from the AIP.
16
OAIS compliance relevant to preservation metadata
OAIS Mandatory Responsibilities: 1. Negotiating and accepting information 2. Obtaining sufficient control of the information to
ensure long-term preservation 3. Determining the "designated community" 4. Ensuring that information is independently
understandable 5. Following documented policies and procedures 6. Making the preserved information available
Digital repository certification
– RLG-NARA Task Force on Digital Repository Certification – Various other certification initiatives (CRL, DCC, nestor,
DRAMBORA) – Trusted Repositories Audit & Certification (TRAC): Criteria and
Checklist (March 2007) • Organisational infrastructure
– e.g., governance, organisational structures, mandates, policy frameworks, funding systems, contracts and licenses
• Digital Object Management (OAIS functions) – e.g., ingest, metadata, preservation strategies
• Technologies, Technical Infrastructure, & Security
Functions of a trusted digital repository relevant to preservation metadata • Maintains persistent, unique identifiers for all archived
objects • Identifies properties it will preserve • Verifies each submitted object during ingest • Creates archival package from submission package to
include technical and rights metadata • Has mechanisms to authenticate content and its source • Ensures that content information isn’t corrupted and
maintains integrity by using fixity information • Manages number and location of copies of all digital
objects • Employs documented preservation strategies
19
Functions of a trusted digital repository relevant to preservation metadata • Maintains precise descriptions of actions necessary to ensure
that objects are preserved • Has mechanisms for monitoring and notification when formats
are becoming obsolete • Uses tools and resources such as format registries to
establish semantic and technical context • Has processes for storage media and/or hardware changes • Tracks and manages intellectual property rights and
restrictions • Ensures that agreements applicable to access conditions are
adhered to • Maintains descriptive metadata for access and retrieval and
associates it with object
20
PREMIS
21
Standards that address preservation metadata: technical • PREMIS • Images
– NISO Z39.87 and MIX – Adobe and XMP (Extensible Metadata Platform) – Exif (Exchangeable Image File Format) – IPTC (International Press Telecommunications Council)/XMP
• Text: textMD • Sound
– AES57-2011: Audio Object XML Schema – AES60-2011: Core Audio Metadata – AudioMD (Library of Congress)
Standards that address preservation metadata: technical
• Video – VideoMD – SMPTE RP210 – Technical metadata in EBUCore, PBCore – U.S. Federal Agencies Digitization Guidelines – MPEG-7 and MPEG-21 for video
Standards that address preservation metadata: Structural § METS § PREMIS § MPEG 21 Digital Item Declaration § OAI/ORE § Specific format types
– MXF – AVI
Standards that address preservation metadata: Rights • PREMIS • METS Rights • CDL Copyright schema • Creative commons • PLUS for images • MPEG-21 REL for moving images • ONIX for licensing terms • Full rights expression languages
– XRML/MPEG-21 – ODRL
PREMIS Data Dictionary • May 2005: Data Dictionary for Preservation
Metadata: Final Report of the PREMIS Working Group • March 2008: PREMIS Data Dictionary for Preservation
Metadata, version 2.0
• Jan. 2011: version 2.1
• April 2012: version 2.2
• Announced in September 2013: version 3.0
• Data Dictionary: – Comprehensive view of information needed to support digital preservation
• Guidelines/recommendations to support creation, use, management – Based on deep pool of institutional experiences in setting up and managing operational
capacity for digital preservation
Guiding principles: “implementable, core preservation metadata”
• Preservation metadata: maintain viability, renderability, understandability, authenticity, identity in a preservation context
• Core: What most preservation repositories need to know to preserve digital materials over the long-term
• Implementable: rigorously defined; supported by usage guidelines/recommendations; emphasis on automated workflows and metadata generation
• Technical neutrality: no assumptions about technologies, systems and architectures, where metadata is stored
Scope
• What PREMIS DD is: – Common data model for organizing/thinking about preservation metadata – Guidance for local implementations – Standard for exchanging information packages between repositories – Compatible with the OAIS reference and information model
• What PREMIS DD is not: – Out-of-the-box solution: need to instantiate as metadata elements in repository
system – All needed metadata: excludes business rules, format-specific technical
metadata, descriptive metadata for access, non-core preservation metadata – Lifecycle management of objects outside repository – Rights management: limited to permissions regarding actions taken within
repository
PREMIS Data Model
Intellectual Entities
Objects
Rights Statements
Agents
Events
Intellectual Entities
Examples: • The Chamber by John Grisham (an
ebook) • “Maggie at the beach”
(a photograph) • The Metropolitan New York Library
Council Website (a website)
• Set of content that is considered a single intellectual unit for purposes of management and description (e.g., a book, a photograph, a map, a database)
• Has one or more digital representations
• May include other Intellectual Entities (e.g. a website that includes a web page)
• Not fully described in PREMIS DD, but can be linked to in metadata describing digital representation THIS WILL CHANGE IN 3.0
Objects
Examples: § a PDF file § A book composed of several
XML files and many images § TIFF file containing a header
and 2 images
Objects are what repository actually preserves FILE: named and ordered sequence of bytes that is known by an operating system REPRESENTATION: set of files, including structural metadata, that, taken together, constitute a complete rendering of an Intellectual Entity BITSTREAM: data within a file with properties relevant for preservation purposes (but needs additional structure or reformatting to be stand-alone file) FILESTREAMS (files within files) are considered files since can be rendered alone
Object Example: book in two versions
Intellectual Entity Da Vinci Code by Dan Brown
Representation 1 Page image version
Representation 2 ebook version
File 1: page1.tiff
File 2: page2.tiff
File N: pageN.tiff
File 1: book.lit
File N+1: METS.xml
Semantic units pertaining to Objects
• Object identifier • Preservation level • Significant characteristics • Object characteristics
– fixity – format – size – creating application – inhibitors – object characteristics
extension • Original name
• Storage • Environment
– software – hardware
will change in 3.0 • Digital signatures • Relationships • Linking event identifier • Linking rights statement
identifier
Events
Examples: § Validation Event: use JHOVE tool to
verify that chapter1.pdf is a valid PDF file
§ Ingest Event: transform an OAIS SIP into an AIP (one Event or multiple Events?)
• An action that involves or impacts at least one Object or Agent associated with or known by the preservation repository
• Helps document digital provenance. Can track history of Object through the chain of Events that occur during the Objects lifecycle
• Determining which Events are in scope is up to the repository (e.g., Events which occur before ingest, or after de-accession)
• Determining which Events should be recorded, and at what level of granularity is up to the repository
Semantic units pertaining to Events: provenance and preservation activity
§ Event identifier § Event type (e.g. capture, creation, validation, migration,
fixity check, ingestion) § Event dateTime § Event detail § Event outcome § Event outcome detail § Linking agent identifier § Linking object identifier
Agents
Examples: § Rebecca Guenther (a person) § New York Public Library (an
organization) § JHOVE version 1.0 (a software
program)
• Person, organization, or software program/system associated with an Event or a Right (permission statement)
• Agents are associated only indirectly to Objects through Events or Rights
• Not defined in detail in PREMIS DD; not considered core preservation metadata beyond identification
Semantic units pertaining to Agents
• Agent Identifier • Agent Name • Agent Type • Agent Note • Agent Extension • Linking Event Identifier • Linking Rights Identifier
Rights Statements
Example: § Priscilla Caplan grants FCLA
digital repository permission to make three copies of metadata_fundamentals.pdf for preservation purposes.
• An agreement with a rights holder that grants permission for the repository to undertake an action(s) associated with an Object(s) in the repository.
• Not a full rights expression language; focuses exclusively on permissions that take the form: – Agent X grants Permission Y
to the repository in regard to Object Z.
Semantic units pertaining to Rights
• Rights Statement • Rights Statement Identifier • Rights Basis • Copyright Information • License Information • Statute Information • Other Rights Information
• Rights Granted • act • restriction • termOfGrant • rightsGranted
• Linking Object Identifier • Linking Agent Identifier • rightsExtension
Relationships
• PREMIS Data Dictionary supports expression of relationships between: – Different Objects
• Structural: relationships between parts of a whole • Derivation: relationships resulting from replication or transformation of
an Object • New relationships in 3.0: replacement, dependency, generalization,
reference – Different Entities
• Relationships are established through reference to Identifiers of other Objects or Entities
PREMIS Maintenance Activity • Web site:
– Permanent Web presence, hosted by Library of Congress
– Central destination for PREMIS-related info, announcements, resources
– Home of the PREMIS Implementers’ Group (PIG) discussion list
• PREMIS Editorial Committee:
– Set directions/priorities for PREMIS development – Coordinate future revisions of Data Dictionary and XML
schema – Promote implementation – International in scope, cross domain
http://www.loc.gov/standards/premis/
Implementation resources • Tools:
– XML schema – PREMIS-in-METS toolbox <http://pim.fcla.edu> – Controlled vocabularies at http://id.loc.gov – RDF/OWL ontology for use as Linked Data
• Guidelines: – PREMIS conformance statement – PREMIS & METS guidelines
• Community Working groups on special topics • Implementation Fairs
• Others: – Understanding PREMIS (available in multiple languages) – PIG Forum – Implementation Registry – Tools Registry
Some implementers …
• DAITTSS (Florida) • Ex Libris Rosetta • OCLC’s Digital Archive™ • Archivematica • HathiTrust • TIPR (Towards Interoperable Preservation
Repositories) – FCLA, NYU and Cornell
• Digital libraries in Spain – Mandated for use in cultural heritage preservation
repositories See: http://www.loc.gov/premis/premis-registry.html
PREMIS Conformance
• Conformance statement issued in 2010 • PREMIS Conformance Working Group active
now • Levels of conformance:
– Level 1 A repository uses an internal metadata schema whose elements can be mapped to PREMIS. The mapped metadata can satisfy the principles of use at both the semantic unit and Data Dictionary levels. The repository is able to produce documentation demonstrating such mapping for representative samples of its holdings.
– Level 2 A repository implements the PREMIS Data Dictionary as its internal metadata schema in a way that satisfies the principles of use at both the semantic unit and Data Dictionary levels and in a form that does not require further mapping or conversion.
URLs, etc.
• PREMIS Maintenance Activity: http://www.loc.gov/standards/premis/
• PREMIS Data Dictionary for Preservation Metadata:
http://www.loc.gov/standards/premis/v2/premis-2-1.pdf
• PREMIS Implementation Registry http://www.loc.gov/standards/premis/registry
• PREMIS Implementers Group list http://listserv.loc.gov/listarch/pig.html
A use case: the preservation health check
46
- Open Planets Foundation (OPF) A community hub for digital preservation whose main goal is
to jointly manage and improve tools and research outcomes for practical use.
- OCLC Research A community resource for shared R&D that addresses
challenges facing libraries and archives in a rapidly changing information technology environment.
- Bibliothèque nationale de France The BnF runs a fully operational trusted digital repository
(SPAR). They volunteered to become a PHC-pilot site.
What is the Preservation Health Check Pilot?
As part of their preservation management task, repository managers need to be able to monitor the preservation status of the content of their repository.
We are looking at regular “routine check-ups” that can support this monitoring task. – Monitoring should be made easy (automatically
generated reports or dashboard) – Monitoring should be based on objective data,
generated by the repository (e.g. preservation metadata)
The Preservation Health Check proposition
The analogy
If a Preservation Health Check is a monitoring activity to be performed on a repository with digital content
1. What are empirical indicators (i.e. measures) for PHCs? 2. Are preservation metadata recorded by repositories
useful as health indicators for PHCs? Monitoring is about tracking change ... intentional and
unintentional change.
The research question
Goal: To develop an implementable logic (or protocol) to
support PHCs, and to test this logic against the store of preservation metadata maintained by an operational preservation repository.
The BnF runs a fully operational trusted digital repository (SPAR). They volunteered to become a PHC-pilot site.
The empirical data consists of: 1. A sample (200 GB) of the PREMIS data (AIP-METS
files), covering the following collections: – Gallica = digitised periodicals, monographs, still images and
manuscripts (TIFF + OCR-files) – Legal deposit Web harvests (warc files) – 3rd party collection (Centre Pompidou)
The pilot site
The empirical data consists of (continued): 2. All the Reference Information packages in SPAR that
contain reference information/code/specifications of (external) tools used during INGEST (ex. JHOVE) and of formats ingested;
3. Per collection: SLAs defining policy agreements with SIP suppliers concerning the preservation regime to be applied at the INGEST and ARCHIVAL STORAGE stages.
The pilot site
Mapping PREMIS on to SPOT
PREMIS Data
Model
Int. Ent.
SPOT Model
Availability
Identity
Persistence
Renderability
Understandability
Authenticity
Objects
Agents
Rights
Events
Semantic Units
Threats
preservation metadata in 2005 “Preservation metadata (…) metadata supporting
the functions of maintaining viability, renderability, understandability, authenticity, and identity in a preservation context.” (p. ix)
55
http://www.loc.gov/standards/premis/
Findings: coverage
SPOT property # of PREMIS semantic units*
• Availability 16 • Identity 19 • Persistence 10 • Renderability 15 • Understandability 14 • Authenticity 16 *Container level only; Agents, Events, Rights considered one semantic unit
Findings: coverage
• What does coverage in terms of “number of PREMIS semantic units” mean?
• More meaningful: Do the PREMIS semantic units address the threats associated with a SPOT property?
Example of a gap between SPOT and PREMIS: SPOT property: Understandability We found no PREMIS semantic units that provide
information that aids in the understanding or interpretation of the content of the archived digital object.
A repository usually implements a large number of explicit and implicit policy decisions; however, PREMIS currently makes few provisions for recording these in preservation metadata (the semantic unit preservationLevel being a notable exception).
Findings: preservation policies
PREMIS conformance does not require explicit encoding of metadata if the information applies to all objects in the repository.
This impedes the provision of automated PHC services (by a third-party provider) because efficient provision of this service would likely require the information in semantic units to be explicitly recorded, and implemented in a standard way.
Findings: explicit encoding
Logic for assessing Persistence
SPOT Model
Availability
Persistence
Identity
Renderability
Understandability
Authenticity
Threats
Six essential properties of successful digital preservation
62
• If storage medium information is not available in PREMIS metadata, the PHC will need to take other information sources into account – such as audit reports generated by storage management systems.
• We note that there are no pre-defined events for Corruption and Readability in PREMIS, which means that the repositories need to define their own events. PREMIS does provide a list of recommended event labels for the semantic unit eventType, but it is just a “suggested starter list”.
• The repository should have policies in place that prescribe frequencies of fixity checks, of medium refreshment, backup policy, etc. The PREMIS semantic unit preservationLevel does not address such policies. The PHC flow thus needs to get the policy information from other sources.
Logic for assessing Persistence
A use case: the preservation health check (to be continued)
64
Thank You!
©2014 OCLC. This work is licensed under a Creative Commons Attribution 3.0 Unported License. Suggested attribution: “This work uses content from [presentation title] © OCLC, used under a Creative Commons Attribution license: http://creativecommons.org/licenses/by/3.0/”