introduction to preservation metadata peter mckinney national library of new zealand te puna...

39
Introduction to Preservation Metadata Peter McKinney National Library of New Zealand Te Puna Mātauranga o Aotearoa

Upload: arthur-gilmore

Post on 22-Dec-2015

227 views

Category:

Documents


3 download

TRANSCRIPT

Introduction to Preservation Metadata

Peter McKinneyNational Library of New ZealandTe Puna Mātauranga o Aotearoa

OUTLINE

What is preservation metadata? What functions of a digital repository does

preservation metadata support? Background of PREMIS Data Dictionary PREMIS data model, identifiers, and relationships PREMIS maintenance activity

Metadata and preservation metadata

“Structured information thatdescribes, explains, locates,

or otherwise makes it easier toretrieve, use, or manage an

information resource”*

METADATA

“Metadata that supportsand documents the digital

preservation process”

PRESERVATIONMETADATA

*www.niso.org/publications/press/UnderstandingMetadata.pdf

Preservation metadata includes: Provenance:

• Who has had custody/ownership of the digital object?

Authenticity:• Is the digital object what it purports to be?

Preservation Activity:• What has been done to preserve it?

Technical Environment:• What is needed to render and use it?

Rights Management:• What IPR must be observed?

Makes digital objects self-documenting across time

Content

PreservationMetadata

10 years on

50 years on

Forever!

Examples of preservation functions and how metadata supports them

Object should be stored securely so it can’t be altered; checksums are stored as metadata to show whether the object has changed between two points in time

Files are stored on media that can be read by current computers; metadata supports media management by recording type and age of media and dates refreshed

Over time file formats become obsolete and preservation managers must employ preservation strategies to keep them useable; migration and emulation strategies require metadata about original format and software and hardware environments supporting them

Preservation actions involve changing original resources and authenticity comes into question; metadata supports digital provenance (chain of custody and change history)

Standards that address preservation metadata: technical

PREMIS Images

• NISO Z39.87 and MIX• Adobe and XMP (Extensible Metadata Platform)• Exif (Exchangeable Image File Format)• IPTC (International Press Telecommunications

Council)/XMP• Examples of technical metadata for images: byte

order, compression scheme, color space, color profile, image width

Standards that address preservation metadata: technical

Text: textMD• Examples: fonts, character set, byte order

Sound• AES57-2011: Audio Object XML Schema• AES60-2011: Core Audio Metadata• AudioMD (Library of Congress)• Examples: audio data encoding; byte order; physical

structure, materials, dimensions

Standards that address preservation metadata: technical

Video• VideoMD• SMPTE RP210• Technical metadata in EBUCore, PBCore• U.S. Federal Agencies Digitization Guidelines• MPEG-7 and MPEG-21 for video• Examples: generation, dimensions, byte order, bits per

sample, aspect ratio, color coding

Standards that address preservation metadata: Structural

METS PREMIS MPEG 21 Digital Item Declaration OAI/ORE Specific format types

• MXF• AVI

Standards that address preservation metadata: Rights

PREMIS METS Rights CDL Copyright schema Creative commons PLUS for images MPEG-21 REL for moving images ONIX for licensing terms Full rights expression languages

• XRML/MPEG-21• ODRL

PREMIS Data Dictionary May 2005: Data Dictionary for Preservation

Metadata: Final Report of the PREMIS Working Group

March 2008: PREMIS Data Dictionary for Preservation Metadata, version 2.0 (version 2.1 Jan. 2011)

April 2012: version 2.2

Includes PREMIS Data Dictionary, context/assumptions, data model, usage examples

XML schema to support implementation

Data Dictionary: • Comprehensive view of information needed to support digital preservation

• Guidelines/recommendations to support creation, use, management• Based on deep pool of institutional experiences in setting up and managing

operational capacity for digital preservation

http://www.loc.gov/standards/premis/premis-2-2.pdf

Guiding principles: “implementable, core preservation metadata”

Preservation metadata: maintain viability, renderability, understandability, authenticity, identity in a preservation context

Core: What most preservation repositories need to know to preserve digital materials over the long-term

Implementable: rigorously defined; supported by usage guidelines/recommendations; emphasis on automated workflows

Guiding principles: “technical neutrality”

Digital archiving system: no assumptions about specific archiving technology, system/DB architectures, preservation strategy

Metadata management: no assumptions about whether metadata is stored locally or in external registry; recorded explicitly or known implicitly; instantiated in one metadata element or multiple elements

Promotes flexibility, applicability in wide range of contexts

Scope What PREMIS DD is:

• Common data model for organizing/thinking about preservation metadata

• Guidance for local implementations• Standard for exchanging information packages between

repositories• Compatible with the OAIS reference and information model

What PREMIS DD is not:• Out-of-the-box solution: need to instantiate as metadata

elements in repository system• All needed metadata: excludes business rules, format-specific

technical metadata, descriptive metadata for access, non-core preservation metadata

• Lifecycle management of objects outside repository• Rights management: limited to permissions regarding actions

taken within repository

What’s in PREMIS?

“Things” you have to describePREMIS Data Model

What you want to say about these “things”Semantic units in the PREMIS Data Dictionary

How you want this information to be encoded and implementedIn XML PREMIS XML schemaIn RDF OWL ontologyOr any other way you like it

The PREMIS Data Model

Data model includes:• Entities: “things” relevant to digital preservation that are

described by preservation metadata (Intellectual Entities, Objects, Events, Rights, Agents)

• Properties of Entities (semantic units) • Relationships between Entities

Why have data model?• Organizational convenience (for development and use)• Useful framework for distinguishing applicability of

semantic units across different types of Entities and different types of Objects

• But: not a formal entity-relationship model; not sufficient to design databases

PREMIS Data Model

IntellectualEntities

Objects

RightsStatements

Agents

Events

Intellectual Entities

Examples: The Chamber by John Grisham

(an ebook) “Maggie at the beach”

(a photograph) The Metropolitan New York

Library Council Website (a website)

Set of content that is considered a single intellectual unit for purposes of management and description (e.g., a book, a photograph, a map, a database)

Has one or more digital representations

May include other Intellectual Entities (e.g. a website that includes a web page)

Not fully described in PREMIS DD, but can be linked to in metadata describing digital representation THIS WILL CHANGE IN 3.0

Objects

Examples: a PDF file A book composed of several

XML files and many images TIFF file containing a header

and 2 images

Objects are what repository actually preserves

FILE: named and ordered sequence of bytes that is known by an operating system

REPRESENTATION: set of files, including structural metadata, that, taken together, constitute a complete rendering of an Intellectual Entity

BITSTREAM: data within a file with properties relevant for preservation purposes (but needs additional structure or reformatting to be stand-alone file)

FILESTREAMS (files within files)are considered files since can be rendered alone

Object Example: book in two versions

Intellectual EntityDa Vinci Code by

Dan Brown

Representation 1Page image

version

Representation 2ebook version

File 1: page1.tiff

File 2:page2.tiff

File N:pageN.tiff

File 1:book.lit

File N+1:METS.xml

Technical metadata pertaining to objects

Object identifier Preservation level Significant characteristics Object characteristics

• fixity• format• size• creating application• inhibitors• object characteristics

extension Original name

Storage Environment

• software• hardware

will change in 3.0 Digital signatures Relationships Linking event identifier Linking rights statement

identifier

Events

Examples: Validation Event: use JHOVE tool

to verify that chapter1.pdf is a valid PDF file

Ingest Event: transform an OAIS SIP into an AIP (one Event or multiple Events?)

An action that involves or impacts at least one Object or Agent associated with or known by the preservation repository

Helps document digital provenance. Can track history of Object through the chain of Events that occur during the Objects lifecycle

Determining which Events are in scope is up to the repository (e.g., Events which occur before ingest, or after de-accession)

Determining which Events should be recorded, and at what level of granularity is up to the repository

Agents

Examples: Rebecca Guenther (a person) New York Public Library (an

organization) JHOVE version 1.0 (a software

program)

Person, organization, or software program/system associated with an Event or a Right (permission statement)

Agents are associated only indirectly to Objects through Events or Rights

Not defined in detail in PREMIS DD; not considered core preservation metadata beyond identification

Rights Statements

Example: Priscilla Caplan grants FCLA

digital repository permission to make three copies of metadata_fundamentals.pdf for preservation purposes.

An agreement with a rights holder that grants permission for the repository to undertake an action(s) associated with an Object(s) in the repository.

Not a full rights expression language; focuses exclusively on permissions that take the form:• Agent X grants Permission Y to the repository in regard to

Object Z.

From the data model to the data dictionary Data model: defines Entities and Relationships between them Data dictionary: for each entity lists its semantic units

A semantic unit is a property of an Entity• Something you need to know about an Object, Event, Agent, Right• Piece of information most repositories need to know in order to

carry out their digital preservation functions Two kinds of semantic units:

• Container: groups together related semantic units• Semantic components: semantic units grouped under the same

container Semantic units can be recorded explicitly, or known implicitly

• However it is implemented/recorded, a semantic unit should be recoverable from archiving system

Example:ObjectIdentifier [container]

ObjectIdentifierType [semantic component]ObjectIdentifierValue [semantic component]

Identifiers in PREMIS Identifiers used to

• identify unambiguously an object, agent, event, rights statement…

• [entity]Identifier• and link it to another entity

• linking[entity]Identifier

All identifiers have• An identifierType (category of identifier)• An identifierValue (the identifier itself)

identifierType makes the identifier unique in a given domainExamples: URL, DOI, ARK, local…

If all identifiers are local to the repository system, identifierType does not have to be recorded for each identifier in the system• BUT it should be supplied when exchanging data with others

Relationships

Many different types of information relevant to preservation can be expressed as relationships:• e.g., “A is part of B”, “A is scanned from B”, “A is a version of B”

PREMIS Data Dictionary supports expression of relationships between:• Different Objects

• Across same level or different levels• Structural: relationships between parts of a whole• Derivation: relationships resulting from replication or

transformation of an Object• Different Entities

Relationships are established through reference to Identifiers of other Objects or Entities

Example: Structural relationshipFile “is part of” Representation

relationship [part of the description of File]relationshipType = structuralrelationshipSubType = is part ofrelatedObjectIdentification [the Web page]

relatedObjectIdentifierType = repositoryIDrelatedObjectIdentifierValue = 0385503954 relatedObjectSequence = 0

relatedEventIdentification [none]

File:

graphic.gif

Representation:

Web Page

is part of

Example: Derivation relationshipFile 1 “is source of” File 2 through Migration Event

relationship [part of description of File 1]relationshipType = derivationrelationshipSubType = is source ofrelatedObjectIdentification [identifier of File 2]

relatedObjectIdentifierType = repositoryIDrelatedObjectIdentifierValue = F004400relatedObjectSequence [none]

relatedEventIdentification [Migration Event ID]relatedEventIdentifierType =

repEventIDrelatedEventIdentifierValue = E0192relatedEventSequence [none]

File 1(original)

is source of

throughevent

MigrationEvent

File 2(migrated)

Extension containers in PREMIS

PREMIS is core preservation metadata PREMIS defines an Extension container to extend PREMIS if

you need• more granular description• specific semantic units (non-core information)• out of scope semantic units (not grounded in

preservation) Extensions are empty containers

• Its semantic components are whatever you need• One schema per extension; if more schemas are needed,

the extension element needs to be repeated• Mechanism in PREMIS XML Schema: <mdSec> element

Data in the container may replace, refine or be additional to the appropriate PREMIS semantic unit

PREMIS Extensions

significantPropertiesExtension creatingApplicationExtension objectCharacteristicsExtension environmentExtension signatureInformationExtension eventOutcomeDetailExtension agentExtension rightsExtension

Sample data dictionary entry

Semantic unit size Semantic components

None

Definition The size in bytes of the file or bitstream stored in the repository.

Rationale Size is useful for ensuring the correct number of bytes from storage have been retrieved and that an application has enough room to move or process files. It might also be used when billing for storage.

Data constraint Integer Object category Representation File Bitstream Applicability Not applicable Applicable Applicable Examples 2038927 Repeatability Not repeatable Not repeatable Obligation Optional Optional Creation/ Maintenance notes

Automatically obtained by the repository.

Usage notes Defining this semantic unit as size in bytes makes it unnecessary to record a unit of measurement. However, for the purpose of data exchange the unit of measurement should be stated or understood by both partners.

Is it a container unit?

What does it contain?

Why should it be recorded?

Some implementation

guidelines

How should it be provided?

How should it be

recorded? constraints

and examples

How do people implement PREMIS?

Using METS in XML to package together metadata and content information

In relational databases In commercial preservation repository systems Part of open source repository system that supports

metadata • Archivematica• DAITTS

METS using PREMIS in METS guidelines for SIP and DIP• Open source tool available for creation and

validation

PREMIS Maintenance Activity

Web site:• Permanent Web presence, hosted by

Library of Congress• Central destination for PREMIS-related

info, announcements, resources• Home of the PREMIS Implementers’ Group (PIG)

discussion list

PREMIS Editorial Committee:• Set directions/priorities for PREMIS development• Coordinate future revisions of Data Dictionary and XML

schema• Promote implementation• International in scope, cross domain

http://www.loc.gov/standards/premis/

Implementation resources Tools:

• XML schema• PREMIS-in-METS toolbox <http://pim.fcla.edu>• Controlled vocabularies at http://id.loc.gov• RDF/OWL ontology for use as Linked Data

Guidelines:• PREMIS conformance statement• PREMIS & METS guidelines

Community Working groups on special topics Implementation Fairs Others:

• Understanding PREMIS (available in multiple languages)• PIG Forum• Implementation Registry• Tools Registry

Some implementers …

DAITTSS (Florida) • A preservation repository for the use of the libraries of the

public universities of Florida• Developed PREMIS tools and open source software

Ex Libris Rosetta• a commercial digital preservation system supporting acquisition,

validation, ingest, storage, management, preservation and dissemination of different types of digital objects

National Digital Newspaper Program• Implemented an early version of PREMIS

FDSys (US Government Printing Office)• Content management, search engine, preservation repository• Supports media refreshment, content authentication, virus

checking

Some implementers (cont.) Archivematica

• Comprehensive open-source digital preservation system• Supports many preservation repository functions

TIPR (Towards Interoperable Preservation Repositories)• Experimenting with a standard package for exchange

based on PREMIS in METS guidelines• FCLA, NYU and Cornell

HathiTrust: partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future

Digital libraries in Spain• Mandated for use in cultural heritage preservation

repositoriesSee: http://www.loc.gov/standards/premis/registry/

Making preservation metadata happen

Metadata should be the basis for design of a digital repository

Systems are needed to store information supporting preservation and access decisions

Enough metadata needs to be provided to take appropriate actions to maintain digital objects over the long-term

There should be automatic population of the maximum number of elements

Any system should ensure preservation of metadata as well as preservation of objects

PREMIS provides a critical piece of reliable digital preservation infrastructure comprised of technology, standards, and best practice

URLs, etc. PREMIS Maintenance Activity:

http://www.loc.gov/standards/premis/

PREMIS Data Dictionary for Preservation Metadata:http://www.loc.gov/standards/premis/v2/premis-2-1.pdf

PREMIS Implementation Registryhttp://www.loc.gov/standards/premis/registry

PREMIS Implementers Group listhttp://listserv.loc.gov/listarch/pig.html