introduction to preservation metadata peter mckinney national library of new zealand te puna...
TRANSCRIPT
Introduction to Preservation Metadata
Peter McKinneyNational Library of New ZealandTe Puna Mātauranga o Aotearoa
OUTLINE
What is preservation metadata? What functions of a digital repository does
preservation metadata support? Background of PREMIS Data Dictionary PREMIS data model, identifiers, and relationships PREMIS maintenance activity
Metadata and preservation metadata
“Structured information thatdescribes, explains, locates,
or otherwise makes it easier toretrieve, use, or manage an
information resource”*
METADATA
“Metadata that supportsand documents the digital
preservation process”
PRESERVATIONMETADATA
*www.niso.org/publications/press/UnderstandingMetadata.pdf
Preservation metadata includes: Provenance:
• Who has had custody/ownership of the digital object?
Authenticity:• Is the digital object what it purports to be?
Preservation Activity:• What has been done to preserve it?
Technical Environment:• What is needed to render and use it?
Rights Management:• What IPR must be observed?
Makes digital objects self-documenting across time
Content
PreservationMetadata
10 years on
50 years on
Forever!
Examples of preservation functions and how metadata supports them
Object should be stored securely so it can’t be altered; checksums are stored as metadata to show whether the object has changed between two points in time
Files are stored on media that can be read by current computers; metadata supports media management by recording type and age of media and dates refreshed
Over time file formats become obsolete and preservation managers must employ preservation strategies to keep them useable; migration and emulation strategies require metadata about original format and software and hardware environments supporting them
Preservation actions involve changing original resources and authenticity comes into question; metadata supports digital provenance (chain of custody and change history)
Standards that address preservation metadata: technical
PREMIS Images
• NISO Z39.87 and MIX• Adobe and XMP (Extensible Metadata Platform)• Exif (Exchangeable Image File Format)• IPTC (International Press Telecommunications
Council)/XMP• Examples of technical metadata for images: byte
order, compression scheme, color space, color profile, image width
Standards that address preservation metadata: technical
Text: textMD• Examples: fonts, character set, byte order
Sound• AES57-2011: Audio Object XML Schema• AES60-2011: Core Audio Metadata• AudioMD (Library of Congress)• Examples: audio data encoding; byte order; physical
structure, materials, dimensions
Standards that address preservation metadata: technical
Video• VideoMD• SMPTE RP210• Technical metadata in EBUCore, PBCore• U.S. Federal Agencies Digitization Guidelines• MPEG-7 and MPEG-21 for video• Examples: generation, dimensions, byte order, bits per
sample, aspect ratio, color coding
Standards that address preservation metadata: Structural
METS PREMIS MPEG 21 Digital Item Declaration OAI/ORE Specific format types
• MXF• AVI
Standards that address preservation metadata: Rights
PREMIS METS Rights CDL Copyright schema Creative commons PLUS for images MPEG-21 REL for moving images ONIX for licensing terms Full rights expression languages
• XRML/MPEG-21• ODRL
PREMIS Data Dictionary May 2005: Data Dictionary for Preservation
Metadata: Final Report of the PREMIS Working Group
March 2008: PREMIS Data Dictionary for Preservation Metadata, version 2.0 (version 2.1 Jan. 2011)
April 2012: version 2.2
Includes PREMIS Data Dictionary, context/assumptions, data model, usage examples
XML schema to support implementation
Data Dictionary: • Comprehensive view of information needed to support digital preservation
• Guidelines/recommendations to support creation, use, management• Based on deep pool of institutional experiences in setting up and managing
operational capacity for digital preservation
http://www.loc.gov/standards/premis/premis-2-2.pdf
Guiding principles: “implementable, core preservation metadata”
Preservation metadata: maintain viability, renderability, understandability, authenticity, identity in a preservation context
Core: What most preservation repositories need to know to preserve digital materials over the long-term
Implementable: rigorously defined; supported by usage guidelines/recommendations; emphasis on automated workflows
Guiding principles: “technical neutrality”
Digital archiving system: no assumptions about specific archiving technology, system/DB architectures, preservation strategy
Metadata management: no assumptions about whether metadata is stored locally or in external registry; recorded explicitly or known implicitly; instantiated in one metadata element or multiple elements
Promotes flexibility, applicability in wide range of contexts
Scope What PREMIS DD is:
• Common data model for organizing/thinking about preservation metadata
• Guidance for local implementations• Standard for exchanging information packages between
repositories• Compatible with the OAIS reference and information model
What PREMIS DD is not:• Out-of-the-box solution: need to instantiate as metadata
elements in repository system• All needed metadata: excludes business rules, format-specific
technical metadata, descriptive metadata for access, non-core preservation metadata
• Lifecycle management of objects outside repository• Rights management: limited to permissions regarding actions
taken within repository
What’s in PREMIS?
“Things” you have to describePREMIS Data Model
What you want to say about these “things”Semantic units in the PREMIS Data Dictionary
How you want this information to be encoded and implementedIn XML PREMIS XML schemaIn RDF OWL ontologyOr any other way you like it
The PREMIS Data Model
Data model includes:• Entities: “things” relevant to digital preservation that are
described by preservation metadata (Intellectual Entities, Objects, Events, Rights, Agents)
• Properties of Entities (semantic units) • Relationships between Entities
Why have data model?• Organizational convenience (for development and use)• Useful framework for distinguishing applicability of
semantic units across different types of Entities and different types of Objects
• But: not a formal entity-relationship model; not sufficient to design databases
Intellectual Entities
Examples: The Chamber by John Grisham
(an ebook) “Maggie at the beach”
(a photograph) The Metropolitan New York
Library Council Website (a website)
Set of content that is considered a single intellectual unit for purposes of management and description (e.g., a book, a photograph, a map, a database)
Has one or more digital representations
May include other Intellectual Entities (e.g. a website that includes a web page)
Not fully described in PREMIS DD, but can be linked to in metadata describing digital representation THIS WILL CHANGE IN 3.0
Objects
Examples: a PDF file A book composed of several
XML files and many images TIFF file containing a header
and 2 images
Objects are what repository actually preserves
FILE: named and ordered sequence of bytes that is known by an operating system
REPRESENTATION: set of files, including structural metadata, that, taken together, constitute a complete rendering of an Intellectual Entity
BITSTREAM: data within a file with properties relevant for preservation purposes (but needs additional structure or reformatting to be stand-alone file)
FILESTREAMS (files within files)are considered files since can be rendered alone
Object Example: book in two versions
Intellectual EntityDa Vinci Code by
Dan Brown
Representation 1Page image
version
Representation 2ebook version
File 1: page1.tiff
File 2:page2.tiff
File N:pageN.tiff
File 1:book.lit
File N+1:METS.xml
Technical metadata pertaining to objects
Object identifier Preservation level Significant characteristics Object characteristics
• fixity• format• size• creating application• inhibitors• object characteristics
extension Original name
Storage Environment
• software• hardware
will change in 3.0 Digital signatures Relationships Linking event identifier Linking rights statement
identifier
Events
Examples: Validation Event: use JHOVE tool
to verify that chapter1.pdf is a valid PDF file
Ingest Event: transform an OAIS SIP into an AIP (one Event or multiple Events?)
An action that involves or impacts at least one Object or Agent associated with or known by the preservation repository
Helps document digital provenance. Can track history of Object through the chain of Events that occur during the Objects lifecycle
Determining which Events are in scope is up to the repository (e.g., Events which occur before ingest, or after de-accession)
Determining which Events should be recorded, and at what level of granularity is up to the repository
Agents
Examples: Rebecca Guenther (a person) New York Public Library (an
organization) JHOVE version 1.0 (a software
program)
Person, organization, or software program/system associated with an Event or a Right (permission statement)
Agents are associated only indirectly to Objects through Events or Rights
Not defined in detail in PREMIS DD; not considered core preservation metadata beyond identification
Rights Statements
Example: Priscilla Caplan grants FCLA
digital repository permission to make three copies of metadata_fundamentals.pdf for preservation purposes.
An agreement with a rights holder that grants permission for the repository to undertake an action(s) associated with an Object(s) in the repository.
Not a full rights expression language; focuses exclusively on permissions that take the form:• Agent X grants Permission Y to the repository in regard to
Object Z.
From the data model to the data dictionary Data model: defines Entities and Relationships between them Data dictionary: for each entity lists its semantic units
A semantic unit is a property of an Entity• Something you need to know about an Object, Event, Agent, Right• Piece of information most repositories need to know in order to
carry out their digital preservation functions Two kinds of semantic units:
• Container: groups together related semantic units• Semantic components: semantic units grouped under the same
container Semantic units can be recorded explicitly, or known implicitly
• However it is implemented/recorded, a semantic unit should be recoverable from archiving system
Example:ObjectIdentifier [container]
ObjectIdentifierType [semantic component]ObjectIdentifierValue [semantic component]
Identifiers in PREMIS Identifiers used to
• identify unambiguously an object, agent, event, rights statement…
• [entity]Identifier• and link it to another entity
• linking[entity]Identifier
All identifiers have• An identifierType (category of identifier)• An identifierValue (the identifier itself)
identifierType makes the identifier unique in a given domainExamples: URL, DOI, ARK, local…
If all identifiers are local to the repository system, identifierType does not have to be recorded for each identifier in the system• BUT it should be supplied when exchanging data with others
Relationships
Many different types of information relevant to preservation can be expressed as relationships:• e.g., “A is part of B”, “A is scanned from B”, “A is a version of B”
PREMIS Data Dictionary supports expression of relationships between:• Different Objects
• Across same level or different levels• Structural: relationships between parts of a whole• Derivation: relationships resulting from replication or
transformation of an Object• Different Entities
Relationships are established through reference to Identifiers of other Objects or Entities
Example: Structural relationshipFile “is part of” Representation
relationship [part of the description of File]relationshipType = structuralrelationshipSubType = is part ofrelatedObjectIdentification [the Web page]
relatedObjectIdentifierType = repositoryIDrelatedObjectIdentifierValue = 0385503954 relatedObjectSequence = 0
relatedEventIdentification [none]
File:
graphic.gif
Representation:
Web Page
is part of
Example: Derivation relationshipFile 1 “is source of” File 2 through Migration Event
relationship [part of description of File 1]relationshipType = derivationrelationshipSubType = is source ofrelatedObjectIdentification [identifier of File 2]
relatedObjectIdentifierType = repositoryIDrelatedObjectIdentifierValue = F004400relatedObjectSequence [none]
relatedEventIdentification [Migration Event ID]relatedEventIdentifierType =
repEventIDrelatedEventIdentifierValue = E0192relatedEventSequence [none]
File 1(original)
is source of
throughevent
MigrationEvent
File 2(migrated)
Extension containers in PREMIS
PREMIS is core preservation metadata PREMIS defines an Extension container to extend PREMIS if
you need• more granular description• specific semantic units (non-core information)• out of scope semantic units (not grounded in
preservation) Extensions are empty containers
• Its semantic components are whatever you need• One schema per extension; if more schemas are needed,
the extension element needs to be repeated• Mechanism in PREMIS XML Schema: <mdSec> element
Data in the container may replace, refine or be additional to the appropriate PREMIS semantic unit
PREMIS Extensions
significantPropertiesExtension creatingApplicationExtension objectCharacteristicsExtension environmentExtension signatureInformationExtension eventOutcomeDetailExtension agentExtension rightsExtension
Sample data dictionary entry
Semantic unit size Semantic components
None
Definition The size in bytes of the file or bitstream stored in the repository.
Rationale Size is useful for ensuring the correct number of bytes from storage have been retrieved and that an application has enough room to move or process files. It might also be used when billing for storage.
Data constraint Integer Object category Representation File Bitstream Applicability Not applicable Applicable Applicable Examples 2038927 Repeatability Not repeatable Not repeatable Obligation Optional Optional Creation/ Maintenance notes
Automatically obtained by the repository.
Usage notes Defining this semantic unit as size in bytes makes it unnecessary to record a unit of measurement. However, for the purpose of data exchange the unit of measurement should be stated or understood by both partners.
Is it a container unit?
What does it contain?
Why should it be recorded?
Some implementation
guidelines
How should it be provided?
How should it be
recorded? constraints
and examples
How do people implement PREMIS?
Using METS in XML to package together metadata and content information
In relational databases In commercial preservation repository systems Part of open source repository system that supports
metadata • Archivematica• DAITTS
METS using PREMIS in METS guidelines for SIP and DIP• Open source tool available for creation and
validation
PREMIS Maintenance Activity
Web site:• Permanent Web presence, hosted by
Library of Congress• Central destination for PREMIS-related
info, announcements, resources• Home of the PREMIS Implementers’ Group (PIG)
discussion list
PREMIS Editorial Committee:• Set directions/priorities for PREMIS development• Coordinate future revisions of Data Dictionary and XML
schema• Promote implementation• International in scope, cross domain
http://www.loc.gov/standards/premis/
Implementation resources Tools:
• XML schema• PREMIS-in-METS toolbox <http://pim.fcla.edu>• Controlled vocabularies at http://id.loc.gov• RDF/OWL ontology for use as Linked Data
Guidelines:• PREMIS conformance statement• PREMIS & METS guidelines
Community Working groups on special topics Implementation Fairs Others:
• Understanding PREMIS (available in multiple languages)• PIG Forum• Implementation Registry• Tools Registry
Some implementers …
DAITTSS (Florida) • A preservation repository for the use of the libraries of the
public universities of Florida• Developed PREMIS tools and open source software
Ex Libris Rosetta• a commercial digital preservation system supporting acquisition,
validation, ingest, storage, management, preservation and dissemination of different types of digital objects
National Digital Newspaper Program• Implemented an early version of PREMIS
FDSys (US Government Printing Office)• Content management, search engine, preservation repository• Supports media refreshment, content authentication, virus
checking
Some implementers (cont.) Archivematica
• Comprehensive open-source digital preservation system• Supports many preservation repository functions
TIPR (Towards Interoperable Preservation Repositories)• Experimenting with a standard package for exchange
based on PREMIS in METS guidelines• FCLA, NYU and Cornell
HathiTrust: partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future
Digital libraries in Spain• Mandated for use in cultural heritage preservation
repositoriesSee: http://www.loc.gov/standards/premis/registry/
Making preservation metadata happen
Metadata should be the basis for design of a digital repository
Systems are needed to store information supporting preservation and access decisions
Enough metadata needs to be provided to take appropriate actions to maintain digital objects over the long-term
There should be automatic population of the maximum number of elements
Any system should ensure preservation of metadata as well as preservation of objects
PREMIS provides a critical piece of reliable digital preservation infrastructure comprised of technology, standards, and best practice
URLs, etc. PREMIS Maintenance Activity:
http://www.loc.gov/standards/premis/
PREMIS Data Dictionary for Preservation Metadata:http://www.loc.gov/standards/premis/v2/premis-2-1.pdf
PREMIS Implementation Registryhttp://www.loc.gov/standards/premis/registry
PREMIS Implementers Group listhttp://listserv.loc.gov/listarch/pig.html