global digital format registry stephen l. abrams harvard university library mackenzie smith...

38
Global Digital Format Registry Stephen L. Abrams Harvard University Library MacKenzie Smith Massachusetts Institute of Technology DLF Spring Forum New York, May 14-16, 2003

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Global Digital Format Registry

Stephen L. AbramsHarvard University Library

MacKenzie SmithMassachusetts Institute of Technology

DLF Spring Forum New York, May 14-16, 2003

DLF Spring Forum New York, May 14-16, 2003 2

Why Do We Need a Registry?

• Repository functions are performed on a format-specific basis

• Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented

• Interchange requires mutual agreement of format syntax and semantics

DLF Spring Forum New York, May 14-16, 2003 3

Potential Use Cases

• Identification– “I have a digital object; what format is it?”

• Validation– “I have an object purportedly of format F; is it?”

• Transformation– “I have an object of format F, but need G; how can I produce it?”

• Characterization– “I have an object of format F; what are its significant properties?”

• Risk assessment– “I have an object of format F; is at risk of obsolescence?”

• Delivery– “I have an object of format F; how can I render it?”

DLF Spring Forum New York, May 14-16, 2003 4

Repository Format Dependencies

• Ingest– Validation– SIP-to-AIP

• Access– AIP-to-DIP– Rendering

• Preservation planning– Migration– Emulation– UVC

DLF Spring Forum New York, May 14-16, 2003 5

Repository Format Dependencies

SIP

AIP

Data Management

Administer

Archival storage

Manage

Access

DIP

Preservation

Strategies

Monitoring

Migration

Emulation

DescriptiveMetadata

Content andrepresentation

information

Ingest

QA

Generate AIP

Discovery

Generate DIP

Delivery

DLF Spring Forum New York, May 14-16, 2003 6

Repository Format Dependencies

SIP

AIP

Data Management

Administer

Archival storage

Manage

Access

DIP

Format registry

Preservation

Strategies

Monitoring

Migration

Emulation

Transform SIP-to-AIP

Validate SIP

Transform AIP-to-DIP

Metadata for encapsulation/archaeology

DescriptiveMetadata

Content andrepresentation

information

Ingest

QA

Generate AIP

Discovery

Generate DIP

Delivery

DLF Spring Forum New York, May 14-16, 2003 7

What’s Wrong with MIME Types?

• Insufficient depth of detail– Syntax and semantics– Public and proprietary

• Insufficient granularity– Both tiled RGB TIFF with LZW and striped

bi-tonal TIFF with Group 4 → image/tiff– All of PDF 1.0 – 1.4, PDF/X-1 – 3, and

PDF/A → application/pdf

DLF Spring Forum New York, May 14-16, 2003 8

A Bit of History

• DLF-sponsored invitational meetings

• Ad-hoc committee– Collected use cases– Working groups on data and governance

models

During summer 2002 the Harvard LDI and MIT DSpace teams met to discuss shared concerns.

DLF Spring Forum New York, May 14-16, 2003 9

Ad-Hoc Committee

• Bibliothèque nationale de France

• California Digital Library• Digital Library Federation• Harvard University• IETF• JISC• JSTOR• Library of Congress• MIT• NARA

• National Archives of Canada

• New York University• NIST• OCLC• Public Records Office,

UK• RLG• Stanford University• University of

Pennsylvania

DLF Spring Forum New York, May 14-16, 2003 10

Global Digital Format Registry

The registry will maintain persistent, unambiguous bindings between public identifiers for digital formats and representation information for those formats.

DLF Spring Forum New York, May 14-16, 2003 11

What is a Format?

• No assumption regarding byte size

• An information model is a formal expression of exchangeable knowledge

A format is a fixed, byte-serialized encoding of an information model.

DLF Spring Forum New York, May 14-16, 2003 12

What is Representation Information?

• Significant properties are those aspects of a format that are the primary carriers of the format’s intellectual value

Representation information maps typed formats into more meaningful concepts by capturing the significant syntactic and semantic properties of those formats.

DLF Spring Forum New York, May 14-16, 2003 13

Data Model

• Registry• Format

– Descriptive• General descriptive properties

– Characterization• Technical syntactic/semantic properties

– Processing• Services and systems using format as input or output

– Administrative• Provenance

DLF Spring Forum New York, May 14-16, 2003 14

Informative, not Evaluative

• Legal liability

• May discourage deposit of proprietary information

• Investigate ways to include (by reference?) third party evaluations/recommendations– Insofar as this doesn’t hamper primary goal

The format properties stored in the registry should be factual, not judgmental.

DLF Spring Forum New York, May 14-16, 2003 15

Data Model Sources

• ISO 14721, Open archival information system -- Reference model– CCSDS OAIS reference model– Representation information

• Interpret, or provide “additional meaning” to Data Object• Structure and semantic information

• PRONOM– Public Records Office, UK– “information about file formats and the application

software needed to open them”– Format, vendor, product

DLF Spring Forum New York, May 14-16, 2003 16

Data Model Sources

• Diffuse– EC’s Information Society Technologies programme

– “reference and guidance information on available and emerging standards and specifications”

– Business Guides• “application of standards and specifications in specific areas”

• OCLC/RLG Preservation Metadata Framework– “information necessary to render/display, understand,

and interpret the Content Data Object”– Based on CEDARS, NEDLIB NLA, OAIS, and OCLC

metadata

DLF Spring Forum New York, May 14-16, 2003 17

Data Model Sources

• NIST National Software Reference Library– File profiles for the NSRL Reference Data Set

• Vendor, product, operating system

– Used for forensic identification• Media features

– Protocol-independent content negotiation• Selection of an “appropriate representation” of a

resource

– RFCs 2506, 2533, 2534

DLF Spring Forum New York, May 14-16, 2003 18

Data Model Sources

• Typed Object Model (TOM)– “model for identifying and describing data formats …

distributed system of ‘type brokers’ that maintain and interpret these descriptions”

– Format is aggregate of type (attributes, operations, semantics) and encoding

• JISC File Format Representation and Rendering Project– Assessment of formats and rendering software– Representation system to track formats and their

rendering software

DLF Spring Forum New York, May 14-16, 2003 19

Data Model Sources

• Bitstream Syntax Description Language– MPEG-21content adaptation– XML schema to model multimedia bitstreams

Useful for administrative properties and data types:

• ISO/IEC 11179, Specification and standardization of data elements

• OASIS/ebXML Registry Information Model

DLF Spring Forum New York, May 14-16, 2003 20

Data Model

Relation

Target : CognomenRegistry : CognomenType : <<enum>>Note * : UTF-8

Cognomen

Value : UTF-8Type : <<enum>>Note * : UTF-8

Person

Title ? : UTF-8Affiliation + : Agent

Agent

Name : UTF-8Address ? : UTF-8Telephone ? : ITU-TFax ? : ITU-T E.164Email ? : RFC 2821Web ? : URIType : <<enum>>Note * : UTF-8

Class

Identifier : CognomenOntology : CognomenNote * : UTF-8

Document

Title : UTF-8Version ? : UTF-8Author * : AgentPublisher * : AgentDate ? : ISO 8601Type : <<enum>>Identifier * : CognomenAccessibility : AccessNote * : UTF-8

Signature

Value : Byte streamObligation : <<enum>>Note * : UTF-8

ExternalSignature

Type : <<enum>>

InternalSignature

Fixity : <<enum>>Offset ? : Non-negative

Registry

Name : UTF-8Version : UTF-8Date : ISO 8601Format * : FormatService * : ServiceRegistry * : RegistryNote * : UTF-8

Service

Name : UTF-8Protocol ? : UTF-8Note * : UTF-8

Event

Agent : AgentDate : ISO 8601Type : <<enum>>Review : <<enum>>Note * : UTF-8

Process

Type : <<enum>>Stream * : StreamNote * : UTF-8

System

Name : UTF-8Version : UTF-8Agent : AgentProcess * : ProcessRelationship * : RelationNote * : UTF-8

Format

Identifier : UTF-8Alias * : UTF-8Author * : AgentOwner + : AuthorityMaintainer * : AuthorityClassification + : ClassRelationship * : RelationSpecification * : DocumentSignature * : SignatureTool * : SystemStatus : <<enum>>Provenace* : EventNote * : UTF-8

Stream

Format : CognomenType : <<enum>>Note * : UTF-8

Authority

Agent : AgentStart : ISO 8601End ? : ISO 8601Note * : UTF-8

Access

Type : <<enum>>Start : ISO 8601End : ISO 8601Note * : UTF-8

DLF Spring Forum New York, May 14-16, 2003 21

High-Level Format PropertiesFormat

Identifier UTF-8 Canonical format identifier

Alias * UTF-8 Variant identifiers

Author * Agent Author

Owner + Authority Owner

Maintainer * Authority Maintenance agency

Classification + Class Ontological classification

Relationship * FormatRelation Typed relationship with another format, either registered internally or externally

Specification * Document Specification document

Signature * Signature Internal or external signature

Tool * System Process or service having format as input or output

Status Status: ‘Active’, ‘Withdrawn’, ‘Unknown’, ‘Other’

Provenance * Event Provenance event

Note * UTF-8 Informative note

DLF Spring Forum New York, May 14-16, 2003 22

Descriptive Properties

• Identifiers– Canonical and alias

• Arbitrary relationships– Equivalence– Encapsulation– Sub-typing, with strict substitutability

• PDF 1.0 ← … ← PDF 1.4 ← PDF/A• XML ← SVG

– Versioning

• Ontological classification

DLF Spring Forum New York, May 14-16, 2003 23

Format Ontology• Content stream

– Logical– Numeric

• Scalar– Integer

» Unsigned– Real

» Floating point– Complex

– Text• Structured text

– Mark-up language– Programming language

• Message– Mail– News

– Image• Still

– Font» Outline» Raster

– Graphic» Vector» Raster

– Page description• Motion

– Audio• Music

– Application• CAD• Communication

• Database• Executable• GIS• Presentation• Spreadsheet• Word processing

– Transformation• Compression

– Lossless– Lossy

• Container– File system

• Transfer– 7-bit safe

• Physical media– Magnetic

• Disk• Tape

– Reel– Cartridge

– Optical• Disk

– CD-ROM– DVD

• Film– Paper

• Card• Tape

DLF Spring Forum New York, May 14-16, 2003 24

Characterization Properties

• Specification documents– Actionable links– Public identifiers– Hard copy

• Public, on-site, license, and escrow access

• Signatures– External

• File extension, Mac OS data fork type

– Internal• Magic number

DLF Spring Forum New York, May 14-16, 2003 25

Centralized vs. Distributed

• Allowing arbitrary granularity may lead to an explosion of registered formats– Versions– Local profiles

• Typed relationships support internal and external references

• Enable distributed architecture without mandating it

DLF Spring Forum New York, May 14-16, 2003 26

Core Registry Services

• Management Services– Approval

• Level of review, level of public disclosure

– Maintenance• Add, update, delete format entries

– Notification• Notify registry clients of new/updated format or trigger events

(e.g. obsolescence, new transformation service, etc.)

– Introspection• Determine local policies (scope, coverage, implemented

services, etc.) of a given registry to identify appropriate registry to use

DLF Spring Forum New York, May 14-16, 2003 27

Core Registry Services

• Access Services– Description

• Representation information returned on request for single format

– Export• Entire registry or selected subset sent to external repository

DLF Spring Forum New York, May 14-16, 2003 28

Supported Services

• Representation Services– Identification services

• Determine format of a specific digital object by comparing its attributes to the attribute profiles retrieved from the registry

– Validation services• Verify format of a specific DO by comparing its

attributes to the attribute profile retrieved from the registry for that format.

DLF Spring Forum New York, May 14-16, 2003 29

Supported Services

• Brokerage Services– Rendering service

• Identify current rendering conditions for supplied DO

– Transformation service• Convert DO from current (source) format to

target format

– Metadata Extraction services• Registry returns information supporting

automated extraction of attribute metadata from a DO of a specific format

DLF Spring Forum New York, May 14-16, 2003 30

Service Model Sources

• ANSI X3.285, Metamodel for Management of Shareable Data– Service model for ISO/IEC 11179

• IANA MIME media type registry

• OASIS/ebXML Registry Services Specification

DLF Spring Forum New York, May 14-16, 2003 31

Registry Operation

• Trust is necessary to encourage deposit of proprietary information

• Sustainability is necessary to justify expense– As for all preservation activities, how do we

generate income today, for services not needed until tomorrow?

The registry is valuable insofar as it is trustworthy and sustainable.

DLF Spring Forum New York, May 14-16, 2003 32

Registry Operation

• Will registry staff collect and manage representation information, or

• Will knowledgeable community members submit information?

• What is the level of technical review, and by whom?– IETF model

Is the registry self-populating, or a public bulletin board?

DLF Spring Forum New York, May 14-16, 2003 33

Governance Model

• Can this initiative reasonably be placed under the umbrella of an existing organization?

• Is global scope in conflict with national prerogatives?

• How to build sufficient trust models• Governance model becomes more important as

the operational model becomes more pro-active (distributed and contributory)

DLF Spring Forum New York, May 14-16, 2003 34

Business Model

• Costs depend on level of quality and authority required (e.g. wiki vs oclc)

• Assuming the registry needs to be cost-recovered, options for supporting “common good” services include:– Subsidy– Subscription– Pay to submit

• Format registration accompanied by fee

– Pay to view• Queries on a for-fee basis

– Added-value services

DLF Spring Forum New York, May 14-16, 2003 35

Next Steps

• Tell people what we’re doing– National, academic, private libraries/archives– Standards bodies– Commercial

• Regulated industries• Software vendors (developers and consumers of formats)• Publishers

– Anyone with long-term digital preservation needs

• Refine documentation for a general audience– Vision statement and high-level project plan

DLF Spring Forum New York, May 14-16, 2003 36

Next Steps

• Look for project funding– Potentially two phases:

• Design and implementation– Can be funded through grants, in-kind participation

• Operational– Need reliable, sustainable income stream

– Planning grant to sustain initial activity• Data and service models• Governance and business model• Development and operations plan

– Library of Congress NDIIPP and/or JISC (UK) Digital Curation Centre

DLF Spring Forum New York, May 14-16, 2003 37

Why Is This Important to You?

• If you care about the long-term usability of your digital assets:– The registry will allow typing of digital objects

at an appropriate level of granularity– The registry will allow the recovery in the

future of the syntax and semantics associated with typed digital objects

– The registry is an enabling technology underlying digital repository operations and preservation activities

DLF Spring Forum New York, May 14-16, 2003 38

… thanks!

hul.harvard.edu/formatregistry

[email protected]@mit.edu