File Creation, Rendering and Formats
Euan Cochrane, Archives New Zealand&
Dirk von Suchodoletz, University of Freiburg
Future Perfect 201226 March 2012
Wellington, New Zealand
ContentsEuan
•Files, formats and their relationships to creating applications
•Files, formats and their relationships to rendering applications
Dirk
•Maintaining the ability to use older rendering applications
Euan
•Context and conclusions
Digital Preservation
• What is digital preservation?
Maintaining the full information content of digital objects [across time]
Maintaining the ability to render digital objects [across time]
“The goal of digital preservation is the accurate rendering of authenticated content over time”
• What is a file format?
“[pre-defined/particular] way that information is encoded for storage in a computer file”
File Creation and Formats• In 2007 Over 90% of HTML documents did not conform
to standards
• Microsoft Office 2007
(and possibly 2010) create
ODS files differently to most
open source office suites.
• Microsoft Office 2007 and 2010 create Microsoft Office 97-2003 formatted files differently to Microsoft Office 97-2003
Format Standards are Often Ambiguous or not Available
• The JPEG standard specifies an end of image marker but not an end of file marker – Different apps write them differently
• LibreOffice 3.5 (14 February 2012) now “supports” Visio file import. This support is based on reverse engineering as the format standard is not publically available. It is not complete
“Rendering Matters” Research
• Compared the rendering of ~100 files on old software running on old hardware (the “control”) to:
1. LibreOffice version 3.3.0
2. Microsoft Office 2007
3. Word Perfect Office X5
4. Control Software running on emulated hardware
Summary Research Results
• [The choice of] Rendering [Environment] Matters
• MS-Office 2007 was a better rendering tool for the old files than either LibreOffice or WordPerfect Office
• The use of particular attributes/features in office files is inconsistent but most are used at least once.
• At least one “odd”/rare attribute/feature is included in most office files
Original Environments (OE) Original creating application best candidate to render
documents properly
Proprietary format knowledge embedded in the application
One environment renders all objects of a certain type
Keeping original software (and hardware) environments has impact on preservation and access workflows
Components of Access through OE Emulators for different computer architectures
Software archive of all required applica-tions, operating systems, additional components like fonts, codecs
Workflows on object ingest
Access systems for end users
Emulators Wide range available for all relevant
computer architectures
Many Open Source
Not yet DP aware – long term availability to be secured
DP community should seek more influence
Software Archive Preserve the relevant software components and
operational knowledge
Necessary Workflows
Freiburg digital preservation group leads the state-sponsored two years bwFLA project
BwFLA project providing access to complex, interactive digital objects
Provide extended ingest workflows with feedback loop
Extended Ingest Workflow Make use of donator's expertise to collect complete
information and components Extend software archive if necessary Add necessary technical metadata Record knowledge on object handling
Let the donor check and sign-off the rendering results
Access Workflows Provide a reading room system or extension
– Pre-configure emulator to the OE required by the object
– Prepare the inclusion of the object into the original environment
– Automate the startup of the OSE
– Provide the user information and hints on how to interact with the OE & automate parts of this
– (Dis)allow to a certain degree to save results from the original environments or capture certain states (e.g. using screenshots)
Access System Many components already exist, develo-ped by past DP
projects Next step: Make them a usable “product”
Reading Room Access System
Make emulation accessible to standard users like in memory institutions
Robust platform, extension to standard reading room systems
Unified access to a wide range of different emulators + preconfigured environments
Context and Conclusions
• Making decisions about preservation strategies
• When to Normalise?
• Variation in format implementation doesn’t matter if you maintain a compatible rendering environment
• Variation in rendering across environments doesn’t matter if you maintain the “right” rendering environment
• There are practical options for maintaining rendering environments
Thank you