the conversion software registry
DESCRIPTION
The Conversion Software Registry. Michal Ondrejcek, Kenton McHenry, Rob Kooper, Luigi Marini, and Peter Bajcsy. Overview. - PowerPoint PPT PresentationTRANSCRIPT
National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign
The Conversion Software Registry
Michal Ondrejcek, Kenton McHenry, Rob Kooper, Luigi Marini, and Peter Bajcsy
With an increasing number of file formats used each year preservation of electronic records has become one of the major challenges for the National Archives and Records Administration (NARA).
The Strategic Plan of the National Archives and Records Administration (NARA) 2006-2016, Preserving Past to Protect Future 2006, URL http://www.archives.gov/about/plans-reports/strategic-plan/
Overview
• Why?:• Will there be software to load the file in the future?• If not will the specification for the format still exist?• What would be the best file format conversion in terms of
information preservation? • Was the specification ever available in the case of
closed/proprietary formats to begin with?
This research is partially supported by a National Archive and Records Administration supplement to NSF PACI cooperative agreement CA #SCI-9619019.
2010 MS eScience -1
Conversions
• Convert files to an open standardized format to store with original
• How and which format?• Conversions often result in some information loss, which format
would have the least?• If we had a “universal” converter we could test conversions
and compare before after files to estimate information loss• How do we convert!?
• MANY file formats!• MANY closed/proprietary formats!• MANY with large/complex specifications
2010 MS eScience -2
• Many applications to create/view/save 3D content• MANY of them introduce a new file format for that content!
Available 3D File Formats…
2010 MS eScience -3
• Most applications support a handful of imports and exports
• Perform differently based not only on the algorithm used but also on the purpose of the format domain • emphasis on the texture in 3d• morphology vs. color histogram in 2d
2010 MS eScience -4
NCSA file conversion technologies
Visualization (I/O Graph)
Conversion (Polyglot)Software Reuse
Closed Source Software
Comparison (Versus)I/O Graph Weights Tool
2010 MS eScience -5
• Software import and export options are visualized. • I/O Graph chooses the shortest path with the
minimum applications.
Visualization (I/O Graph)
Conversion (Polyglot)Software Reuse
Closed Source Software
Comparison (Versus)I/O Graph Weights Tool
Input/Output Graphs
2010 MS eScience -6
3DS Max
Adobe 3D Reviewer
AutoCAD
Blender
Cinema 4D
K-3D
LightWave 3D
Maya
Wings 3D
Shortest conversion path
Input/Output Graphs
2010 MS eScience -7
Software Reuse Layer
Exists for the sole purpose of providing an API interface to functionality in 3rd party software•Controls software via wrapper scripts
• AutoHotkey, AppleScript, various shell scripts• Vision based scripts
• Hides away details of using 3rd party software• Attempts to recover from errors, can throw exceptions
Making Use of 3rd Party Software - We define this as the wrapping of 3rd party software, utilizing whatever interfaces the software vendors have made available, in order to re-introduce an API like interface to embedded functionality.
Visualization (I/O Graph)
Conversion (Polyglot)Software Reuse
Closed Source Software
Comparison (Versus)I/O Graph Weights Tool
2010 MS eScience -8
• Exists as a service on the machine where the 3rd party software exists
• Clients provide the Java API interface• Many servers can exists on many machines of different
platforms
Software Reuse Layer
2010 MS eScience -9
The sole purpose of this layer is conversions.•Uses multiple software reuse servers•Merges available script operations into an I/O-Graph•Searches I/O-Graph for conversion paths between an input format and a desired output format•Has no knowledge of underlying 3rd party software•Can use redundancy in software reuse servers to improve performance and work around faults
Visualization (I/O Graph)
Conversion (Polyglot)Software Reuse
Closed Source Software
Comparison (Versus)I/O Graph Weights Tool
Polyglot
2010 MS eScience -10
Comparison Layer
The sole purpose of this layer is to compare files.•Versus, a framework for pair-wise digital object comparisons. The library extracts the same features from both objects
and computes the similarity based on the chosen measure.•Uses Polyglot layer to convert many test files across many of
the possible paths
A -> B -> A’•Compare files before and after conversion•I/O Graph Weights Tool - Converts a set of files
across many paths using Polyglot and scripts.
Adds information losses obtained from Versus
as edge weights to I/O Graph.
Visualization (I/O Graph)
Conversion (Polyglot)Software Reuse
Closed Source Software
Comparison (Versus)I/O Graph Weights Tool
2010 MS eScience -11
Conversion Software Registry (CSR)
2010 MS eScience -12
• http://isda.ncsa.illinois.edu/NARA/CSR • Complementary to format registries such as PRONOM
and GDFR• No similar service that we are aware of.• Community contributions encouraged
A database focused on:•Conversion software!•Finding subsets of software for specific conversion needs•Find conversion paths between pairs of formats
2010 MS eScience -13
The CSR pseudo-tables block design
Parts: 1) Conversions, 2) Software, 3) Formats and Files,4) Scripts, 5) User login and history
2010 MS eScience -14
Adding Conversions
2010 MS eScience -15
Adding Conversions - scripts
Script types present:• Convert - full conversion • Monitor - monitoring software
behavior• Kill - terminating the software • Open/Save/Import/Export
Script headers are standardized with up to four lines with Software name and version, software domain (image, 3d, document, etc.), and input/output formats.
2010 MS eScience -16
Editing Pane
• Software• Vendors• Software platforms• Interfaces• Formats• Equivalent extensions• Sample files
2010 MS eScience -17
File formats identifiers and extensions
Canonical and derived identifiers:Common usage ‘TIFF’MIME ‘image/tiff’UTI ‘public.tiff’PRONOM puid ‘fmt/10’
CSR relies on the identifiers.
CSR search by extensions, MIME, PUIDPUID is used for different format versions. For example, a tiff extension is represented as PUID ‘fmt/10’ for the version 6.0, ‘fmt/155’ for GeoTiff. 2010 MS eScience -
18
Test files
Any file which can be used for conversion accuracy and software validation. The files are uploaded and verified through the UNIX File command and against the file extension entry in the CSR database.
Additional file validation has been performed semi-automatically by NARA using GTRI (Georgia Tech Research Institute) File Type Identifier.
W. Underwood, “Extensions of the UNIX file command and magic file for file type identification”, Technical report ITTL/CSITD 09-02, Georgia Tech Institute, 2009, URL: http://perpos.gtri.gatech.edu/publications/index.htm 2010 MS eScience -
19
Find a conversion path for converting a file format A to a file format B.
Searching for Software
2010 MS eScience -20
Searching for Software
Find a conversion path for converting a file format A to a file format B.
2010 MS eScience -20
Searching for Software
Find a conversion path for converting a file format A to a file format B.
2010 MS eScience -21
• Dijkstra's algorithm - path with lowest cost (e.g. the shortest path) between one vertex/node and every other vertex with edges defined by some measure
• Subjective measure - software ranking by user propagates to all conversions.
• Quantitative measures within the domain (images, 3d etc.).• Images: Normalized cross correlation measure, Histogram
distance measure,• 3d: Surface area, Statistics, Spin images, Light fields• Document (pdf)• Audio
Shortest path from file A to B
User specified measures for example a linear combination of measures.
2010 MS eScience -22
Searching for Conversion Paths
2010 MS eScience -23
Searching for Conversion Paths
2010 MS eScience -23
Searching for Conversion Paths
2010 MS eScience -23
Searching for Conversion Paths
2010 MS eScience -23
Searching for Conversion Paths
2010 MS eScience -23
Searching for Conversion Paths
2010 MS eScience -23
Future Directions
• Compiling known “good” data of various formats• Systematically measuring information loss across
software and formats• Possibly distributing task among a community• Ranking software based on performance
• Integration of CSR and Polyglot.
2010 MS eScience -24
Summary
• Currently contains 2,006 software packages • 1,682 format extensions• 233,810 conversions
• No similar service that we are aware of• Complementary to format registries such as PRONOM
and GDFR• Free• Community contributions encouraged
http://isda.ncsa.illinois.edu/NARA/CSR
2010 MS eScience -25